regex - Create new list from old using re.sub() in python 2.7 -


my goal take xml file, pull out instances of specific element, remove xml tags, work on remaining text.

i started this, works remove xml tags, entire xml file:

from urllib import urlopen import re  url = [url of xml file here]  #the url of file search  raw = urlopen(url).read()   #open file , read variable  exp = re.compile(r'<.*?>') text_only = exp.sub('',raw).strip() 

i've got this, text2 = soup.find_all('quoted-block'), creates list of quoted-block elements (yes, know need import beautifulsoup).

but can't figure out how apply regex list resulting soup.find_all. i've tried use text_only = [item item in text2 if exp.sub('',item).strip()] , variations keep getting error: typeerror: expected string or buffer

what doing wrong?

you don't want regex this. instead use beautifulsoup's existing support grabbing text:

quoted_blocks = soup.find_all('quoted-block') text_chunks = [block.get_text() block in quoted_blocks] 

Comments

Popular posts from this blog

matlab - error with cyclic autocorrelation function -

django - (fields.E300) Field defines a relation with model 'AbstractEmailUser' which is either not installed, or is abstract -

c# - What is a good .Net RefEdit control to use with ExcelDna? -