regex - Create new list from old using re.sub() in python 2.7 -
my goal take xml file, pull out instances of specific element, remove xml tags, work on remaining text.
i started this, works remove xml tags, entire xml file:
from urllib import urlopen import re url = [url of xml file here] #the url of file search raw = urlopen(url).read() #open file , read variable exp = re.compile(r'<.*?>') text_only = exp.sub('',raw).strip()
i've got this, text2 = soup.find_all('quoted-block')
, creates list of quoted-block
elements (yes, know need import beautifulsoup).
but can't figure out how apply regex list resulting soup.find_all. i've tried use text_only = [item item in text2 if exp.sub('',item).strip()]
, variations keep getting error: typeerror: expected string or buffer
what doing wrong?
you don't want regex this. instead use beautifulsoup's existing support grabbing text:
quoted_blocks = soup.find_all('quoted-block') text_chunks = [block.get_text() block in quoted_blocks]
Comments
Post a Comment