regex - Create new list from old using re.sub() in python 2.7 -

- August 15, 2010

my goal take xml file, pull out instances of specific element, remove xml tags, work on remaining text.

i started this, works remove xml tags, entire xml file:

from urllib import urlopen import re  url = [url of xml file here]  #the url of file search  raw = urlopen(url).read()   #open file , read variable  exp = re.compile(r'<.*?>') text_only = exp.sub('',raw).strip()

i've got this, text2 = soup.find_all('quoted-block'), creates list of quoted-block elements (yes, know need import beautifulsoup).

but can't figure out how apply regex list resulting soup.find_all. i've tried use text_only = [item item in text2 if exp.sub('',item).strip()] , variations keep getting error: typeerror: expected string or buffer

what doing wrong?

you don't want regex this. instead use beautifulsoup's existing support grabbing text:

quoted_blocks = soup.find_all('quoted-block') text_chunks = [block.get_text() block in quoted_blocks]

Search This Blog

Prevent

regex - Create new list from old using re.sub() in python 2.7 -

Comments

Post a Comment

Popular posts from this blog

github - Git errors while pushing -

django - (fields.E300) Field defines a relation with model 'AbstractEmailUser' which is either not installed, or is abstract -

php - Using grpc in Laravel, "Class 'Grpc\ChannelCredentials' not found." -