html - Eliminating Span Elements in a nested TD using BeautifulSoup -
i'm pretty new webscraping, wrote small little script extract player scores site: http://fold.it/portal/players
here's code:
import urllib2 bs4 import beautifulsoup soup = beautifulsoup(urllib2.urlopen("http://www.fold.it/portal/players").read() row in soup('tr', {'class':'even'}): rank = row('td')[0].string td2 = row('td')[1] name in td2('a'): user = name.text score = row('td')[2].string print rank, user, score
now, works pretty except user has 2 other scores in name well. looking @ html, seems there 2 span elements after href.
my first thought split 'user' on white space, names have spaces in them, didn't work. thought looking numeric, users have numeric names well.
i figure eliminating span best option. however, i'm not sure best way parse them out be. appreciated!
the scores in separate span
tags - use it:
for row in soup('tr', {'class': 'even'}): cells = row('td') rank = cells[0].string # finding first text node - our name name = cells[1].a.find(text=true).strip() # ranks in 2 separate `span` tags rank1, rank2 = cells[1].find_all("span") print name, rank1.text, rank2.text
prints:
galaxie 1 3 smilingone 2 35 locioiling 3 9 desnouck maarten 4 153 ...
Comments
Post a Comment