html - Eliminating Span Elements in a nested TD using BeautifulSoup -

- September 15, 2013

i'm pretty new webscraping, wrote small little script extract player scores site: http://fold.it/portal/players

here's code:

import urllib2 bs4 import beautifulsoup  soup = beautifulsoup(urllib2.urlopen("http://www.fold.it/portal/players").read()  row in soup('tr', {'class':'even'}):   rank = row('td')[0].string   td2 = row('td')[1]   name in td2('a'):      user = name.text   score = row('td')[2].string  print rank, user, score

now, works pretty except user has 2 other scores in name well. looking @ html, seems there 2 span elements after href.

my first thought split 'user' on white space, names have spaces in them, didn't work. thought looking numeric, users have numeric names well.

i figure eliminating span best option. however, i'm not sure best way parse them out be. appreciated!

the scores in separate span tags - use it:

for row in soup('tr', {'class': 'even'}):     cells = row('td')     rank = cells[0].string      # finding first text node - our name     name = cells[1].a.find(text=true).strip()      # ranks in 2 separate `span` tags     rank1, rank2 = cells[1].find_all("span")      print name, rank1.text, rank2.text

prints:

galaxie 1 3 smilingone 2 35 locioiling 3 9 desnouck maarten 4 153 ...

Search This Blog

Prevent

html - Eliminating Span Elements in a nested TD using BeautifulSoup -

Comments

Post a Comment

Popular posts from this blog

github - Git errors while pushing -

django - (fields.E300) Field defines a relation with model 'AbstractEmailUser' which is either not installed, or is abstract -

Unity3d perpendicular vector3 -