html - rvest package read_html() function stops reading at "<" symbol -
i wondering if behavior intentional in rvest
package. when rvest
sees <
character stops reading html.
library(rvest) read_html("<html><title>under 30 years = < 30 years <title></html>")
prints:
[1] <head>\n <title>under 30 = </title>\n</head>
if intentional, there workaround?
yes, normal rvest
because it's normal html.
see w3schools html entities page. <
, >
reserved characters in html , literal values have written way, specific character entities. here entity table linked page, giving commonly used html characters , respective html entities.
xml::readhtmltable("http://www.w3schools.com/html/html_entities.asp", = 2) # result description entity name entity number # 1 non-breaking space   # 2 < less < < # 3 > greater > > # 4 & ampersand & & # 5 ¢ cent ¢ ¢ # 6 £ pound £ £ # 7 ¥ yen ¥ ¥ # 8 € euro € € # 9 © copyright © © # 10 ® registered trademark ® ®
so have replace values, perhaps gsub()
or manually if there aren't many. can see parse when characters replaced correct entity.
library(xml) doc <- htmlparse("<html><title>under 30 years = < 30 years </title></html>") xmlvalue(doc["//title"][[1]]) # [1] "under 30 years = < 30 years "
you use gsub()
, following
txt <- "<html><title>under 30 years = < 30 years </title></html>" xmlvalue(htmlparse(gsub(" < ", " < ", txt, fixed = true))["//title"][[1]]) # [1] "under 30 years = < 30 years "
i used xml package here, same applies other packages process html.
Comments
Post a Comment