html - rvest package read_html() function stops reading at "<" symbol -


i wondering if behavior intentional in rvest package. when rvest sees < character stops reading html.

library(rvest) read_html("<html><title>under 30 years = < 30 years <title></html>") 

prints:

[1] <head>\n  <title>under 30 = </title>\n</head> 

if intentional, there workaround?

yes, normal rvest because it's normal html.

see w3schools html entities page. < , > reserved characters in html , literal values have written way, specific character entities. here entity table linked page, giving commonly used html characters , respective html entities.

xml::readhtmltable("http://www.w3schools.com/html/html_entities.asp", = 2) #    result          description entity name entity number # 1           non-breaking space      &nbsp;        &#160; # 2       <            less        &lt;         &#60; # 3       >         greater        &gt;         &#62; # 4       &            ampersand       &amp;         &#38; # 5       ¢                 cent      &cent;        &#162; # 6       £                pound     &pound;        &#163; # 7       ¥                  yen       &yen;        &#165; # 8       €                 euro      &euro;       &#8364; # 9       ©            copyright      &copy;        &#169; # 10      ® registered trademark       &reg;        &#174; 

so have replace values, perhaps gsub() or manually if there aren't many. can see parse when characters replaced correct entity.

library(xml) doc <- htmlparse("<html><title>under 30 years = &lt; 30 years </title></html>") xmlvalue(doc["//title"][[1]]) # [1] "under 30 years = < 30 years " 

you use gsub(), following

txt <- "<html><title>under 30 years = < 30 years </title></html>" xmlvalue(htmlparse(gsub(" < ", " &lt; ", txt, fixed = true))["//title"][[1]]) # [1] "under 30 years = < 30 years " 

i used xml package here, same applies other packages process html.


Comments

Popular posts from this blog

matlab - error with cyclic autocorrelation function -

django - (fields.E300) Field defines a relation with model 'AbstractEmailUser' which is either not installed, or is abstract -

c# - What is a good .Net RefEdit control to use with ExcelDna? -