XML : XML_PARSE_HUGE option in rvest

I am getting an error that says I need to use the XML_PARSE_HUGE option in rvest, but I can't figure out how to use it. I tried to do a workaround to grab a list of urls from a site with the xml package, but it did not grab the complete list. I can try to explain what is going on a bit more, but I will just paste the code I am using to see if anyone knows a way that either one could work.

  url<-"http://www.example-website.com/url-list.html"  list<-read_html(url) %>% xml_nodes("dd a")

This option fails at reading the website, and tells me I need to use the XML_PARSE_HUGE option. I looked around the help documentation, and tried to read a few other answers here, but they didn't help much. I installed a version of the xml2 package that was supposed to force the option, but that didn't do anything either. The other option gets the list (in a clumsy way), but doesn't get the full list.

  htmlTreeParse(url, options= HUGE) %>% xmlRoot -> check  check[[2]] %>% paste %>% strsplit(split='"') -> check2  url.list<-paste(url, check2[[3]][grep(".htm", check2[[3]])], sep="")

As I understand it (and this may be wrong), this grabs the html. I checked and the second one seemed to have all the links I wanted, so I pasted the html as a character, split it on quotations, and looked through the splits for ".htm" to find the links. This is probably a clumsy way to extract the links I want, but it actually works fine. The problem with this option is that the htmlTreeParse doesn't actually get the full list (I checked through here, and it stops a bit short of what I wanted). The rest of the code successfully grabs all of the links that the htmlTreeParse does.

I just started teaching myself web scraping yesterday, so I am sure I am doing a lot of boneheaded things, but any advice that can be given would be appreciated.

XML : XML_PARSE_HUGE option in rvest

No comments:

Post a Comment