XML : Parse RSS Feeds with variable XML structures in R

I am a XML novice trying to scrape and parse the following RSS feed http://xml.newsbox.ch/corporate_web/che/dufry/digest_en_year_2015_atom.xml. Along this, I ran into two questions:

1) I would like to extract the nodes of individual news stories using xmlChildren on the parsed document as follows:

  library(RCurl)  library(XML)  xml.url <- "http://xml.newsbox.ch/corporate_web/che/dufry/digest_en_year_2015_atom.xml"  script <- getURL(xml.url)  doc <- xmlParse(script)  doc.children = xpathApply(doc,"//entry",xmlChildren)

Although this procedure works well on other feeds, where the individual news releases are stored with nodes <items>, it does not work in this particular case with nodes <entry> as it returns an empty list. I am stuck here, as I cannot figure out what I miss in the structure of the XML document.

2) More generally: Can I implement this approach to handle both cases when the XML structure includes the individual news stories either in node <item> or in node <entry> without knowing the particular structure in advance?

Any help is very much appreciated, thank you.

XML : Parse RSS Feeds with variable XML structures in R

No comments:

Post a Comment