I am a XML novice trying to scrape and parse the following RSS feed http://xml.newsbox.ch/corporate_web/che/dufry/digest_en_year_2015_atom.xml. Along this, I ran into two questions:
1) I would like to extract the nodes of individual news stories using xmlChildren
on the parsed document as follows:
library(RCurl) library(XML) xml.url <- "http://xml.newsbox.ch/corporate_web/che/dufry/digest_en_year_2015_atom.xml" script <- getURL(xml.url) doc <- xmlParse(script) doc.children = xpathApply(doc,"//entry",xmlChildren)
Although this procedure works well on other feeds, where the individual news releases are stored with nodes <items>
, it does not work in this particular case with nodes <entry>
as it returns an empty list. I am stuck here, as I cannot figure out what I miss in the structure of the XML document.
2) More generally: Can I implement this approach to handle both cases when the XML structure includes the individual news stories either in node <item>
or in node <entry>
without knowing the particular structure in advance?
Any help is very much appreciated, thank you.
No comments:
Post a Comment