XML parsing in R: Attributes of Child nodes not accessible



I'm new to XML parsing and have worked through a variety of tutorials. Right now, I am trying to parse an XML document from search results from the PsychInfo database. My XML document seem has a nesting structure that I don't fully understand. The file itself is very large, so I put the raw data on a GitHub Gist in order to make this a reproducible example.


The following code initializes the workspace and reads in the XML data.



library(XML)
library(RCurl)

read.url <- function(url, ...){
tmpFile <- tempfile()
download.file(url, destfile = tmpFile, method = "curl")
url.data <- xmlParse(tmpFile, ... )
return(url.data)
}


Here is the full path to the data on Gist. (Attempts at shortening with bit.ly prevents it from being properly read. Any suggestions on addressing this secondary problem would be helpful!)


DF <- read.url("http://ift.tt/1uplqm2")


The following code first checks to see that it is of the XML document class. The subsequent code extracts the root nodes and children.



class(DF) #Check to see it is an XML document
RootNode <- xmlRoot(DF) #Obtain root node
ChildNodes <- xmlChildren(RootNode) #Obtain children of root


Now, I am examining the characteristics of the first record of the XML file (first child node). When I look at the attributes for this item xmlAttrs, it shows only one list item, as indicated by the $. But, there are a number of different fields that I want to access -- i.e., article title and authors.



pietyScale <- ChildNodes[[1]]
xmlName(pietyScale)
xmlSize(pietyScale)
xmlAttrs(pietyScale)
xmlValue(pietyScale)
xmlChildren(pietyScale)
xmlAttrs(pietyScale) #Shows only one item


My ideal is to get the data into a data frame in a long format (to handle multiple authors). Feel free to direct me to other postings or resources that may effectively address this issue


No comments:

Post a Comment