xpath and r - create a key table



I'm new to the xml package for r and new to xpath. I have a very large xml file that I am parsing. I wrote some code using loops that works but takes too long, so I am writing more efficient code using xpath. The xml looks something like this:



...
<person personId="1">
<personNames>
<personName nameId="1000">
<first>Joe<last>
<last>Jones<last>
</personName>
<personName nameId="1001">
<first>Joseph><first>
<last>Jones<last>
</personName>
<personName nameId="1002"
<first>The One and only Joe<first>
</personName>
</personNames>
</person>
...


Some people have one name, some have more. Some people have first and last names, some of just a first name or just a last name. So, I need to be careful.


I was able to efficiently create a data frame of first and last names using xpath:



library(XML)
doc<-xmlTreeParse("People.xml",useInternalNodes = TRUE)
top<-xmlRoot(doc)
First<-as.character(xpathApply(top,"//person/personNames/personName/first", xmlValue))
name_id<-as.integer(xpathApply(top,"//person/personNames/personName[first]/@nameId"))
FirstNames<-data.frame(TMS_name_id=name_id,first=First)
Last<-as.character(xpathApply(top,"//person/personNames/personName/last", xmlValue))
name_id<-as.integer(xpathApply(top,"//person/personNames/personName[last]/@nameId"))
LastNames<-data.frame(name_id=name_id,last=Last)
Names<-merge(x=FirstNames,y=LastNames,by="name_id",all=TRUE)


My Names data frame looks good. It has the nameId, first name, and last name of every person. If a first or last name is missing, it is a null. It generated in a few minutes (610K rows!). Awesome.


The problem is associating these names with the parent personId. I assume I need to loop through the names in my data frame, and grab the personId that has the correct nameId attribute, but I am unable to do this. For example, the following code gives me a null result:



xpathSApply(top,"//person/personNames/personName[@nameId="1000"]/@personId")


I am expecting a result of 1. What is the most efficient way to add a column in my data frame for personId?


No comments:

Post a Comment