I've had a number of projects requiring XML processing in R and I've always struggled. The problem is always the same: parsing someone else's complex XML structure into a workable dataframe.
Example below of my usual problem. With the data I work with, node names are not necessarily consistent between files, I usually just want to flatten with each dataframe row at the level of the deepest level and then fill columns with adult node names or attributes.
Want to get from this:
library(XML)
# Small example extract
# Most data points removed
xml_extract <- xmlParse("
<COMPARISON ID=\"CMP-001\" NO=\"1\">
<NAME>Incomplete resection (HGG)</NAME>
<DICH_SUBGROUP CHI2=\"0.0\" CI_END=\"0.0\" CI_START=\"0.0\">
<NAME>iMRI</NAME>
<DICH_DATA CI_END=\"0.9640231041199472\" CI_START=\"0.017586933339032232\"/>
</DICH_SUBGROUP>
<DICH_SUBGROUP CHI2=\"0.0\" CI_END=\"0.0\" CI_START=\"0.0\">
<NAME>5-ALA</NAME>
<DICH_DATA CI_END=\"0.7124078544369572\" CI_START=\"0.4242461206130219\"/>
</DICH_SUBGROUP>
<DICH_SUBGROUP CHI2=\"0.0\" CI_END=\"0.0\" CI_START=\"0.0\">
<NAME>DTI-neuronavigation</NAME>
<DICH_DATA CI_END=\"0.6302184844574396\" CI_START=\"0.19776580326143214\"/>
</DICH_SUBGROUP>
</COMPARISON>
")
To this:
(I know two of these columns have the same NAME
, part of the problem. Not my XML).
I use XML
and have had a look at XML2R
. I'm familiar-ish with XPath
. Standard xmlToDataFrame
type commands don't work. Standard Apply
approaches such as xmlSApply
or plyr
for lists usually require completely standardised node names.
Is what I'm looking for possible? A recursive function that run through and flatten an XML stucture. I know I have conveniently omitted the bits I don't want from the XML extract in the table :) Thank you in advance!
No comments:
Post a Comment