XML Stack Overflow: Using getNodeSet on XMLNodeSet (XML package)

I have trouble using the R XML package for a specific application I have in mind. Consider the following example document. I am interested in getting the information in b inside the first a node. But the nature of my problem (application) is so that I first need to identify all the a nodes in the document and then subset this nodeset to get the first a node and then get the b node. The first step is easy:


    doc <- "
    <div></div>
    <a id='1'><b id='3'>text1</b></a>
    <a id='2'><b id='4'>text2</b></a>
    "
    parsed <- htmlParse(doc)

    step1 <- getNodeSet(parsed, "//a")

    > step1

    [[1]]

    <a id="1">

      <b id="3">text1</b>

    </a> 


    [[2]]

    <a id="2">

      <b id="4">text2</b>

    </a> 


    attr(,"class")

    [1] "XMLNodeSet"

This yields the expected results. The next step in my application is to extract the b nodes from the first a node. If I use getNodeSet on step1[[1]], I get the b nodes from both nodes in the step1 nodeset.


    step2 <- getNodeSet(step1[[1]], "//b")
    step2

    [[1]]
    <b id="3">text1</b> 

    [[2]]
    <b id="4">text2</b> 

    attr(,"class")
    [1] "XMLNodeSet"

I figured out that I could use the XPath "b" to get the information in this example, but ultimately I need "//b" to work here. The way I understand the XML package works, I don't think this behaviour is a bug, but a consequence of the reference of the C-level representation of this document. Is there a way I can achieve this "two-step" process by any way? I essentially want step[[1]] to work like a fresh document.

XML Stack Overflow

Wednesday, 7 January 2015

Using getNodeSet on XMLNodeSet (XML package)

No comments:

Post a Comment