XML : Python lxml XPath : preceding keyword does not give expected result

i am trying to parse an xml document as follows

  import re  from lxml.html.soupparser import fromstring    inString = """  <doc>    <q></q>    <p1>      <p2 dd="ert" ji="pp">            <p3>1</p3>          <p3>2</p3>          <p3>ABC</p3>          <p3>3</p3>         </p2>         <p2 dd="ert" ji="pp">            <p3>4</p3>          <p3>5</p3>          <p3>ABC</p3>          <p3>6</p3>         </p2>    </p1>  <r></r>  <p1>      <p2 dd="ert" ji="pp">            <p3>7</p3>          <p3>8</p3>          <p3>ABC</p3>          <p3>9</p3>         </p2>         <p2 dd="ert" ji="pp">            <p3>10</p3>          <p3>11</p3>          <p3>ABC</p3>          <p3>12</p3>         </p2>    </p1>  </doc>  """  root = fromstring(inString)    nodes = root.xpath("./doc//p1/p2/p3[contains(text(),'ABC')]//preceding::p2//p3")    print " ".join([re.sub('[\s+]', ' ', para.text.encode('utf-8').strip()) for para in nodes])    

so, for each <p1> tag, i want to get to <p3> tags inside <p2>. Then i only want the <p3> tags upto tag having text like ABC. however, if i run the above code, i get

  1 2 ABC 3 4 5 ABC 6 7 8 ABC 9    

desired output is

  1 2 4 5 7 8 10 11    

also, if i make this change

  nodes = root.xpath("./doc//p1/p2/p3[contains(text(),'ABC')]")    

i get

  ABC ABC ABC ABC    

so looks like the second approach is able to get all the <p3> nodes from the entire document as per the xpath, which is fine. why doesn't my first query work?

how do i get the desired output?

No comments:

Post a Comment