i am trying to parse an xml document as follows
import re from lxml.html.soupparser import fromstring inString = """ <doc> <q></q> <p1> <p2 dd="ert" ji="pp"> <p3>1</p3> <p3>2</p3> <p3>ABC</p3> <p3>3</p3> </p2> <p2 dd="ert" ji="pp"> <p3>4</p3> <p3>5</p3> <p3>ABC</p3> <p3>6</p3> </p2> </p1> <r></r> <p1> <p2 dd="ert" ji="pp"> <p3>7</p3> <p3>8</p3> <p3>ABC</p3> <p3>9</p3> </p2> <p2 dd="ert" ji="pp"> <p3>10</p3> <p3>11</p3> <p3>ABC</p3> <p3>12</p3> </p2> </p1> </doc> """ root = fromstring(inString) nodes = root.xpath("./doc//p1/p2/p3[contains(text(),'ABC')]//preceding::p2//p3") print " ".join([re.sub('[\s+]', ' ', para.text.encode('utf-8').strip()) for para in nodes])
so, for each <p1>
tag, i want to get to <p3>
tags inside <p2>
. Then i only want the <p3>
tags upto tag having text like ABC
. however, if i run the above code, i get
1 2 ABC 3 4 5 ABC 6 7 8 ABC 9
desired output is
1 2 4 5 7 8 10 11
also, if i make this change
nodes = root.xpath("./doc//p1/p2/p3[contains(text(),'ABC')]")
i get
ABC ABC ABC ABC
so looks like the second approach is able to get all the <p3>
nodes from the entire document as per the xpath, which is fine. why doesn't my first query work?
how do i get the desired output?
No comments:
Post a Comment