i have the following way of parsing an xml
import re from lxml.html.soupparser import fromstring inString = """ <doc> <q></q> <p1> <p2 dd="ert" ji="pp"> <p3>1</p3> <p3>2</p3> <p3>32</p3> <p3>3</p3> </p2> <p2 dd="ert" ji="pp"> <p3>4</p3> <p3>5</p3> <p3>ABC</p3> <p3>6</p3> </p2> </p1> <r></r> <p1> <p2 dd="ert" ji="pp"> <p3>7</p3> <p3>8</p3> <p3>ABC</p3> <p3>9</p3> </p2> <p2 dd="ert" ji="pp"> <p3>10</p3> <p3>11</p3> <p3>XYZ</p3> <p3>12</p3> </p2> </p1> </doc> """ root = fromstring(inString) #nodes = root.xpath("./doc//p1/p2/p3[contains(text(),'ABC') or contains(text(),'XYZ')]/preceding-sibling::p3") ns = {"re": "http://exslt.org/regular-expressions"} nodes = root.xpath(".//p3[re:match(.,'XYZ') or re:match(.,'ABC')]/preceding-sibling::p3", namespaces=ns) which gives me
4 5 7 8 10 11 so it completely skips the first <p2> my ideal output is
1 2 32 3 4 5 7 8 10 11 so, if i cant find a <p3>ABC<p3> or <p3>XYZ<p3> in a <p2>, i still want all the <p3> s of that <p2>. is that possible?
No comments:
Post a Comment