Extract text with lxml

I have this text :


INTRODUCTION
This is a test document for xml.
I need to extract this sentence.

Conclusion
It should hopefully..

The line I need to extract this sentence. is in italics . The xml of the file looks like:


<?xml version="1.0" encoding="UTF-8" standalone="yes"?>\r\n
<w:document xmlns:wpc="http://ift.tt/JiuBoL" xmlns:mc="http://ift.tt/pzd6Lm" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://ift.tt/1bA4cfb" xmlns:m="http://ift.tt/JiuBoH" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://ift.tt/1bA4bYX" xmlns:wp="http://ift.tt/JiuBoF" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://ift.tt/JiuBoE" xmlns:w14="http://ift.tt/1bA4bYS" xmlns:w15="http://ift.tt/Ua2VHY" xmlns:wpg="http://ift.tt/JiuB8i" xmlns:wpi="http://ift.tt/1bA4bYO" xmlns:wne="http://ift.tt/JiuB8g" xmlns:wps="http://ift.tt/1bA4djs" mc:Ignorable="w14 w15 wp14">
   <w:body>
      <w:p w:rsidR="00470EEF" w:rsidRDefault="00456755">
         <w:pPr>
            <w:rPr>
               <w:b/>
            </w:rPr>
         </w:pPr>
         <w:r w:rsidRPr="00456755">
            <w:rPr>
               <w:b/>
            </w:rPr>
            <w:t>INTRODUCTION</w:t>
         </w:r>
      </w:p>
      <w:p w:rsidR="00456755" w:rsidRPr="00B042E3" w:rsidRDefault="00456755">
         <w:pPr>
            <w:rPr>
               <w:color w:val="FFFF00"/>
            </w:rPr>
         </w:pPr>
         <w:r w:rsidRPr="00B042E3">
            <w:rPr>
               <w:color w:val="FFFF00"/>
            </w:rPr>
            <w:t>This is a test document for xml.</w:t>
         </w:r>
      </w:p>
      <w:p w:rsidR="00456755" w:rsidRDefault="00E971E1">
         <w:r>
            <w:rPr>
               <w:i/>
            </w:rPr>
            <w:t>I need to extract this sentence.</w:t>
         </w:r>
         <w:bookmarkStart w:id="0" w:name="_GoBack"/>
         <w:bookmarkEnd w:id="0"/>
      </w:p>
      <w:p w:rsidR="00456755" w:rsidRDefault="00456755"/>
      <w:p w:rsidR="00456755" w:rsidRDefault="00456755">
         <w:pPr>
            <w:rPr>
               <w:b/>
            </w:rPr>
         </w:pPr>
         <w:r w:rsidRPr="00456755">
            <w:rPr>
               <w:b/>
            </w:rPr>
            <w:t>Conclusion</w:t>
         </w:r>
      </w:p>
      <w:p w:rsidR="00456755" w:rsidRPr="00456755" w:rsidRDefault="00456755">
         <w:r w:rsidRPr="00456755">
            <w:t>It should hopefully</w:t>
         </w:r>
         <w:r>
            <w:t>..</w:t>
         </w:r>
      </w:p>
      <w:sectPr w:rsidR="00456755" w:rsidRPr="00456755">
         <w:pgSz w:w="11906" w:h="16838"/>
         <w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="708" w:footer="708" w:gutter="0"/>
         <w:cols w:space="708"/>
         <w:docGrid w:linePitch="360"/>
      </w:sectPr>
   </w:body>
</w:document>

I tried :


tree = ET.parse(doc_xml)  
[b.tag for b in tree.iterfind(".//i")]

The above returns an empty list.

I've searched a lot but wasn't able to figure out how to do that as the text is contained within <w:i/>. I have seen this question where this was done easily using BeautifulSoup.

Extract text with lxml

No comments:

Post a Comment