XML : parse a xml from .docx with lxml gives IOError python

I getting an xml called xml_content from a .docx file the xml looks like this:

  <?xml version="1.0" encoding="UTF-8"?>  <w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing">     <w:body>        <w:p>           <w:pPr>              <w:pStyle w:val="Normal" />              <w:ind w:left="5070" w:right="0" w:hanging="0" />              <w:rPr>                 <w:rFonts w:cs="Book Antiqua" w:ascii="Book Antiqua" w:hAnsi="Book Antiqua" />              </w:rPr>           </w:pPr>           <w:r>              <w:rPr>                 <w:rFonts w:cs="Book Antiqua" w:ascii="Book Antiqua" w:hAnsi="Book Antiqua" />              </w:rPr>              <w:t xml:space="preserve">                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     </w:t>              <w:pict>                 <v:rect id="shape_0" stroked="f" style="position:absolute;margin-left:405pt;margin-top:0pt;width:80.9pt;height:71.9pt">                    <v:imagedata r:id="rId2" detectmouseclick="t" />                    <v:wrap v:type="none" />                    <v:stroke color="#3465a4" joinstyle="round" endcap="flat" />                 </v:rect>              </w:pict>              <w:pict>                 <v:rect id="shape_0" stroked="f" style="position:absolute;margin-left:0.05pt;margin-top:0pt;width:71.9pt;height:70.1pt">                    <v:imagedata r:id="rId3" detectmouseclick="t" />                    <v:wrap v:type="none" />                    <v:stroke color="#3465a4" joinstyle="round" endcap="flat" />                 </v:rect>              </w:pict>           </w:r>        </w:p>  ...     </w:body>  </w:document>    

With lxml I want to parse this xml. My code looks like this:

  import lxml.etree    document = zipfile.ZipFile('test.docx')  xml_content = document.read('word/document.xml')  tree = lxml.etree.parse(xml_content)    

When i run this code i get this error:

  Traceback (most recent call last):    File "import.py", line 29, in <module>      tree = lxml.etree.parse(xml_content)    File "lxml.etree.pyx", line 3301, in lxml.etree.parse (src/lxml/lxml.etree.c:72453)    File "parser.pxi", line 1791, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:105915)    File "parser.pxi", line 1817, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:106214)    File "parser.pxi", line 1721, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:105213)    File "parser.pxi", line 1122, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:100163)    File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:94286)    File "parser.pxi", line 690, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:95722)    File "parser.pxi", line 618, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:94754)  IOError    

No comments:

Post a Comment