Parsing a docx using Element Tree module

I have this document which I need to parse and get an XML equivalent.Basically I need an ElementTree type object but it isn't happening. I have tried many different combinations but I'm yet to figure it out. Here's what I did:


import xml.etree.ElementTree as ET
z = zf.ZipFile("INTRODUCTION.docx")
doc_xml = z.read("word/document.xml")
print doc_xml           #type(doc_xml) is str

Since doc_xml was of type string , I used the following to get an Element.


rooted = ET.fromstring(doc_xml)    #type(rooted) is 'Element'
type(rooted)

and this too:


tree = ET.ElementTree(doc_xml)  #type(tree) is 'ElementTree'
type(tree)

I thought this works but when I do:


for branch in tree.iter():
    print branch  

AttributeError: 'str' object has no attribute 'iter'

The variable tree is of ElementTree type . How do I resolve this ?

Parsing a docx using Element Tree module

No comments:

Post a Comment