Basically I created a word document to check how the parsing in XML fares. I did :
import xml.etree.ElementTree
import zipfile as zf
z = zf.ZipFile("INTRODUCTION.docx")
doc_xml = z.open("word/document.xml")
tree = ET.parse(doc_xml)
NAMESPACE_PREFIXES = {
'w': 'http://ift.tt/JiuBoE'
}
text_elements = [element for element in tree.iter() if element.tag ==
'{' + NAMESPACE_PREFIXES['w'] + '}t']
for node in text_elements:
print node.text
The Namespace prefixes is to take care of those links so that they are ignored. The node.text got printed as:
INTRODUCTION
This is a test document for xml
.
Lets
see how this works.
Conclusion
It should hopefully
..
In my original document , "Lets see how this works" comes in a single line , similarly , I see the full stops appearing in another node . How do I solve it? Here's the xml code:
'<?xml version="1.0" encoding="UTF-8" standalone="yes"?>\r\n<w:document xmlns:wpc="http://ift.tt/JiuBoL" xmlns:mc="http://ift.tt/pzd6Lm" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://ift.tt/1bA4cfb" xmlns:m="http://ift.tt/JiuBoH" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://ift.tt/1bA4bYX" xmlns:wp="http://ift.tt/JiuBoF" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://ift.tt/JiuBoE" xmlns:w14="http://ift.tt/1bA4bYS" xmlns:w15="http://ift.tt/Ua2VHY" xmlns:wpg="http://ift.tt/JiuB8i" xmlns:wpi="http://ift.tt/1bA4bYO" xmlns:wne="http://ift.tt/JiuB8g" xmlns:wps="http://ift.tt/1bA4djs" mc:Ignorable="w14 w15 wp14"><w:body><w:p w:rsidR="00470EEF" w:rsidRDefault="00456755"><w:pPr><w:rPr><w:b/></w:rPr></w:pPr><w:r w:rsidRPr="00456755"><w:rPr><w:b/></w:rPr><w:t>INTRODUCTION</w:t></w:r></w:p><w:p w:rsidR="00456755" w:rsidRDefault="00456755"><w:r w:rsidRPr="00456755"><w:t>This is a test document for xml</w:t></w:r><w:r><w:t>.</w:t></w:r></w:p><w:p w:rsidR="00456755" w:rsidRDefault="00456755"><w:proofErr w:type="spellStart"/><w:proofErr w:type="gramStart"/><w:r><w:t>Lets</w:t></w:r><w:proofErr w:type="spellEnd"/><w:proofErr w:type="gramEnd"/><w:r><w:t xml:space="preserve"> see how this works.</w:t></w:r></w:p><w:p w:rsidR="00456755" w:rsidRDefault="00456755"/><w:p w:rsidR="00456755" w:rsidRDefault="00456755"/><w:p w:rsidR="00456755" w:rsidRDefault="00456755"><w:pPr><w:rPr><w:b/></w:rPr></w:pPr><w:r w:rsidRPr="00456755"><w:rPr><w:b/></w:rPr><w:t>Conclusion</w:t></w:r></w:p><w:p w:rsidR="00456755" w:rsidRPr="00456755" w:rsidRDefault="00456755"><w:r w:rsidRPr="00456755"><w:t>It should hopefully</w:t></w:r><w:r><w:t>..</w:t></w:r><w:bookmarkStart w:id="0" w:name="_GoBack"/><w:bookmarkEnd w:id="0"/></w:p><w:sectPr w:rsidR="00456755" w:rsidRPr="00456755"><w:pgSz w:w="11906" w:h="16838"/><w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="708" w:footer="708" w:gutter="0"/><w:cols w:space="708"/><w:docGrid w:linePitch="360"/></w:sectPr></w:body></w:document>'
I noticed something like w:type="spellStart" and "grasmStart" which is the reason why "Lets" appears in a different node. Is there a way to look over this?
No comments:
Post a Comment