I have a xml structure with some elements which are not unique. So I managed to sort the subtrees and I can filter propper the elements which I have more than one time. But the remove function seems not to apply.
My XML Structure looks simplified like this:
<root>
<page>
<text>blabla blub unique</text>
<text>blabla blub not unique</text>
<text>blabla blub not unique</text><!-- line should be removed -->
<text>blabla blub again unique</text>
</page>
<page>
<text>2nd blabla blub unique</text>
<text>2nd blabla blub not unique</text>
<text>2nd blabla blub not unique</text><!-- line should be removed -->
<text>2nd blabla blub again unique</text>
</page>
</root>
I want to remove double strings on each page, so I'm iterating over pages and over elements in page in two for loops: (extract of important lines, I hope didn't forget anything)
import xml.etree.ElementTree as ET
self.tree = ET.parse(path)
self.root = self.tree.getroot()
self.prev = None
# [...]
for page in self.root: # iterate over pages
for elem in page:
if elements_equal(elem, self.prev):
print("found duplicate: %s" % elem.text) # equal function works well
page.remove(elem) # <---- this seems not to work
continue
self.prev = elem
# [...]
self.tree.write("out.xml") # duplicate lines still there....
No comments:
Post a Comment