Python remove duplicate elements from xml tree

I have a xml structure with some elements which are not unique. So I managed to sort the subtrees and I can filter propper the elements which I have more than one time. But the remove function seems not to apply.

My XML Structure looks simplified like this:


<root>
  <page>
    <text>blabla blub unique</text>
    <text>blabla blub not unique</text>
    <text>blabla blub not unique</text><!-- line should be removed -->
    <text>blabla blub again unique</text>
  </page>
  <page>
    <text>2nd blabla blub unique</text>
    <text>2nd blabla blub not unique</text>
    <text>2nd blabla blub not unique</text><!-- line should be removed -->
    <text>2nd blabla blub again unique</text>
  </page>
</root>

I want to remove double strings on each page, so I'm iterating over pages and over elements in page in two for loops: (extract of important lines, I hope didn't forget anything)


import xml.etree.ElementTree as ET
self.tree = ET.parse(path)
self.root = self.tree.getroot()
self.prev = None
# [...]
for page in self.root:                     # iterate over pages
    for elem in page:
        if elements_equal(elem, self.prev):
            print("found duplicate: %s" % elem.text)   # equal function works well
            page.remove(elem) # <---- this seems not to work
            continue
        self.prev = elem
# [...]
self.tree.write("out.xml") # duplicate lines still there....

Python remove duplicate elements from xml tree

No comments:

Post a Comment