UnicodeEncodeError: how to encode xml tree parsed with ElementTree

I have an XML file with this structure:


<doc>
 <content>
  <one>Title</one>
  <two>bla bla bla bla</two>
 </content>
 <content>
  <one>Title</one>
  <two>bla bla bla bla</two>
 </content>
 ...
</doc>

I read the file in python through nltk package and parse the tree with ElementTree like this:


from xml.etree.ElementTree import ElementTree
wow = nltk.data.find('/path/file.xml')
tree = ElementTree().parse(wow)

Then I try to print something from 'two' elements like this:


for i, content in enumerate(tree.findall('content')):
    for two in content.findall('two'):
        if 'keyword' in str(two.text):
            print("%s" % (two.text))

And I get the infamous error:


Traceback (most recent call last):
   File "<stdin>", line 3, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 21: ordinal not in range(128)

I know this is due to incompatibility problems with ascii and UTF-8 encodings. The XML encoding is UTF-8. I tried several solutions found here on stackoverflow (mainly: I tried adding .encode('UTF-8') or .decode('UTF-8') here and there, or also encoding='utf-8' added in data.find), but the examples I found were quite different from mine, so I didn't manage to adapt those answers to my case: as you can imagine, I am new to python.

How can I avoid the error and print the content I need? Thanks.

UnicodeEncodeError: how to encode xml tree parsed with ElementTree

No comments:

Post a Comment