UnicodeEncodeError: how to encode xml tree parsed with ElementTree



I have an XML file with this structure:



<doc>
<content>
<one>Title</one>
<two>bla bla bla bla</two>
</content>
<content>
<one>Title</one>
<two>bla bla bla bla</two>
</content>
...
</doc>


I read the file in python through nltk package and parse the tree with ElementTree like this:



from xml.etree.ElementTree import ElementTree
wow = nltk.data.find('/path/file.xml')
tree = ElementTree().parse(wow)


Then I try to print something from 'two' elements like this:



for i, content in enumerate(tree.findall('content')):
for two in content.findall('two'):
if 'keyword' in str(two.text):
print("%s" % (two.text))


And I get the infamous error:



Traceback (most recent call last):
File "<stdin>", line 3, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 21: ordinal not in range(128)


I know this is due to incompatibility problems with ascii and UTF-8 encodings. The XML encoding is UTF-8. I tried several solutions found here on stackoverflow (mainly: I tried adding .encode('UTF-8') or .decode('UTF-8') here and there, or also encoding='utf-8' added in data.find), but the examples I found were quite different from mine, so I didn't manage to adapt those answers to my case: as you can imagine, I am new to python.


How can I avoid the error and print the content I need? Thanks.


No comments:

Post a Comment