UnicodeEncodeError in Python Code that Generates XML



I'm new to Python, and I'm trying to generate XML code that matches words with numbers (a probability distribution). What this code does is that it counts the number of times each noun-phrase word and puts it into XML. The XML should look like this:



<?xml version="1.0" encoding="UTF-8" ?>
<root>
<Durapipe type="int">1</Durapipe>
<EXPLAIN type="int">2</EXPLAIN>
<woods type="int">2</woods>
<hanging type="int">3</hanging>
<hastily type="int">2</hastily>
<localized type="int">1</localized>
<Schuster type="int">5</Schuster>
</root>


I recently added some code to my Python code to avoid things like "." from being considered words. Here's what my Python looks like:



from __future__ import unicode_literals

import nltk.corpus
from nltk import FreqDist
from dicttoxml import dicttoxml, xml_escape

#corpus
words = [w.decode('utf-8', errors='replace') for w in nltk.corpus.reuters.words()]
fd = FreqDist(words)
afd = {xml_escape(k):v for k,v in fd.items()}

# special key for sum
afd['__sum__']=fd.N()

xml = dicttoxml(afd)

f=open('frequencies.xml', 'w')
f.write(xml)
f.close()


I'm currently getting the following error in my Python compiler:



UnicodeEncodeError Traceback (most recent call last)
C:\Users\David Naber\AppData\Local\Enthought\Canopy\App\appdata\canopy-1.4.1.1975.win-x86_64\lib\site-packages\IPython\utils\py3compat.pyc in execfile(fname, glob, loc)
195 else:
196 filename = fname
--> 197 exec compile(scripttext, filename, 'exec') in glob, loc
198 else:
199 def execfile(fname, *where):

C:\Users\David Naber\workspace\AttributeExtraction\libs\freq2xml.py in <module>()
6
7 #corpus
----> 8 words = [w.decode('utf-8', errors='replace') for w in nltk.corpus.reuters.words()]
9 fd = FreqDist(words)
10 afd = {xml_escape(k):v for k,v in fd.items()}

C:\Users\David Naber\AppData\Local\Enthought\Canopy\App\appdata\canopy-1.4.1.1975.win- x86_64\lib\encodings\utf_8.pyc in decode(input, errors)
14
15 def decode(input, errors='strict'):
---> 16 return codecs.utf_8_decode(input, errors, True)
17
18 class IncrementalEncoder(codecs.IncrementalEncoder):

UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 0: ordinal not in range(128)


If anyone could tell me what I can do to fix this, that would be great. Any help would be much appreciated. Thanks in advance!


This question answered the 'why' in why this is happening, but not what I can do to fix it: Handle wrongly encoded character in Python unicode string


No comments:

Post a Comment