Getting unicode error while parsing xml file



I have a directory of xml files, where a xml file is of the form:



<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="CoreNLP-to-HTML.xsl" type="text/xsl"?>
<root>
<document>
<sentences>
<sentence id="1">
<tokens>
<token id="1">
<word>Brand</word>
<lemma>brand</lemma>
<CharacterOffsetBegin>0</CharacterOffsetBegin>
<CharacterOffsetEnd>5</CharacterOffsetEnd>
<POS>NN</POS>
<NER>O</NER>
</token>
<token id="2">
<word>Blogs</word>
<lemma>blog</lemma>
<CharacterOffsetBegin>6</CharacterOffsetBegin>
<CharacterOffsetEnd>11</CharacterOffsetEnd>
<POS>NNS</POS>
<NER>O</NER>
</token>
<token id="3">
<word>Capture</word>
<lemma>capture</lemma>
<CharacterOffsetBegin>12</CharacterOffsetBegin>
<CharacterOffsetEnd>19</CharacterOffsetEnd>
<POS>VBP</POS>
<NER>O</NER>
</token>


I am parsing each xml file and storing the word between the tags, and then finding the top 100 words.


I am doing like this:



def find_top_words(xml_directory):
file_list = []
temp_list=[]
file_list2=[]
for dir_file in os.listdir(xml_directory):
dir_file_path = os.path.join(xml_directory, dir_file)
if os.path.isfile(dir_file_path):
with open(dir_file_path) as f:
page = f.read()
soup = BeautifulSoup(page,"xml")
for word in soup.find_all('word'):
file_list.append(str(word.string.strip()))
f.close()
for element in file_list:
s = element.lower()
file_list2.append(s)
counts = Counter(file_list2)
for w in sorted(counts, key=counts.get, reverse=True):
temp_list.append(w)
return temp_list[:100]


But, I'm getting this error:



File "prac31.py", line 898, in main
v = find_top_words('/home/xyz/xml_dir')
File "prac31.py", line 43, in find_top_words
file_list.append(str(word.string.strip()))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xef' in position 2: ordinal not in range(128)


What does this mean and how to fix it?


No comments:

Post a Comment