Getting unicode error while parsing xml file

I have a directory of xml files, where a xml file is of the form:


<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="CoreNLP-to-HTML.xsl" type="text/xsl"?>
<root>
  <document>
    <sentences>
      <sentence id="1">
        <tokens>
          <token id="1">
            <word>Brand</word>
            <lemma>brand</lemma>
            <CharacterOffsetBegin>0</CharacterOffsetBegin>
            <CharacterOffsetEnd>5</CharacterOffsetEnd>
            <POS>NN</POS>
            <NER>O</NER>
          </token>
          <token id="2">
            <word>Blogs</word>
            <lemma>blog</lemma>
            <CharacterOffsetBegin>6</CharacterOffsetBegin>
            <CharacterOffsetEnd>11</CharacterOffsetEnd>
            <POS>NNS</POS>
            <NER>O</NER>
          </token>
          <token id="3">
            <word>Capture</word>
            <lemma>capture</lemma>
            <CharacterOffsetBegin>12</CharacterOffsetBegin>
            <CharacterOffsetEnd>19</CharacterOffsetEnd>
            <POS>VBP</POS>
            <NER>O</NER>
          </token>

I am parsing each xml file and storing the word between the tags, and then finding the top 100 words.

I am doing like this:


def find_top_words(xml_directory):
    file_list = []
    temp_list=[]
    file_list2=[]
    for dir_file in os.listdir(xml_directory):
        dir_file_path = os.path.join(xml_directory, dir_file)
        if os.path.isfile(dir_file_path):
            with open(dir_file_path) as f:
                page = f.read()
                soup = BeautifulSoup(page,"xml")
                for word in soup.find_all('word'):
                    file_list.append(str(word.string.strip()))
            f.close()
    for element in file_list:
        s = element.lower()
        file_list2.append(s)
    counts = Counter(file_list2)
    for w in sorted(counts, key=counts.get, reverse=True):
          temp_list.append(w)
    return temp_list[:100]

But, I'm getting this error:


File "prac31.py", line 898, in main
    v = find_top_words('/home/xyz/xml_dir')
  File "prac31.py", line 43, in find_top_words
    file_list.append(str(word.string.strip()))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xef' in position 2: ordinal not in range(128)

What does this mean and how to fix it?

Getting unicode error while parsing xml file

No comments:

Post a Comment