How to use beautiful soup to parse this xml file?

I am trying to use Beautiful Soup to parse xml files, I tried reading the documentation of BS at Crummy but could not find anything satisfying for xml parsing. Right now, I've been able to figure only this much out:


file = open("input.xml")
page = file.read()

soup = BeautifulSoup(page, "xml")
for word in soup.findAll('word'):
    word_attr = dict(word.attrs)
    netag = word.find('ner')
    nertag = dict(netag)
    print ("STOP", nertag['ner'])

But, it is not doing anything. My xml file is of the form:


<?xml version="1.0" encoding="utf-8"?>
<root>
 <document>
  <sentences>
   <sentence id="1">
    <tokens>
     <token id="1">
      <word>
       Starbucks
      </word>
      <lemma>
       Starbucks
      </lemma>
      <CharacterOffsetBegin>
       0
      </CharacterOffsetBegin>
      <CharacterOffsetEnd>
       9
      </CharacterOffsetEnd>
      <POS>
       NNP
      </POS>
      <NER>
       ORGANIZATION
      </NER>
     </token>
     <token id="2">
      <word>
       to
      </word>
      <lemma>
       to
      </lemma>
      <CharacterOffsetBegin>
       10
      </CharacterOffsetBegin>
      <CharacterOffsetEnd>
       12
      </CharacterOffsetEnd>
      <POS>
       TO
      </POS>
      <NER>
       O
      </NER>
     </token>
<token id="5">
  <word>
   .
  </word>
  <lemma>
   .
  </lemma>
  <CharacterOffsetBegin>
   263
  </CharacterOffsetBegin>
  <CharacterOffsetEnd>
   264
  </CharacterOffsetEnd>
  <POS>
   .
  </POS>
  <NER>
   O
  </NER>
 </token>
 </tokens>
   </sentence>
  </sentences>
 </document>
</root>

What I am trying to do is to extract the NER values, remove the period punctuation with "STOP", and write this to another txt file.

Like for a sentence: Starbucks in New York is good. (Written in an xml) Should give: ORGANIZATION in LOCATION is good STOP

Can somebody please help me how to do this? Or provide me with sufficient documentation for Beautiful Soup xml parsing?

How to use beautiful soup to parse this xml file?

No comments:

Post a Comment