I am trying to use Beautiful Soup to parse xml files, I tried reading the documentation of BS at Crummy but could not find anything satisfying for xml parsing. Right now, I've been able to figure only this much out:
file = open("input.xml")
page = file.read()
soup = BeautifulSoup(page, "xml")
for word in soup.findAll('word'):
word_attr = dict(word.attrs)
netag = word.find('ner')
nertag = dict(netag)
print ("STOP", nertag['ner'])
But, it is not doing anything. My xml file is of the form:
<?xml version="1.0" encoding="utf-8"?>
<root>
<document>
<sentences>
<sentence id="1">
<tokens>
<token id="1">
<word>
Starbucks
</word>
<lemma>
Starbucks
</lemma>
<CharacterOffsetBegin>
0
</CharacterOffsetBegin>
<CharacterOffsetEnd>
9
</CharacterOffsetEnd>
<POS>
NNP
</POS>
<NER>
ORGANIZATION
</NER>
</token>
<token id="2">
<word>
to
</word>
<lemma>
to
</lemma>
<CharacterOffsetBegin>
10
</CharacterOffsetBegin>
<CharacterOffsetEnd>
12
</CharacterOffsetEnd>
<POS>
TO
</POS>
<NER>
O
</NER>
</token>
<token id="5">
<word>
.
</word>
<lemma>
.
</lemma>
<CharacterOffsetBegin>
263
</CharacterOffsetBegin>
<CharacterOffsetEnd>
264
</CharacterOffsetEnd>
<POS>
.
</POS>
<NER>
O
</NER>
</token>
</tokens>
</sentence>
</sentences>
</document>
</root>
What I am trying to do is to extract the NER values, remove the period punctuation with "STOP", and write this to another txt file.
Like for a sentence: Starbucks in New York is good. (Written in an xml) Should give: ORGANIZATION in LOCATION is good STOP
Can somebody please help me how to do this? Or provide me with sufficient documentation for Beautiful Soup xml parsing?
No comments:
Post a Comment