I need some help finding the text offset of certain tags in an XML document. I have a data set following the format illustrated below where the ROOT element contains several RECORDs though each RECORD contains only one TEXT element. In the text there may exist several TAG elements used as annotations of some text. I need to convert these annotations to another format requiring begin and end offset of the tags using Python.
<ROOT>
<RECORD ID="123">
<TEXT>
This is an example text written at <TAG TYPE="DATE">December 29th</TAG> to illustrate the problem.
</TEXT>
</RECORD>
</ROOT>
Basically, I would like to convert above format to the following format:
<ROOT>
<RECORD ID="123">
<TEXT>
This is an example text written at December 29th to illustrate the problem.
</TEXT>
<TAG TYPE="DATE" BEGIN=36 END=49/>
</RECORD>
</ROOT>
I've tried using BeautifulSoup but could not find a way of extracting the tag offsets. Ideas anyone?
Thankful for any help!
/Jakob
No comments:
Post a Comment