XML : xml parsing (Removing parent nodes)

Hi I'm seriously stuck when trying to filter out my xml document. Here is some example of the contents:

  <sentence id="1" document_id="Perseus:text:1999.02.0029" >      <primary>millermo</primary>      <word id="1" />      <word id="2" />      <word id="3" />      <word id="4" />  </sentence>      <sentence id="2" document_id="Perseus:text:1999.02.0029" >      <primary>millermo</primary>      <word id="1" />      <word id="2" />      <word id="3" />      <word id="4" />      <word id="5" />      <word id="6" />      <word id="7" />      <word id="8" />  </sentence>    

There are many sentences (Over 3000) but all I want to do is write some code (preferably in java or python) that will go through my xml file and remove all the sentences which have more than 5 word ids, so in other words I will be left with just sentences tags with 5 or less word ids. Thanks. (Just to note my xml isnt great, I get mixed up with nodes/tags/element/ids.

No comments:

Post a Comment