Monday, 8 December 2014

Unable to parse XML tags within other tags using SAX Parser in Groovy



I have this sample xml format that I am trying to parse.



<records>
<ae_documentTitleBegin /><ae_subDocumentTitleGenerated generatedTitle="Introduction" /><ae_750584b7e5364775bf21d91c5020b965_clauseBegin /><ae_clauseTitleBegin />Introduction<ae_clauseTitleEnd /><ae_clauseBodyBegin />ABL <ae_definedTermInstanceBegin />CREDIT AGREEMENT<ae_definedTermInstanceEnd /><ae_documentTitleEnd />
<car name='HSV Maloo' make='Holden' year='2006'>
<ae_definedTermTitleBegin />Australia<ae_definedTermTitleEnd />
<ae_clauseTitleBegin />1.02 <u>Accounting Terms </u>.<ae_clauseTitleEnd />

</car>
<car name='P50' make='Peel' year='1962'>
<ae_definedTermTitleBegin />Isle of Man<ae_definedTermTitleEnd />
<ae_clauseTitleBegin />Smallest Street-Legal Car at 99cm wide and 59 kg in weight<ae_clauseTitleEnd />
</car>
<car name='Royale' make='Bugatti' year='1931'>
<ae_definedTermTitleBegin />France<ae_definedTermTitleEnd />
<ae_clauseTitleBegin />Most Valuable Car at $15 million<ae_clauseTitleEnd />
</car>
</records>


The sax parser that I have implemented looks like



import javax.xml.parsers.SAXParserFactory
import org.xml.sax.helpers.DefaultHandler
import org.xml.sax.*

class SAXXMLParser extends DefaultHandler {
ArrayList<String> DefinedTermTitles = new ArrayList<>();
ArrayList<String> ClauseTitles = new ArrayList<>();
ArrayList<String> DocumentTitles = new ArrayList<>();
String currentMessage;
boolean countryFlag = false;
StringBuilder message = new StringBuilder();

void startElement(String ns, String localName, String qName, Attributes atts) {
switch (qName) {
case 'ae_clauseTitleBegin':
//messages.add(currentMessage)
countryFlag = true;
break

case 'ae_documentTitleBegin':
//messages.add(currentMessage)
countryFlag = true;
break

case 'ae_definedTermTitleBegin':
//messages.add(currentMessage)
countryFlag = true;
break
}
}

void characters(char[] chars, int offset, int length) {
if (countryFlag) {
message.append(new String(chars, offset, length));
//println(currentMessage)
}
}

void endElement(String ns, String localName, String qName) {
switch (qName) {
case 'ae_clauseTitleEnd':
ClauseTitles.add(message.toString());
countryFlag = false;
message.setLength(0);
break
case 'ae_documentTitleEnd':
DocumentTitles.add(message.toString());
countryFlag = false;
message.setLength(0);
break
case 'ae_definedTermTitleEnd':
DefinedTermTitles.add(message.toString());
countryFlag = false;
message.setLength(0);
break
}
}
}


The out put that I am getting is



Calling XML Parser
[Australia, Isle of Man, France] <<-- DefinedTermTitles

[Introduction, 1.02 Accounting Terms ., Smallest Street-Legal Car at 99cm wide and 59 kg in weight, Most Valuable Car at $15 million] <<-- ClauseTitles

[] <<-- DocuemntTitles
ENd of XML Parser


which is wrong. As you can see in the second list Introduction has come which should be in the third list of DocumentTitles(since I am appending document title tag to that list). Also Introduction is not correct It should be Introduction ABL CREDIT AGREEMENT. I have no idea why this is happening. I guess its because of having tags within tags. I need a way to get only the text ignoring the tags


No comments:

Post a Comment