Sunday, 10 August 2014

How to resolve "questionmarks" when parsing a dtd vaidated XML document via Java SAXParser?



I am wondering why my SaxParser seems not to be able to resolve certain entities defined in an external dtd file. I am processing a huge xml file which has the following header. So the input is (heavily reduced :-)):



// myxml.xml
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE authors SYSTEM "mydtd.dtd">
<authors>
<author>
Bal&aacute;zs Hidasi
</author>
</authors>


And this is the incorrect output:



Bal
?zs Hidasi


Obviousely &aacute; was not resolved!


This is how I have set up the parser:



// MySaxParser.java

public class MySaxParser extends DefaultHandler {

@Override
public void characters(char[] ch, int start, int length)
throws SAXException {
if ("author".equals(currentTag)) {
System.out.println(String.valueOf(Arrays.copyOfRange(ch, start, start + length)));
}
}

static public void main(String[] args) throws Exception {
SAXParserFactory spf = SAXParserFactory.newInstance();
spf.setFeature(XMLConstants.FEATURE_SECURE_PROCESSING, false);
spf.setNamespaceAware(true);
spf.setValidating(true); // From what I understood from the API this combined
// with '<!DOCTYPE mydtd SYSTEM "mydtd.dtd">' from
// the file myxml.xml should do the trick. What do I miss?

SAXParser saxParser = spf.newSAXParser();
XMLReader xmlReader = saxParser.getXMLReader();
xmlReader.setContentHandler(new SAXLocalNameCount());
xmlReader.setErrorHandler(new MyErrorHandler(System.err));

xmlReader.parse("file:/path/to/myxml.xml");
}
}


What do I miss? Do I somehow have to do more than spf.setValidating(true) to make the parser aware of the dtd defined in the xml file header?


I should mention that the dtd and xml are syntactically and semantically correct. The dtd contains <!ENTITY aacute "&#225;" ><!-- small a, acute accent --> as a mapping for resolving. I donwloaded the files from a trusted source, so the error has to be in my Code.


No comments:

Post a Comment