Wednesday, 4 February 2015

Reading compressed XML with java



I noticed a problem reading XML using java: Essentially I am using javax.xml.parsers.*, in particular for a given InputStream stream I do the following:



DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();

org.w3c.dom.Element docElem = db.parse(stream).getDocumentElement();


My files are generally encoded using UTF-8, but they don't actually contain any unicode characters at all. Nevertheless the encoding is specified as <?xml version="1.0" encoding="UTF-8" ?>. The problem is that some of the XML files are rather large. For this reason I generally compress them using gzip file.xml. I use the following method to get an InputStream depending on the extension of the file name:



private static InputStream getInputStream(File file) throws IOException {
String extension = "";
String fileName = file.getName();

int i = fileName.lastIndexOf('.');
if (i > 0) {
extension = fileName.substring(i+1);
}

InputStream stream = new FileInputStream(file);

if("gz".equals(extension)) {
return new GZIPInputStream(stream);
}
else {
if(!"xml".equals(extension)) {
LOGGER.warning(String.format("Unknown extension: %s, assuming plain XML", extension));
}

return stream;
}
}


The problem is that if I use the snippet above on a file with a gz extension I get the following exception:



com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.
at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(UTF8Reader.java:691)
at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(UTF8Reader.java:557)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:1743)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.arrangeCapacity(XMLEntityScanner.java:1614)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.skipString(XMLEntityScanner.java:1652)
at com.sun.org.apache.xerces.internal.impl.XMLVersionDetector.determineDocVersion(XMLVersionDetector.java:196)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:812)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:243)
at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:348)
at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:205)


The problem does not show up when I use gunzip to decompress the file before reading in the XML. What am I doing wrong here?


No comments:

Post a Comment