Basically, i parse several xml output from Apache Tika to get metadata (via meta tags) and list of embedded files using <div class="embedded" id="content">. However, i found my map had several key Unknown tag (0x...). I wonder if it caused by Tika's incomplete tag output because the error i get only related to unclosed tag - which i suspect within the body of XML instead of the output i want (meta, div). However, it is rather illogical where the only code that writes into the map are meta tags and divs (with embedded class) - which is only a small part of the document.
public class Parse { private class internalXMLReader extends DefaultHandler{ public final Map<String, Object> entityList = new HashMap<>(); @Override public void startElement(String uri, String localname, String qName, Attributes attributes) throws SAXException{ String key, content; if(qName.equalsIgnoreCase("meta")){ key = attributes.getValue("name"); content = attributes.getValue("content"); if(key.contains("Content-Type")){ String tmp[] = attributes.getValue("content").replace(' ', '\0').split(";"); if(tmp.length > 1){ content = tmp[0]; } } entityList.put(key, content); } else if(qName.equalsIgnoreCase("div")){ if((attributes.getValue("class") != null) && (attributes.getValue("class").equalsIgnoreCase("embedded"))){ key = "embedded"; List<String> inlist; if(entityList.containsKey("embedded") && (entityList.get("embedded") instanceof List)){ inlist = (List) entityList.get(key); } else{ inlist = new LinkedList<>(); entityList.put(key, inlist); } inlist.add(attributes.getValue("id")); } } } @Override public void endElement(String uri, String localname, String qName) throws SAXException{ //no, i just did not want to validate or such.. } @Override public void characters(char ch[], int start, int length) throws SAXException{ //no, we don't actually read <something>this</something> yet } } public Entity parse(String xml, Entity in){ try{ InputSource xmlinput = new InputSource(new StringReader(xml)); SAXParserFactory factory = SAXParserFactory.newInstance(); SAXParser parser = factory.newSAXParser(); internalXMLReader handler = new internalXMLReader(); parser.parse(xmlinput, handler); in.addMeta(handler.entityList); } catch(IOException | ParserConfigurationException | SAXException ex){ Logger.getLogger(TikaParseNCluste.class.getName()).log(Level.SEVERE, null, ex); } return in; } } Perhaps i should take a look at my 800+ xml files.
No comments:
Post a Comment