XML : Parsing Apache Tika XML Output returns Unknown Tag

Basically, i parse several xml output from Apache Tika to get metadata (via meta tags) and list of embedded files using <div class="embedded" id="content">. However, i found my map had several key Unknown tag (0x...). I wonder if it caused by Tika's incomplete tag output because the error i get only related to unclosed tag - which i suspect within the body of XML instead of the output i want (meta, div). However, it is rather illogical where the only code that writes into the map are meta tags and divs (with embedded class) - which is only a small part of the document.

  public class Parse {      private class internalXMLReader extends DefaultHandler{          public final Map<String, Object> entityList = new HashMap<>();            @Override          public void startElement(String uri, String localname, String qName, Attributes attributes) throws SAXException{              String key, content;              if(qName.equalsIgnoreCase("meta")){                  key = attributes.getValue("name");                  content = attributes.getValue("content");                  if(key.contains("Content-Type")){                      String tmp[] = attributes.getValue("content").replace(' ', '\0').split(";");                      if(tmp.length > 1){                          content = tmp[0];                      }                  }                  entityList.put(key, content);              }              else if(qName.equalsIgnoreCase("div")){                  if((attributes.getValue("class") != null) && (attributes.getValue("class").equalsIgnoreCase("embedded"))){                      key = "embedded";                      List<String> inlist;                      if(entityList.containsKey("embedded") && (entityList.get("embedded") instanceof List)){                          inlist = (List) entityList.get(key);                      }                      else{                          inlist = new LinkedList<>();                          entityList.put(key, inlist);                      }                      inlist.add(attributes.getValue("id"));                  }              }          }            @Override          public void endElement(String uri, String localname, String qName) throws SAXException{              //no, i just did not want to validate or such..          }            @Override          public void characters(char ch[], int start, int length) throws SAXException{              //no, we don't actually read <something>this</something> yet          }      }      public Entity parse(String xml, Entity in){          try{              InputSource xmlinput = new InputSource(new StringReader(xml));              SAXParserFactory factory = SAXParserFactory.newInstance();              SAXParser parser = factory.newSAXParser();              internalXMLReader handler = new internalXMLReader();              parser.parse(xmlinput, handler);              in.addMeta(handler.entityList);          }          catch(IOException | ParserConfigurationException | SAXException ex){              Logger.getLogger(TikaParseNCluste.class.getName()).log(Level.SEVERE, null, ex);          }          return in;      }  }    

Perhaps i should take a look at my 800+ xml files.

No comments:

Post a Comment