Wednesday, 24 September 2014

Extract complete content from xml with Tika



I want to extract the complete content of an xml file with tika. That means that tika should not take the text out of the elements and throw the tags away.


The ouput of the content should be like this:



content:
<?xml version="1.0" encoding="UTF-8" ?>
<xml>
<tag1>text</tag1>
<tag2>text</tag2>
</xml>


But the result is always this:



content:





text
text


Program code:



public static void main(String[] args) {
try {
InputStream input;

input = new FileInputStream(new File("D:/SolrTestFileSystem/Test_Files/test.xml"));

ContentHandler textHandler = new WriteOutContentHandler();
Metadata metadata = new Metadata();
XMLParser parser = new XMLParser();
ParseContext context = new ParseContext();
parser.parse(input, textHandler, metadata, context);
input.close();
System.out.println("content: " + textHandler.toString());
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}


The xml file:



<?xml version="1.0" encoding="UTF-8" ?>
<xml>
<tag1>text</tag1>
<tag2>text</tag2>
</xml>

No comments:

Post a Comment