tagsoup breaks good xml



Cleaning an xml file I have obtained unexpected results: tagsoup has orphaned some properties closing the parent tag too soon. It also downcases the parent tag's name.


Before tagsoup:



<Objects>
<Object>
<ObjectID>240</ObjectID>
[...]
<Status>Not Ready</Status>
<Title>Some description which includes word/word, 22,000</Title>
<Url>http://ift.tt/1wDs0ae;
[...]
<Owner>
<Name>JOHN MARSHALL, MR</Name>
</Owner>
</Object>
<Object>
<ObjectID>122</ObjectID>
[...]


After tagsoup:



<Objects>
<object>
<ObjectID>240</ObjectID>
[...]
<Status>Not Ready</Status>
</object>
<Title>Some description which includes word/word, 22,000</Title>
<Url>http://ift.tt/1wDs0ae;
[...]
<Owner>
<Name>JOHN MARSHALL, MR</Name>
</Owner>
<object>
<ObjectID>122</ObjectID>
[...]


I'm in a java project that uses this libraries:



import org.ccil.cowan.tagsoup.Parser;
import org.ccil.cowan.tagsoup.XMLWriter;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;


I'm using Java 6.


Any clues? The desired output of a valid xml file would be the same file (maybe just changing details, but not the structure), wouldn't it?


No comments:

Post a Comment