XML : In an XML document, is it possible to tell the difference between an entity-encoded character and one that is not?

I am being feed an XML document with metadata about online resources that I need to parse. Among the different metadata items are a collection of tags, which are comma-delimited. Here is an example:

  <tags>Research skills, Searching&#44; evaluating and referencing</tags>

The issue is that one of these "tags" contains a comma in it. The comma within the tag is encoded, but the commas intended to delimit tags are not. I am (currently) using the getText() method on org.dom4j.Node to read the text content of the element, which returns a String.

The problem is that I am not able -- as far as I'm aware -- to differentiate the encoded comma (from the ones that aren't encoded) in the String I receive.

Short of writing my own XML parser, is there another way to access the text content of this node in a more "raw" state? (viz. a state where the encoded comma is still encoded.)

XML : In an XML document, is it possible to tell the difference between an entity-encoded character and one that is not?

No comments:

Post a Comment