Are HTML/XML character entity just the first code-point under UTF-8 encoding?



Recently I came across an project and found a method which aims to convert special characters to corresponding HTML/XML character entities for displaying.


The method is simple, which just replace all special characters in the source string(under UTF-8 encoding) with their first code point(use codePointAt(0) method) plus prefix "&#" and suffix ";" using regular expression.


I have done some other tests using this conversion and the results all turn out to be right.


As I found a lot of discussion about how to convert special characters to HTML/XML character entities in Java, some of which even involves using third-party libraries. So I guess If UTF-8 format of the source string can be obtained, the conversion can be simply done by extracting the first code point?


No comments:

Post a Comment