Convert Chinese characters into XML/HTML-style numerical entities and into Unicode UTF-8?



I have a mixture of English words and Chinese characters, and I would like to convert the text into a mixture of English words and the XML/HTML-style numerical entities of the Chinese characters.


For example, the following mixture of English words, numbers and Chinese characters



Title: 目录.doc
Level: 1
PageNumber: 1
Begin
Title: 1 C语言概述
Level: 1
PageNumber: 13
BeginTitle: 1.1 C语言的发展过程
Level: 2
PageNumber: 13
Begin
Title: 1.2 当代最优秀的程序设计语言


would be turned into the following, with the Chinese characters replaced by their XML/HTML-style numerical entities:



Title: 目录.doc
Level: 1
PageNumber: 1
Begin
Title: 1 C语言概述
Level: 1
PageNumber: 13
Begin
Title: 1.1 C语言的发展过程
Level: 2
PageNumber: 13
Begin
Title: 1.2 当代最优秀的程序设计语言


I am not familar with encoding, and wonder if I can program this in Python?


Also possible to program for turning the Chinese characters into their Unicode UTF-8 code?


Thanks in advance!


No comments:

Post a Comment