I'm looking for a solution to be able to parse a non-standard encoded XML file which has elements (not only content) encoded as well (simplified Chinese). I have managed to build a ContentHandler using sax (and python) which works quite well, here's a sample of the XML (note that variableA, variableB, listOfRuns etc. are just fake English words whereas in the real xml there's Chinese everywhere...):
<?xml version="1.0" encoding="gb-2312" standalone="yes"?> <some_container> <variableA>SO201601240103</variableA> <variableB>SO</variableB> <listOfRuns> <run name="test" start="2016-01-24T02:59:17" original="N"/> <run name="test" start="2016-01-24T02:59:17" original="N"/> </listOfRuns> </some_container>
My ContentHandler maps the Chinese 'text nodes' quite nicely :
# -*- coding: utf-8 -*- import sys, codecs, os, shutil import xml.sax class ExplorationProgramHandler(xml.sax.ContentHandler): def __init__(self): self.parentFlag = False self.CurrentData = "" self.ProgramNumber = "" self.ProgramCategory = "" self.PayloadModeSwitchingList = [] # Call when an element starts def startElement(self, tag, attributes): self.parentFlag = True if tag == u"listOfRuns": print 'found payload switching list directive' self.PayloadModeSwitchingList.append(tag) self.parentFlag = False self.CurrentData = tag # Call when an elements ends / FIXME! def endElement(self, tag): if self.CurrentData == u"variableA": print "ProgramNumber:", self.ProgramNumber elif self.CurrentData == u"variableB": print "ProgramCategory:", self.ProgramCategory elif self.CurrentData == u"listOfRuns": print "PayloadModeSwitchingList:", self.PayloadModeSwitchingList def characters(self, content): if self.CurrentData == u"variableA": self.ProgramNumber = content if self.CurrentData == u"variableB": self.ProgramCategory = content if self.CurrentData == u"listOfRuns": self.PayloadModeSwitchingList = None # FIXME if ( __name__ == "__main__"): enc = "gb2312" # from header input_fname = sys.argv[1] shutil.copy(input_fname,"tmp.xml") f = open("tmp.xml",'r').read() data = f.decode(enc).replace(enc,"utf-8").encode("utf-8") # print data gives a bunch of Chinese! # # recode to utf-8 foo = open("tmp.utf.xml","w") foo.write(data) foo.close() # create an XMLReader parser = xml.sax.make_parser() # turn off namepsaces parser.setFeature(xml.sax.handler.feature_namespaces, 0) # override the default ContextHandler Handler = ExplorationProgramHandler() parser.setContentHandler( Handler ) #p = xml.sax.parseString(data,Handler) parser.parse("tmp.utf.xml") # cleanup os.remove("tmp.xml") os.remove("tmp.utf.xml")
Running this on the sample XML python saxParser.py myTest.xml
produces the following result:
ProgramNumber: SO201601240103 ProgramCategory: SO found payload switching list directive
which is close to what i have but misses the nested node associated with run
along with its attributes. I've checked the other stack overflow questions, the only one that comes close is for code written in java (parse xml nested nodes that repeat with SAX) which is not really helpful since I don't really know how to create the handler for the run
element.
Note that I can try parsing the xml with other tools such as ElementTree or minidom but then I cannot do the kind of nice mapping from Chinese tags into English that sax permits me to do.
Thanks a lot for the help in advance!
No comments:
Post a Comment