XML : nested XML parsing with sax and Python

I'm looking for a solution to be able to parse a non-standard encoded XML file which has elements (not only content) encoded as well (simplified Chinese). I have managed to build a ContentHandler using sax (and python) which works quite well, here's a sample of the XML (note that variableA, variableB, listOfRuns etc. are just fake English words whereas in the real xml there's Chinese everywhere...):

  <?xml version="1.0" encoding="gb-2312" standalone="yes"?>  <some_container>   <variableA>SO201601240103</variableA>   <variableB>SO</variableB>   <listOfRuns>    <run name="test" start="2016-01-24T02:59:17" original="N"/>    <run name="test" start="2016-01-24T02:59:17" original="N"/>   </listOfRuns>  </some_container>    

My ContentHandler maps the Chinese 'text nodes' quite nicely :

   # -*- coding: utf-8 -*-   import sys, codecs, os, shutil   import xml.sax   class ExplorationProgramHandler(xml.sax.ContentHandler):    def __init__(self):      self.parentFlag = False      self.CurrentData = ""      self.ProgramNumber = ""      self.ProgramCategory = ""      self.PayloadModeSwitchingList = []    # Call when an element starts    def startElement(self, tag, attributes):      self.parentFlag = True      if tag == u"listOfRuns":          print 'found payload switching list directive'          self.PayloadModeSwitchingList.append(tag)          self.parentFlag = False      self.CurrentData = tag        # Call when an elements ends / FIXME!    def endElement(self, tag):      if self.CurrentData == u"variableA":          print "ProgramNumber:", self.ProgramNumber      elif self.CurrentData == u"variableB":          print "ProgramCategory:", self.ProgramCategory      elif self.CurrentData == u"listOfRuns":          print "PayloadModeSwitchingList:", self.PayloadModeSwitchingList    def characters(self, content):      if self.CurrentData == u"variableA":          self.ProgramNumber = content      if self.CurrentData == u"variableB":          self.ProgramCategory = content      if self.CurrentData == u"listOfRuns":          self.PayloadModeSwitchingList = None # FIXME    if ( __name__ == "__main__"):   enc = "gb2312" # from header   input_fname = sys.argv[1]   shutil.copy(input_fname,"tmp.xml")   f = open("tmp.xml",'r').read()     data = f.decode(enc).replace(enc,"utf-8").encode("utf-8")   # print data gives a bunch of Chinese! #   # recode to utf-8   foo = open("tmp.utf.xml","w")   foo.write(data)   foo.close()     # create an XMLReader   parser = xml.sax.make_parser()   # turn off namepsaces parser.setFeature(xml.sax.handler.feature_namespaces, 0)   # override the default ContextHandler   Handler = ExplorationProgramHandler()   parser.setContentHandler( Handler )   #p = xml.sax.parseString(data,Handler)   parser.parse("tmp.utf.xml")   # cleanup   os.remove("tmp.xml")   os.remove("tmp.utf.xml")    

Running this on the sample XML python saxParser.py myTest.xml produces the following result:

  ProgramNumber: SO201601240103  ProgramCategory: SO  found payload switching list directive    

which is close to what i have but misses the nested node associated with run along with its attributes. I've checked the other stack overflow questions, the only one that comes close is for code written in java (parse xml nested nodes that repeat with SAX) which is not really helpful since I don't really know how to create the handler for the run element.

Note that I can try parsing the xml with other tools such as ElementTree or minidom but then I cannot do the kind of nice mapping from Chinese tags into English that sax permits me to do.

Thanks a lot for the help in advance!

No comments:

Post a Comment