Any tips on importing lxml.etree start event into pandas.DataFrame. The following code shows simple lxml parsing and converting entries into dataframe (pandas) using from_records. [NOTE: I tried from_dict but it needed a list per attribute while from_records seems to handle dictionaries better. ]
The pd.DataFrame.from_record fails on coercion of data attributes... with error:
TypeError: Argument must be bytes or unicode, got 'int' Thanks in advance for any tips?
CODE SNIPPET:
x2="""<m2> <entry attrm201=1 attrm202 attrm203=1>m0201_t</entry> <entry attrm201=1 attrm0203=1>m0202_t</entry> <entry displevel=1 entrytype=1>m0202_t</entry> </m2>""" import pandas as pd objDF = pd.DataFrame() import io srcIO = io.StringIO(x2) #srcIO = io.BytesIO(str.encode(x2)) from lxml import etree for event, e in etree.iterparse(srcIO, recover=True, html=True, events=('start', 'end')): if event != 'start' : continue if e.tag != 'entry' : continue elmDict = e.attrib elmDict[e.tag] = e.text df = pd.DataFrame.from_records(elmDict, index=[0]) objDF = pd.concat(objDF, df) print(event, objDF)
No comments:
Post a Comment