How to convert an XML file to nice pandas dataframe?

Let's assume that I have an XML like this:


<type="XXX" language="EN" gender="xx" feature="xx" web="foobar.com">
    <count="N">
        <KEY="e95a9a6c790ecb95e46cf15bee517651" web="http://ift.tt/1vnsWoG"><![CDATA[A large text with lots of strings and punctuations symbols [...]
]]>
        </document>
        <KEY="bc360cfbafc39970587547215162f0db" web="http://ift.tt/1vnsWoG"><![CDATA[A large text with lots of strings and punctuations symbols [...]
]]>
        </document>
        <KEY="19e71144c50a8b9160b3f0955e906fce" web="http://ift.tt/1vnsWoG"><![CDATA[A large text with lots of strings and punctuations symbols [...]
]]>
        </document>
        <KEY="21d4af9021a174f61b884606c74d9e42" web="http://ift.tt/1vnsWoG"><![CDATA[A large text with lots of strings and punctuations symbols [...]
]]>
        </document>
        <KEY="28a45eb2460899763d709ca00ddbb665" web="http://ift.tt/1vnsWoG"><![CDATA[A large text with lots of strings and punctuations symbols [...]
]]>
        </document>
        <KEY="a0c0712a6a351f85d9f5757e9fff8946" web="http://ift.tt/1vnsWoG"><![CDATA[A large text with lots of strings and punctuations symbols [...]
]]>
        </document>
        <KEY="626726ba8d34d15d02b6d043c55fe691" web="http://ift.tt/1vnsWoG"><![CDATA[A large text with lots of strings and punctuations symbols [...]
]]>
        </document>
        <KEY="2cb473e0f102e2e4a40aa3006e412ae4" web="http://ift.tt/1vnsWoG"><![CDATA[A large text with lots of strings and punctuations symbols [...] [...]
]]>
        </document>
    </documents>
</author>

I would like to read this xml file and convert it to a pandas dataframe:


key                                         type     language    feature            web                             data
e95324a9a6c790ecb95e46cf15bE232ee517651      XXX        EN          xx      http://ift.tt/1vnsWoG     A large text with lots of strings and punctuations symbols [...]
e95324a9a6c790ecb95e46cf15bE232ee517651     XXX         EN          xx      http://ift.tt/1vnsWoG     A large text with lots of strings and punctuations symbols [...]
19e71144c50a8b9160b3cvdf2324f0955e906fce    XXX         EN          xx      http://ift.tt/1vnsWoG     A large text with lots of strings and punctuations symbols [...]
21d4af9021a174f61b8erf284606c74d9e42        XXX         EN          xx      http://ift.tt/1vnsWoG     A large text with lots of strings and punctuations symbols [...]
28a45eb2460823499763d70vdf9ca00ddbb665       XXX        EN          xx      http://ift.tt/1vnsWoG     A large text with lots of strings and punctuations symbols [...]

This is what I all ready tried, but I am getting some errors and probably there is a more efficient way for doing this task:


from lxml import objectify
import pandas as pd

path = 'file_path'
xml = objectify.parse(open(path))
root = xml.getroot()
root.getchildren()[0].getchildren()
df = pd.DataFrame(columns=('key','type', 'language', 'feature', 'web', 'data'))

for i in range(0,len(xml)):
    obj = root.getchildren()[i].getchildren()
    row = dict(zip(['key','type', 'language', 'feature', 'web', 'data'], [obj[0].text, obj[1].text]))
    row_s = pd.Series(row)
    row_s.name = i
    df = df.append(row_s)

Could anybody provide me a better aproach for this problem?, thanks in advance.

How to convert an XML file to nice pandas dataframe?

No comments:

Post a Comment