I've been trying to parse an xml file (JMdict_e.xml) for translation purposes. However, parsing of the whole file returns an incomplete dataset.
Code:
tree2 = ET.ElementTree(file = "JMdict_e.xml") root2 = tree2.getroot() print([i.tag for i in root2[55711]]) print([i.text for i in root2[55711][4]])
returns the following entries:
Result:
['ent_seq', 'k_ele', 'r_ele', 'r_ele', 'sense'] ["Godan verb with `ru' ending", 'intransitive verb', 'to become less capable', 'to grow dull', 'to become blunt', 'to weaken']
Conversely, when the single entry is extracted from the original xml database, the following is obtained:
Code:
import xml.etree.cElementTree as ET tree = ET.ElementTree(file = "new.xml") root = tree.getroot() print([i.tag for i in root[1]]) for i in root[1]: print([j.text for j in i if i.tag == "sense"])
result:
['ent_seq', 'k_ele', 'r_ele', 'r_ele', 'sense', 'sense', 'sense', 'sense', 'sense'] ##Truncated empty lists ['にぶい', 'adjective (keiyoushi)', 'dull (e.g. a knife)', 'blunt'] ['のろい is usu. in kana', 'thickheaded', 'obtuse', 'stupid'] ['にぶい', 'dull (sound, color, etc.)', 'dim (light)'] ['slow', 'sluggish', 'inert', 'lethargic'] ['のろい', 'indulgent (esp. to the opposite sex)', 'doting']
I've been picking apart the data for a while, but have not been able to identify a cause for this, but suspect that another entry in the xml file may override what is shown.
XML fragments
<JMdict> <entry> <ent_seq>1000000</ent_seq> <r_ele> <reb>ヽ</reb> </r_ele> <r_ele> <reb>くりかえし</reb> </r_ele> <sense> <pos>&n;</pos> <gloss>repetition mark in katakana</gloss> </sense> </entry> <entry> <ent_seq>1582430</ent_seq> <k_ele> <keb>鈍い</keb> <ke_pri>ichi1</ke_pri> <ke_pri>news2</ke_pri> <ke_pri>nf30</ke_pri> </k_ele> <r_ele> <reb>にぶい</reb> <re_pri>ichi1</re_pri> <re_pri>news2</re_pri> <re_pri>nf30</re_pri> </r_ele> <r_ele> <reb>のろい</reb> <re_pri>ichi1</re_pri> </r_ele> <sense> <stagr>にぶい</stagr> <pos>&adj-i;</pos> <gloss>dull (e.g. a knife)</gloss> <gloss>blunt</gloss> </sense> <sense> <s_inf>のろい is usu. in kana</s_inf> <gloss>thickheaded</gloss> <gloss>obtuse</gloss> <gloss>stupid</gloss> </sense> <sense> <stagr>にぶい</stagr> <gloss>dull (sound, color, etc.)</gloss> <gloss>dim (light)</gloss> </sense> <sense> <gloss>slow</gloss> <gloss>sluggish</gloss> <gloss>inert</gloss> <gloss>lethargic</gloss> </sense> <sense> <stagr>のろい</stagr> <gloss>indulgent (esp. to the opposite sex)</gloss> <gloss>doting</gloss> </sense> </entry> </JMdict>
XML file in question
No comments:
Post a Comment