Iterparse truncating XML elements



I have a large XML file (about 600 MB) that I am trying to parse using cElementTree with iterparse. First time attempting this.


I am iterating on 'product' tags and elem.clear()-ing after I process each product. Within my parsing I have a function parse_trips which uses a for loop to parse tags within tags (each product could potentially have hundreds of these which are each hundreds of lines long).



for trip in trips:
dump(trip)
get_date(trip, product)
set_price(trip, product)


However, when I dump(trips) I see that these tags are getting truncated/closed out early without any error being thrown. The parser seems to reach a maximum length for the elem in memory and then just won't hold anymore.


The raw xml:



<trip>
<code>text</code>
<name>text</name>
<image>img.jpg</image>
<date>2014-08-10</date>
<pricing>

</pricing>
<itinerary>
<code>1</code>
<events>
<event>
eventInfo 1
</event>
<event>
eventInfo 2
</event>
<event>
eventInfo 3
</event>
<event>
eventInfo 4
</event>
<event>
eventInfo 5
</event>
<event>
eventInfo 6
</event>
<event>
eventInfo 7
</event>
<event>
eventInfo 8
</event>
</events>
</itinerary>
</trip>


The output I am getting is while there might be 6 such groups, when I reach the second trip in the group, dump(trip) the looks like this:



<trip>
<code>text</code>
<name>text</name>
<image>img.jpg</image>
<date>2014-08-10</date>
<pricing></pricing>
<itinerary>
<code>1</code>
<events>
<event>
eventInfo 1
</event>
<event>
eventInfo 2
</event>
<event>
eventInfo 3
</event>
</events>
</itinerary>
</trip>


and every later trip is gone. I tried looping through and just incrementing an integer i to count how many tags there are, and it only reaches the second one which it truncates and then ends the for loop.


Is there a way to view/configure the size of the element iterparse can grab? Or a way to use iter again once I get to trips to grab ALL child nodes of ?


No comments:

Post a Comment