Python XML Parsing Algorithm Speed



I'm currently parsing a large XML file of the following form in a python-flask webapp on heroku:



<book name="bookname">
<volume n="1" name="volume1name">
<chapter n="1">
<li n="1">li 1 content</li>
<li n="2">li 2 content</li>
</chapter/>
<chapter n="2">
<li n="1">li 1 content</li>
<li n="2">li 2 content</li>
</chapter/>
</volume>
<volume n="2" name="volume2name">
<chapter n="1">
<li n="1">li 1 content</li>
<li n="2">li 2 content</li>
</chapter/>
<chapter n="2">
<li n="1">li 1 content</li>
<li n="2">li 2 content</li>
</chapter/>
</volume>
</book>


The code that I use to parse, analyze it, and display it through Flask is as the following:



from lxml import etree
file = open("books/filename.xml")
parser = etree.XMLParser(recover=True)
tree = etree.parse(file, parser)
root = tree.getroot()

def getChapter(volume, chapter):
i = 0
data = []
while True:
try:
data.append(root[volumeList().index(volume)][chapter-1][i].text)
except IndexError:
break
i += 1
if data == []:
data = None
return data

def volumeList():
data = tree.xpath('//volume/@name')
return data

def chapterCount(volume):
currentChapter = 1
count = 0
while True:
data = getChapter(volume, currentChapter)
if data == None:
break
else:
count += 1
currentChapter += 1
return count

def volumeNumerate():
list = volumeList()
i = 1
dict = {}
for element in list:
dict[i] = element
i += 1
return dict

def render_default_values(template, **kwargs):
chapter = getChapter(session['volume'],session['chapter'])
count = chapterCount(session['volume'])
return render_template(template, chapter=chapter, count=count, **kwargs)

@app.route('/<volume>/<int:chapter>')
def goto(volume, chapter):
session['volume'] = volume
session['chapter'] = chapter
return render_default_values("index.html")


The issue that I am having is that whenever Flask is trying to render a volume with many chapters, (whenever chapterCount(session['volume']) > about 50 or so), the loading and processing of the page takes a very long time. In comparison, if the app is loading a volume that has say under 10/15 chapters, the loading is almost instantaneous, even as a live webapp. As such, is there a good way that I can optimize this, and improve the speed and performance? Thanks a lot!


(PS: For reference, this is my old getChapter function, that I stopped using since I don't want to refer to an individual `li' in the code and want the code to work with any generic XML file. It was considerably faster than the current getChapter function though!:



def OLDgetChapter(volume, chapter):
data = tree.xpath('//volume[@name="%s"]/chapter[@n=%d]/li/text()'%(volume,chapter))
if data == []:
data = None
return data


Thanks a lot!


No comments:

Post a Comment