Python XML Parsing Algorithm Speed

I'm currently parsing a large XML file of the following form in a python-flask webapp on heroku:


<book name="bookname">
  <volume n="1" name="volume1name">
    <chapter n="1">
       <li n="1">li 1 content</li>
       <li n="2">li 2 content</li>
    </chapter/>
    <chapter n="2">
       <li n="1">li 1 content</li>
       <li n="2">li 2 content</li>
    </chapter/>
  </volume>
  <volume n="2" name="volume2name">
    <chapter n="1">
       <li n="1">li 1 content</li>
       <li n="2">li 2 content</li>
    </chapter/>
    <chapter n="2">
       <li n="1">li 1 content</li>
       <li n="2">li 2 content</li>
    </chapter/>
  </volume>
</book>

The code that I use to parse, analyze it, and display it through Flask is as the following:


from lxml import etree
file = open("books/filename.xml")
parser = etree.XMLParser(recover=True)
tree = etree.parse(file, parser)
root = tree.getroot()

def getChapter(volume, chapter):
    i = 0
    data = []
    while True:
        try:
            data.append(root[volumeList().index(volume)][chapter-1][i].text)
        except IndexError:
            break
        i += 1
    if data == []:
        data = None
    return data

def volumeList():
    data = tree.xpath('//volume/@name')
    return data

def chapterCount(volume):
    currentChapter = 1
    count = 0
    while True:
        data = getChapter(volume, currentChapter)
        if data == None:
            break
        else:
            count += 1
            currentChapter += 1
    return count

def volumeNumerate():
    list = volumeList()
    i = 1
    dict = {}
    for element in list:
        dict[i] = element
        i += 1
    return dict

def render_default_values(template, **kwargs):
    chapter = getChapter(session['volume'],session['chapter'])
    count = chapterCount(session['volume'])
    return render_template(template, chapter=chapter, count=count, **kwargs)

@app.route('/<volume>/<int:chapter>')
def goto(volume, chapter):
    session['volume'] = volume
    session['chapter'] = chapter
    return render_default_values("index.html")

The issue that I am having is that whenever Flask is trying to render a volume with many chapters, (whenever chapterCount(session['volume']) > about 50 or so), the loading and processing of the page takes a very long time. In comparison, if the app is loading a volume that has say under 10/15 chapters, the loading is almost instantaneous, even as a live webapp. As such, is there a good way that I can optimize this, and improve the speed and performance? Thanks a lot!

(PS: For reference, this is my old getChapter function, that I stopped using since I don't want to refer to an individual `li' in the code and want the code to work with any generic XML file. It was considerably faster than the current getChapter function though!:


def OLDgetChapter(volume, chapter):
    data = tree.xpath('//volume[@name="%s"]/chapter[@n=%d]/li/text()'%(volume,chapter))
    if data == []:
        data = None
    return data

Thanks a lot!

Python XML Parsing Algorithm Speed

No comments:

Post a Comment