Scraping data using XPath on Python lxml



I am trying to scrape data from a PDF file to generate statistics for a programme. The PDF file is like this : http://ift.tt/VVEEG0 (Page 177)


What I want is to get the total score of each candidate to calculate the corresponding aggregate percentage. Currently, I am parsing the data to XML and using XPath to browse through the data. The problem is that the marks are not the children of the candidate and they don't even have unique attributes within the page. Is there a way to move the XML pointer to start from after a particular text tag?


Am I doing it right y converting to XML? Is there any other way to do this?


No comments:

Post a Comment