I`m trying to parse in Python a webpage, a ajax response which basically looks like this xml:
<table class="tab02"> <tr> <th>Skrót</th> <th>Pełna nazwa</th> </tr> <tr> <td><a href="http://www.gpw.pl/karta_spolki/PLATAL000046/">1AT</a></td> <td><a href="http://www.gpw.pl/karta_spolki/PLATAL000046/">ATAL SPÓŁKA AKCYJNA</a></td> </tr> </table> If I provide this code in python file as variable with use of simple code & lxml library (see below) I successfully parse everything, and whole result is well formated:
from lxml import etree root = etree.fromstring(xml) print etree.tounicode(root) # print etree.tostring(root) Problem happens while parsing data from webpage (see example code below)
magical_parser = etree.XMLParser(encoding='utf-8', recover=True) root = etree.parse(link2page, magical_parser) print etree.tounicode(root) In result all characters < > from table are changed to < and >
<response> <html> <table class="tab02"> <tr> <th>Skrót</th> <th>Pełna nazwa</th> </tr> etc. I`ve tried also with first treating link with urlib, with parsing it as html but i fail all the time. Can anyone provide me a hint please?
No comments:
Post a Comment