XML : parsing ajax response html/xml with lxml changes < > charcters to

I`m trying to parse in Python a webpage, a ajax response which basically looks like this xml:

  <table class="tab02">      <tr>          <th>Skrót</th>          <th>Pełna nazwa</th>      </tr>          <tr>                  <td><a href="http://www.gpw.pl/karta_spolki/PLATAL000046/">1AT</a></td>          <td><a href="http://www.gpw.pl/karta_spolki/PLATAL000046/">ATAL SPÓŁKA AKCYJNA</a></td>      </tr>  </table>    

Link: http://www.gpw.pl/ajaxindex.php?action=GPWCompanySearch&start=listForLetter&letter=A&listTemplateName=GPWCompanySearch%2FajaxList_PL

If I provide this code in python file as variable with use of simple code & lxml library (see below) I successfully parse everything, and whole result is well formated:

  from lxml import etree  root = etree.fromstring(xml)  print etree.tounicode(root) # print etree.tostring(root)    

Problem happens while parsing data from webpage (see example code below)

  magical_parser = etree.XMLParser(encoding='utf-8', recover=True)  root = etree.parse(link2page, magical_parser)  print etree.tounicode(root)    

In result all characters < > from table are changed to < and >

  <response>  <html>  &lt;table class="tab02"&gt;      &lt;tr&gt;          &lt;th&gt;Skrót&lt;/th&gt;          &lt;th&gt;Pełna nazwa&lt;/th&gt;      &lt;/tr&gt;  etc.    

I`ve tried also with first treating link with urlib, with parsing it as html but i fail all the time. Can anyone provide me a hint please?

No comments:

Post a Comment