XML Stack Overflow: Parse html source code into xml tree

I know there's many ways to do this using 3rd party libraries such as resources, pyparsing, selenium, etc. but I'm looking for a quick and dirty way to do it without any 3rd party modules.

Basically what I want to do is take the HTML code from the page source of a webpage and parse it into xml format (probably using xml.etree.ElementTree). I've tried this:


import urllib.request
import xml.etree.ElementTree as ET
data = urllib.request.urlopen(website)
tree = ET.fromstring(data.read)

However when I do this I either get mismatched tags or unknown symbol for UTF-8 encoding, which the page source is definitely in. I was under the assumption that a functioning html page wouldn't have mismatched tags so I'm thinking there's something I'm missing.

And the whole reason I don't want to use a 3rd party library is because I need to grab a small set of information and don't think it's enough to justify using another module.

XML Stack Overflow

Monday, 24 November 2014

Parse html source code into xml tree

No comments:

Post a Comment