Parse html source code into xml tree



I know there's many ways to do this using 3rd party libraries such as resources, pyparsing, selenium, etc. but I'm looking for a quick and dirty way to do it without any 3rd party modules.


Basically what I want to do is take the HTML code from the page source of a webpage and parse it into xml format (probably using xml.etree.ElementTree). I've tried this:



import urllib.request
import xml.etree.ElementTree as ET
data = urllib.request.urlopen(website)
tree = ET.fromstring(data.read)


However when I do this I either get mismatched tags or unknown symbol for UTF-8 encoding, which the page source is definitely in. I was under the assumption that a functioning html page wouldn't have mismatched tags so I'm thinking there's something I'm missing.


And the whole reason I don't want to use a 3rd party library is because I need to grab a small set of information and don't think it's enough to justify using another module.


No comments:

Post a Comment