This might be a really beginner question but I dont get an error so I dont know whats going on.
This is my code:
# -*- coding: utf-8 -*- import urllib2 from urllib2 import urlopen import re import cookielib from cookielib import CookieJar import time cj = CookieJar() opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) opener.addheaders = [('User-agent', 'Mozilla/5.0')] def main(): with open('word_list.txt') as f: word_list = f.readlines() try: pages = open('rss_sources.txt').readlines() for rss_resource in pages: sourceCode = opener.open(rss_resource).read() #print sourceCode try: titles = re.findall(r'<title>(.*?)</title>', sourceCode) for title in titles: if any(word.lower() in title.lower() for word in word_list): print title except Exception, e: print str(e) except Exception, e: print str(e) main()
My example RSS sources are:
http://www.finanzen.de/news/feed http://www.welt.de/wirtschaft/?service=Rss
Issues: The first RSS source is fine and it will print me out the titles that contain the keywords from word_list.txt. Now once I add the second RSS source to the .txt file my output is nothing, there is no errormessage or anything. Not even the first rss resource gives me anything.
Is there a problem with the second resource? How would I handle that error? And why isnt the first resource parsed correctly?
Please point me in the right direction so I can take care of this :)
No comments:
Post a Comment