Why isn't XMLFeedSpider failing to iterate through the designated nodes?



I'm trying to parse through PLoS's RSS feed to pick up new publications. The RSS feed is located here.


Below is my spider:



from scrapy.contrib.spiders import XMLFeedSpider


class PLoSSpider(XMLFeedSpider):
name = "plos"
itertag = 'entry'
allowed_domains = ["plosone.org"]
start_urls = [
('http://ift.tt/1zUCcm0'
'?unformattedQuery=*%3A*&sort=Date%2C+newest+first')
]

def parse_node(self, response, node):
pass


This configuration produces the following log output:



$ scrapy crawl plos
2015-02-06 00:19:08+0100 [scrapy] INFO: Scrapy 0.24.4 started (bot: plos)
2015-02-06 00:19:08+0100 [scrapy] INFO: Optional features available: ssl, http11, boto
2015-02-06 00:19:08+0100 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'plos.spiders', 'SPIDER_MODULES': ['plos.spiders'], 'BOT_NAME': 'plos'}
2015-02-06 00:19:08+0100 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-02-06 00:19:08+0100 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-02-06 00:19:08+0100 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-02-06 00:19:08+0100 [scrapy] INFO: Enabled item pipelines:
2015-02-06 00:19:08+0100 [plos] INFO: Spider opened
2015-02-06 00:19:08+0100 [plos] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-02-06 00:19:08+0100 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-02-06 00:19:08+0100 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2015-02-06 00:19:09+0100 [plos] DEBUG: Crawled (200) <GET http://ift.tt/16m4Qi0; (referer: None)
2015-02-06 00:19:09+0100 [plos] ERROR: Spider error processing <GET http://ift.tt/16m4Qi0;
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 824, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/usr/lib/python2.7/dist-packages/twisted/internet/task.py", line 638, in _tick
taskObj._oneWorkUnit()
File "/usr/lib/python2.7/dist-packages/twisted/internet/task.py", line 484, in _oneWorkUnit
result = next(self._iterator)
File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/defer.py", line 57, in <genexpr>
work = (callable(elem, *args, **named) for elem in iterable)
--- <exception caught here> ---
File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/defer.py", line 96, in iter_errback
yield next(it)
File "/usr/local/lib/python2.7/dist-packages/scrapy/contrib/spidermiddleware/offsite.py", line 26, in process_spider_output
for x in result:
File "/usr/local/lib/python2.7/dist-packages/scrapy/contrib/spidermiddleware/referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/usr/local/lib/python2.7/dist-packages/scrapy/contrib/spidermiddleware/urllength.py", line 33, in <genexpr>
return (r for r in result or () if _filter(r))
File "/usr/local/lib/python2.7/dist-packages/scrapy/contrib/spidermiddleware/depth.py", line 50, in <genexpr>
return (r for r in result or () if _filter(r))
File "/usr/local/lib/python2.7/dist-packages/scrapy/contrib/spiders/feed.py", line 61, in parse_nodes
for selector in nodes:
File "/usr/local/lib/python2.7/dist-packages/scrapy/contrib/spiders/feed.py", line 87, in _iternodes
for node in xmliter(response, self.itertag):
File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/iterators.py", line 31, in xmliter
yield Selector(text=nodetext, type='xml').xpath('//' + nodename)[0]
exceptions.IndexError: list index out of range

2015-02-06 00:19:09+0100 [plos] INFO: Closing spider (finished)
2015-02-06 00:19:09+0100 [plos] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 282,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 7590,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 2, 5, 23, 19, 9, 379574),
'log_count/DEBUG': 3,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'spider_exceptions/IndexError': 1,
'start_time': datetime.datetime(2015, 2, 5, 23, 19, 8, 834428)}
2015-02-06 00:19:09+0100 [plos] INFO: Spider closed (finished)


Changing itertag = "entry" to itertag = "//entry" removes the warning, but no items are scraped. I also tried using scrapy.log.msg to log a message from within parse_node, but nothing appears and no nodes are reported as having been scraped.


What am I doing wrong?


No comments:

Post a Comment