Scrapy: obtaining data from squawka

I am trying to scrape the x and y coordinates of the shots taken for the match between Everton and Aston Villa from the squawka webpage: http://ift.tt/1t6mgXk.

I've used Firebug element inspector to obtain the X-Paths for the circles (e.g. /html/body/div[2]/div[3]/div[2]/div[1]/div/div[15]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/svg/g[22]/circle). The pixel coordinates for each shot circle are contained in the cx and cy attributes.

I have tried to scrape these numbers using the scrapy module in Python, but with no success. I am very new to this and have basically adapted the code from the scrapy tutorial. The item file:


import scrapy

class SquawkaItem(scrapy.Item):
    cx = scrapy.Field()
    cy = scrapy.Field()

The spider file:


import scrapy
from squawka.items import SquawkaItem

class SquawkaSpider(scrapy.Spider):
    name = "squawka"
    allowed_domains = ["squawka.com"]
    start_urls = ["http://ift.tt/1t6mgXk"]

    def parse(self, response):
        for sel in response.xpath('/html/body/div/div/div/div/div/div/div/div/div/div/div/div/svg/g/circle'):
            cx = sel.xpath('[@cx]').extract()
            cy = sel.xpath('[@cy]').extract()
            print cx, cy

When I run this spider in my linux terminal, using 'scrapy crawl squawka' command, I get the following output:


2014-10-26 12:49:53+0000 [scrapy] INFO: Scrapy 0.25.0-222-g675fd5b started (bot: squawka)
2014-10-26 12:49:53+0000 [scrapy] INFO: Optional features available: ssl, http11, boto, django
2014-10-26 12:49:53+0000 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'squawka.spiders', 'SPIDER_MODULES': ['squawka.spiders'], 'BOT_NAME': 'squawka'}
2014-10-26 12:49:54+0000 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, CoreStats, SpiderState
2014-10-26 12:49:55+0000 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-10-26 12:49:55+0000 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-10-26 12:49:55+0000 [scrapy] INFO: Enabled item pipelines: 
2014-10-26 12:49:55+0000 [squawka] INFO: Spider opened
2014-10-26 12:49:55+0000 [squawka] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-10-26 12:49:55+0000 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2014-10-26 12:49:56+0000 [squawka] DEBUG: Crawled (200) <GET http://ift.tt/12JrT5a; (referer: None)
2014-10-26 12:49:56+0000 [squawka] INFO: Closing spider (finished)
2014-10-26 12:49:56+0000 [squawka] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 300,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 16169,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 10, 26, 12, 49, 56, 402920),
'log_count/DEBUG': 1,
'log_count/INFO': 3,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2014, 10, 26, 12, 49, 55, 261954)}
2014-10-26 12:49:56+0000 [squawka] INFO: Spider closed (finished)

As you can see it says that it hasn't crawled any web pages and there is no output data. I've got no ideas how to go about changing my code to get the data I want. Any ideas of changes to my code or other techniques I could use would be gratefully received. Thanks.

Scrapy: obtaining data from squawka

No comments:

Post a Comment