python - Not able to go on with the Request object and callback mechanism of scrapy spider-crawler -
i new scrapy , python.this spider-crawler
from scrapy.spider import spider scrapy.selector import selector scrapy.http import request tutorial.settings import * tutorial.items import * class dmozspider(spider): name = "dmoz" allowed_domains = ["m.timesofindia.com"] start_urls = ["http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html"] def parse(self, response): sel = selector(response) torrent = dmozitem() items=[] links = sel.xpath('//div[@class="gapleftm"]/ul[@class="content"]/li') ti in sel.xpath("//a[@class='pda']/text()").extract(): yield dmozitem(title=ti) url in sel.xpath("//a[@class='pda']/@href").extract(): yield dmozitem(link=url) yield request(url, callback=self.my_parse) def my_parse(self, response): sel = selector(response) self.log('a response my_parse arrived!') text in sel.xpath("//body/text()").extract(): yield dmozitem(desc=text) pass
here trying collect urls in tag , calling callback function code fails enter my_parse function. missing something.
this console log
root@yogesh-system-model:~/pythontest/tutorial# scrapy crawl dmoz -o mypune13.txt 2014-02-06 16:15:01+0530 [scrapy] info: scrapy 0.22.0 started (bot: tutorial) 2014-02-06 16:15:01+0530 [scrapy] info: optional features available: ssl, http11, boto, django 2014-02-06 16:15:01+0530 [scrapy] info: overridden settings: {'newspider_module': 'tutorial.spiders', 'spider_modules': ['tutorial.spiders'], 'feed_uri': 'mypune13.txt', 'bot_name': 'tutorial'} 2014-02-06 16:15:01+0530 [scrapy] info: enabled extensions: feedexporter, logstats, telnetconsole, closespider, webservice, corestats, spiderstate 2014-02-06 16:15:02+0530 [scrapy] info: enabled downloader middlewares: httpauthmiddleware, downloadtimeoutmiddleware, useragentmiddleware, retrymiddleware, defaultheadersmiddleware, metarefreshmiddleware, httpcompressionmiddleware, redirectmiddleware, cookiesmiddleware, chunkedtransfermiddleware, downloaderstats 2014-02-06 16:15:02+0530 [scrapy] info: enabled spider middlewares: httperrormiddleware, offsitemiddleware, referermiddleware, urllengthmiddleware, depthmiddleware 2014-02-06 16:15:02+0530 [scrapy] info: enabled item pipelines: 2014-02-06 16:15:02+0530 [dmoz] info: spider opened 2014-02-06 16:15:02+0530 [dmoz] info: crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2014-02-06 16:15:02+0530 [scrapy] debug: telnet console listening on 0.0.0.0:6023 2014-02-06 16:15:02+0530 [scrapy] debug: web service listening on 0.0.0.0:6080 2014-02-06 16:15:03+0530 [dmoz] debug: crawled (200) <get http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> (referer: none) 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'front page'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'times city'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'times nation'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'auto expo 2014'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'times global'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'editorial'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'times business'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'times sport'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'pune times'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'news digest'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'cong denied pranab chance pm: modi'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'mom, daughter badly hurt in mishap @ theme park'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'13 indians head major global firms,4 studied @ st stephens'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'9.7cr new voters added across india in 5 years'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'exit bond money afmc grads hiked rs 30 lakh'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'sc revisiting death sentences, stays 3 more'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'jr college teachers call off hsc exams boycott plan'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'tourists 180 countries visa on arrival now'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'50 of 58 new rajya sabha members crorepatis'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'2g spectrum bids touch rs 50,000 crore'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'discoms loss may tata powers gain'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'colleges, schools work till last min give hall tickets'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'front page'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'times city'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'times nation'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'auto expo 2014'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'times global'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'editorial'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'times business'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'times sport'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'pune times'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'link': u'http://mobiletoi.timesofindia.com/mobile.aspx?sect_articles=yes§name=front+page&edid=&edlabel=toipu&mydatehid=06-02-2014&pubname=times+of+india+-+pune+-+front+page&edname=&publabel=toi'} 2014-02-06 16:15:03+0530 [dmoz] debug: filtered offsite request 'mobiletoi.timesofindia.com': <get http://mobiletoi.timesofindia.com/mobile.aspx?sect_articles=yes§name=front+page&edid=&edlabel=toipu&mydatehid=06-02-2014&pubname=times+of+india+-+pune+-+front+page&edname=&publabel=toi> 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'link': u'http://mobiletoi.timesofindia.com/mobile.aspx?sect_articles=yes§name=times+city&edid=&edlabel=toipu&mydatehid=06-02-2014&pubname=times+of+india+-+pune+-+times+city&edname=&publabel=toi'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'link': u'http://mobiletoi.timesofindia.com/mobile.aspx?sect_articles=yes§name=times+nation&edid=&edlabel=toipu&mydatehid=06-02-2014&pubname=times+of+india+-+pune+-+times+nation&edname=&publabel=toi'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'link': u'http://mobiletoi.timesofindia.com/mobile.aspx?sect_articles=yes§name=auto+expo+2014&edid=&edlabel=toipu&mydatehid=06-02-2014&pubname=times+of+india+-+pune+-+auto+expo+2014&edname=&publabel=toi'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'link': u'http://mobiletoi.timesofindia.com/mobile.aspx?sect_articles=yes§name=times+global&edid=&edlabel=toipu&mydatehid=06-02-2014&pubname=times+of+india+-+pune+-+times+global&edname=&publabel=toi'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'link': u'http://mobiletoi.timesofindia.com/mobile.aspx?sect_articles=yes§name=editorial&edid=&edlabel=toipu&mydatehid=06-02-2014&pubname=times+of+india+-+pune+-+editorial&edname=&publabel=toi'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'link': u'http://mobiletoi.timesofindia.com/mobile.aspx?sect_articles=yes§name=times+business&edid=&edlabel=toipu&mydatehid=06-02-2014&pubname=times+of+india+-+pune+-+times+business&edname=&publabel=toi'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'link': u'http://mobiletoi.timesofindia.com/mobile.aspx?sect_articles=yes§name=times+sport&edid=&edlabel=toipu&mydatehid=06-02-2014&pubname=times+of+india+-+pune+-+times+sport&edname=&publabel=toi'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'link': u'http://mobiletoi.timesofindia.com/mobile.aspx?sect_articles=yes§name=pune+times&edid=&edlabel=toipu&mydatehid=06-02-2014&pubname=times+of+india+-+pune+-+pune+times&edname=&publabel=toi'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'link': u'http://mobiletoi.timesofindia.com/mobile.aspx?article=yes&pageid=3§id=edid=&edlabel=toipu&mydatehid=06-02-2014&pubname=times+of+india+-+pune&edname=&articleid=ar00300&publabel=toi'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'link': u'http://mobiletoi.timesofindia.com/mobile.aspx?article=yes&pageid=3§id=edid=&edlabel=toipu&mydatehid=06-02-2014&pubname=times+of+india+-+pune&edname=&articleid=ar00301&publabel=toi'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'link': u'http://mobiletoi.timesofindia.com/mobile.aspx?article=yes&pageid=3§id=edid=&edlabel=toipu&mydatehid=06-02-2014&pubname=times+of+india+-+pune&edname=&articleid=ar00302&publabel=toi'} 2014-02-06 16:15:03+0530 [dmoz] info: closing spider (finished) 2014-02-06 16:15:03+0530 [dmoz] info: stored jsonlines feed (62 items) in: mypune13.txt 2014-02-06 16:15:03+0530 [dmoz] info: dumping scrapy stats: {'downloader/request_bytes': 279, 'downloader/request_count': 1, 'downloader/request_method_count/get': 1, 'downloader/response_bytes': 11226, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2014, 2, 6, 10, 45, 3, 542688), 'item_scraped_count': 62, 'log_count/debug': 66, 'log_count/info': 8, 'request_depth_max': 1, 'response_received_count': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'start_time': datetime.datetime(2014, 2, 6, 10, 45, 2, 127946)} 2014-02-06 16:15:03+0530 [dmoz] info: spider closed (finished)
your console log shows request http://mobiletoi.timesofindia.com/mobile.aspx?sect_articles=yes§name=front+page&edid=&edlabel=toipu&mydatehid=06-02-2014&pubname=times+of+india+-+pune+-+front+page&edname=&publabel=toi
filtered
filtered offsite request 'mobiletoi.timesofindia.com'
scrapy has offsitemiddleware
on default:
this middleware filters out every request host names aren’t in spider’s allowed_domains attribute.
you need include 'mobiletoi.timesofindia.com' in allowed_domains
, this:
allowed_domains = ["m.timesofindia.com", "mobiletoi.timesofindia.com"]
otherwise, scrapy spider middleware offsitemiddleware
receive requests yield yield request(url, callback=self.my_parse)
, domain doesn't match, , discard them, no callback being called @ all.
Comments
Post a Comment