python - Not able to go on with the Request object and callback mechanism of scrapy spider-crawler -


i new scrapy , python.this spider-crawler

from scrapy.spider import spider scrapy.selector import selector scrapy.http import request tutorial.settings import * tutorial.items import *  class dmozspider(spider):     name = "dmoz"    allowed_domains = ["m.timesofindia.com"]    start_urls =    ["http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html"]      def parse(self, response):         sel = selector(response)         torrent = dmozitem()         items=[]         links = sel.xpath('//div[@class="gapleftm"]/ul[@class="content"]/li')         ti in sel.xpath("//a[@class='pda']/text()").extract():             yield dmozitem(title=ti)         url in sel.xpath("//a[@class='pda']/@href").extract():             yield dmozitem(link=url)             yield request(url, callback=self.my_parse)      def my_parse(self, response):         sel = selector(response)         self.log('a response my_parse arrived!')         text in sel.xpath("//body/text()").extract():             yield dmozitem(desc=text)             pass 

here trying collect urls in tag , calling callback function code fails enter my_parse function. missing something.

this console log

root@yogesh-system-model:~/pythontest/tutorial# scrapy crawl  dmoz -o mypune13.txt 2014-02-06 16:15:01+0530 [scrapy] info: scrapy 0.22.0 started (bot: tutorial) 2014-02-06 16:15:01+0530 [scrapy] info: optional features available: ssl, http11, boto,     django 2014-02-06 16:15:01+0530 [scrapy] info: overridden settings: {'newspider_module':    'tutorial.spiders', 'spider_modules': ['tutorial.spiders'], 'feed_uri': 'mypune13.txt',   'bot_name': 'tutorial'} 2014-02-06 16:15:01+0530 [scrapy] info: enabled extensions: feedexporter, logstats, telnetconsole, closespider, webservice, corestats, spiderstate 2014-02-06 16:15:02+0530 [scrapy] info: enabled downloader middlewares:  httpauthmiddleware, downloadtimeoutmiddleware, useragentmiddleware, retrymiddleware,   defaultheadersmiddleware, metarefreshmiddleware, httpcompressionmiddleware, redirectmiddleware, cookiesmiddleware, chunkedtransfermiddleware, downloaderstats 2014-02-06 16:15:02+0530 [scrapy] info: enabled spider middlewares: httperrormiddleware, offsitemiddleware, referermiddleware, urllengthmiddleware, depthmiddleware 2014-02-06 16:15:02+0530 [scrapy] info: enabled item pipelines:  2014-02-06 16:15:02+0530 [dmoz] info: spider opened 2014-02-06 16:15:02+0530 [dmoz] info: crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2014-02-06 16:15:02+0530 [scrapy] debug: telnet console listening on 0.0.0.0:6023 2014-02-06 16:15:02+0530 [scrapy] debug: web service listening on 0.0.0.0:6080 2014-02-06 16:15:03+0530 [dmoz] debug: crawled (200) <get http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> (referer: none) 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'front page'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'times city'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'times nation'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'auto expo 2014'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'times global'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'editorial'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'times business'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'times sport'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'pune times'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'news digest'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'cong denied pranab chance pm: modi'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'mom, daughter badly hurt in mishap @ theme park'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'13 indians head major global firms,4 studied @ st stephens'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'9.7cr new voters added across india in 5 years'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'exit bond money afmc grads hiked rs 30 lakh'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'sc revisiting death sentences, stays 3 more'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'jr college teachers call off hsc exams boycott plan'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'tourists 180 countries visa on arrival now'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'50 of 58 new rajya sabha members crorepatis'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'2g spectrum bids touch rs 50,000 crore'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'discoms loss may tata powers gain'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'colleges, schools work till last min give hall tickets'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'front page'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'times city'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'times nation'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'auto expo 2014'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'times global'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'editorial'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'times business'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'times sport'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'title': u'pune times'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'link': u'http://mobiletoi.timesofindia.com/mobile.aspx?sect_articles=yes&sectname=front+page&edid=&edlabel=toipu&mydatehid=06-02-2014&pubname=times+of+india+-+pune+-+front+page&edname=&publabel=toi'} 2014-02-06 16:15:03+0530 [dmoz] debug: filtered offsite request 'mobiletoi.timesofindia.com': <get http://mobiletoi.timesofindia.com/mobile.aspx?sect_articles=yes&sectname=front+page&edid=&edlabel=toipu&mydatehid=06-02-2014&pubname=times+of+india+-+pune+-+front+page&edname=&publabel=toi> 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'link': u'http://mobiletoi.timesofindia.com/mobile.aspx?sect_articles=yes&sectname=times+city&edid=&edlabel=toipu&mydatehid=06-02-2014&pubname=times+of+india+-+pune+-+times+city&edname=&publabel=toi'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'link': u'http://mobiletoi.timesofindia.com/mobile.aspx?sect_articles=yes&sectname=times+nation&edid=&edlabel=toipu&mydatehid=06-02-2014&pubname=times+of+india+-+pune+-+times+nation&edname=&publabel=toi'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'link': u'http://mobiletoi.timesofindia.com/mobile.aspx?sect_articles=yes&sectname=auto+expo+2014&edid=&edlabel=toipu&mydatehid=06-02-2014&pubname=times+of+india+-+pune+-+auto+expo+2014&edname=&publabel=toi'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'link': u'http://mobiletoi.timesofindia.com/mobile.aspx?sect_articles=yes&sectname=times+global&edid=&edlabel=toipu&mydatehid=06-02-2014&pubname=times+of+india+-+pune+-+times+global&edname=&publabel=toi'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'link': u'http://mobiletoi.timesofindia.com/mobile.aspx?sect_articles=yes&sectname=editorial&edid=&edlabel=toipu&mydatehid=06-02-2014&pubname=times+of+india+-+pune+-+editorial&edname=&publabel=toi'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'link': u'http://mobiletoi.timesofindia.com/mobile.aspx?sect_articles=yes&sectname=times+business&edid=&edlabel=toipu&mydatehid=06-02-2014&pubname=times+of+india+-+pune+-+times+business&edname=&publabel=toi'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'link': u'http://mobiletoi.timesofindia.com/mobile.aspx?sect_articles=yes&sectname=times+sport&edid=&edlabel=toipu&mydatehid=06-02-2014&pubname=times+of+india+-+pune+-+times+sport&edname=&publabel=toi'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'link': u'http://mobiletoi.timesofindia.com/mobile.aspx?sect_articles=yes&sectname=pune+times&edid=&edlabel=toipu&mydatehid=06-02-2014&pubname=times+of+india+-+pune+-+pune+times&edname=&publabel=toi'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'link': u'http://mobiletoi.timesofindia.com/mobile.aspx?article=yes&pageid=3&sectid=edid=&edlabel=toipu&mydatehid=06-02-2014&pubname=times+of+india+-+pune&edname=&articleid=ar00300&publabel=toi'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'link': u'http://mobiletoi.timesofindia.com/mobile.aspx?article=yes&pageid=3&sectid=edid=&edlabel=toipu&mydatehid=06-02-2014&pubname=times+of+india+-+pune&edname=&articleid=ar00301&publabel=toi'} 2014-02-06 16:15:03+0530 [dmoz] debug: scraped <200 http://mobiletoi.timesofindia.com/htmldbtoi/toipu/20140206/toipu_articles__20140206.html> {'link': u'http://mobiletoi.timesofindia.com/mobile.aspx?article=yes&pageid=3&sectid=edid=&edlabel=toipu&mydatehid=06-02-2014&pubname=times+of+india+-+pune&edname=&articleid=ar00302&publabel=toi'}     2014-02-06 16:15:03+0530 [dmoz] info: closing spider (finished) 2014-02-06 16:15:03+0530 [dmoz] info: stored jsonlines feed (62 items) in: mypune13.txt 2014-02-06 16:15:03+0530 [dmoz] info: dumping scrapy stats:     {'downloader/request_bytes': 279,      'downloader/request_count': 1,      'downloader/request_method_count/get': 1,      'downloader/response_bytes': 11226,      'downloader/response_count': 1,      'downloader/response_status_count/200': 1,  'finish_reason': 'finished',  'finish_time': datetime.datetime(2014, 2, 6, 10, 45, 3, 542688),  'item_scraped_count': 62,  'log_count/debug': 66,  'log_count/info': 8,  'request_depth_max': 1,  'response_received_count': 1,  'scheduler/dequeued': 1,  'scheduler/dequeued/memory': 1,  'scheduler/enqueued': 1,  'scheduler/enqueued/memory': 1,  'start_time': datetime.datetime(2014, 2, 6, 10, 45, 2, 127946)} 2014-02-06 16:15:03+0530 [dmoz] info: spider closed (finished) 

your console log shows request http://mobiletoi.timesofindia.com/mobile.aspx?sect_articles=yes&sectname=front+page&edid=&edlabel=toipu&mydatehid=06-02-2014&pubname=times+of+india+-+pune+-+front+page&edname=&publabel=toi filtered

filtered offsite request 'mobiletoi.timesofindia.com' 

scrapy has offsitemiddleware on default:

this middleware filters out every request host names aren’t in spider’s allowed_domains attribute.

you need include 'mobiletoi.timesofindia.com' in allowed_domains, this:

allowed_domains = ["m.timesofindia.com", "mobiletoi.timesofindia.com"] 

otherwise, scrapy spider middleware offsitemiddleware receive requests yield yield request(url, callback=self.my_parse) , domain doesn't match, , discard them, no callback being called @ all.


Comments

Popular posts from this blog

python - Subclassed QStyledItemDelegate ignores Stylesheet -

java - HttpClient 3.1 Connection pooling vs HttpClient 4.3.2 -

SQL: Divide the sum of values in one table with the count of rows in another -