我在爬一个网站,大概有 100 页数据,我在 scrapy 中的写法如下:
def parse(self, response):
........解析第一个页面的数据
while True:#这儿假装有 99 页数据
url = "xxxxxxxxxx&page=2/3/4/5/6......."
yield scrapy.Request(url, callback=self.parseChild)
def parseChild(self, response):
........#解析后面 99 个页面的数据
为每个 request 都配了 IPProxy,但是我发现,除了第一个页面能正常抓取外,后面的 99 个页面都在不停的配置自己的 ipproxy 信息,日志如下:
+++++++++++++++++++++++++++++++++++++++++正在用 http://221.231.91.164:4532 的代理去请求,
“ https://recommd.xyq.cbg.163.com/cgi-bin/recommend.py?callback=Request.JSONP.request_map.request_0&_=1539830579182&level_min=109&level_max=109&expt_gongji=17&expt_fangyu=17&expt_fashu=17&expt_kangfa=17&bb_expt_gongji=17&bb_expt_fangyu=17&bb_expt_fashu=17&bb_expt_kangfa=17&act=recommd_by_role&count=15&search_type=overall_search_role&view_loc=overall_search&page=91 ”
-----------------------------分割线---------------------------------下面是 retrying 的信息
“ 2018-11-07 16:05:27 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://recommd.xyq.cbg.163.com/cgi-bin/recommend.py?callback=Request.JSONP.request_map.request_0&_=1539830579182&level_min=109&level_max=109&expt_gongji=17&expt_fangyu=17&expt_fashu=17&expt_kangfa=17&bb_expt_gongji=17&bb_expt_fangyu=17&bb_expt_fashu=17&bb_expt_kangfa=17&act=recommd_by_role&count=15&search_type=overall_search_role&view_loc=overall_search&page=68> (failed 1 times): User timeout caused connection failure.
2018-11-07 16:05:27 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://recommd.xyq.cbg.163.com/cgi-bin/recommend.py?callback=Request.JSONP.request_map.request_0&=1539830579182&level_min=109&level_max=109&expt_gongji=17&expt_fangyu=17&expt_fashu=17&expt_kangfa=17&bb_expt_gongji=17&bb_expt_fangyu=17&bb_expt_fashu=17&bb_expt_kangfa=17&act=recommd_by_role&count=15&search_type=overall_search_role&view_loc=overall_search&page=67> (failed 1 times): User timeout caused connection failure. 2018-11-07 16:05:27 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://recommd.xyq.cbg.163.com/cgi-bin/recommend.py?callback=Request.JSONP.request_map.request_0&=1539830579182&level_min=109&level_max=109&expt_gongji=17&expt_fangyu=17&expt_fashu=17&expt_kangfa=17&bb_expt_gongji=17&bb_expt_fangyu=17&bb_expt_fashu=17&bb_expt_kangfa=17&act=recommd_by_role&count=15&search_type=overall_search_role&view_loc=overall_search&page=66> (failed 1 times): User timeout caused connection failure. 2018-11-07 16:05:27 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://recommd.xyq.cbg.163.com/cgi-bin/recommend.py?callback=Request.JSONP.request_map.request_0&=1539830579182&level_min=109&level_max=109&expt_gongji=17&expt_fangyu=17&expt_fashu=17&expt_kangfa=17&bb_expt_gongji=17&bb_expt_fangyu=17&bb_expt_fashu=17&bb_expt_kangfa=17&act=recommd_by_role&count=15&search_type=overall_search_role&view_loc=overall_search&page=65> (failed 1 times): User timeout caused connection failure. 2018-11-07 16:05:27 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://recommd.xyq.cbg.163.com/cgi-bin/recommend.py?callback=Request.JSONP.request_map.request_0&=1539830579182&level_min=109&level_max=109&expt_gongji=17&expt_fangyu=17&expt_fashu=17&expt_kangfa=17&bb_expt_gongji=17&bb_expt_fangyu=17&bb_expt_fashu=17&bb_expt_kangfa=17&act=recommd_by_role&count=15&search_type=overall_search_role&view_loc=overall_search&page=64> (failed 1 times): User timeout caused connection failure. 2018-11-07 16:05:27 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://recommd.xyq.cbg.163.com/cgi-bin/recommend.py?callback=Request.JSONP.request_map.request_0&=1539830579182&level_min=109&level_max=109&expt_gongji=17&expt_fangyu=17&expt_fashu=17&expt_kangfa=17&bb_expt_gongji=17&bb_expt_fangyu=17&bb_expt_fashu=17&bb_expt_kangfa=17&act=recommd_by_role&count=15&search_type=overall_search_role&view_loc=overall_search&page=63> (failed 1 times): User timeout caused connection failure. 2018-11-07 16:05:27 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://recommd.xyq.cbg.163.com/cgi-bin/recommend.py?callback=Request.JSONP.request_map.request_0&=1539830579182&level_min=109&level_max=109&expt_gongji=17&expt_fangyu=17&expt_fashu=17&expt_kangfa=17&bb_expt_gongji=17&bb_expt_fangyu=17&bb_expt_fashu=17&bb_expt_kangfa=17&act=recommd_by_role&count=15&search_type=overall_search_role&view_loc=overall_search&page=62> (failed 1 times): User timeout caused connection failure. 2018-11-07 16:05:27 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://recommd.xyq.cbg.163.com/cgi-bin/recommend.py?callback=Request.JSONP.request_map.request_0&_=1539830579182&level_min=109&level_max=109&expt_gongji=17&expt_fangyu=17&expt_fashu=17&expt_kangfa=17&bb_expt_gongji=17&bb_expt_fangyu=17&bb_expt_fashu=17&bb_expt_kangfa=17&act=recommd_by_role&count=15&search_type=overall_search_role&view_loc=overall_search&page=61> (failed 1 times): User timeout caused connection failure.”
这是一个专为移动设备优化的页面(即为了让你能够在 Google 搜索结果里秒开这个页面),如果你希望参与 V2EX 社区的讨论,你可以继续到 V2EX 上打开本讨论主题的完整版本。
V2EX 是创意工作者们的社区,是一个分享自己正在做的有趣事物、交流想法,可以遇见新朋友甚至新机会的地方。
V2EX is a community of developers, designers and creative people.