我在爬一个网站,大概有 100 页数据,我在 scrapy 中的写法如下:
def parse(self, response):
........解析第一个页面的数据
while True:#这儿假装有 99 页数据
url = "xxxxxxxxxx&page=2/3/4/5/6......."
yield scrapy.Request(url, callback=self.parseChild)
def parseChild(self, response):
........#解析后面 99 个页面的数据
为每个 request 都配了 IPProxy,但是我发现,除了第一个页面能正常抓取外,后面的 99 个页面都在不停的配置自己的 ipproxy 信息,日志如下:
+++++++++++++++++++++++++++++++++++++++++正在用 http://221.231.91.164:4532 的代理去请求,
“ https://recommd.xyq.cbg.163.com/cgi-bin/recommend.py?callback=Request.JSONP.request_map.request_0&_=1539830579182&level_min=109&level_max=109&expt_gongji=17&expt_fangyu=17&expt_fashu=17&expt_kangfa=17&bb_expt_gongji=17&bb_expt_fangyu=17&bb_expt_fashu=17&bb_expt_kangfa=17&act=recommd_by_role&count=15&search_type=overall_search_role&view_loc=overall_search&page=91 ”
-----------------------------分割线---------------------------------下面是 retrying 的信息
“ 2018-11-07 16:05:27 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://recommd.xyq.cbg.163.com/cgi-bin/recommend.py?callback=Request.JSONP.request_map.request_0&_=1539830579182&level_min=109&level_max=109&expt_gongji=17&expt_fangyu=17&expt_fashu=17&expt_kangfa=17&bb_expt_gongji=17&bb_expt_fangyu=17&bb_expt_fashu=17&bb_expt_kangfa=17&act=recommd_by_role&count=15&search_type=overall_search_role&view_loc=overall_search&page=68> (failed 1 times): User timeout caused connection failure.
2018-11-07 16:05:27 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://recommd.xyq.cbg.163.com/cgi-bin/recommend.py?callback=Request.JSONP.request_map.request_0&=1539830579182&level_min=109&level_max=109&expt_gongji=17&expt_fangyu=17&expt_fashu=17&expt_kangfa=17&bb_expt_gongji=17&bb_expt_fangyu=17&bb_expt_fashu=17&bb_expt_kangfa=17&act=recommd_by_role&count=15&search_type=overall_search_role&view_loc=overall_search&page=67> (failed 1 times): User timeout caused connection failure. 2018-11-07 16:05:27 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://recommd.xyq.cbg.163.com/cgi-bin/recommend.py?callback=Request.JSONP.request_map.request_0&=1539830579182&level_min=109&level_max=109&expt_gongji=17&expt_fangyu=17&expt_fashu=17&expt_kangfa=17&bb_expt_gongji=17&bb_expt_fangyu=17&bb_expt_fashu=17&bb_expt_kangfa=17&act=recommd_by_role&count=15&search_type=overall_search_role&view_loc=overall_search&page=66> (failed 1 times): User timeout caused connection failure. 2018-11-07 16:05:27 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://recommd.xyq.cbg.163.com/cgi-bin/recommend.py?callback=Request.JSONP.request_map.request_0&=1539830579182&level_min=109&level_max=109&expt_gongji=17&expt_fangyu=17&expt_fashu=17&expt_kangfa=17&bb_expt_gongji=17&bb_expt_fangyu=17&bb_expt_fashu=17&bb_expt_kangfa=17&act=recommd_by_role&count=15&search_type=overall_search_role&view_loc=overall_search&page=65> (failed 1 times): User timeout caused connection failure. 2018-11-07 16:05:27 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://recommd.xyq.cbg.163.com/cgi-bin/recommend.py?callback=Request.JSONP.request_map.request_0&=1539830579182&level_min=109&level_max=109&expt_gongji=17&expt_fangyu=17&expt_fashu=17&expt_kangfa=17&bb_expt_gongji=17&bb_expt_fangyu=17&bb_expt_fashu=17&bb_expt_kangfa=17&act=recommd_by_role&count=15&search_type=overall_search_role&view_loc=overall_search&page=64> (failed 1 times): User timeout caused connection failure. 2018-11-07 16:05:27 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://recommd.xyq.cbg.163.com/cgi-bin/recommend.py?callback=Request.JSONP.request_map.request_0&=1539830579182&level_min=109&level_max=109&expt_gongji=17&expt_fangyu=17&expt_fashu=17&expt_kangfa=17&bb_expt_gongji=17&bb_expt_fangyu=17&bb_expt_fashu=17&bb_expt_kangfa=17&act=recommd_by_role&count=15&search_type=overall_search_role&view_loc=overall_search&page=63> (failed 1 times): User timeout caused connection failure. 2018-11-07 16:05:27 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://recommd.xyq.cbg.163.com/cgi-bin/recommend.py?callback=Request.JSONP.request_map.request_0&=1539830579182&level_min=109&level_max=109&expt_gongji=17&expt_fangyu=17&expt_fashu=17&expt_kangfa=17&bb_expt_gongji=17&bb_expt_fangyu=17&bb_expt_fashu=17&bb_expt_kangfa=17&act=recommd_by_role&count=15&search_type=overall_search_role&view_loc=overall_search&page=62> (failed 1 times): User timeout caused connection failure. 2018-11-07 16:05:27 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://recommd.xyq.cbg.163.com/cgi-bin/recommend.py?callback=Request.JSONP.request_map.request_0&_=1539830579182&level_min=109&level_max=109&expt_gongji=17&expt_fangyu=17&expt_fashu=17&expt_kangfa=17&bb_expt_gongji=17&bb_expt_fangyu=17&bb_expt_fashu=17&bb_expt_kangfa=17&act=recommd_by_role&count=15&search_type=overall_search_role&view_loc=overall_search&page=61> (failed 1 times): User timeout caused connection failure.”
1
ooh 2018-11-07 16:29:44 +08:00
代码贴一下
或者换个代理 https://github.com/imWildCat/scylla class ProxyMiddleware(object): def process_request(self, request, spider): # curl https://m.douban.com/book/subject/26628811/ -x http://127.0.0.1:8081 request.meta['proxy'] = 'http://127.0.0.1:8081' |
2
moxiaowei OP @ooh 我把 download middleware 贴一下 您帮我看看问题,谢谢
def process_request(self, request, spider): print("---------------------------------", request.url) canUsedProxyIps = list() proxyIps = proxyIpModel.getTwoHundredIp()#从数据库中取代理地址 print("-----------------------------------", proxyIps) if len(proxyIps) == 0:#如果取出来为空 print("**********************数据库内为空,开始去极光拉取代理数据") ipList = self.fromJiguangGetProxyIps() for item in ipList:#循环插入数据库 proxyModel = proxyIpModel() proxyModel.ipAddr = item proxyModel.insertOne() canUsedProxyIps.append(item) else:#如果不为空,那么验证每一个是否可用 print("**********************数据库内不为空,开始去挨个验证每一个是否可用") for item in proxyIps:#循环验证 if self.runTestProxy(item.ipAddr):#代理是能用的 canUsedProxyIps.append(item.ipAddr) else:#如果代理不能用 item.deleteOne()#从数据库中删除这个代理信息 #如果发现没有可用的,那么去请求极光 while len(canUsedProxyIps) <= 20: print("**********************数据长度不到 20 个,需要再去极光拉一些来") ipList = self.fromJiguangGetProxyIps() for item in ipList: # 循环插入数据库 proxyModel = proxyIpModel() proxyModel.ipAddr = item proxyModel.insertOne() canUsedProxyIps.append(item) suijiNum = random.randint(0, len(canUsedProxyIps)-1) request.meta["proxy"] = canUsedProxyIps[suijiNum] print("+++++++++++++++++++++++++++++++++++++++++正在用"+canUsedProxyIps[suijiNum]+"的代理去请求,"+request.url) return None 整体思路是:如果数据库中没有可用的代理 IP,那么去极光拉 100 个代理回来,然后挨个验证能否使用,能使用,扔到数据库里并且扔到 canUsedProxyIps 这个 list 中,如果数据库中有可用代理 IP,那么从数据库中取 200 个能用的,然后验证这 200 个代理 IP,把能用的扔到 canUsedProxyIps 这个 list 中。然后循环看 canUsedProxyIps 是否有 20 个备用的代理 IP,如果不够 20 个,那么去极光拉一把,并验证是否能用,如果能用,扔到 canUsedProxyIps 和数据库中,然后再为 request 设置代理。 |
3
m16FJ06l0m6I2amP 2018-11-07 17:02:10 +08:00
我想问题处在 canUsedProxyIps = list() ,canUsedProxyIps 是个局部变量
|
8
ooh 2018-11-07 18:51:00 +08:00
@moxiaowei 但是明显代理发访问失败,我怀疑你代理 IP 访问的时候已经失效了,所以让你手动在命令行里 curl -x 访问看看能不能成功,看看是不是代理已经被封了,不然怎么排错
|