scrapy 在爬虫的时候有的网址偶尔出现 404 如何解决？ - V2EX

首页注册登录

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

现在注册

已注册用户请登录

V2EX 提问指南

这是一个创建于 2192 天前的主题，其中的信息可能已经有所发展或是发生改变。

2019-01-05 10:34:15 [csrc][scrapy.core.engine] DEBUG: Crawled (404) <GET http://www.csrc.gov.cn/pub/zjhpublic/G00306202/201806/t20180622_340245.htm> (referer: http://www.csrc.gov.cn/pub/newsite/ xxpl/yxpl/index_9.html)
3916 2019-01-05 10:34:15 [csrc][scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 http://www.csrc.gov.cn/pub/zjhpublic/G00306202/201806/t20180622_340245.htm>: HTTP status code is not handled or not allowed
3917 2019-01-05 10:34:21 [csrc][scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.csrc.gov.cn/pub/zjhpublic/G00306202/201806/t20180622_340247.htm> (referer: http://www.csrc.gov.cn/pub/newsite/ xxpl/yxpl/index_9.html)
3918 2019-01-05 10:34:28 [csrc][scrapy.core.engine] DEBUG: Crawled (404) <GET http://www.csrc.gov.cn/pub/zjhpublic/G00306202/201806/t20180622_340246.htm> (referer: http://www.csrc.gov.cn/pub/newsite/ xxpl/yxpl/index_9.html)

网址： http://www.csrc.gov.cn/pub/newsite/xxpl/yxpl/index.html

有的页面偶尔出现 404 这种如何解决参数都加上了

3 条回复 • 2019-01-06 16:06:11 +08:00

1

dreasky

2019-01-05 18:00:45 +08:00

用 scrapy 的 retry 中间件，改配置文件 retry 错误码和 retry 次数就行

2

Ewig

OP

2019-01-06 15:57:08 +08:00

@dreasky 我单独写了一个 py 需要继承吗

3

Ewig

OP

2019-01-06 16:06:11 +08:00

@dreasky 我本来自己写了一个中间件把自带的给 over 了，但是现在还是先用原生的，我想问一下这个 retry 的间隔可以设置吗？

关于 · 帮助文档 · 博客 · API · FAQ · 实用小工具 · 2945 人在线 最高记录 6679 ·

Select Language

创意工作者们的社区

World is powered by solitude

VERSION: 3.9.8.5 · 22ms · UTC 12:20 · PVG 20:20 · LAX 04:20 · JFK 07:20
Developed with CodeLauncher
♥ Do have faith in what you're doing.