求解 scrapy 爬取报错问题

2019-11-25 13:38:08 +08:00
 yifengs

scrapy 爬取阳光政务出现 Error,但数据出来了,求怎么解决这俩报错,错误如下: [scrapy.robotstxt] WARNING: Failure while parsing robots.txt. File either contains garbage or is in an encoding other than UTF-8, treating it as an empty file. Traceback (most recent call last): File "/home/python/.virtualenvs/webspider/lib/python3.5/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks result = g.send(result) File "/home/python/.virtualenvs/webspider/lib/python3.5/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request defer.returnValue((yield download_func(request=request, spider=spider))) File "/home/python/.virtualenvs/webspider/lib/python3.5/site-packages/twisted/internet/defer.py", line 1362, in returnValue raise _DefGen_Return(val) twisted.internet.defer._DefGen_Return: <200 http://www.sun0769.com/error/404.htm>

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/python/.virtualenvs/webspider/lib/python3.5/site-packages/scrapy/robotstxt.py", line 15, in decode_robotstxt robotstxt_body = robotstxt_body.decode('utf-8') UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position 327: invalid start byte {'content': '东莞南城周溪东径北街 6 号天台严重违建,现在还出租了,没有跟进后续情况', 'content_img': [], 'href': 'http://wz.sun0769.com/html/question/201911/436799.shtml', 'publish_date': '2019-11-25 11:58:44', 'title': '东莞南城周溪东径北街 6 号天台严重违建现在还出租了,相关部门没有跟进后续情况'} 最下面是数据

4696 次点击
所在节点    Python
3 条回复
zdnyp
2019-11-25 14:04:09 +08:00
Failure while parsing robots.txt. File either contains garbage or is in an encoding other than UTF-8, treating it as an empty file.
可以在 settings 里把 robots 改为 Flase
yifengs
2019-11-25 14:08:44 +08:00
谢谢,错误不见了,是我 scrapy 没安装好吗,为啥 robots.txt 会解析失败呢
yifengs
2019-11-25 14:13:02 +08:00
哦哦看到了 robots 协议上不允许,谢谢哈

这是一个专为移动设备优化的页面(即为了让你能够在 Google 搜索结果里秒开这个页面),如果你希望参与 V2EX 社区的讨论,你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/622887

V2EX 是创意工作者们的社区,是一个分享自己正在做的有趣事物、交流想法,可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.

© 2021 V2EX