爬豆瓣相册遇到 403，伪装浏览器不成功，呼叫总部...

2015-03-24 12:02:11 +08:00

dedewei

google得到伪装浏览器有两种选择：
第一种： https://gist.github.com/jianjiao2021/2c34d12dc2b327e62966

第二种： https://gist.github.com/jianjiao2021/05f9bbed66e79c24c9dc

还是返回403，请问哪里出错了？

全部代码： https://gist.github.com/jianjiao2021/7a8069afab52b12b0c76

11757 次点击

所在节点

Python

39 条回复

em70

2015-03-24 13:54:27 +08:00

豆瓣早就用频率监控了,经过测试,一分钟40次是临界点,抓一个等1秒就肯定没问题

fork3rt

2015-03-24 13:58:57 +08:00

为什么不使用 requests + beautifulsoup ?

vjnjc

2015-03-24 14:11:57 +08:00

挺好玩的,楼主你的程序借我用用啊,据说豆瓣里有很多隐藏的美女,顺便学学python ^^

caoz

2015-03-24 16:28:37 +08:00

使用豆瓣的 API (http://developers.douban.com/wiki/?title=photo_v2), 使用豆瓣客户端用的 apikey, 怎么抓也不会被封~

e.g. http://api.douban.com/v2/group/taotaopaoxiao/topics?alt=json&apikey=08f332d3675ca9d71ad9987a3615fd85

happywowwow

2015-03-24 16:37:56 +08:00

http://www.douban.com/group/haixiuzu/
请不要害羞
以前写过爬这个的
hhh

muyi

2015-03-24 17:06:58 +08:00

模拟容易造成IP被封，如楼上所提到的，用官方客户端的apikey，使用api来爬

AnyOfYou

2015-03-24 17:12:41 +08:00

http://doc.scrapy.org/en/0.24/topics/practices.html#bans
Scrapy 的文档中有一点关于如何防治爬虫被 Ban 的方法：

rotate your user agent from a pool of well-known ones from browsers (google around to get a list of them)
disable cookies (see COOKIES_ENABLED) as some sites may use cookies to spot bot behaviour
use download delays (2 or higher). See DOWNLOAD_DELAY setting.
if possible, use Google cache to fetch pages, instead of hitting the sites directly
use a pool of rotating IPs. For example, the free Tor project or paid services like ProxyMesh
use a highly distributed downloader that circumvents bans internally, so you can just focus on parsing clean pages. One example of such downloaders is Crawlera

justlikemaki

2015-03-24 17:34:46 +08:00

..我遇到过网站故意返回错误代码，然后还返回页面代码的。

darmody

2015-03-24 18:49:09 +08:00

看你的代码没有加延时之类的东西，估计是抓取频率的问题

aliao0019

2015-03-25 00:39:38 +08:00

注意豆瓣的 header 里面的 bid

aliao0019

2015-03-25 00:41:48 +08:00

@aliao0019 headers

dedewei

2015-03-25 08:20:03 +08:00

@terrychang 没看懂，不过谢谢，以后遇到再试试

dedewei

2015-03-25 08:22:49 +08:00

@lerry lxml and Requests 似乎大家都在推荐这样，继续学习。谢谢指点！

dedewei

2015-03-25 08:28:22 +08:00

@caoz 多谢，当时顺手google了下，没找到，就放弃了。还没用过api，打算这就试试。非常感谢。

dedewei

2015-03-25 08:29:02 +08:00

@happywowwow 哈哈哈〜，提供很好的素材，这就爬去！！！！！！！！！！

dedewei

2015-03-25 08:30:24 +08:00

@AnyOfYou mark.....等再熟练点再好好看看......

lerry

2015-03-25 09:56:26 +08:00

@dedewei 我用的PyQuery，可以像jQuery一样操作dom元素，很方便

penjianfeng

2015-03-25 10:00:41 +08:00

@happywowwow 进去看了下，终于明白为何以前他们说douban才是大黄了-_-||

zjuster

2015-03-25 10:52:19 +08:00

豆瓣的反爬虫机制都是被你们逼的..haha 请不要误会，我并没有恶意..

第 2 页／共 2 页

这是一个专为移动设备优化的页面（即为了让你能够在 Google 搜索结果里秒开这个页面），如果你希望参与 V2EX 社区的讨论，你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/178984

V2EX 是创意工作者们的社区，是一个分享自己正在做的有趣事物、交流想法，可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.