豆瓣妹子采集器，Python 处女座。。请各位指教

withrock

2014-07-15 13:06:46 +08:00

我也有。
http://git.oschina.net/mktime/python-learn

chenggiant

2014-07-15 15:02:08 +08:00

@O21 嗯，刚用mac试了下。不过路径还是看了下源码才知道该怎么输...

O21

2014-07-16 00:54:51 +08:00

@chenggiant :) 嘿嘿现在已经更新啦。自动采集代理随机选取一个进行采集

O21

2014-07-16 01:09:44 +08:00

@Fotix
@shyrock
@sujin190
@dingyaguang117
@puyo
@eslizn
@WhyLiam
@kawaiiushio
@paulw54jrn
@payne
@deslife
@sunjourney
@cocalrush
@vigoss
@1130335361
@qdsearoc
@gelupk
@zephyryu

代码更新啦嘿嘿

增加内容
程序自动采集代理
采集代理后随机选择
然后自动下载图片保存

现在只需要输入需要采集的数量就可以了下载链接还是

http://162.244.92.122/DouBanMZ.zip

ChiangDi

2014-07-16 01:11:16 +08:00

刚去看了下那个小组，好奇怪的，为啥那么多人去晒

reorx

2014-07-16 01:16:13 +08:00

感谢楼主让我知道这个小组，认识了这个世界的广阔…

O21

2014-07-16 05:54:33 +08:00

连夜把获取任何小组跟模拟浏览器访问美化程序还有默认输入什么的写好了睡起来在发下 Python 真好玩可以俺不会写多线程蛋痛

paulw54jrn

2014-07-16 07:18:47 +08:00

@O21
一个主进程负责分析img url,把他们都放到Multiprocessing Queue里面,然后在多个worker thread之间共享,让worker负责具体的抓取. 不同的worker可以用不同的代理,防止屏蔽.

hging

2014-07-16 09:55:04 +08:00

从前我一直不相信星座，直到公司招进来两个处女座。。。。。。。。

我真不是来黑的。。。。别打我。。。。。

zouyun5152

2014-07-16 10:37:13 +08:00

哈哈，牛逼

shyrock

2014-07-16 11:06:54 +08:00

python新手表示好奇，38行和51行的x+=1和i+=1起什么作用？

Pete

2014-07-16 12:33:58 +08:00

感谢楼主我发现了广阔的世界不过这个小组发现你偷偷采集会怎么样..

yangkuku

2014-07-16 12:41:04 +08:00

谢谢楼主制造了这个牛逼的程序，但是我的32位win7不能使用，好遗憾的说

O21

2014-07-16 12:53:53 +08:00

@hging 我本来想说处女作写错鸟

O21

2014-07-16 12:54:38 +08:00

@yangkuku 我昨天编译32位了啊

O21

2014-07-16 16:18:15 +08:00

http://cn-proxy.com/ 这网站被我弄得。。数据库连接超时了。。。不是我搞挂的吧。。。

yangkuku

2014-07-16 17:53:11 +08:00

弱弱的问一下 mac下怎么跑源码？

Owenjia

2014-07-16 20:03:01 +08:00

不是有这个一个网站的么～～http://www.dbmeizi.com/

jptiancai

2014-08-19 15:12:40 +08:00

@O21 看过下面这位仁兄的推荐也非常不错,支持开源!
@Owenjia

linKnowEasy

2014-08-31 16:34:01 +08:00

#coding:utf-8
import urllib.request
import re
import time
import sys
import os
from imp import reload
reload(sys)
print ('#'*50)
print ('This program is mainly collecting watercress <Do not be shy> group picture')
print ('#'*50)
print ('Collected before the need to enter a proxy server address, so we can prevent the douban shielding.')
print ('Recommend a proxy address: http://cn-proxy.com/')
print ('Only need to input the server address and port number, do not need to input HTTP')
print ('Demo:127.0.0.1:8080')
print ('#'*50)
proxy_input = input('127.0.0.1:8087:')
proxy_handler = urllib.request.ProxyHandler({'http':'%s'%proxy_input})
opener = urllib.request.build_opener(proxy_handler)
urllib.request.install_opener(opener)
img_LuJ = input('input path:')
img_LuJ2 = os.path.abspath(img_LuJ)
print(img_LuJ2)
def gethtml2(url2):
req = urllib.request.Request(url2)
html2 = urllib.request.urlopen(req).read()
return html2

def gettoimg(html2):
reg2 = r'http://www.douban.com/group/topic/\d+'
html2 = html2.decode('utf-8')
toplist = re.findall(reg2,html2)
x = 0
for topicurl in toplist:
x+=1
return topicurl

def download(topic_page):
reg3 = r'http://img3.douban.com/view/group_topic/large/public/.+\.jpg'
imglist = re.findall(b'reg3',topic_page)
i = 1
download_img = None
for imgurl in imglist:
img_numlist = re.findall(r'p\d{7}',imgurl)
for img_num in img_numlist:
download_img = urllib.request.urlretrieve(imgurl,img_LuJ2 + '/%s.jpg'%img_num)
time.sleep(1)
i+=1
print (imgurl)
return download_img

page_end = int(input('Please enter the page number:'))
num_end = page_end*25
num = 0
page_num = 1
while num<=num_end:
html2 = gethtml2('http://www.douban.com/group/haixiuzu/discussion?start=%d'%num)
topicurl = gettoimg(html2)
topic_page = gethtml2(topicurl)
download_img=download(topic_page)
num = page_num*25
page_num+=1

else:
print('Program to collect complete')

这个是我修改你的代码 python3下面跑成功但是没有获得图片能不能帮忙看一下

豆瓣 妹子 采集器，Python 处女座。。请各位指教

豆瓣妹子采集器，Python 处女座。。请各位指教