php 淘宝、天猫店铺商品采集

2014-04-29 22:07:06 +08:00
 hpxl
能够规避淘宝防采集功能,通过代理快速采集店铺商品,商品信息以及图片默认存放在./data目录。

https://github.com/hpxl/fetch-taobao-goods
如果觉得有用,欢迎star
15008 次点击
所在节点    程序员
18 条回复
sadara
2014-04-29 22:49:51 +08:00
记得有个淘宝客程序叫单店宝
mahone3297
2014-04-29 23:35:37 +08:00
已fork。。。
leyle
2014-04-29 23:55:55 +08:00
这个有意思,先关注下,白天电脑看看
bigshan
2014-04-30 01:49:46 +08:00
明天用电脑看看咯
huangsong
2014-04-30 10:35:31 +08:00
fork 一下
aWangami
2014-04-30 12:40:28 +08:00
C:\Users\Administrator\Desktop\Fetch-Taobao>php fetch.php 'http://shop65262430.taobao.com'
PHP Warning: file_put_contents(/tmp/fetchgoods.pid): failed to open stream: No such file or directo
ry in C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php on line 13
PHP Stack trace:
PHP 1. {main}() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:0
PHP 2. file_put_contents() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:13

Warning: file_put_contents(/tmp/fetchgoods.pid): failed to open stream: No such file or directory in
C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php on line 13

Call Stack:
0.0010 127528 1. {main}() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:0
0.0010 128008 2. file_put_contents() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php
:13

PHP Notice: Undefined index: scheme in C:\Users\Administrator\Desktop\Fetch-Taobao\FetchGoods.class
.php on line 59
PHP Stack trace:
PHP 1. {main}() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:0
PHP 2. FetchGoods->run() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:33
PHP 3. FetchGoods->fetchOneShop() C:\Users\Administrator\Desktop\Fetch-Taobao\FetchGoods.class.php
:50

Notice: Undefined index: scheme in C:\Users\Administrator\Desktop\Fetch-Taobao\FetchGoods.class.php
on line 59

Call Stack:
0.0010 127528 1. {main}() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:0
0.0068 192584 2. FetchGoods->run() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:3
3
0.0068 193152 3. FetchGoods->fetchOneShop() C:\Users\Administrator\Desktop\Fetch-Taobao\Fe
tchGoods.class.php:50

PHP Notice: Undefined index: host in C:\Users\Administrator\Desktop\Fetch-Taobao\FetchGoods.class.p
hp on line 59
PHP Stack trace:
PHP 1. {main}() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:0
PHP 2. FetchGoods->run() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:33
PHP 3. FetchGoods->fetchOneShop() C:\Users\Administrator\Desktop\Fetch-Taobao\FetchGoods.class.php
:50

Notice: Undefined index: host in C:\Users\Administrator\Desktop\Fetch-Taobao\FetchGoods.class.php on
line 59

Call Stack:
0.0010 127528 1. {main}() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:0
0.0068 192584 2. FetchGoods->run() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:3
3
0.0068 193152 3. FetchGoods->fetchOneShop() C:\Users\Administrator\Desktop\Fetch-Taobao\Fe
tchGoods.class.php:50

shop_url:'http://shop65262430.taobao.com' ... start_time:04-29 15:19:11 ... start!
PHP Fatal error: Call to undefined function curl_init() in C:\Users\Administrator\Desktop\Fetch-Tao
bao\HttpFetch.class.php on line 127
PHP Stack trace:
PHP 1. {main}() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:0
PHP 2. FetchGoods->run() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:33
PHP 3. FetchGoods->fetchOneShop() C:\Users\Administrator\Desktop\Fetch-Taobao\FetchGoods.class.php
:50
PHP 4. HttpFetch->get() C:\Users\Administrator\Desktop\Fetch-Taobao\FetchGoods.class.php:74
PHP 5. HttpFetch->disguise_curl() C:\Users\Administrator\Desktop\Fetch-Taobao\HttpFetch.class.php:
29

Fatal error: Call to undefined function curl_init() in C:\Users\Administrator\Desktop\Fetch-Taobao\H
ttpFetch.class.php on line 127

Call Stack:
0.0010 127528 1. {main}() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:0
0.0068 192584 2. FetchGoods->run() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:3
3
0.0068 193152 3. FetchGoods->fetchOneShop() C:\Users\Administrator\Desktop\Fetch-Taobao\Fe
tchGoods.class.php:50
0.0215 197880 4. HttpFetch->get() C:\Users\Administrator\Desktop\Fetch-Taobao\FetchGoods.c
lass.php:74
0.0215 197896 5. HttpFetch->disguise_curl() C:\Users\Administrator\Desktop\Fetch-Taobao\Ht
tpFetch.class.php:29

PHP Warning: unlink(/tmp/fetchgoods.pid): No such file or directory in C:\Users\Administrator\Deskt
op\Fetch-Taobao\fetch.php on line 15
PHP Stack trace:
PHP 1. {main}() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:0
PHP 2. FetchGoods->run() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:33
PHP 3. FetchGoods->fetchOneShop() C:\Users\Administrator\Desktop\Fetch-Taobao\FetchGoods.class.php
:50
PHP 4. HttpFetch->get() C:\Users\Administrator\Desktop\Fetch-Taobao\FetchGoods.class.php:74
PHP 5. HttpFetch->disguise_curl() C:\Users\Administrator\Desktop\Fetch-Taobao\HttpFetch.class.php:
29
PHP 6. removePidFile() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:0
PHP 7. unlink() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:15

Warning: unlink(/tmp/fetchgoods.pid): No such file or directory in C:\Users\Administrator\Desktop\Fe
tch-Taobao\fetch.php on line 15

Call Stack:
0.0010 127528 1. {main}() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:0
0.0068 192584 2. FetchGoods->run() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:3
3
0.0068 193152 3. FetchGoods->fetchOneShop() C:\Users\Administrator\Desktop\Fetch-Taobao\Fe
tchGoods.class.php:50
0.0215 197880 4. HttpFetch->get() C:\Users\Administrator\Desktop\Fetch-Taobao\FetchGoods.c
lass.php:74
0.0215 197896 5. HttpFetch->disguise_curl() C:\Users\Administrator\Desktop\Fetch-Taobao\Ht
tpFetch.class.php:29
0.0342 194016 6. removePidFile() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:0
0.0342 194128 7. unlink() C:\Users\Administrator\Desktop\Fetch-Taobao\fetch.php:15


C:\Users\Administrator\Desktop\Fetch-Taobao>
andyhu
2014-05-01 04:45:49 +08:00
mark关注下,不过采集这东西用php有点太痛苦了
ptsa
2014-05-02 22:28:53 +08:00
@sadara $1199 现在淘宝客不好做吧
ptsa
2014-05-02 22:30:23 +08:00
@sadara 而且还是去年的版本 不知道好不好用
hanchengluo
2014-05-03 10:22:41 +08:00
@andyhu 我也是用PHP采集的,2G数据用了差不多一个月时间,有更好的推荐吗?
andyhu
2014-05-03 10:43:43 +08:00
@hanchengluo 试下node.js+request+cheerio吧,我其实工作中是用PHP的,但如果有需要抓取远程页面这种工作,用完这个组合以后再回去PHP会觉得非常痛苦
andyhu
2014-05-03 10:45:02 +08:00
@ptsa 淘宝客,主要不好做在哪方面?听说蘑菇街和美丽说都转型了,具体是怎么一个情况?
hanchengluo
2014-05-03 10:52:18 +08:00
@andyhu 主要是取出标签再存入数据库,主要压力应该是抓取速度和数据库IO。我想应该和所用的程序没关的。
www.smartweb.cn
andyhu
2014-05-03 10:57:42 +08:00
html parsing也浪费时间,另外php不支持多线程,每个请求都要等待很慢的。数据库我用的是mongodb,速度还是很快的
andyhu
2014-05-03 11:01:49 +08:00
@hanchengluo 刚才看了您的网站,网页快照用的是什么啊?是phantomjs搞定的吗?node有个thumbbot比较强悍,可以通吃网页 图片 视频缩略图预览。不过是基于phantomjs的,如果需要截取带flash的界面,估计还是要用特殊定制的版本才行,老版的phantomjs已经不支持flash了。总体感觉抓取这东西,php和node.js毫无可比性。python都比php好用很多,也有不少专业的爬虫模块
hanchengluo
2014-05-03 11:14:40 +08:00
@andyhu 多谢光临,我就只用PHP下面的CI,对JS也不熟。以前想搞个爬虫,想学下GoLang,但没坚持,还是用php了,人老了,学不动了。准备将网站改成一个小门户,还在构思中,没采集又没资料,但又怕采集被K。
laodao
2014-05-03 12:14:27 +08:00
ym1623
2014-09-03 14:30:13 +08:00
我发现你这个项目不行啊,,一样会被天猫拦截到...

这是一个专为移动设备优化的页面(即为了让你能够在 Google 搜索结果里秒开这个页面),如果你希望参与 V2EX 社区的讨论,你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/110556

V2EX 是创意工作者们的社区,是一个分享自己正在做的有趣事物、交流想法,可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.

© 2021 V2EX