一个诡异的爬虫,求分析。

2014-08-05 12:47:25 +08:00
 sanp
这是我截取的access log. 其中/{xxx}代表的是我网站的某个路径,其他的都是原始的log未做改动。
这个爬虫IP不固定,封了后过一会会有新的IP爬过来。
这个爬虫从大概2年前就开始爬我的站,中间我的站关掉了一年左右,现在重新开,没想到这个爬虫居然还在。不知道什么路数。很有可能我关站的这段时间他还在爬。大家给分析分析

89.248.162.170 - - [05/Aug/2014:04:42:36 +0000] "GET /{xxx} HTTP/1.1" 301 193 "http://www.google.com/#q=a3" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20120427 Firefox/15.0a1"
89.248.162.170 - - [05/Aug/2014:04:42:36 +0000] "GET /{xxx} HTTP/1.1" 301 193 "http://www.google.com/#q=a2" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.66 Safari/535.11"
89.248.162.170 - - [05/Aug/2014:04:42:36 +0000] "GET /{xxx} HTTP/1.1" 301 193 "http://www.google.com/#q=a0" "Mozilla/5.0 (Windows NT 6.2; WOW64; rv:16.0.1) Gecko/20121011 Firefox/16.0.1"
89.248.162.170 - - [05/Aug/2014:04:42:36 +0000] "GET /{xxx} HTTP/1.1" 301 193 "http://www.google.com/#q=a3" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.66 Safari/535.11"
89.248.162.170 - - [05/Aug/2014:04:42:36 +0000] "GET /{xxx} HTTP/1.1" 301 193 "http://www.google.com/#q=a4" "Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.11 (KHTML, like Gecko) Ubuntu/11.10 Chromium/17.0.963.65 Chrome/17.0.963.65 Safari/535.11"
89.248.162.170 - - [05/Aug/2014:04:42:37 +0000] "GET /{xxx} HTTP/1.1" 301 193 "http://www.google.com/#q=a2" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.66 Safari/535.11"
89.248.162.170 - - [05/Aug/2014:04:42:37 +0000] "GET /{xxx} HTTP/1.1" 301 193 "http://www.google.com/#q=a9" "Mozilla/6.0 (Windows NT 6.2; WOW64; rv:16.0.1) Gecko/20121011 Firefox/16.0.1"
89.248.162.170 - - [05/Aug/2014:04:42:37 +0000] "GET /{xxx} HTTP/1.1" 301 193 "http://www.google.com/#q=a3" "Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:16.0.1) Gecko/20121011 Firefox/16.0.1"
94.102.49.31 - - [05/Aug/2014:04:42:56 +0000] "GET /{xxx} HTTP/1.1" 301 193 "http://www.google.com/#q=a6" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_4) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.65 Safari/535.11"
94.102.49.31 - - [05/Aug/2014:04:42:56 +0000] "GET /{xxx} HTTP/1.1" 301 193 "http://www.google.com/#q=a8" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.11 (KHTML, like Gecko) Ubuntu/10.10 Chromium/17.0.963.65 Chrome/17.0.963.65 Safari/535.11"
94.102.49.31 - - [05/Aug/2014:04:42:57 +0000] "GET /{xxx} HTTP/1.1" 301 193 "http://www.google.com/#q=a1" "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:15.0) Gecko/20100101 Firefox/15.0.1"
94.102.49.31 - - [05/Aug/2014:04:42:58 +0000] "GET /{xxx} HTTP/1.1" 301 193 "http://www.google.com/#q=a3" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.11 (KHTML, like Gecko) Ubuntu/11.04 Chromium/17.0.963.65 Chrome/17.0.963.65 Safari/535.11"
94.102.49.31 - - [05/Aug/2014:04:42:58 +0000] "GET /{xxx} HTTP/1.1" 301 193 "http://www.google.com/#q=a8" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.66 Safari/535.11"
94.102.49.31 - - [05/Aug/2014:04:42:58 +0000] "GET /{xxx} HTTP/1.1" 301 193 "http://www.google.com/#q=a1" "Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:16.0.1) Gecko/20121011 Firefox/16.0.1"
94.102.49.31 - - [05/Aug/2014:04:42:58 +0000] "GET /{xxx} HTTP/1.1" 301 193 "http://www.google.com/#q=a7" "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.66 Safari/535.11"
94.102.49.31 - - [05/Aug/2014:04:42:58 +0000] "GET /{xxx} HTTP/1.1" 301 193 "http://www.google.com/#q=a3" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.11 (KHTML, like Gecko) Ubuntu/11.10 Chromium/17.0.963.65 Chrome/17.0.963.65 Safari/535.11"
94.102.49.31 - - [05/Aug/2014:04:43:00 +0000] "GET /{xxx} HTTP/1.1" 301 193 "http://www.google.com/#q=a1" "Mozilla/5.0 (Windows NT 6.2; WOW64; rv:16.0.1) Gecko/20121011 Firefox/16.0.1"
94.102.49.31 - - [05/Aug/2014:04:43:01 +0000] "GET /{xxx} HTTP/1.1" 301 193 "http://www.google.com/#q=a2" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.16) Gecko/20120427 Firefox/15.0a1"
3943 次点击
所在节点    问与答
15 条回复
liangdi
2014-08-05 12:48:57 +08:00
log呢?
sanp
2014-08-05 12:50:30 +08:00
@liangdi 刚才按了Enter居然自动发布了。。。我刚编辑了。
plprapper
2014-08-05 13:16:23 +08:00
这哪里是爬虫, 简直是癞皮狗。。。。
liangdi
2014-08-05 13:26:59 +08:00
是采集器吧 lz什么站?
sintrb
2014-08-05 13:28:21 +08:00
这爬虫好可怜。。
popbones
2014-08-05 14:20:01 +08:00
IP : 89.248.162.170
Host : server156950.santrex.net
Country : Netherlands

IP : 94.102.49.31
Host : ?
Country : Netherlands
captainhcg
2014-08-05 14:34:48 +08:00
http://www.projecthoneypot.org/ip_94.102.49.213
貌似是发送垃圾评论的,你的站点是不是用了wordpress之类的框架?
ChanneW
2014-08-05 14:35:14 +08:00
怎么看出不是真 google 的
avrillavigne
2014-08-05 15:04:17 +08:00
http://antivirus.neu.edu.cn/ssh/lists/base_30days.txt 东北大学把它列进黑名单了 - -
vicacheung
2014-08-05 15:06:38 +08:00
@sanp 现在可以编辑主题了?
sanp
2014-08-05 15:22:05 +08:00
@liangdi 一个工具类的站,查询数据的,对方是遍历抓取的。我奇怪的是我站都关了一年多。重新开了,他居然还在。
sanp
2014-08-05 15:22:40 +08:00
@vicacheung 刚发布时候可以编辑的。
sanp
2014-08-05 15:24:02 +08:00
@plprapper 确实,一般的爬虫禁了就行了,这个是禁了吗,过会就有别的IP过来,而且抓取很频繁,基本不停的爬。
sanp
2014-08-05 15:25:03 +08:00
@captainhcg 没有用wordpress。这个爬虫是遍历网站页面,然后就不停的爬。
sanp
2014-08-05 15:25:55 +08:00
@avrillavigne 确实被互联网上不少地方列黑名单了。我就是奇怪他咋就不停的爬。

这是一个专为移动设备优化的页面(即为了让你能够在 Google 搜索结果里秒开这个页面),如果你希望参与 V2EX 社区的讨论,你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/126195

V2EX 是创意工作者们的社区,是一个分享自己正在做的有趣事物、交流想法,可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.

© 2021 V2EX