Google 根本不理会 robots.txt 仍然收录站点的标题与 URL

2016-12-01 20:56:23 +08:00
 nikoo
在 VPS 上搭建了一个 wiki 程序并用了一个之前没使用过的域名
从可访问那一刻起保证了 /robots.txt 一直为:
User-agent: *
Disallow: /

这样用了两个多月,导入了很多文章,今天在 google site 一看吓了一跳, google 根本不顾 robots.txt 的限制收录了两页的内容,所有内容包含 title 与 url ,第三行描述全部为:由于此网站的 robots.txt ,所以无法提供该结果的相关说明。
了解详情

用了 Google Remove URLs Tool 申请删除也没有任何响应

Google 不是不作恶吗?为什么还会收录明确禁止收录的页面标题与 URL ?
如何彻底禁止 google 收录站点的所有内容?

btw:其他搜索引擎如yahoo、bing、baidu都很规矩没有收录该站点任何内容
3645 次点击
所在节点    问与答
14 条回复
auzeonfung
2016-12-01 20:59:22 +08:00
服务器 ban 掉 Google 的 IP
stamaimer
2016-12-01 21:03:05 +08:00
@auzeonfung 你知道谷歌有多少 ip?
xmoiduts
2016-12-01 21:07:11 +08:00
题主搜一下 taobao ?
imcocc
2016-12-01 21:07:27 +08:00
搜索 屏蔽垃圾爬虫
用 useragent 匹配屏蔽
gogohigh
2016-12-01 21:09:43 +08:00
@stamaimer
gfwlist
nikoo
2016-12-01 21:16:31 +08:00
一些研究收获:
Why do Google search results include pages disallowed in robots.txt?
http://webmasters.stackexchange.com/questions/24569/why-do-google-search-results-include-pages-disallowed-in-robots-txt
Does Google ignore robots.txt
http://webmasters.stackexchange.com/questions/54879/does-google-ignore-robots-txt

总结上面两个帖子中的结论:
Google 的确会无视 robots.txt 收录禁止收录的页面,解决方法是在所有页面中加入
<meta name="robots" content="noindex, nofollow">
Google 的解释是只要这个页面在其他被收录页面中有链接就会被收录并且无视 robots.txt

我感觉并不对,因为我的 wiki 里导入的文章没有也不可能在其他站点有链接,怎么就连标题带 URL 的被收录了呢
caiych
2016-12-01 21:19:57 +08:00
查 robots.txt 的细节的时候查到 google 的文档,里面写的是
> 如果您想从搜索结果中屏蔽自己的网页,请使用其他方法,例如密码保护或 noindex 标记或指令。
不知道楼主有没有设置这个…

https://support.google.com/webmasters/answer/6062608?visit_id=1-636161949805851671-2329679117&hl=zh-Hans&rd=2
nikoo
2016-12-01 21:34:16 +08:00
@caiych 非常感谢,很有收获的文档,感觉 Google 这样的做法有瑕疵:

robots.txt 指令无法阻止其他网站引用您的网址
尽管 Google 不会抓取 robots.txt 禁止访问的内容或将其编入索引,我们仍有可能在网络上的其他位置找到被禁止访问的网址并将其编入索引。因此,相关网址和其他公开显示的信息(如相关网站的链接中的定位文字)仍可能会出现在 Google 搜索结果中。您可以通过使用其他网址屏蔽方法(例如为您服务器上的文件提供密码保护或使用 noindex 元标记或响应标头),完全阻止您的网址出现在 Google 搜索结果中。

那么问题来了,在
使用元标记阻止搜索引擎将您的网页编入索引 https://support.google.com/webmasters/answer/93710
中, Google 爬虫会因为 robots.txt 限制无法访问"noindex 元标记",那我在自己页面设置"noindex 元标记"理论上是无效的(因为 robots.txt 限制)
khaki
2016-12-01 21:44:24 +08:00
这里的文档更详细 https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt ,会不会是子域名的问题
auzeonfung
2016-12-01 21:59:55 +08:00
@stamaimer deny from 104.132.0.0/21
deny from 104.132.12.0/24
deny from 104.132.128.0/24
deny from 104.132.129.0/24
deny from 104.132.13.0/26
deny from 104.132.13.112/28
deny from 104.132.13.128/25
deny from 104.132.13.64/27
deny from 104.132.13.96/28
deny from 104.132.130.0/24
deny from 104.132.131.0/24
deny from 104.132.132.0/24
deny from 104.132.133.0/24
deny from 104.132.134.0/24
deny from 104.132.135.0/24
deny from 104.132.136.0/23
deny from 104.132.138.0/24
deny from 104.132.139.0/24
deny from 104.132.14.0/23
deny from 104.132.140.0/24
deny from 104.132.141.0/26
deny from 104.132.141.112/28
deny from 104.132.141.128/25
deny from 104.132.141.64/27
deny from 104.132.141.96/28
deny from 104.132.142.0/24
deny from 104.132.143.0/24
deny from 104.132.144.0/24
deny from 104.132.145.0/24
deny from 104.132.146.0/24
deny from 104.132.147.0/24
deny from 104.132.148.0/23
deny from 104.132.150.0/24
deny from 104.132.151.0/24
deny from 104.132.152.0/24
deny from 104.132.153.0/24
deny from 104.132.154.0/23
deny from 104.132.156.0/24
deny from 104.132.157.0/24
deny from 104.132.158.0/24
deny from 104.132.159.0/24
deny from 104.132.16.0/24
deny from 104.132.160.0/24
deny from 104.132.161.0/24
deny from 104.132.162.0/24
deny from 104.132.163.0/24
deny from 104.132.164.0/23
deny from 104.132.166.0/24
deny from 104.132.167.0/24
deny from 104.132.168.0/24
deny from 104.132.169.0/24
deny from 104.132.17.0/26
deny from 104.132.17.112/28
deny from 104.132.17.128/25
deny from 104.132.17.64/27
deny from 104.132.17.96/28
deny from 104.132.170.0/24
deny from 104.132.171.0/24
deny from 104.132.172.0/22
deny from 104.132.176.0/23
deny from 104.132.178.0/24
deny from 104.132.179.0/24
deny from 104.132.18.0/24
deny from 104.132.180.0/24
deny from 104.132.181.0/24
deny from 104.132.182.0/24
deny from 104.132.183.0/24
deny from 104.132.184.0/24
deny from 104.132.185.0/24
deny from 104.132.186.0/24
deny from 104.132.187.0/24
deny from 104.132.188.0/24
deny from 104.132.189.0/24
deny from 104.132.19.0/24
deny from 104.132.190.0/23
deny from 104.132.192.0/22
deny from 104.132.196.0/24
deny from 104.132.197.0/24
deny from 104.132.198.0/23
deny from 104.132.20.0/24
deny from 104.132.200.0/23
deny from 104.132.202.0/24
deny from 104.132.203.0/24
deny from 104.132.204.0/24
deny from 104.132.205.0/24
deny from 104.132.206.0/23
deny from 104.132.208.0/24
deny from 104.132.209.0/24
deny from 104.132.21.0/26
deny from 104.132.21.112/28
deny from 104.132.21.128/25
deny from 104.132.21.64/27
deny from 104.132.21.96/28
deny from 104.132.210.0/23
deny from 104.132.212.0/22
deny from 104.132.216.0/21
deny from 104.132.22.0/24
deny from 104.132.224.0/19
deny from 104.132.23.0/24
deny from 104.132.24.0/26
deny from 104.132.24.128/25
deny from 104.132.24.64/26
deny from 104.132.25.0/24
deny from 104.132.26.0/24
deny from 104.132.27.0/24
deny from 104.132.28.0/24
deny from 104.132.29.0/24
deny from 104.132.30.0/23
deny from 104.132.32.0/24
deny from 104.132.33.0/24
deny from 104.132.34.0/24
deny from 104.132.35.0/24
deny from 104.132.36.0/22
deny from 104.132.40.0/21
deny from 104.132.48.0/22
deny from 104.132.52.0/23
deny from 104.132.54.0/24
deny from 104.132.55.0/24
deny from 104.132.56.0/21
deny from 104.132.64.0/18
deny from 104.132.8.0/22
deny from 104.133.0.0/17
deny from 104.133.128.0/18
deny from 104.133.192.0/19
deny from 104.133.224.0/20
deny from 104.133.240.0/21
deny from 104.133.248.0/24
deny from 104.133.249.0/24
deny from 104.133.250.0/23
deny from 104.133.252.0/22
deny from 104.134.0.0/16
deny from 104.135.0.0/17
deny from 104.135.128.0/18
deny from 104.135.192.0/19
deny from 104.135.224.0/19
deny from 104.154.0.0/15
deny from 104.196.0.0/15
deny from 104.198.0.0/16
deny from 104.199.0.0/17
deny from 104.199.128.0/20
deny from 104.199.144.0/23
deny from 104.199.146.0/24
deny from 104.199.147.0/24
deny from 104.199.148.0/22
deny from 104.199.152.0/21
deny from 104.199.160.0/19
deny from 104.199.192.0/18
deny from 107.167.160.0/19
deny from 107.178.192.0/18
deny from 108.170.192.0/20
deny from 108.170.208.0/21
deny from 108.170.216.0/24
deny from 108.170.217.0/25
deny from 108.170.217.128/28
deny from 108.170.217.160/27
deny from 108.170.217.192/26
deny from 108.170.218.0/23
deny from 108.170.220.0/22
deny from 108.170.224.0/19
deny from 108.177.0.0/17
deny from 108.59.80.0/24
deny from 108.59.81.0/27
deny from 108.59.82.0/23
deny from 108.59.84.0/22
deny from 108.59.88.0/22
deny from 108.59.92.0/27
deny from 108.59.92.128/26
deny from 108.59.92.192/27
deny from 108.59.92.96/27
deny from 108.59.93.0/27
deny from 108.59.93.192/26
deny from 108.59.93.32/29
deny from 108.59.93.40/31
deny from 108.59.93.43/32
deny from 108.59.93.44/30
deny from 108.59.93.48/28
deny from 108.59.93.64/26
deny from 108.59.94.0/28
deny from 108.59.94.128/26
deny from 108.59.94.16/29
deny from 108.59.94.192/28
deny from 108.59.94.208/29
deny from 108.59.94.240/28
deny from 108.59.94.32/27
deny from 108.59.94.64/26
deny from 108.59.95.0/24
deny from 12.216.80.0/24
deny from 12.234.149.240/29
deny from 125.16.7.72/30
deny from 125.17.82.112/30
deny from 128.177.109.0/26
deny from 128.177.119.128/25
deny from 128.177.163.0/25
deny from 130.211.0.0/16
deny from 142.250.0.0/15
deny from 146.148.0.0/17
deny from 162.216.148.0/22
deny from 162.222.176.0/21
deny from 172.102.8.0/21
deny from 172.217.0.0/16
deny from 172.253.0.0/16
deny from 173.194.0.0/18
deny from 173.194.100.0/22
deny from 173.194.104.0/21
deny from 173.194.112.0/20
deny from 173.194.128.0/17
deny from 173.194.64.0/19
deny from 173.194.96.0/24
deny from 173.194.97.0/24
deny from 173.194.98.0/24
deny from 173.194.99.0/24
deny from 173.255.112.0/22
deny from 173.255.116.0/25
deny from 173.255.116.128/26
deny from 173.255.116.192/27
deny from 173.255.117.128/25
deny from 173.255.117.32/27
deny from 173.255.117.64/26
deny from 173.255.118.0/23
deny from 173.255.120.0/24
deny from 173.255.121.0/25
deny from 173.255.121.128/26
deny from 173.255.122.128/26
deny from 173.255.122.64/26
deny from 173.255.123.0/24
deny from 173.255.124.0/27
deny from 173.255.124.128/29
deny from 173.255.124.144/28
deny from 173.255.124.160/27
deny from 173.255.124.192/27
deny from 173.255.124.232/29
deny from 173.255.124.240/29
deny from 173.255.124.32/28
deny from 173.255.124.48/29
deny from 173.255.124.64/26
deny from 173.255.125.0/27
deny from 173.255.125.128/25
deny from 173.255.125.72/29
deny from 173.255.125.80/28
deny from 173.255.125.96/27
deny from 173.255.126.0/23
deny from 180.87.33.64/26
deny from 192.104.160.0/23
deny from 192.158.28.0/22
deny from 192.178.0.0/15
deny from 195.16.45.144/29
deny from 198.108.100.192/28
deny from 199.192.112.0/25
deny from 199.192.112.128/26
deny from 199.192.112.192/27
deny from 199.192.112.224/29
deny from 199.192.113.0/25
deny from 199.192.113.128/27
deny from 199.192.113.176/28
deny from 199.192.113.192/26
deny from 199.192.114.0/25
deny from 199.192.114.192/26
deny from 199.192.115.0/28
deny from 199.192.115.128/25
deny from 199.192.115.80/28
deny from 199.192.115.96/27
deny from 199.223.232.0/21
deny from 203.222.167.144/28
deny from 206.160.135.240/28
deny from 207.223.160.0/20
deny from 208.184.125.240/28
deny from 208.21.209.0/28
deny from 208.44.48.240/29
deny from 208.46.199.160/29
deny from 209.185.108.128/25
deny from 209.85.128.0/17
deny from 213.155.151.128/26
deny from 213.200.103.128/26
deny from 213.200.99.192/26
deny from 216.109.75.80/28
deny from 216.136.145.128/27
deny from 216.239.32.0/24
deny from 216.239.33.0/29
deny from 216.239.33.104/29
deny from 216.239.33.112/28
deny from 216.239.33.128/25
deny from 216.239.33.16/28
deny from 216.239.33.32/29
deny from 216.239.33.40/29
deny from 216.239.33.48/28
deny from 216.239.33.64/27
deny from 216.239.33.8/29
deny from 216.239.33.96/29
deny from 216.239.34.0/24
deny from 216.239.35.0/24
deny from 216.239.36.0/23
deny from 216.239.38.0/24
deny from 216.239.39.0/24
deny from 216.239.40.0/22
deny from 216.239.44.0/23
deny from 216.239.46.0/23
deny from 216.239.48.0/22
deny from 216.239.52.0/23
deny from 216.239.54.0/24
deny from 216.239.55.0/28
deny from 216.239.55.128/27
deny from 216.239.55.16/29
deny from 216.239.55.160/29
deny from 216.239.55.168/29
deny from 216.239.55.176/28
deny from 216.239.55.192/26
deny from 216.239.55.24/29
deny from 216.239.55.32/27
deny from 216.239.55.64/26
deny from 216.239.56.0/21
deny from 216.252.220.0/22
deny from 216.33.229.144/29
deny from 216.33.229.160/29
deny from 216.34.7.176/28
deny from 216.58.192.0/19
deny from 216.74.130.48/28
deny from 216.74.153.0/27
deny from 217.118.234.96/28
deny from 23.236.48.0/20
deny from 23.251.128.0/19
deny from 4.3.2.0/24
deny from 41.206.188.128/26
deny from 61.246.190.124/30
deny from 61.246.224.136/30
deny from 63.158.137.224/29
deny from 63.161.156.0/24
deny from 63.166.17.128/25
deny from 63.226.245.56/29
deny from 63.237.119.112/29
deny from 63.88.22.0/23
deny from 64.124.98.104/29
deny from 64.233.160.0/23
deny from 64.233.162.0/24
deny from 64.233.163.0/24
deny from 64.233.164.0/22
deny from 64.233.168.0/21
deny from 64.233.176.0/20
deny from 64.41.146.208/28
deny from 64.41.221.192/28
deny from 64.68.64.64/26
deny from 64.68.80.0/20
deny from 64.71.148.240/29
deny from 64.9.224.0/19
deny from 65.167.144.64/28
deny from 65.170.13.0/28
deny from 65.171.1.144/28
deny from 65.216.183.0/24
deny from 65.220.13.0/24
deny from 66.102.0.0/21
deny from 66.102.12.0/23
deny from 66.102.14.0/25
deny from 66.102.14.128/30
deny from 66.102.14.132/31
deny from 66.102.14.134/31
deny from 66.102.14.136/29
deny from 66.102.14.144/28
deny from 66.102.14.160/27
deny from 66.102.14.192/26
deny from 66.102.15.0/24
xiaoz
2016-12-01 22:01:01 +08:00
用 google 站长工具检测下你网站的 robots.txt ,之前我遇到了 robots.txt 包含 bom 头被 google 报错。
Vicer
2016-12-01 23:48:02 +08:00
学习一下
Showfom
2016-12-02 09:08:13 +08:00
@stamaimer Google 的 爬虫 IP 基本都隐藏在这儿

http://bgp.he.net/AS15169#_prefixes

全部屏蔽可破,亲测
stamaimer
2016-12-02 10:13:11 +08:00
学习了,同志们。

这是一个专为移动设备优化的页面(即为了让你能够在 Google 搜索结果里秒开这个页面),如果你希望参与 V2EX 社区的讨论,你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/324666

V2EX 是创意工作者们的社区,是一个分享自己正在做的有趣事物、交流想法,可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.

© 2021 V2EX