网站根目录下设置的 robots.txt 规则现在爬虫机器人不遵守了嘛

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

现在注册

已注册用户请登录

推荐书目

› 高性能网站建设进阶指南

› High Performance Web Sites

› Google Hacks: Tips & Tools for Finding and Using the World's Information

关于 Google SEO 最好的一本书

网站根目录下设置的 robots.txt 规则貌似对 gptbot 和 facebook 的 crawler 不生效啊

User-agent: *
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

设置 robots.txt 的时间已经超过了 30 个小时。都不遵守 robots 的话，只能从 nginx 配置了。

10M 的宽带直接被爬虫跑满了

20.171.207.130 - - [17/Oct/2025:09:16:41 +0800] "GET /?s=search/index/cid/323/bid/24/scid/85C4/peid/27/ov/new-asc.html HTTP/1.1" 200 38211 "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot)"
117.50.153.198 - - [17/Oct/2025:09:16:42 +0800] "GET /?s=search/index/cid/316/scid/85C4/poid/33/bid/8/ov/new-asc/peid/7.html HTTP/1.1" 200 38340 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0"
57.141.0.25 - - [17/Oct/2025:09:16:42 +0800] "GET /?s=search/index/poid/33/scid/9EBB198E982B/cid/444/peid/17/bid/12.html HTTP/1.1" 200 637932 "-" "meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)"
57.141.0.12 - - [17/Oct/2025:09:16:42 +0800] "GET /?s=search/index/poid/33/scid/9EBB198E982B/cid/631/peid/29/bid/28/ov/price-asc.html HTTP/1.1" 200 637644 "-" "meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)"
57.141.0.74 - - [17/Oct/2025:09:16:42 +0800] "GET /?s=search/index/poid/33/scid/C4/cid/608/peid/7/ov/new-asc.html HTTP/1.1" 200 635769 "-" "meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)"
57.141.0.63 - - [17/Oct/2025:09:16:42 +0800] "GET /?s=search/index/poid/33/peid/29/bid/24/scid/C4/cid/570/ov/access-desc.html HTTP/1.1" 200 618851 "-" "meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)"
117.50.153.198 - - [17/Oct/2025:09:16:43 +0800] "GET /?s=search/index/cid/321/bid/29/scid/85C4/ov/new-desc/peid/7/poid/33.html HTTP/1.1" 200 38368 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0"
57.141.0.34 - - [17/Oct/2025:09:16:43 +0800] "GET /?s=search/index/poid/33/peid/18/ov/new-desc/scid/9EBB198E982B/bid/8/cid/367.html HTTP/1.1" 200 467003 "-" "meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)"

第 1 条附言 · 2 天前

nginx 层面直接 return 403;

清净了

robots.txt

gptbot

crawler

27 条回复 • 2025-10-18 12:22:33 +08:00

Configuration

2 天前

1 这是君子协定
2 UA 可以伪造

keer

2 天前

@Configuration 这样来看，他们是一点也不君子了呀

SuperGeorge

2 天前

点名 YisouSpider ，robots.txt 毫无作用，UA + IP 段都拉黑后还是疯狂爬，403 状态码告警就没停过。

iugo

2 天前

参考: https://developers.facebook.com/docs/sharing/webmasters/web-crawlers/

1. UA 是 `meta-externalagent`
2. 判断一下 IP 是否是 Meta 声明的爬虫 IP

OpenAI 的爬虫, 不予置评.

1up

2 天前

一直不遵守啊

Goooooos

2 天前

现在 AI 的爬虫都不当自己是爬虫，完全乱来

liuidetmks

2 天前

识别到是 AI 爬虫，能不能随机输出乱序假文？

搜索引擎还能反哺网站流量，AI 就是纯喝血了

bgm004

2 天前

ai 的爬虫就和当年的迅雷一样。

picone

2 天前

我也发现了，直接根据 UA 返回 403 了，真的乱来

laobaiguolai

2 天前

用的 cloudflare ，他们家的识别和阻止能力还是可以的

opengps

2 天前

我最近刚好做了相关的，搜索引擎的爬虫，至少人家 UA 是明确的，虽然可以轻松伪造，但如果你不想，可以从 UA 入手拦截官方的爬虫。（按伦理来讲，至少官方的爬虫不至于明目张胆伪造 UA ）。
顺便附赠几个最近关注到的主要的 AI 爬虫 UA 关键字："mj12bot","openai","gptbot","claudebot","semrushbot","siteauditbot"