AI Crawler User-Agents
Last updated
AI platforms typically operate multiple crawlers for distinct purposes. The user-agent string identifies which system is accessing your site. Whether to allow or block a particular crawler depends on its purpose, not just the company that operates it.
What is the difference between a training crawler and a retrieval crawler?
Training crawlers fetch content to build or update model weights. Blocking them stops your content from being used in future training runs. It has no effect on whether that platform’s AI products currently cite or surface your site.
Retrieval crawlers index or fetch content for live search and answer generation. Blocking them removes your site from that platform’s AI search results.
The two are separate systems. Some operators split them cleanly across different user-agents: OpenAI uses GPTBot for training and OAI-SearchBot for search indexing. Anthropic uses ClaudeBot for training, Claude-SearchBot for search indexing, and Claude-User for live page fetches when a user’s query requires it. Others, such as Perplexity, use a single crawler (PerplexityBot) that serves both indexing and live retrieval.
Because the decisions are independent, you can block training crawlers while leaving retrieval crawlers untouched, or vice versa.
Which AI crawlers are currently active?
| User-agent | Purpose |
|---|---|
GPTBot (OpenAI)1 | Training |
OAI-SearchBot (OpenAI)1 | ChatGPT Search indexing |
ChatGPT-User (OpenAI)1 | Live retrieval (user-triggered) |
ClaudeBot (Anthropic)2 | Training |
Claude-SearchBot (Anthropic)2 | Search indexing |
Claude-User (Anthropic)2 | Live retrieval (user-triggered) |
PerplexityBot (Perplexity)3 | Search indexing and live retrieval |
Perplexity-User (Perplexity)3 | Live retrieval |
Google-Extended (Google)4 | Gemini and Vertex AI training |
CCBot (Common Crawl)5 | Open training dataset (used by many providers) |
Amazonbot (Amazon)6 | Training and product improvement |
Applebot (Apple)7 | Siri and Spotlight indexing |
Applebot-Extended (Apple)7 | Generative AI features training opt-out |
meta-externalagent (Meta)8 | AI training |
YouBot (You.com)9 | Search indexing |
ByteSpider (ByteDance) | Training and search |
Note on ChatGPT-User: OpenAI updated its documentation in December 2025 to remove its commitment to honouring robots.txt for ChatGPT-User.10 Unlike the other crawlers listed here, you cannot rely on robots.txt alone to control it.
Note on ByteSpider: ByteDance does not publish IP verification data and ByteSpider has a documented history of robots.txt non-compliance. IP-level blocking is more reliable than robots.txt for this crawler.
How do you verify a crawler is legitimate?
A user-agent string is self-reported. Any client can claim to be GPTBot or Googlebot. Verification requires checking the source IP, not the string.
Reverse DNS lookup: perform a reverse DNS lookup on the IP that made the request. The resulting hostname should match the crawler’s documented domain (e.g. googlebot.com for Googlebot, or a domain in the range Anthropic publishes). Then perform a forward DNS lookup on that hostname and confirm it resolves back to the same IP. This forward-confirmed reverse DNS check is the standard verification method. See Log File Analysis for how to run this check against your server logs.
Published IP lists: several operators provide machine-readable IP ranges:
- OpenAI: published in the bots documentation1
- Anthropic: published in the crawler help article2
- Perplexity:
https://www.perplexity.com/perplexitybot.jsonandhttps://www.perplexity.com/perplexity-user.json3 - Common Crawl:
https://index.commoncrawl.org/ccbot.json5 - Amazon: published in the Amazonbot documentation6
For crawlers without a published IP list (ByteSpider, YouBot, meta-externalagent), log-based pattern analysis and reverse DNS are the available options.
Which crawlers should you allow or block?
The robots.txt syntax for targeting specific crawlers is covered in Crawlability and robots.txt. The decision of what to do depends on your situation:
No specific concern: the default, where no explicit rules apply to these agents, allows all well-behaved crawlers. Most sites are in this position.
Want AI citations, not training contribution: block training crawlers (GPTBot, ClaudeBot, Google-Extended, CCBot) while leaving retrieval crawlers (OAI-SearchBot, ChatGPT-User, PerplexityBot, Perplexity-User, Claude-SearchBot, Claude-User) untouched. Blocking training crawlers has no effect on search citations.
Paywalled or licensed content: blocking all AI crawlers is a defensible position. Training crawlers extract value without payment; retrieval crawlers may surface excerpts without driving traffic. Be aware that ChatGPT-User may not reliably respect robots.txt directives.
ByteSpider: add IP-level blocks via your hosting platform or CDN in addition to robots.txt rules, given the documented non-compliance.
Frequently asked questions
Does blocking GPTBot affect ChatGPT search results?
No. OAI-SearchBot handles ChatGPT Search indexing. GPTBot is a training crawler only. Blocking it has no effect on whether your site appears in ChatGPT search answers.
Does blocking ClaudeBot affect Claude’s answers?
No. ClaudeBot is Anthropic’s training crawler. Claude-SearchBot handles search indexing, and Claude-User handles live retrieval. Blocking ClaudeBot only affects whether your content is used in future training data.
Can I block one platform’s training crawler but not another’s?
Yes. Each has a distinct user-agent string. You can write separate robots.txt rules for each.
Do all AI crawlers respect robots.txt?
There is no legal requirement to do so. Most major operators comply by policy. ByteSpider is a documented exception. ChatGPT-User removed its robots.txt commitment from its documentation in December 2025.10
How do I see which AI crawlers are actually hitting my site?
Server log analysis is the most accurate method. See Log File Analysis.