AI Crawler User-Agents

Last updated 14 July 2026

AI platforms typically operate multiple crawlers for distinct purposes. The user-agent string identifies which system is accessing your site. Whether to allow or block a particular crawler depends on its purpose, not just the company that operates it.

What are the three types of AI crawler?

Training crawlers fetch content to build or update model weights. Blocking them stops your content from being used in future training runs. It has no effect on whether that platform’s AI products currently cite or surface your site.

Search crawlers index content so it can be retrieved and cited when an AI product answers a question, much as Googlebot indexes for Search. Blocking them removes your site from that platform’s AI search results.

Agent crawlers fetch a page live, at the moment a user or an agent asks for it, rather than indexing it in advance. ChatGPT-User, Claude-User and Perplexity-User sit here. They behave less like a crawler and more like a browser acting on someone’s behalf, which is the argument their operators use for exempting them from robots.txt, and it is why two of the three are the hardest to control.

The three are separate systems, and the decisions about them are independent. Some operators split them cleanly across user-agents: OpenAI uses GPTBot for training, OAI-SearchBot for search indexing and ChatGPT-User for live fetches. Anthropic mirrors that with ClaudeBot, Claude-SearchBot and Claude-User. Others blur the lines: Perplexity’s PerplexityBot serves both indexing and live retrieval.

This three-way split is no longer just a useful mental model. It is the categorisation Cloudflare uses to gate crawler access at the network layer (see what changes at the CDN below), which means it increasingly determines what actually reaches your server, regardless of what your robots.txt says.

Which AI crawlers are currently active?

User-agent	Purpose
`GPTBot` (OpenAI)¹	Training
`OAI-SearchBot` (OpenAI)¹	ChatGPT Search indexing
`ChatGPT-User` (OpenAI)¹	Live retrieval (user-triggered)
`ClaudeBot` (Anthropic)²	Training
`Claude-SearchBot` (Anthropic)²	Search indexing
`Claude-User` (Anthropic)²	Live retrieval (user-triggered)
`claude-code` (Anthropic)²	Claude Code CLI URL fetches
`PerplexityBot` (Perplexity)³	Search indexing and live retrieval
`Perplexity-User` (Perplexity)³	Live retrieval
`Google-Extended` (Google)⁴	Gemini and Vertex AI training
`CCBot` (Common Crawl)⁵	Open training dataset (used by many providers)
`Amazonbot` (Amazon)⁶	Training and product improvement
`Applebot` (Apple)⁷	Siri and Spotlight indexing
`Applebot-Extended` (Apple)⁷	Generative AI features training opt-out
`meta-externalagent` (Meta)⁸	AI training
`YouBot` (You.com)⁹	Search indexing
`ByteSpider` (ByteDance)¹⁰	Training and search

Note on ChatGPT-User: OpenAI’s documentation now states that its robots.txt tags apply to OAI-SearchBot and GPTBot,¹ having removed ChatGPT-User from that list in a December 2025 revision.¹¹ Unlike the other crawlers listed here, you cannot rely on robots.txt alone to control it.

Note on ByteSpider: ByteDance does not publish IP verification data and ByteSpider has a documented history of robots.txt non-compliance.¹⁰ IP-level blocking is more reliable than robots.txt for this crawler.

Note on Microsoft Web IQ: Microsoft’s Web IQ grounding API draws from Bing’s existing index. Jordi Ribas, Microsoft’s President of Search and AI, has said it is used directly in Copilot and by ChatGPT “for some of its web answers”,¹² so it accounts for a substantial share of ChatGPT’s web retrieval rather than all of it. No separate Web IQ crawler user-agent exists: BingBot governs what Web IQ can access, and Web IQ inherits Bing’s existing robots.txt compliance. Microsoft has stated it is engaging with the IETF on formalising publisher controls for the AI era.

How do you verify a crawler is legitimate?

A user-agent string is self-reported. Any client can claim to be GPTBot or Googlebot. Verification requires checking the source IP, not the string.

Reverse DNS lookup: perform a reverse DNS lookup on the IP that made the request. The resulting hostname should match the crawler’s documented domain (e.g. googlebot.com for Googlebot). Then perform a forward DNS lookup on that hostname and confirm it resolves back to the same IP. This forward-confirmed reverse DNS check is the standard verification method. See Log File Analysis for how to run this check against your server logs.

Published IP lists: several operators provide machine-readable IP ranges:

OpenAI: published in the bots documentation¹
Perplexity: https://www.perplexity.com/perplexitybot.json and https://www.perplexity.com/perplexity-user.json³
Common Crawl: https://index.commoncrawl.org/ccbot.json⁵
Amazon: published in the Amazonbot documentation⁶

Anthropic does not publish IP ranges for its crawlers. Their bots run on public cloud provider addresses, making IP-based blocking unreliable. Reverse DNS verification is the only reliable method for confirming Anthropic crawler identity.

For crawlers without a published IP list (Anthropic, ByteSpider, YouBot, meta-externalagent), log-based pattern analysis and reverse DNS are the available options.

What is replacing reverse-DNS verification?

Reverse DNS and published IP lists are the current standard, but both are workarounds for a missing piece: there is no cryptographic way for a client to prove its identity at the moment of the request. Two separate standardisation efforts are trying to close that gap, and they solve different problems.

Signed agents (bot identity). The Web Bot Auth effort uses HTTP Message Signatures so that an automated client can cryptographically sign its requests, letting a server confirm a crawler is who it claims to be without depending on IP ranges. This is the direction that would eventually replace reverse-DNS checks for verifying agents like GPTBot or PerplexityBot. It is early-stage and not yet widely implemented.

Personhood tokens (human-in-the-loop). In June 2026 Cloudflare, Google, Microsoft, Mozilla and Shopify announced PACT (Private Access Control Tokens), a protocol building on Privacy Pass (RFC 9576) that lets a site issue anonymous tokens proving a real person, or an agent authorised by one, is behind a request, without tracking the user.¹³ PACT does not verify crawler identity and does not replace the user-agent and reverse-DNS methods above; it answers the separate question of whether an automated client is acting on a genuine user’s behalf. See Cloudflare and Browser Makers Announce PACT for detail.

Both are proposed standards in development with no deployment timeline. For now, reverse DNS and published IP lists remain the practical verification methods. The value in tracking these efforts is understanding where access control is heading: from self-reported user-agent strings towards cryptographic proof of both who a client is and on whose behalf it acts.

Which crawlers should you allow or block?

The robots.txt syntax for targeting specific crawlers is covered in Crawlability and robots.txt. The decision of what to do depends on your situation:

No specific concern: the default, where no explicit rules apply to these agents, allows all well-behaved crawlers. Most sites are in this position.

Want AI citations, not training contribution: block training crawlers (GPTBot, ClaudeBot, Google-Extended, CCBot) while leaving the search and agent crawlers (OAI-SearchBot, ChatGPT-User, PerplexityBot, Perplexity-User, Claude-SearchBot, Claude-User) untouched. Blocking training crawlers has no effect on search citations. If you enforce this at a CDN rather than in robots.txt, check first that the training block does not also catch Googlebot, Bingbot or Applebot.

Paywalled or licensed content: blocking all AI crawlers is a defensible position. Training crawlers extract value without payment; search and agent crawlers may surface excerpts without driving traffic. Be aware that ChatGPT-User may not reliably respect robots.txt directives.

ByteSpider: add IP-level blocks via your hosting platform or CDN in addition to robots.txt rules, given the documented non-compliance.

Perplexity-User: robots.txt is not effective. Use server-side access controls (IP blocking or authentication) to prevent Perplexity-User from fetching content. PerplexityBot (the indexing crawler) does respect robots.txt and can be controlled normally.

In August 2025, Cloudflare published research documenting a secondary Perplexity crawler that spoofs a standard Chrome on macOS user-agent string and rotates IP ranges when declared Perplexity crawlers are blocked, generating millions of additional daily requests to affected sites.¹⁴ Cloudflare subsequently de-listed Perplexity as a verified bot and added managed-rule heuristics to detect and block the behaviour. Perplexity disputed the characterisation, stating the activity relates to user-triggered AI Assistant requests rather than automated crawling. Site owners who need to restrict Perplexity access should use CDN-level managed rules rather than relying on IP blocklists alone.

What changes at the CDN

robots.txt asks. A CDN enforces. For a growing share of sites, the decision about whether an AI crawler gets a response is now made before the request reaches the file at your document root.

On 1 July 2026 Cloudflare gave all customers, free accounts included, the ability to sort AI crawlers into the three categories above and manage each separately. From 15 September 2026, Training and Agent crawlers are blocked by default on ad-supported pages for new sites and new customers, while Search crawlers remain allowed.¹⁵ For those sites the default answer flips from allow to deny, without anyone editing a robots.txt file.

The trap: blocking training can block Googlebot. Cloudflare applies the strictest matching rule to crawlers that serve more than one purpose. Googlebot, Bingbot and Applebot all fetch for both search indexing and AI training, so a site that blocks the Training category can find it has blocked the search crawler its traffic depends on.¹⁵¹⁶ The symptom is a crawl collapse with a robots.txt that looks entirely correct, because the block never reaches robots.txt. This is the single most expensive mistake available in this area, and it is easy to make by accident while trying to do the right thing about AI training.

The practical consequence for the decisions below: check what your CDN is configured to do before assuming your robots.txt is the operative control. See robots.txt and crawlability for the file-level detail.

Frequently asked questions

Does blocking GPTBot affect ChatGPT search results?
No. OAI-SearchBot handles ChatGPT Search indexing. GPTBot is a training crawler only. Blocking it has no effect on whether your site appears in ChatGPT search answers.

Does blocking ClaudeBot affect Claude’s answers?
No. ClaudeBot is Anthropic’s training crawler. Claude-SearchBot handles search indexing, and Claude-User handles live retrieval. Blocking ClaudeBot only affects whether your content is used in future training data.

Can I block one platform’s training crawler but not another’s?
Yes. Each has a distinct user-agent string. You can write separate robots.txt rules for each.

Do all AI crawlers respect robots.txt?
There is no legal requirement to do so. Most major operators comply by policy. ByteSpider is a documented exception.¹⁰ ChatGPT-User removed its robots.txt commitment from its documentation in December 2025.¹¹ Perplexity-User does not respect robots.txt by design; server-side access controls are required to block it.³

How do I see which AI crawlers are actually hitting my site?
Server log analysis is the most accurate method. See Log File Analysis. For a free graphical option, Microsoft Clarity’s Bot Analytics added a robots.txt violations view in June 2026 that reports which bots are ignoring your directives, broken down by operator, bot name, and the URLs they hit.¹⁷ It requires connecting a supported CDN (Fastly, CloudFront, or Cloudflare), so it is not zero-setup, and its bot list reflects Microsoft’s own detection rather than an industry standard.

AI Crawler User-Agents

What are the three types of AI crawler?

Which AI crawlers are currently active?

How do you verify a crawler is legitimate?

What is replacing reverse-DNS verification?

Which crawlers should you allow or block?

What changes at the CDN

Frequently asked questions

Guides, Checklists & References

How to Target SERP Features

How to Build a Keyword Research Process from Scratch

SEO Recovery

How to Learn SEO: A Beginner's Learning Path

Link Building Guide

Local SEO Guide

SEO Glossary

SEO News + Updates

OpenAI Retires ChatGPT Atlas, Folding Agentic Browsing Into ChatGPT

Google revamps Image Search and brings image generation into AI Overviews

ChatGPT Citations Shift When Its Hidden Search Pipelines Switch

Cloudflare Splits AI Crawlers Into Search, Agent and Training, With Default Blocks From 15 September

Google Search Console Adds Platform Properties for Social and Video Content

What are the three types of AI crawler?

Which AI crawlers are currently active?

How do you verify a crawler is legitimate?

What is replacing reverse-DNS verification?

Which crawlers should you allow or block?

What changes at the CDN

Frequently asked questions

Footnotes

See also

Log File Analysis for SEO

robots.txt and crawlability

HTTP User-Agents in SEO

Guides, Checklists & References

How to Target SERP Features

How to Build a Keyword Research Process from Scratch

SEO Recovery

How to Learn SEO: A Beginner's Learning Path

Link Building Guide

Local SEO Guide

SEO Glossary

SEO News + Updates

OpenAI Retires ChatGPT Atlas, Folding Agentic Browsing Into ChatGPT

Google revamps Image Search and brings image generation into AI Overviews

ChatGPT Citations Shift When Its Hidden Search Pipelines Switch

Cloudflare Splits AI Crawlers Into Search, Agent and Training, With Default Blocks From 15 September

Google Search Console Adds Platform Properties for Social and Video Content