robots.txt Reference
Last updated
robots.txt is a plain text file placed at the root of a domain (https://example.com/robots.txt) that tells compliant crawlers which URLs they may and may not fetch. It is part of the Robots Exclusion Protocol, a convention rather than an enforced standard — well-behaved crawlers follow it, malicious ones do not.
For the SEO implications of crawl control, see Crawlability and robots.txt.
File location and format
The file must be at the root of the domain. It applies only to the domain it is hosted on — a robots.txt at example.com does not apply to subdomain.example.com, and vice versa.
Requirements:
- Plain text, UTF-8 encoded
- One directive per line
- Lines beginning with
#are comments and are ignored by crawlers - Blank lines separate rule groups
Directives
User-agent
Specifies which crawler the following rules apply to. Must appear at the start of each rule group, before any Disallow or Allow directives.
User-agent: Googlebot
User-agent: *
*matches all crawlers not covered by a specific rule group- Multiple
User-agentlines can precede a shared set of rules - Rules are applied per user agent; a crawler only follows rules matching its own user agent string or
* - If both a specific rule and
*exist, the specific rule takes precedence for that crawler
Disallow
Blocks the specified path. The crawler will not fetch any URL beginning with this path.
Disallow: /admin/
Disallow: /search
Disallow: /
- An empty
Disallow:value means “allow everything” (equivalent to no restriction) Disallow: /blocks the entire site- Paths are case-sensitive:
Disallow: /Admin/andDisallow: /admin/are different rules - The rule matches the beginning of the URL path —
Disallow: /blogmatches/blog,/blog/, and/blog-post
Allow
Explicitly permits a path within a disallowed directory. Used to create exceptions.
Disallow: /private/
Allow: /private/public-page/
- Only meaningful within a
Disallowcontext;Allow: /with noDisallowdoes nothing - When both
AllowandDisallowrules match a URL, the more specific (longer) rule wins - If rules are the same length,
Allowwins
Sitemap
Declares the location of an XML sitemap. Not a crawl rule — it is a hint to crawlers about where to find URLs.
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-news.xml
- Use absolute URLs, not relative paths
- Multiple
Sitemapdirectives are allowed - Can appear anywhere in the file, but convention places them at the end
- Googlebot also discovers sitemaps submitted via Search Console; this directive covers other crawlers
Crawl-delay
Specifies a delay (in seconds) between successive requests from the crawler.
Crawl-delay: 2
- Googlebot does not honour
Crawl-delay. To control Googlebot’s crawl rate, use the crawl rate settings in Google Search Console. - Bing, Yandex, and many other crawlers do honour it
- Useful on low-resource servers to prevent crawlers from overwhelming the host
Wildcards
* (asterisk)
Matches any sequence of characters, including none.
Disallow: /*.pdf$
Disallow: /search?*
Disallow: /*/print/
- Supported by Googlebot; behaviour varies across other crawlers
Disallow: /search?*blocks all URLs containing/search?followed by anythingDisallow: /*/print/blocks any path with/print/as a segment
$ (end of string)
Anchors the pattern to the end of the URL.
Disallow: /*.pdf$
- Matches only URLs ending with the specified pattern
Disallow: /*.pdf$blocks/document.pdfbut not/pdf-guide/- Supported by Googlebot; behaviour varies across other crawlers
- Without
$, the pattern matches anywhere in the URL
Rule precedence
When multiple rules match a URL, crawlers use the following precedence:
- The most specific (longest) matching rule wins
- If two rules are equal length,
Allowwins overDisallow User-agent-specific rules take precedence overUser-agent: *rules
Example:
User-agent: *
Disallow: /private/
Allow: /private/public/
A request to /private/public/page matches both rules. /private/public/ (length 16) is longer than /private/ (length 9), so Allow wins and the URL is accessible.
Common user agent strings
| User agent | Crawls |
|---|---|
Googlebot | Web search (applies to all Googlebot variants unless overridden) |
Googlebot-Image | Google Images |
Googlebot-Video | Google Video |
Googlebot-News | Google News |
Google-Extended | Google AI (Gemini training and products) |
AdsBot-Google | Google Ads landing page quality |
Mediapartners-Google | AdSense |
Other search engines
| User agent | Crawls |
|---|---|
Bingbot | Bing web search |
Slurp | Yahoo (uses Bing index) |
DuckDuckBot | DuckDuckGo |
Baiduspider | Baidu |
YandexBot | Yandex |
AI crawlers
| User agent | Purpose |
|---|---|
GPTBot | OpenAI training crawler |
OAI-SearchBot | OpenAI ChatGPT Search indexing |
ChatGPT-User | OpenAI live retrieval at query time |
ClaudeBot | Anthropic training crawler |
anthropic-ai | Anthropic (alternative user agent) |
PerplexityBot | Perplexity indexing crawler |
Perplexity-User | Perplexity live retrieval |
cohere-ai | Cohere training crawler |
CCBot | Common Crawl (used by many model providers) |
meta-externalagent | Meta AI training crawler |
Common configurations
Block all crawlers
User-agent: *
Disallow: /
Use during development. Remove before launch — this is the most common cause of sites not appearing in search after deployment.
Block all AI training crawlers
User-agent: GPTBot
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: meta-externalagent
Disallow: /
Note: blocking AI training crawlers does not prevent AI search products (ChatGPT Search, Perplexity) from citing your content if they retrieve it via their search crawlers (OAI-SearchBot, Perplexity-User). Training and retrieval use different bots.
Block internal search results
User-agent: *
Disallow: /search
Disallow: /search/
Block admin and account areas
User-agent: *
Disallow: /admin/
Disallow: /account/
Disallow: /login
Disallow: /checkout/
Block parameterised URLs
User-agent: *
Disallow: /*?
Blocks all parameterised URLs. Use with caution — this can inadvertently block legitimate pages if any use query strings.
Block print and feed versions
User-agent: *
Disallow: /*/print/
Disallow: /*/feed/
Disallow: /*/amp/
Allow Googlebot, block all others
User-agent: *
Disallow: /
User-agent: Googlebot
Disallow:
An empty Disallow after a specific user agent allows full access for that crawler. Useful for staging environments where you want Google to crawl but no other bots.
What robots.txt does not do
It does not prevent indexing. A URL blocked in robots.txt can still appear in search results if it has inbound links. Google will index the URL without crawling it, showing the page title from anchor text and no description. To prevent indexing, use a noindex meta tag — but the page must be crawlable for Google to read it.
It does not secure content. robots.txt is public and explicitly advertises what you are hiding. Do not rely on it for sensitive content; use authentication.
It does not apply to subdomains. example.com/robots.txt has no authority over blog.example.com. Each subdomain needs its own file.
It does not apply to non-HTTP protocols. FTP, email, and other protocols are not covered.
It does not stop malicious bots. Only well-behaved crawlers follow robots.txt. Scrapers, spam bots, and vulnerability scanners ignore it.