Reference

robots.txt Reference

Last updated 27 April 2026

robots.txt is a plain text file placed at the root of a domain (https://example.com/robots.txt) that tells compliant crawlers which URLs they may and may not fetch. It is part of the Robots Exclusion Protocol, a convention rather than an enforced standard — well-behaved crawlers follow it, malicious ones do not.

For the SEO implications of crawl control, see Crawlability and robots.txt.

File location and format

The file must be at the root of the domain. It applies only to the domain it is hosted on — a robots.txt at example.com does not apply to subdomain.example.com, and vice versa.

Requirements:

Plain text, UTF-8 encoded
One directive per line
Lines beginning with # are comments and are ignored by crawlers
Blank lines separate rule groups

Directives

User-agent

Specifies which crawler the following rules apply to. Must appear at the start of each rule group, before any Disallow or Allow directives.

User-agent: Googlebot
User-agent: *

* matches all crawlers not covered by a specific rule group
Multiple User-agent lines can precede a shared set of rules
Rules are applied per user agent; a crawler only follows rules matching its own user agent string or *
If both a specific rule and * exist, the specific rule takes precedence for that crawler

Disallow

Blocks the specified path. The crawler will not fetch any URL beginning with this path.

Disallow: /admin/
Disallow: /search
Disallow: /

An empty Disallow: value means “allow everything” (equivalent to no restriction)
Disallow: / blocks the entire site
Paths are case-sensitive: Disallow: /Admin/ and Disallow: /admin/ are different rules
The rule matches the beginning of the URL path — Disallow: /blog matches /blog, /blog/, and /blog-post

Allow

Explicitly permits a path within a disallowed directory. Used to create exceptions.

Disallow: /private/
Allow: /private/public-page/

Only meaningful within a Disallow context; Allow: / with no Disallow does nothing
When both Allow and Disallow rules match a URL, the more specific (longer) rule wins
If rules are the same length, Allow wins

Sitemap

Declares the location of an XML sitemap. Not a crawl rule — it is a hint to crawlers about where to find URLs.

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-news.xml

Use absolute URLs, not relative paths
Multiple Sitemap directives are allowed
Can appear anywhere in the file, but convention places them at the end
Googlebot also discovers sitemaps submitted via Search Console; this directive covers other crawlers

Crawl-delay

Specifies a delay (in seconds) between successive requests from the crawler.

Crawl-delay: 2

Googlebot does not honour Crawl-delay. To control Googlebot’s crawl rate, use the crawl rate settings in Google Search Console.
Bing, Yandex, and many other crawlers do honour it
Useful on low-resource servers to prevent crawlers from overwhelming the host

Wildcards

* (asterisk)

Matches any sequence of characters, including none.

Disallow: /*.pdf$
Disallow: /search?*
Disallow: /*/print/

Supported by Googlebot; behaviour varies across other crawlers
Disallow: /search?* blocks all URLs containing /search? followed by anything
Disallow: /*/print/ blocks any path with /print/ as a segment

$ (end of string)

Anchors the pattern to the end of the URL.

Disallow: /*.pdf$

Matches only URLs ending with the specified pattern
Disallow: /*.pdf$ blocks /document.pdf but not /pdf-guide/
Supported by Googlebot; behaviour varies across other crawlers
Without $, the pattern matches anywhere in the URL

Rule precedence

When multiple rules match a URL, crawlers use the following precedence:

The most specific (longest) matching rule wins
If two rules are equal length, Allow wins over Disallow
User-agent-specific rules take precedence over User-agent: * rules

Example:

User-agent: *
Disallow: /private/
Allow: /private/public/

A request to /private/public/page matches both rules. /private/public/ (length 16) is longer than /private/ (length 9), so Allow wins and the URL is accessible.

Common user agent strings

Google

User agent	Crawls
`Googlebot`	Web search (applies to all Googlebot variants unless overridden)
`Googlebot-Image`	Google Images
`Googlebot-Video`	Google Video
`Googlebot-News`	Google News
`Google-Extended`	Google AI (Gemini training and products)
`AdsBot-Google`	Google Ads landing page quality
`Mediapartners-Google`	AdSense

Other search engines

User agent	Crawls
`Bingbot`	Bing web search
`Slurp`	Yahoo (uses Bing index)
`DuckDuckBot`	DuckDuckGo
`Baiduspider`	Baidu
`YandexBot`	Yandex

AI crawlers

User agent	Purpose
`GPTBot`	OpenAI training crawler
`OAI-SearchBot`	OpenAI ChatGPT Search indexing
`ChatGPT-User`	OpenAI live retrieval at query time
`ClaudeBot`	Anthropic training crawler
`anthropic-ai`	Anthropic (alternative user agent)
`PerplexityBot`	Perplexity indexing crawler
`Perplexity-User`	Perplexity live retrieval
`cohere-ai`	Cohere training crawler
`CCBot`	Common Crawl (used by many model providers)
`meta-externalagent`	Meta AI training crawler

Common configurations

Block all crawlers

User-agent: *
Disallow: /

Use during development. Remove before launch — this is the most common cause of sites not appearing in search after deployment.

Block all AI training crawlers

User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: meta-externalagent
Disallow: /

Note: blocking AI training crawlers does not prevent AI search products (ChatGPT Search, Perplexity) from citing your content if they retrieve it via their search crawlers (OAI-SearchBot, Perplexity-User). Training and retrieval use different bots.

Block internal search results

User-agent: *
Disallow: /search
Disallow: /search/

Block admin and account areas

User-agent: *
Disallow: /admin/
Disallow: /account/
Disallow: /login
Disallow: /checkout/

Block parameterised URLs

User-agent: *
Disallow: /*?

Blocks all parameterised URLs. Use with caution — this can inadvertently block legitimate pages if any use query strings.

Block print and feed versions

User-agent: *
Disallow: /*/print/
Disallow: /*/feed/
Disallow: /*/amp/

Allow Googlebot, block all others

User-agent: *
Disallow: /

User-agent: Googlebot
Disallow:

An empty Disallow after a specific user agent allows full access for that crawler. Useful for staging environments where you want Google to crawl but no other bots.

What robots.txt does not do

It does not prevent indexing. A URL blocked in robots.txt can still appear in search results if it has inbound links. Google will index the URL without crawling it, showing the page title from anchor text and no description. To prevent indexing, use a noindex meta tag — but the page must be crawlable for Google to read it.

It does not secure content. robots.txt is public and explicitly advertises what you are hiding. Do not rely on it for sensitive content; use authentication.

It does not apply to subdomains. example.com/robots.txt has no authority over blog.example.com. Each subdomain needs its own file.

It does not apply to non-HTTP protocols. FTP, email, and other protocols are not covered.

It does not stop malicious bots. Only well-behaved crawlers follow robots.txt. Scrapers, spam bots, and vulnerability scanners ignore it.