Reference

robots.txt Reference

robots.txt is a plain text file placed at the root of a domain (https://example.com/robots.txt) that tells compliant crawlers which URLs they may and may not fetch. It is part of the Robots Exclusion Protocol, a convention rather than an enforced standard — well-behaved crawlers follow it, malicious ones do not.

For the SEO implications of crawl control, see Crawlability and robots.txt.


File location and format

The file must be at the root of the domain. It applies only to the domain it is hosted on — a robots.txt at example.com does not apply to subdomain.example.com, and vice versa.

Requirements:

  • Plain text, UTF-8 encoded
  • One directive per line
  • Lines beginning with # are comments and are ignored by crawlers
  • Blank lines separate rule groups

Directives

User-agent

Specifies which crawler the following rules apply to. Must appear at the start of each rule group, before any Disallow or Allow directives.

User-agent: Googlebot
User-agent: *
  • * matches all crawlers not covered by a specific rule group
  • Multiple User-agent lines can precede a shared set of rules
  • Rules are applied per user agent; a crawler only follows rules matching its own user agent string or *
  • If both a specific rule and * exist, the specific rule takes precedence for that crawler

Disallow

Blocks the specified path. The crawler will not fetch any URL beginning with this path.

Disallow: /admin/
Disallow: /search
Disallow: /
  • An empty Disallow: value means “allow everything” (equivalent to no restriction)
  • Disallow: / blocks the entire site
  • Paths are case-sensitive: Disallow: /Admin/ and Disallow: /admin/ are different rules
  • The rule matches the beginning of the URL path — Disallow: /blog matches /blog, /blog/, and /blog-post

Allow

Explicitly permits a path within a disallowed directory. Used to create exceptions.

Disallow: /private/
Allow: /private/public-page/
  • Only meaningful within a Disallow context; Allow: / with no Disallow does nothing
  • When both Allow and Disallow rules match a URL, the more specific (longer) rule wins
  • If rules are the same length, Allow wins

Sitemap

Declares the location of an XML sitemap. Not a crawl rule — it is a hint to crawlers about where to find URLs.

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-news.xml
  • Use absolute URLs, not relative paths
  • Multiple Sitemap directives are allowed
  • Can appear anywhere in the file, but convention places them at the end
  • Googlebot also discovers sitemaps submitted via Search Console; this directive covers other crawlers

Crawl-delay

Specifies a delay (in seconds) between successive requests from the crawler.

Crawl-delay: 2
  • Googlebot does not honour Crawl-delay. To control Googlebot’s crawl rate, use the crawl rate settings in Google Search Console.
  • Bing, Yandex, and many other crawlers do honour it
  • Useful on low-resource servers to prevent crawlers from overwhelming the host

Wildcards

* (asterisk)

Matches any sequence of characters, including none.

Disallow: /*.pdf$
Disallow: /search?*
Disallow: /*/print/
  • Supported by Googlebot; behaviour varies across other crawlers
  • Disallow: /search?* blocks all URLs containing /search? followed by anything
  • Disallow: /*/print/ blocks any path with /print/ as a segment

$ (end of string)

Anchors the pattern to the end of the URL.

Disallow: /*.pdf$
  • Matches only URLs ending with the specified pattern
  • Disallow: /*.pdf$ blocks /document.pdf but not /pdf-guide/
  • Supported by Googlebot; behaviour varies across other crawlers
  • Without $, the pattern matches anywhere in the URL

Rule precedence

When multiple rules match a URL, crawlers use the following precedence:

  1. The most specific (longest) matching rule wins
  2. If two rules are equal length, Allow wins over Disallow
  3. User-agent-specific rules take precedence over User-agent: * rules

Example:

User-agent: *
Disallow: /private/
Allow: /private/public/

A request to /private/public/page matches both rules. /private/public/ (length 16) is longer than /private/ (length 9), so Allow wins and the URL is accessible.


Common user agent strings

Google

User agentCrawls
GooglebotWeb search (applies to all Googlebot variants unless overridden)
Googlebot-ImageGoogle Images
Googlebot-VideoGoogle Video
Googlebot-NewsGoogle News
Google-ExtendedGoogle AI (Gemini training and products)
AdsBot-GoogleGoogle Ads landing page quality
Mediapartners-GoogleAdSense

Other search engines

User agentCrawls
BingbotBing web search
SlurpYahoo (uses Bing index)
DuckDuckBotDuckDuckGo
BaiduspiderBaidu
YandexBotYandex

AI crawlers

User agentPurpose
GPTBotOpenAI training crawler
OAI-SearchBotOpenAI ChatGPT Search indexing
ChatGPT-UserOpenAI live retrieval at query time
ClaudeBotAnthropic training crawler
anthropic-aiAnthropic (alternative user agent)
PerplexityBotPerplexity indexing crawler
Perplexity-UserPerplexity live retrieval
cohere-aiCohere training crawler
CCBotCommon Crawl (used by many model providers)
meta-externalagentMeta AI training crawler

Common configurations

Block all crawlers

User-agent: *
Disallow: /

Use during development. Remove before launch — this is the most common cause of sites not appearing in search after deployment.

Block all AI training crawlers

User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: meta-externalagent
Disallow: /

Note: blocking AI training crawlers does not prevent AI search products (ChatGPT Search, Perplexity) from citing your content if they retrieve it via their search crawlers (OAI-SearchBot, Perplexity-User). Training and retrieval use different bots.

Block internal search results

User-agent: *
Disallow: /search
Disallow: /search/

Block admin and account areas

User-agent: *
Disallow: /admin/
Disallow: /account/
Disallow: /login
Disallow: /checkout/

Block parameterised URLs

User-agent: *
Disallow: /*?

Blocks all parameterised URLs. Use with caution — this can inadvertently block legitimate pages if any use query strings.

Block print and feed versions

User-agent: *
Disallow: /*/print/
Disallow: /*/feed/
Disallow: /*/amp/

Allow Googlebot, block all others

User-agent: *
Disallow: /

User-agent: Googlebot
Disallow:

An empty Disallow after a specific user agent allows full access for that crawler. Useful for staging environments where you want Google to crawl but no other bots.


What robots.txt does not do

It does not prevent indexing. A URL blocked in robots.txt can still appear in search results if it has inbound links. Google will index the URL without crawling it, showing the page title from anchor text and no description. To prevent indexing, use a noindex meta tag — but the page must be crawlable for Google to read it.

It does not secure content. robots.txt is public and explicitly advertises what you are hiding. Do not rely on it for sensitive content; use authentication.

It does not apply to subdomains. example.com/robots.txt has no authority over blog.example.com. Each subdomain needs its own file.

It does not apply to non-HTTP protocols. FTP, email, and other protocols are not covered.

It does not stop malicious bots. Only well-behaved crawlers follow robots.txt. Scrapers, spam bots, and vulnerability scanners ignore it.