Crawlability and robots.txt
Last updated
Crawlability is the degree to which search engine bots can access the pages of your site. It is the foundation of indexing, and a site with broken crawlability cannot rank no matter how good its content is. robots.txt is the file that controls what bots can and can’t fetch.
How do crawlers discover content?
Search engine crawlers (Googlebot, Bingbot, etc.) discover content through:
- Internal links from already-known pages. The most common discovery path. Pages without internal links are functionally invisible.
- External links from other sites. Backlinks introduce new URLs to crawlers.
- XML sitemaps. Explicit lists of URLs you want indexed.
- Direct submission. Manual URL submission via Search Console (limited use; for individual urgent pages).
Crawlers fetch pages, parse HTML, follow links, and add new URLs to their queue. The process repeats indefinitely, with frequency determined by site authority, change rate, and crawl budget.
What does robots.txt do?
robots.txt is a plain text file at the root of the domain (https://example.com/robots.txt) that tells well-behaved crawlers which URLs they can and cannot fetch. It uses simple syntax:
User-agent: *
Disallow: /admin/
Disallow: /search/
Allow: /
User-agent: Googlebot
Disallow: /no-google/
Sitemap: https://example.com/sitemap.xml
User-agent specifies which crawler the rules apply to (* is all). Disallow blocks paths. Allow permits paths within disallowed directories. Sitemap declares one or more sitemap locations.
What does robots.txt NOT do?
It does not prevent indexing. Pages disallowed in robots.txt can still appear in search results if Google discovers them through external links. The page won’t be crawled, but the URL might be indexed (with a warning that no description is available). To prevent indexing, use a noindex meta tag, which requires the page to be crawlable.
It does not enforce compliance. robots.txt is a request, not a rule. Well-behaved crawlers (Google, Bing, established AI crawlers) honour it. Malicious or aggressive scrapers ignore it.
It does not secure content. Disallowing a path in robots.txt makes its existence public. Anyone can read the file. For genuinely sensitive content, use authentication, not robots.txt.
Common crawlability problems
Accidentally blocking the entire site. A misplaced Disallow: / blocks all crawling. This is the most common (and most damaging) robots.txt mistake. Always verify the live robots.txt after deployment.
Blocking CSS and JavaScript. Googlebot renders pages with their associated CSS and JavaScript to evaluate layout, mobile usability, and content. Blocking these resources damages how Google understands the page. Allow them.
Disallow vs noindex confusion. Disallowing a page in robots.txt while the page also has a noindex meta tag means Google can’t crawl the page to see the noindex directive. The page may end up indexed without a description, the worst of both worlds.
Blocking pagination, faceted filters, or parameterised URLs incorrectly. Some parameter URLs should be canonicalised, some should be noindexed, and some should be allowed. Blanket blocking via robots.txt is rarely the right answer.
Orphan pages. Pages with no internal links are crawled rarely if at all. Sitemaps help, but internal links remain the strongest discovery signal.
Crawl budget
Crawl budget is the number of URLs Googlebot will crawl on your site within a given time period. For most sites under 10,000 pages, crawl budget is not a constraint. For very large sites (e-commerce with millions of SKUs, news publishers with deep archives), it becomes a real consideration.
The factors that influence crawl budget:
- Site authority. Higher-authority sites get crawled more.
- Site speed. Faster sites get crawled more frequently per second.
- Content freshness. Sites that update frequently get crawled more.
- Discoverable URL count. Larger sites need to share crawl budget across more URLs.
Wasted crawl budget (crawlers fetching low-value URLs at the expense of important ones) is the failure mode. Reduce it by:
- Using robots.txt to block genuinely worthless URLs (admin, search results pages, infinite faceted filter combinations)
- Returning proper status codes for non-existent pages. A soft 404 is a page that returns a 200 OK status but contains no real content: a “not found” message served as a live page, an empty search result, or a removed product page with nothing left on it. Google identifies these algorithmically and may deindex them; they consume crawl budget that should go to real pages. Return 404 or 410 for removed URLs, or restore content so the page earns its 200.
- Consolidating duplicate or near-duplicate content via canonicals
- Cleaning up sitemaps to include only canonical, indexable URLs
AI crawlers
AI systems operate two distinct types of crawler, and robots.txt can address each separately.
Training crawlers fetch content to build or update model training datasets. Blocking them prevents your content being used to train LLMs, but has no effect on whether AI search products cite your site.
Search and retrieval crawlers fetch content to power AI search products: either by indexing it (ChatGPT Search, Perplexity) or by fetching live context at query time. Blocking them removes your site from those AI search channels.
The two decisions are independent. You can block training crawlers while allowing search retrieval crawlers, allow all of them, or block all of them.
| User agent | Provider | Purpose |
|---|---|---|
Google-Extended | Training | |
GPTBot | OpenAI | Training |
OAI-SearchBot | OpenAI | Search indexing (ChatGPT Search) |
ChatGPT-User | OpenAI | Live retrieval |
ClaudeBot | Anthropic | Training |
Claude-SearchBot | Anthropic | Search indexing (Claude) |
Claude-User | Anthropic | Live retrieval |
PerplexityBot | Perplexity | Search indexing |
Perplexity-User | Perplexity | Live retrieval |
CCBot | Common Crawl | Training dataset (feeds multiple providers) |
Decision framework:
- You want AI citation: allow the search and retrieval crawlers (
OAI-SearchBot,ChatGPT-User,Claude-SearchBot,Claude-User,PerplexityBot,Perplexity-User). Training crawlers are irrelevant to citation. Note thatChatGPT-Userremoved its robots.txt commitment in December 2025;Perplexity-Userexplicitly ignores robots.txt by design. Blocking either reliably requires IP-level or WAF rules rather than robots.txt alone. - You want to limit AI training on your content: block
GPTBot,ClaudeBot,Google-Extended,CCBot. This does not affect AI search visibility. - You are a paywalled publisher: blocking all AI crawlers is a reasonable position. Training crawlers use your content without payment; retrieval crawlers may surface excerpts without sending traffic.
- You have no specific concern: the default (no explicit rules for these agents) allows everything. Most sites are in this position.
A robots.txt configuration that blocks training crawlers while keeping AI search open:
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
Leaving OAI-SearchBot, ChatGPT-User, Claude-SearchBot, Claude-User, PerplexityBot, and Perplexity-User with no explicit rule allows them, though ChatGPT-User and Perplexity-User may not honour those rules regardless.
AI crawlers represent a measurable and growing share of bot traffic. GPTBot requests grew by more than 300% between May 2024 and May 2025.1 Like Googlebot, compliance is voluntary: well-behaved providers honour robots.txt; malicious scrapers do not.
robots.txt vs. the Search Console AI blocking toggle
From June 2026, Google is testing a toggle in Search Console that lets site owners opt their site out of appearing in AI Overviews, AI Mode, and AI Overviews in Discover. This is a different mechanism from robots.txt: blocking Google-Extended stops the training crawler but does not remove your site from AI-generated answers. The Search Console toggle controls appearance in AI search features directly, without affecting crawling or traditional rankings. The toggle is currently limited to a subset of UK website owners. See AI Overviews for detail.
Auditing crawlability
The basic audit:
- Read your live robots.txt. Check it is what you expect. Verify nothing critical is disallowed.
- Use the Google Search Console URL Inspection tool. Spot-check key URLs for crawl status, indexing status, and any blocking signals.
- Crawl your own site. Tools like Screaming Frog or Sitebulb crawl as Googlebot would and report blocked URLs, redirect chains, broken links, and orphan pages.
- Review the Crawl Stats report in Search Console. Track crawl rate, response times, and host status over time. Sudden changes warrant investigation.
For a full crawlability checklist, see the Technical SEO Audit Checklist.
Frequently asked questions
Is robots.txt case-sensitive?
Yes. Disallow: /Admin/ and Disallow: /admin/ block different URLs.
Does robots.txt affect rankings?
Indirectly. By controlling what gets crawled and how crawl budget is spent, it influences indexing efficiency and the freshness of indexed content. The file itself is not a ranking signal.
Should I disallow the search results pages on my site?
Generally yes. Internal search results pages are typically low-value, near-infinite in URL combinations, and should not appear in Google’s index. Disallow them in robots.txt and noindex them on-page as belt-and-braces.