Crawlability and robots.txt
Last updated
Crawlability is the degree to which search engine bots can access the pages of your site. It is the foundation of indexing, and a site with broken crawlability cannot rank no matter how good its content is. robots.txt is the file that controls what bots can and can’t fetch.
How crawlers discover content
Search engine crawlers (Googlebot, Bingbot, etc.) discover content through:
- Internal links from already-known pages. The most common discovery path. Pages without internal links are functionally invisible.
- External links from other sites. Backlinks introduce new URLs to crawlers.
- XML sitemaps. Explicit lists of URLs you want indexed.
- Direct submission. Manual URL submission via Search Console (limited use; for individual urgent pages).
Crawlers fetch pages, parse HTML, follow links, and add new URLs to their queue. The process repeats indefinitely, with frequency determined by site authority, change rate, and crawl budget.
What robots.txt does
robots.txt is a plain text file at the root of the domain (https://example.com/robots.txt) that tells well-behaved crawlers which URLs they can and cannot fetch. It uses simple syntax:
User-agent: *
Disallow: /admin/
Disallow: /search/
Allow: /
User-agent: Googlebot
Disallow: /no-google/
Sitemap: https://example.com/sitemap.xml
User-agent specifies which crawler the rules apply to (* is all). Disallow blocks paths. Allow permits paths within disallowed directories. Sitemap declares one or more sitemap locations.
What robots.txt does NOT do
It does not prevent indexing. Pages disallowed in robots.txt can still appear in search results if Google discovers them through external links. The page won’t be crawled, but the URL might be indexed (with a warning that no description is available). To prevent indexing, use a noindex meta tag, which requires the page to be crawlable.
It does not enforce compliance. robots.txt is a request, not a rule. Well-behaved crawlers (Google, Bing, established AI crawlers) honour it. Malicious or aggressive scrapers ignore it.
It does not secure content. Disallowing a path in robots.txt makes its existence public. Anyone can read the file. For genuinely sensitive content, use authentication, not robots.txt.
Common crawlability problems
Accidentally blocking the entire site. A misplaced Disallow: / blocks all crawling. This is the most common (and most damaging) robots.txt mistake. Always verify the live robots.txt after deployment.
Blocking CSS and JavaScript. Googlebot renders pages with their associated CSS and JavaScript to evaluate layout, mobile usability, and content. Blocking these resources damages how Google understands the page. Allow them.
Disallow vs noindex confusion. Disallowing a page in robots.txt while the page also has a noindex meta tag means Google can’t crawl the page to see the noindex directive. The page may end up indexed without a description, the worst of both worlds.
Blocking pagination, faceted filters, or parameterised URLs incorrectly. Some parameter URLs should be canonicalised, some should be noindexed, and some should be allowed. Blanket blocking via robots.txt is rarely the right answer.
Orphan pages. Pages with no internal links are crawled rarely if at all. Sitemaps help, but internal links remain the strongest discovery signal.
Crawl budget
Crawl budget is the number of URLs Googlebot will crawl on your site within a given time period. For most sites under 10,000 pages, crawl budget is not a constraint. For very large sites (e-commerce with millions of SKUs, news publishers with deep archives), it becomes a real consideration.
The factors that influence crawl budget:
- Site authority. Higher-authority sites get crawled more.
- Site speed. Faster sites get crawled more frequently per second.
- Content freshness. Sites that update frequently get crawled more.
- Discoverable URL count. Larger sites need to share crawl budget across more URLs.
Wasted crawl budget (crawlers fetching low-value URLs at the expense of important ones) is the failure mode. Reduce it by:
- Using robots.txt to block genuinely worthless URLs (admin, search results pages, infinite faceted filter combinations)
- Returning proper status codes for non-existent pages (404 or 410, not soft 404s)
- Consolidating duplicate or near-duplicate content via canonicals
- Cleaning up sitemaps to include only canonical, indexable URLs
AI crawlers
The robots.txt rules apply to AI crawlers too, with each major model provider operating one or more user agents:
- Google-Extended. Google’s AI training crawler. Separate from Googlebot.
- GPTBot. OpenAI’s training crawler.
- OAI-SearchBot. OpenAI’s search crawler (used by ChatGPT Search).
- ChatGPT-User. OpenAI’s live retrieval at query time.
- ClaudeBot. Anthropic’s training crawler.
- PerplexityBot. Perplexity’s indexing crawler.
- Perplexity-User. Perplexity’s live retrieval at query time.
- CCBot. Common Crawl, used by many model providers.
Allowing or blocking each is a publisher decision. For sites that want AI citation, allowing all of them is the standard recommendation. For sites where AI training of their content is a concern (paywalled publishers, original research producers), selective blocking is appropriate.
Auditing crawlability
The basic audit:
- Read your live robots.txt. Check it is what you expect. Verify nothing critical is disallowed.
- Use the Google Search Console URL Inspection tool. Spot-check key URLs for crawl status, indexing status, and any blocking signals.
- Crawl your own site. Tools like Screaming Frog or Sitebulb crawl as Googlebot would and report blocked URLs, redirect chains, broken links, and orphan pages.
- Review the Crawl Stats report in Search Console. Track crawl rate, response times, and host status over time. Sudden changes warrant investigation.
Frequently asked questions
Is robots.txt case-sensitive?
Yes. Disallow: /Admin/ and Disallow: /admin/ block different URLs.
Does robots.txt affect rankings? Indirectly. By controlling what gets crawled and how crawl budget is spent, it influences indexing efficiency and the freshness of indexed content. The file itself is not a ranking signal.
Should I disallow the search results pages on my site? Generally yes. Internal search results pages are typically low-value, near-infinite in URL combinations, and should not appear in Google’s index. Disallow them in robots.txt and noindex them on-page as belt-and-braces.