Meta Robots Tags and Crawl Directives

The meta robots tag and x-robots-tag HTTP header give you per-page control over how search engines handle individual URLs: whether to index them, follow their links, display a text snippet, or show a cached copy. They work at the page level, which is what makes them distinct from robots.txt, which works at the crawl-access level.

The meta robots tag

The meta robots tag sits in the <head> of an HTML document:

<meta name="robots" content="noindex, nofollow">

The name attribute can target a specific crawler (googlebot, bingbot) or all robots (robots). The content attribute is a comma-separated list of directives.

Multiple directives combine: content="noindex, noarchive" tells Google not to index the page and not to show a cached version.

Available directives

Index and crawl control

DirectiveEffect
indexDefault. Google may index this page.
noindexDo not include this page in search results.
followDefault. Follow links on this page.
nofollowDo not follow links on this page (does not pass PageRank).
noneEquivalent to noindex, nofollow.
allEquivalent to index, follow. Rarely needed as it is the default.

Snippet and display control

DirectiveEffect
nosnippetDo not show a text snippet or video preview in results.
max-snippet: [n]Allow a snippet of up to n characters.
noarchiveDo not show a “Cached” link in results.
noimageindexDo not index images on this page.
max-image-preview: [setting]Control image preview size: none, standard, or large.
max-video-preview: [n]Limit video preview to n seconds.
notranslateDo not offer a translation of this page in results.

The x-robots-tag HTTP header

The x-robots-tag header delivers the same directives as the meta robots tag, but via an HTTP response header rather than HTML. This makes it the only option for file types without an HTML <head>, such as PDFs, images, and other binary files.

Example server configuration (Apache):

Header set X-Robots-Tag "noindex, noarchive"

The x-robots-tag supports all the same directives as the meta robots tag and can also target specific crawlers:

X-Robots-Tag: googlebot: noindex
X-Robots-Tag: bingbot: noindex, nofollow

For HTML pages, either approach works. The HTTP header takes no position in the document hierarchy and can be set programmatically for large groups of URLs.

How this differs from robots.txt

Robots.txt and meta robots are frequently confused because both appear to “hide” pages from search engines. They do different things at different stages of the crawl-index pipeline.

Robots.txt controls whether Googlebot requests a URL at all. A Disallow rule tells the crawler not to visit the URL. It does not prevent indexing: Google can index a disallowed URL if it discovers it via links, though it will have no content to display in the snippet.

Meta robots controls what Google does with a page once it has crawled and read it. Noindex, nosnippet, and the other directives only take effect after Googlebot has successfully downloaded and parsed the page. If Googlebot cannot access the page (because robots.txt blocks it), it cannot read any meta robots instructions.

This creates a practical problem: adding noindex to a page blocked by robots.txt achieves nothing. Googlebot never reads the noindex because it cannot visit the URL.

When to use robots.txt: To reduce crawl load on URLs that do not need to be crawled (URL parameters, internal search results, admin paths). Not as the primary mechanism for excluding pages from search results.

When to use noindex: To exclude specific pages from search results while keeping them crawlable. Thank-you pages, gated content, duplicate versions of content, and staging pages are common candidates.

Common mistakes

Noindex on a disallowed URL. Googlebot cannot read the noindex if it cannot crawl the page. If you want a page excluded from results, allow crawling and use noindex. If you want to block crawling for resource reasons, use robots.txt and accept that the URL may still be indexed as a stub.

Conflicting directives on the same page. A noindex in the meta robots tag and a page submitted to the sitemap send conflicting signals. Google generally honours the noindex, but including noindexed URLs in the sitemap wastes crawl resources and creates confusing GSC errors.

Using noindex for privacy. Noindex is not a security measure. It prevents Google from surfacing the page in results, but the URL remains accessible to anyone who knows it. Use authentication or server-level access control for content that should not be publicly accessible.

Forgetting noindex on staging environments. Development and staging sites should have a site-wide noindex directive (often set at the CMS level) or be blocked from crawling via robots.txt. Without this, staging content can be indexed and create duplicate content problems in production.