Semantic HTML
Last updated
Semantic HTML is the use of HTML elements that describe the meaning of content, not just how it should look. A <div> is a generic container with no inherent meaning. An <article> tells the browser, the crawler, and the AI retrieval system that this region is a standalone piece of content. An <h2> declares the start of a new subtopic. The elements themselves carry signal.
This is distinct from structured data and schema markup. Those add an explicit vocabulary layer on top of content. Semantic HTML works through the document structure itself.
How does semantic HTML differ from schema markup?
Both serve machine interpretation, but through different mechanisms.
Schema markup is a separate layer of explicit labelling: a JSON-LD block in a <script> tag that declares “this page is an Article, authored by this Person, published on this date.” It lives outside the visible content and is processed by systems that read it as structured metadata.
Semantic HTML is built into the document structure itself. There is no separate block to add. You choose <article> instead of <div>, <h2> instead of a styled paragraph, <nav> instead of an unsemantic container. The elements communicate what each part of the page is for, through the HTML that the content already lives in.
The practical difference for machine interpretation: schema labels entities and their properties; semantic HTML defines the document boundaries and hierarchy that machines use to parse and chunk the page. Neither replaces the other. A well-structured page with correct semantic HTML and complete schema markup gives machines two distinct, complementary signals.
Why does semantic HTML matter for traditional SEO?
Three reasons, none of them a direct ranking factor.
Crawlability. Googlebot parses HTML to extract content. Clean semantic structure means the crawler reliably identifies what is navigation, what is primary content, what is supplementary. Div-soup with no landmark elements forces the crawler to infer these boundaries, though it usually gets this right.
Accessibility. Screen readers rely on semantic elements to navigate: jumping between <h2> headings, skipping to <main>, identifying <nav>. Accessibility and search crawlability use the same underlying signal. A page that passes accessibility review tends to be a page that crawlers parse cleanly.
Rendering. Semantic elements carry browser defaults that ensure content remains readable even when CSS fails to load. This is a minor resilience signal, not a ranking one.
Why does semantic HTML matter for AI retrieval?
This is where semantic HTML has become more relevant.
AI retrieval systems (the pipelines behind Google AI Overviews, Bing/Copilot, Perplexity) extract content by chunking pages into passages. The most common chunking strategy is heading-based: the content under each heading becomes its own retrievable unit. When a user query matches a specific subtopic, the system can retrieve that subtopic’s passage directly, without fetching the whole page.
This means a heading boundary is, functionally, a retrieval boundary. The content under your <h2> is a candidate to be cited as a standalone answer. If that section’s heading is vague, skipped (jumping from <h2> to <h4>), or absent, the chunking model either merges content that should be separate or cannot determine the correct relationship between sections.
Landmark elements (<article>, <main>, <section>, <nav>) help AI systems distinguish primary content from navigation, sidebars, and boilerplate. Systems that process pages for retrieval use these signals to identify which regions are worth extracting from. A page with clear landmark structure is easier to extract from than a page where primary content and navigation both live in generic <div> containers.
One reason the semantic HTML signal is cleaner than the JSON-LD signal for AI systems: semantic elements are part of the body content, not <script> tags. Preprocessing pipelines that strip scripts (as FineWeb-style training pipelines do) leave semantic HTML intact. The structure survives.
What are the key semantic HTML elements?
| Element | What it communicates |
|---|---|
<h1> | The primary topic of the page. One per page. |
<h2> – <h6> | Subtopic hierarchy. H2 = major subtopic; H3 = section within an H2. Do not skip levels. |
<article> | A standalone, self-contained piece of content: a blog post, a news item, a product description. |
<section> | A thematic grouping within a larger piece. Should have a heading. |
<main> | The primary content region of the page. One per page. Excludes navigation, header, footer. |
<nav> | Navigation: primary, secondary, breadcrumb. Helps systems identify what is not primary content. |
<aside> | Supplementary content tangentially related to the main content: sidebars, callouts, related links. |
<header> / <footer> | Page or section boundary markers. |
<figure> / <figcaption> | Self-contained media with a caption. Tells parsers that the image and its description belong together. |
Common mistakes
Skipping heading levels. Going from <h2> to <h4> breaks the hierarchy. Chunking models use heading level to determine the relationship between sections: skipping a level signals a structural relationship that does not exist in the content.
<div> for everything. A page built entirely from <div> containers gives machines nothing to anchor to. No content boundaries, no landmark regions, no topic hierarchy: just undifferentiated markup.
No landmark elements. Without <main> and <nav>, systems parsing for primary content have to infer what is navigation and what is the page’s substance. This is usually inferable, but the inference can fail on complex layouts.
<strong> for structure, not emphasis. <strong> marks important text within a flow, not headings or section titles. Using it in place of heading elements removes the hierarchical signal.
<table> for layout. Tabular elements imply data relationships between rows and columns. Using them for visual layout confuses parsers that read tables as structured data.
Frequently asked questions
Does semantic HTML improve rankings directly?
No direct ranking signal. The indirect effects are real: cleaner crawlability, better accessibility scores, and (for AI retrieval) better passage extraction. But you will not see a rankings lift from adding <article> tags to a page that is otherwise unchanged.
Is semantic HTML the same as structured data? No. Structured data is the concept of organising content for machine interpretation. Semantic HTML is one way to achieve that through native HTML elements. Schema markup (JSON-LD) is another way: it adds an explicit vocabulary layer. They are complementary, not the same thing.
Does it help with Google AI Overviews specifically? The mechanism is indirect but real. AI Overviews extract passages from pages. Heading-based chunking defines which passage gets retrieved for which query. Clear heading hierarchy means each subtopic is a clean, independently citable unit. Vague or missing headings collapse distinct subtopics into a single undifferentiated passage, reducing the chance that a specific question maps to your content.
How do I audit semantic HTML on my site?
Open any page and inspect the heading hierarchy: does it run H1 → H2 → H3 without skips? Check for <main>, <nav>, and <article> or <section> elements on content pages. Accessibility tools (Axe, Lighthouse’s accessibility audit) flag most semantic HTML violations alongside WCAG issues.