Structured Data and Schema Markup

Structured data and schema markup are machine-readable metadata embedded in a webpage that explicitly declare the meaning of its content. They use the schema.org vocabulary, maintained jointly by Google, Microsoft, Yahoo, and Yandex, and are most commonly delivered as JSON-LD. Implemented correctly, schema markup drives rich results, improves AI parsing, and reinforces entity and topic associations.

What is the difference between structured data, schema, and semantic HTML?

These three terms get used interchangeably, but they refer to different things.

Structured data is the concept: organising information so machines can interpret it reliably. It is not a format or a tool: it is the goal.

Schema markup is one way to implement it: a standardised vocabulary from schema.org, delivered as JSON-LD, that explicitly labels what the content on a page is. That is what this article covers.

Semantic HTML is another way to implement it: native HTML elements (<article>, <section>, <h2>) that communicate meaning and structure through the document itself, without a separate vocabulary layer.

Both schema and semantic HTML serve machine interpretation, but they operate differently. Schema labels entities and properties explicitly. Semantic HTML defines the document structure that machines (including AI retrieval systems) use to identify and chunk content. Neither replaces the other.

What does structured data do?

The HTML on a page describes how content should be displayed. Structured data describes what the content actually means. A <p> tag containing “Dr. Sarah Wilson” tells a browser to render the text in paragraph style; a Person schema with name “Dr. Sarah Wilson” and a knowsAbout array tells search engines that this string refers to a specific real person with declared expertise.

Schema also resolves ambiguity. A page about “Mercury” can declare itself as being about the planet (Place), the chemical element (ChemicalSubstance), or the band (MusicGroup), removing ambiguity for parsers before they have to infer it from surrounding text.

The benefits cascade through several systems:

  • Search engines use structured data to power rich results (star ratings, recipe cards, product information, event details, breadcrumbs).
  • AI retrieval systems (Bing/Copilot, Google AI Overviews, Perplexity) use it to extract metadata about authors, dates, publishers, and content type at index time.
  • Knowledge graphs use it to construct entity relationships across the web.
  • Voice assistants use it for spoken responses.

Implementation: JSON-LD as the standard

Three formats are valid for structured data: JSON-LD, Microdata, and RDFa. Use JSON-LD. It is Google’s recommended format, sits in a separate <script> block independent of HTML structure, and is significantly easier to maintain than the alternatives.

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "Article",
  "headline": "Structured Data and Schema Markup",
  "datePublished": "2026-04-25",
  "author": {
    "@type": "Person",
    "@id": "https://example.com/#author"
  }
}
</script>

JSON-LD blocks can be placed in <head> or <body>. <head> is conventional. Multiple JSON-LD blocks per page are permitted; a common pattern is one Article block, one BreadcrumbList block, plus any rich-result-specific blocks.

Component pattern. Build a reusable schema component (Astro, React, Vue) that takes a JSON object and emits it as a <script type="application/ld+json">. Each page builds its own schema object based on its content, keeping schema in sync with page data without manual updates.

Which schema types should I implement?

Types grouped by site model. Where a type’s rich result eligibility has changed, that is noted: both FAQPage and HowTo retain value for AI extraction despite losing their Google Search rich result formats.

Editorial and publication sites: Article (or NewsArticle, TechArticle), Person (for authors), Organization (for publisher), BreadcrumbList.

E-commerce: Product, Offer, AggregateRating, Review, BreadcrumbList, Organization. HowTo on product-related instruction pages.

Local businesses: LocalBusiness (or a subtype such as Restaurant, Dentist, Plumber), PostalAddress, OpeningHoursSpecification, AggregateRating.

SaaS and software: SoftwareApplication, Organization, AggregateRating, Review, BreadcrumbList.

Personal sites and portfolios: Person, CreativeWork, AboutPage, BreadcrumbList.

Recipe sites: Recipe (with ingredients, cookingMethod, nutrition), AggregateRating, Review.

Event sites: Event (with location, performer, offers), Place.

Video-heavy sites: VideoObject (with thumbnailUrl, uploadDate, duration), embedded on the page hosting the video.

FAQPage. Apply to pages with a genuine FAQ section. Google removed FAQ rich results from Search in May 2026 (restrictions had already limited eligibility to government and health sites since 2023). FAQPage schema no longer produces a visible result in Google Search but retains value for AI extraction: Google AI Overviews, ChatGPT Search, and Perplexity use structured data to identify Q&A content.

HowTo. For pages with sequential instructional content. HowTo rich results were deprecated in September 2023 and no longer appear in Google Search. The schema retains value for AI extraction.

WebSite (with SearchAction). (Deprecated) Google’s Sitelinks Search Box (the visual search input shown in branded SERP results) was deprecated and no longer appears in search results. The SearchAction property on WebSite schema retains some relevance for agentic and AI-powered search systems that use it to understand site search capabilities, but it produces no visual rich result in Google Search.

The @id graph pattern

For sites with multiple schema entities (Person, Organization, WebSite), the most powerful pattern is to give each entity a stable @id and reference them across pages.

{
  "@context": "https://schema.org",
  "@graph": [
    {
      "@type": "Person",
      "@id": "https://example.com/#person",
      "name": "Author Name"
    },
    {
      "@type": "WebSite",
      "@id": "https://example.com/#website",
      "author": { "@id": "https://example.com/#person" }
    },
    {
      "@type": "Article",
      "@id": "https://example.com/article/#article",
      "author": { "@id": "https://example.com/#person" },
      "isPartOf": { "@id": "https://example.com/#website" }
    }
  ]
}

Cross-referencing the same @id across every article an author writes gives knowledge graph systems an explicit signal linking content to a person, regardless of whether that relationship is stated in prose. This is especially valuable for personal brands and organisations that have not yet established Wikipedia or Wikidata presence.

How do you validate structured data?

Two tools to use before shipping any schema change:

Both should pass. The Schema Markup Validator alone is not sufficient because Google has additional requirements (image sizes, required properties for specific rich results) that go beyond the base schema.org spec.

The picture is more contested than most writing on the topic suggests.

For retrieval-augmented systems (Bing/Copilot, Google AI Overviews, Perplexity) there is platform-level confirmation that structured data is used. Fabrice Canel of Microsoft confirmed in March 2025 that schema helps Bing’s systems understand content.1 Google’s AI Overviews are built on years of structured data investment. In these pipelines, schema can influence how content is classified and retrieved.

For pure LLM inference, the evidence is weaker. The only empirical study to date (Search Atlas, December 2024) found no correlation between schema coverage and citation rates across OpenAI, Gemini, and Perplexity.2 One reason: JSON-LD lives in <script> tags, which preprocessing pipelines typically strip before LLM training, meaning schema may never enter the model’s weights at all.

What this means in practice: schema is worth implementing for the retrieval-augmented use cases where its benefit is confirmed, and for entity reinforcement via @id. Treating it as a direct LLM citation lever, without evidence for a specific platform, overstates what is known.

Can LLMs read schema?

Information declared only in your JSON-LD may not reach LLMs the way you intend. LLMs read <script> blocks as plain text, not as structured metadata: tests using deliberately broken schema with fake types and invalid properties found LLMs still returning the content as if it were valid. If your author name, expertise, or key claims matter, state them in the prose too. Schema is worth implementing for rich results, Bing/Copilot, and entity recognition, but it should reinforce what the page already says, not substitute for it.

Common technical mistakes

Schema describing content not visible on the page. Adding FAQ schema for questions not actually shown to users is a guidelines violation that has resulted in manual actions. The general rule: if the user cannot see it, do not mark it up.

Conflicting @type values across blocks. A page with two competing Article schemas confuses parsers. Have one Article block per page.

Missing required properties. Each schema type has required properties (Article needs headline, Recipe needs name, Product needs name and image). Missing requireds make the schema invalid; Google ignores it entirely.

Using string IDs that aren’t URLs. @id should be a URL (URI). String IDs like "@id": "author-1" are invalid and don’t participate in graph relationships.

Schema only on some pages. Inconsistent application across the site fragments the entity graph. Apply schema systematically.

Stale dateModified. Auto-incrementing the dateModified on every build (a common mistake with static-site generators) destroys the freshness signal. Set it from the actual content modification date.

Maintenance and monitoring

Schema implementations drift. Field changes to schema types, framework upgrades that change rendering, and CMS template edits can all silently break schema. Monitoring approaches:

  • Search Console > Enhancements. Reports rich result eligibility and errors per schema type.
  • Periodic spot-checks. Run the Rich Results Test on a sample of important URLs quarterly.
  • Crawl-based audits. Screaming Frog and Sitebulb extract structured data from every URL during a crawl, surfacing missing or invalid schema across the site.

Frequently asked questions

Does structured data improve rankings directly?
No. The indirect effects (rich results, better CTR, clearer topic signals, AI citation) move rankings.

How do I know which schema types my site is eligible for?
Google’s search gallery lists current rich result types and their requirements. Not all schema types produce rich results; those that don’t still serve indexing and AI purposes.

Can structured data be added retrospectively to old content?
Yes. Adding schema to existing pages is a frequent quick-win SEO project. Pages that previously had no schema often gain rich result eligibility within weeks of implementation.

How much schema is too much?
Schema should describe the page accurately and completely. There is no penalty for detailed markup, provided every schema block reflects real on-page content.

Footnotes

  1. Microsoft Bing/Copilot use schema for its LLMs — Search Engine Land

  2. The Limits of Schema Markup for AI Search — Search Atlas