Generative Web Standard

Version 1.0 — Public Draft

Published2026StatusPublic Draft — open for technical commentSupersedesNone (inaugural standard)Issued byInstitute for Generative Web Standards (IGWS)Normative depsschema.org v29.x · IETF RFC 9309 · WCAG 2.2 · C2PA 2.xNext Review12 months from ratification

Contents

§0

Scope

This standard defines the minimum architectural requirements for web entities seeking legitimate, consistent citation by AI-mediated information retrieval systems, including large language model search interfaces, generative answer engines, and AI overview systems.

Conformance does not guarantee citation. It establishes the structural conditions under which citation becomes technically possible and editorially probable.

This standard applies to publicly accessible web properties, organizational and personal web entities, and multi-entity topic architectures operating under the Citation Mesh Standard (CMS v1.0, forthcoming).

This standard is concerned with structural and technical conditions of citation eligibility. It does not address content quality, factual accuracy, editorial values, or the merits of any particular content. Those concerns are the proper domain of the publishing entity, not of a structural standard.

§1

Definitions

AI Retrieval System: Any system that uses a language model to synthesize information from indexed web sources into a generated response. The class includes large language model search interfaces, generative answer engines, and AI overview systems.
Web Entity: A domain-level web property with a distinct organizational or authorial identity, a resolvable canonical URL structure, and independently maintained content.
Practitioner Posture: Content authored from the direct operational experience of the named author. Grounded in first-person accounts of work performed; cites internal evidence (work product, methods, results) as primary support; external citation is supplementary.
Analytical Posture: Content authored as synthesis of evidence drawn from multiple external sources. Author position is interpretive, comparative, or evaluative rather than experiential. External citation is primary support.
Narrative Posture: Content built around case study, experience report, or sequential account establishing context and trajectory. The unit of value is the specific account, not transferable principle.
Criterion Tier: Each criterion is classified Required (non-conformance disqualifies certification), Recommended (best practice; weighted in conformance scoring), or Conditional (applies only when stated condition obtains).
Retrieval-Excluded: A conformance designation for entities that satisfy all Required technical criteria but have published an editorial policy excluding their content from AI retrieval crawling. Such entities are conformant; their exclusion is editorial, not structural.
Audit Test: A reproducible procedure for verifying conformance with a single criterion, including sample selection rule.

§2

Technical Architecture

GWS-01Required

Canonical URL Structure

Requirement

Every page must declare a canonical URL via <link rel="canonical">. Duplicate content across subdomains, protocol variants, trailing-slash variants, or query-parameter variants must resolve to a single canonical.

Rationale

AI retrieval systems use canonical URLs as content identifiers. Multiple URLs resolving to the same content fragment the entity's citation surface, splitting authority signals and reducing reliable attribution.

Audit Test

Sample 10 pages: the homepage, three highest-traffic content pages, and six randomly selected from the sitemap. Confirm canonical tag present, self-referencing on each canonical, and pointing to the canonical from non-canonical variants.

GWS-02Required

Crawl Accessibility

Requirement

The entity must maintain a valid robots.txt per IETF RFC 9309. Known AI crawler user agents (GPTBot, ClaudeBot, PerplexityBot, Google-Extended, OAI-SearchBot, CCBot) must not appear in Disallow directives unless covered by the entity's published AI crawler policy (GWS-18).

Rationale

AI retrieval systems require crawl access to index content. Inadvertent exclusion through outdated or imprecise robots directives is the most common single cause of citation failure for otherwise high-quality entities.

Audit Test

Fetch /robots.txt. Parse per RFC 9309. Confirm none of the named user agents are disallowed except in agreement with the entity's stated AI crawler policy.

GWS-03Required

Page Speed Threshold

Requirement

Largest Contentful Paint (LCP) must score at or below 2.5s on mobile across the homepage and three sampled content pages, measured via the PageSpeed Insights API (CrUX field data preferred; Lighthouse lab data acceptable when CrUX is unavailable).

Rationale

AI retrieval crawlers operate under crawl budgets. Slow-loading pages are deprioritized in crawl scheduling and may be indexed with stale or partial content. Page speed is also a documented ranking signal for the search systems whose indices feed several AI retrieval pipelines.

Audit Test

Run PageSpeed Insights API on homepage and three content pages. Confirm mobile LCP <= 2.5s for 3 or more of 4 pages.

GWS-04Recommended

HTTPS with Valid Certificate

Requirement

All content served over HTTPS with no mixed-content warnings and a non-expired certificate from a recognized CA. HSTS header (Strict-Transport-Security) recommended with max-age >= 31536000.

Rationale

AI retrieval systems treat insecure transport as a trust signal. Mixed-content warnings and expired certificates are correlated with abandoned or compromised properties.

Audit Test

SSL Labs scan; confirm A rating or above. Verify HSTS header presence.

GWS-22Required

Content Language Declaration

Requirement

Every page must declare its primary content language via the lang attribute on the root <html> element (BCP 47 tag). Multilingual entities must use hreflang annotations linking equivalent-content pages across language variants with bidirectional links.

Rationale

AI retrieval systems use declared language to disambiguate entities and prevent cross-language attribution errors. Undeclared language reduces citation accuracy in non-English markets and produces translation artifacts.

Audit Test

Inspect <html lang> on five randomly sampled pages plus the canonical homepage. For multilingual entities, sample three language pairs and verify bidirectional hreflang annotation.

GWS-23Recommended

Semantic HTML5 Structure

Requirement

Content pages should use HTML5 semantic elements (<article>, <header>, <main>, <nav>, <section>, <aside>) to demarcate document regions. Primary content must be wrapped in <main>; bylined article content must be wrapped in <article>.

Rationale

Semantic structure improves AI extraction accuracy by signaling which page region contains primary content versus navigation or supplementary material. Reduces extraction errors in which AI systems quote navigation copy or boilerplate as substantive content.

Audit Test

Inspect five content pages for presence of <main> containing primary content and <article> wrapping bylined material.

GWS-24Required

Accessibility Floor

Requirement

Pages must conform to WCAG 2.2 Level AA. Automated accessibility audit must show zero Level A failures and no more than two Level AA failures per audited page.

Rationale

Human legibility is a precondition of machine legibility. Failures correlated with citation failure include missing alt text on content images, absent or scrambled heading hierarchy, and rendering of substantive content as inaccessible images.

Audit Test

Run automated accessibility audit (axe, WAVE, or Lighthouse Accessibility) on homepage and four content pages. Confirm thresholds.

GWS-25Required

URL Stability Semantics

Requirement

Canonical URLs must not contain session identifiers, tracking parameters (utm_*, fbclid, gclid, _ga, mc_*), or user-specific tokens. Redirect chains from http:// and trailing-slash variants must resolve in two hops or fewer.

Rationale

AI retrieval systems use canonical URLs as content identity. URLs that vary per session or carry tracking parameters fragment that identity. Long redirect chains cause crawl abandonment and increase the rate at which content is indexed under non-canonical URLs.

Audit Test

Inspect canonical URLs on 10 sampled pages. Trace redirect chains from http:// and trailing-slash variants of the homepage and three content pages.

§3

Entity & Identity

GWS-05Required

Organization Schema

Requirement

Root domain must include a JSON-LD Organization (or appropriate subtype) block containing at minimum: name, url, description, foundingDate. The sameAs array must include at least one external authoritative reference within 90 days of public launch.

Rationale

The Organization schema and its sameAs array are the primary mechanism by which AI knowledge graphs ground a domain to an external entity. Without external grounding, the entity is unverifiable and treated as a low-trust source.

Audit Test

Validate JSON-LD via the schema.org validator. Confirm sameAs populated with at least one tiered reference. Resolve each sameAs URI; confirm corroboration of organization identity.

GWS-06Required

Named Authorship

Requirement

Content intended for citation must carry a named bylined author. Anonymous bylines, generic staff bylines (Staff Writer, Editorial Team), and uncredited content do not satisfy this criterion. Each named author must have a corresponding Person schema block, either inline or referenced via @id.

Rationale

AI retrieval systems weight author identity as a primary expertise signal. Unattributed content is treated as low-confidence and is preferentially passed over for citation when attributed alternatives exist.

Audit Test

Inspect five content pages for byline presence and corresponding Person schema.

GWS-07Required

Author Identity Continuity

Requirement

Person schema for each bylined author must include either a url resolving to a stable internal author profile page that links to external corroborating profiles, or sameAs references to external profiles. Self-controlled portfolios without external corroboration do not satisfy this criterion alone.

Rationale

AI retrieval systems verify author claims by traversing identity chains. A claim to expertise that resolves only to assertions on the same domain forms a closed loop and is weighted accordingly.

Audit Test

For five sampled bylined authors, follow the Person schema URL and sameAs references; confirm at least one external corroborating profile exists and is consistent with the byline.

GWS-08Recommended

Wikidata Entity

Requirement

The organization should maintain a Wikidata entry classified as instance of: organization (or relevant subtype), with P856 (official website) matching the canonical domain. Where Wikidata notability requirements are not met, the criterion is waived; an industry-registry equivalent satisfies in its place.

Rationale

Wikidata is the highest-trust grounding reference for AI knowledge graphs, including the Google Knowledge Graph and the entity layers of major LLM providers. The waiver acknowledges that Wikidata's notability bar legitimately excludes some conformant entities.

Audit Test

Query Wikidata for entity; confirm P856 matches domain. If absent, confirm GWS-05 tier-2 reference is present.

GWS-09Recommended

Editorial Independence Statement

Requirement

A published editorial policy or about page must declare ownership structure, funding sources where these could materially affect editorial decisions, sponsorship and affiliate-relationship disclosure policy, and any material financial relationships with subjects of coverage.

Rationale

Editorial independence is a documented input to AI retrieval source-weighting. Undeclared relationships, when later discovered, produce systematic down-weighting of the entire entity.

Audit Test

Locate and review editorial policy. Confirm presence of ownership, funding, and material-relationship declarations.

GWS-26Recommended

Author Affiliation

Requirement

Person schema for bylined authors should include an affiliation property linked to a resolvable Organization schema entity, either internal (the publishing entity itself) or external (the author's primary professional affiliation).

Rationale

Affiliation grounding allows AI retrieval systems to weight author expertise claims against the affiliated organization's authority. Unaffiliated person entities are weaker citation candidates than equivalently expert affiliated persons.

Audit Test

Inspect Person schema on five bylined content pages; confirm affiliation property resolves to an Organization entity.

§4

Content Posture

GWS-10Recommended

Declared Epistemic Posture

Requirement

Each content section should declare a primary epistemic posture (Practitioner, Analytical, or Narrative) and apply it consistently. Posture consistency within a section means 80% or more of pages in that section conform to the declared posture.

Rationale

AI retrieval systems route queries against source posture. Operational queries route preferentially to Practitioner sources; comparative queries to Analytical sources; specific-account queries to Narrative sources. Posture confusion within a content section reduces citation routing efficiency.

Audit Test

Sample five content pages from each declared section. Classify posture per the §1 definitions. Confirm 80% or more match within each section.

GWS-11Required

Question Alignment

Requirement

Content pages must align title, H1, and meta description around an answerable question or topic. The page must contain a clear answer, claim, or thesis statement within the first 200 words of body content, in language consistent with the title and H1.

Rationale

AI retrieval systems extract candidate answers from early-position content. Pages whose primary thesis is buried below the fold, or whose title misrepresents the body, are routinely misquoted, partially quoted, or skipped.

Audit Test

For five content pages, identify the question or topic implied by title/H1/meta description. Verify a clear corresponding answer or thesis appears within the first 200 words of body content.

GWS-12Recommended

FAQ Schema on Eligible Pages

Requirement

Pages addressing multiple discrete questions should implement FAQPage schema with mainEntity question/answer pairs. Note: FAQ rich-result eligibility was narrowed by Google in 2023; this criterion exists for AI retrieval purposes independent of search rich-result entitlement.

Rationale

FAQ schema remains a strong signal to AI retrieval systems regardless of its current status as a search rich-result feature. Question/answer pairs are directly extractable as citation units.

Audit Test

Identify pages addressing multiple discrete questions; validate FAQ schema presence on those pages.

GWS-13Recommended

Content Freshness Signal

Requirement

datePublished and dateModified must be declared in both visible page content and Article or WebPage schema. dateModified must reflect actual content modification, not regeneration timestamp; pages whose dateModified updates on every render do not satisfy this criterion.

Rationale

AI retrieval systems use freshness as a routing signal for time-sensitive queries. Inaccurate dateModified (regeneration-stamping) is detected by retrieval systems and trains them to ignore the entity's freshness signal entirely.

Audit Test

Inspect schema dateModified on five content pages. Cross-reference against visible content modification dates and archive snapshots to confirm dates correspond to actual content changes.

GWS-27Required

AI-Generated Content Disclosure

Requirement

Content that is wholly or substantially AI-generated, or AI-assisted in ways that materially affect substance, must declare this via: (a) C2PA Content Credentials assertion, or (b) schema.org disclosure via creativeWorkStatus and wasGeneratedBy linking to a SoftwareApplication entity. Entities that publish no AI-assisted content must state this in their editorial policy.

Rationale

AI retrieval systems increasingly filter or down-weight AI-generated content to avoid recursive synthesis loops. Disclosure protects editorial integrity and citation eligibility. Aligns with EU AI Act Article 50 transparency provisions effective August 2026.

Audit Test

For five content pages, verify either C2PA manifest is attached or schema disclosure is present. For entities declaring no AI-assisted content, verify the editorial policy contains an explicit statement.

§5

Structured Data

GWS-14Required

Valid JSON-LD Implementation

Requirement

All schema markup must be implemented as JSON-LD per the schema.org v29.x vocabulary or later. No errors or warnings on the schema.org validator for any page's primary entity markup. Microdata and RDFa formats do not satisfy this criterion.

Rationale

JSON-LD is the format universally supported across major AI retrieval systems and the schema.org-recommended format. Errors in primary entity markup cause the entire entity block to be discarded by parsers.

Audit Test

Validate five pages via the schema.org validator. Confirm zero errors and warnings on primary entity markup.

GWS-15Required

BreadcrumbList on Content Pages

Requirement

Non-root content pages must implement BreadcrumbList schema reflecting the page's position in the site hierarchy, with position and item properties on each ListItem.

Rationale

Breadcrumb structure provides AI retrieval systems with the site-hierarchical context that disambiguates content meaning. A page titled Pricing is interpreted differently when its breadcrumb places it under /products/enterprise/ versus /legal/.

Audit Test

Inspect five content pages for BreadcrumbList schema with valid ListItem entries.

GWS-16Required

Article or WebPage Schema on Content

Requirement

Each content page must implement either Article (with author, datePublished, publisher) or WebPage schema as the primary page entity. The choice between Article and WebPage must be appropriate to content type.

Rationale

Primary entity declaration tells AI retrieval systems how to interpret the page. Misuse (Article schema on a product page; WebPage on a bylined essay) produces citation errors.

Audit Test

Validate primary entity schema on five content pages; confirm appropriate subtype.

GWS-17Recommended

SearchAction Schema

Requirement

Root domain should implement SearchAction schema declaring the entity's internal search endpoint and query parameter.

Rationale

SearchAction enables AI retrieval systems to surface and use the entity's internal search capability when handling exploratory user queries.

Audit Test

Check homepage JSON-LD for SearchAction with target and query-input.

GWS-28Recommended

Inline Citation Markup

Requirement

Content that quotes, references, or relies on external sources should mark these references using CreativeWork.citation properties or inline <cite> elements with linked source URIs.

Rationale

AI retrieval systems trace citation chains to evaluate source authority. Pages with declared citations score higher for trustworthiness; pages making external claims without traceable citation are weighted as opinion.

Audit Test

Inspect five long-form content pages for inline <cite> elements or schema citation properties when sources are referenced.

GWS-29Required

Image and Multimodal Content Semantics

Requirement

All non-decorative images must include descriptive alt text. Primary content images (lead images, infographics, charts, diagrams) should be marked with ImageObject schema including caption, creditText, and license where applicable.

Rationale

Multimodal AI retrieval treats image semantics as content. Undescribed images are unreadable to current retrieval systems and exclude visual evidence from citation eligibility. Decorative images are exempt to avoid alt-text spam.

Audit Test

Inspect 10 non-decorative images across five content pages for substantive alt text. Verify ImageObject schema on at least three primary content images.

GWS-30Recommended

Content Licensing Declaration

Requirement

Content pages should declare license terms via CreativeWork.license linking to a license URI (Creative Commons license, proprietary terms page, or other recognized license document).

Rationale

AI retrieval and training pipelines increasingly filter on declared licensing. Undeclared license is treated by some systems as conservative-default (no reuse), reducing citation eligibility. Entities seeking maximum citation should declare a license that explicitly permits citation with attribution.

Audit Test

Inspect license property on five content pages and the root domain.

GWS-31Required

Canonical Content Parity

Requirement

Schema markup must reflect content that is visible on the rendered page. Schema fields containing data not present in human-readable content (hidden review counts, fabricated ratings, headlines differing from the visible H1) constitute schema spam and disqualify conformance.

Rationale

Search and AI retrieval systems penalize divergence between schema and rendered content. The criterion exists to prevent gaming and to align AI-extractable claims with reader-visible claims.

Audit Test

For five content pages, verify Article or WebPage schema fields (headline, datePublished, author, body excerpts) match visible page content.

GWS-32Recommended

OpenGraph and Card Metadata

Requirement

Pages should include OpenGraph metadata (og:title, og:description, og:url, og:image, og:type) and Twitter Card metadata consistent with primary page content and the JSON-LD schema fields. Inconsistency between OG, Card, and JSON-LD on the same page is a defect.

Rationale

OG and Card metadata are widely consumed by AI retrieval systems for preview generation and entity grounding alongside JSON-LD. Inconsistency across the three vocabularies presents AI systems with conflicting signals about primary entity claims.

Audit Test

Inspect OG and Card metadata on homepage and four content pages; verify consistency with JSON-LD.

§6

AI Retrieval Configuration

GWS-18Required

AI Crawler Policy Documentation

Requirement

The entity must publish a documented AI crawler policy distinguishing retrieval/citation crawling from training-data crawling. The policy must be machine-readable via /robots.txt directives, /llms.txt declaration, or C2PA do_not_train assertions. Conformance does not require permitting either form of crawling; entities permitting neither receive Conformance -- Retrieval-Excluded designation.

Rationale

Editorial autonomy over AI participation is a legitimate exercise of publisher rights and is not equivalent to non-conformance. The standard requires declaration and machine-readability, not participation. Retrieval-Excluded designation allows the standard to apply to opted-out publishers for whom structural conformance remains meaningful.

Audit Test

Locate the published AI crawler policy. Verify machine-readable expression matches the stated policy. Verify policy distinguishes retrieval from training where the entity intends a distinction.

GWS-19Required

Sitemap Currency

Requirement

XML sitemap must be present at /sitemap.xml (or referenced via robots.txt Sitemap: directive), registered with Google Search Console and Bing Webmaster Tools, and reflect all canonical content pages. lastmod dates must be accurate per GWS-13.

Rationale

Sitemap currency is the primary mechanism by which AI retrieval systems discover new and modified content. Stale or absent sitemaps produce systematic under-indexing.

Audit Test

Fetch sitemap. Verify lastmod accuracy on five sampled URLs by cross-reference with page-visible modification dates.

GWS-20Recommended

llms.txt Implementation

Requirement

The entity should publish /llms.txt per the llmstxt.org specification, providing AI retrieval systems with a structured Markdown summary of the entity's content, purpose, key documents, and preferred citation format.

Rationale

llms.txt is an emerging convention with documented partial adoption among AI retrieval systems. Entities that adopt early gain the benefit of the convention where it is honored, with no detriment where it is not.

Audit Test

Fetch /llms.txt. Validate against the llmstxt.org specification structure.

GWS-21Recommended

Speakable Content Schema

Requirement

Pages containing content appropriate for voice or audio AI interfaces should implement speakable schema annotations identifying the most citable passages.

Rationale

Voice-mediated AI retrieval systems use speakable annotations to select passages for spoken-output excerpts. Annotated pages are preferentially selected over unannotated equivalents for voice-context queries.

Audit Test

For pages with content appropriate to voice context (news, reference, FAQ), check for speakable schema presence.

GWS-33Conditional

Sitemap Segmentation

Condition: Applies to entities with more than 50,000 canonical URLs.

Requirement

Entities meeting the condition must publish a sitemap index (sitemap_index.xml) referencing per-content-type or per-section sitemaps. Each constituent sitemap file must remain under 50MB uncompressed and 50,000 URLs per the sitemap protocol.

Rationale

Single-file sitemaps exceeding protocol limits are silently truncated by crawlers, hiding content from AI retrieval. The condition exists because the criterion is irrelevant to small entities.

Audit Test

For entities meeting the condition, verify sitemap index presence; verify each constituent file's URL count and uncompressed size compliance.

GWS-34Recommended

Change Feed

Requirement

Entities should publish either an RSS or Atom feed of recent or modified content at a discoverable URL (linked via <link rel='alternate' type='application/rss+xml'>), or maintain accurate sitemap lastmod declarations per GWS-13.

Rationale

AI retrieval systems use change signals to prioritize re-crawling. Either an explicit feed or accurate lastmod semantics serve this purpose; absence of both forces full periodic re-crawl, which is deprioritized.

Audit Test

Verify either feed presence and <link rel='alternate'> discoverability, or accurate sitemap lastmod per GWS-13 audit results.

§7

Conformance Levels

Certified — Full Conformance

All Required criteria satisfied. All applicable Conditional criteria satisfied. At least 9 of the 13 Recommended criteria satisfied.

Certified — Core Conformance

All Required criteria satisfied. All applicable Conditional criteria satisfied. Fewer than 9 Recommended criteria satisfied.

Conformance — Retrieval-Excluded

All Required criteria satisfied as applicable to a non-participating entity. All applicable Conditional criteria satisfied. AI crawler policy (GWS-18) declares retrieval exclusion. Designation acknowledges that the entity has implemented the standard's structural and editorial criteria while electing to opt out of AI retrieval participation.

Non-Conformant

One or more Required criteria not satisfied, or one or more applicable Conditional criteria not satisfied.

Certification is issued per domain. Valid for 12 months from audit date, tied to the GWS version current at the time of audit. Material changes to the audited entity within the validity period require re-audit.

§8

Standards Committee

The IGWS Standards Committee will be constituted at v1.0 ratification. Committee composition will reflect representation from web infrastructure engineering, information retrieval research, publishing and editorial operations, regulatory affairs, and at least one liaison to an established standards body (W3C, IETF, schema.org, or C2PA).

Founding member invitations are open. Inquiries: standards@igws.org

Public technical comments on this draft are invited via standards@igws.org, the public GitHub repository, and the W3C AI-Mediated Retrieval Community Group (upon chartering).

§9

Normative and Informative References

Normative

Robots Exclusion Protocol. IETF RFC 9309. September 2022. https://www.rfc-editor.org/rfc/rfc9309.html

Schema.org Vocabulary v29.x. Schema.org. https://schema.org/

Web Content Accessibility Guidelines (WCAG) 2.2. W3C. https://www.w3.org/TR/WCAG22/

Content Credentials Specification v2.x. C2PA -- Coalition for Content Provenance and Authenticity. https://spec.c2pa.org/

Sitemaps Protocol 0.9. Sitemaps.org. https://www.sitemaps.org/protocol.html

BCP 47: Tags for Identifying Languages. IETF.

The llms.txt Specification. llmstxt.org. https://llmstxt.org/

Informative

GEO: Generative Engine Optimization. Aggarwal et al. Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2024. https://arxiv.org/abs/2311.09735

Regulation (EU) 2024/1689 (Artificial Intelligence Act), Article 50. European Union. Effective 2 August 2026.

NLWeb -- Conversational AI Interfaces over Structured Data. Microsoft, Schema.org. 2025.

Model Context Protocol (MCP). Anthropic. 2024--2025. https://modelcontextprotocol.io/

Search Central -- Structured Data and AI Overviews. Google. https://developers.google.com/search/docs