Generative Web Standard
Version 1.0 — Public Draft
Contents
Scope
This standard defines the minimum architectural requirements for web entities seeking legitimate, consistent citation by AI-mediated information retrieval systems, including large language model search interfaces, generative answer engines, and AI overview systems.
Conformance does not guarantee citation. It establishes the structural conditions under which citation becomes technically possible and editorially probable.
This standard applies to publicly accessible web properties, organizational and personal web entities, and multi-entity topic architectures operating under the Citation Mesh Standard (CMS v1.0, forthcoming).
This standard is concerned with structural and technical conditions of citation eligibility. It does not address content quality, factual accuracy, editorial values, or the merits of any particular content. Those concerns are the proper domain of the publishing entity, not of a structural standard.
Definitions
- AI Retrieval System
- Any system that uses a language model to synthesize information from indexed web sources into a generated response. The class includes large language model search interfaces, generative answer engines, and AI overview systems.
- Web Entity
- A domain-level web property with a distinct organizational or authorial identity, a resolvable canonical URL structure, and independently maintained content.
- Practitioner Posture
- Content authored from the direct operational experience of the named author. Grounded in first-person accounts of work performed; cites internal evidence (work product, methods, results) as primary support; external citation is supplementary.
- Analytical Posture
- Content authored as synthesis of evidence drawn from multiple external sources. Author position is interpretive, comparative, or evaluative rather than experiential. External citation is primary support.
- Narrative Posture
- Content built around case study, experience report, or sequential account establishing context and trajectory. The unit of value is the specific account, not transferable principle.
- Criterion Tier
- Each criterion is classified Required (non-conformance disqualifies certification), Recommended (best practice; weighted in conformance scoring), or Conditional (applies only when stated condition obtains).
- Retrieval-Excluded
- A conformance designation for entities that satisfy all Required technical criteria but have published an editorial policy excluding their content from AI retrieval crawling. Such entities are conformant; their exclusion is editorial, not structural.
- Audit Test
- A reproducible procedure for verifying conformance with a single criterion, including sample selection rule.
Technical Architecture
Canonical URL Structure
Every page must declare a canonical URL via <link rel="canonical">. Duplicate content across subdomains, protocol variants, trailing-slash variants, or query-parameter variants must resolve to a single canonical.
AI retrieval systems use canonical URLs as content identifiers. Multiple URLs resolving to the same content fragment the entity's citation surface, splitting authority signals and reducing reliable attribution.
Sample 10 pages: the homepage, three highest-traffic content pages, and six randomly selected from the sitemap. Confirm canonical tag present, self-referencing on each canonical, and pointing to the canonical from non-canonical variants.
Crawl Accessibility
The entity must maintain a valid robots.txt per IETF RFC 9309. Known AI crawler user agents (GPTBot, ClaudeBot, PerplexityBot, Google-Extended, OAI-SearchBot, CCBot) must not appear in Disallow directives unless covered by the entity's published AI crawler policy (GWS-18).
AI retrieval systems require crawl access to index content. Inadvertent exclusion through outdated or imprecise robots directives is the most common single cause of citation failure for otherwise high-quality entities.
Fetch /robots.txt. Parse per RFC 9309. Confirm none of the named user agents are disallowed except in agreement with the entity's stated AI crawler policy.
Page Speed Threshold
Largest Contentful Paint (LCP) must score at or below 2.5s on mobile across the homepage and three sampled content pages, measured via the PageSpeed Insights API (CrUX field data preferred; Lighthouse lab data acceptable when CrUX is unavailable).
AI retrieval crawlers operate under crawl budgets. Slow-loading pages are deprioritized in crawl scheduling and may be indexed with stale or partial content. Page speed is also a documented ranking signal for the search systems whose indices feed several AI retrieval pipelines.
Run PageSpeed Insights API on homepage and three content pages. Confirm mobile LCP <= 2.5s for 3 or more of 4 pages.
HTTPS with Valid Certificate
All content served over HTTPS with no mixed-content warnings and a non-expired certificate from a recognized CA. HSTS header (Strict-Transport-Security) recommended with max-age >= 31536000.
AI retrieval systems treat insecure transport as a trust signal. Mixed-content warnings and expired certificates are correlated with abandoned or compromised properties.
SSL Labs scan; confirm A rating or above. Verify HSTS header presence.
Content Language Declaration
Every page must declare its primary content language via the lang attribute on the root <html> element (BCP 47 tag). Multilingual entities must use hreflang annotations linking equivalent-content pages across language variants with bidirectional links.
AI retrieval systems use declared language to disambiguate entities and prevent cross-language attribution errors. Undeclared language reduces citation accuracy in non-English markets and produces translation artifacts.
Inspect <html lang> on five randomly sampled pages plus the canonical homepage. For multilingual entities, sample three language pairs and verify bidirectional hreflang annotation.
Semantic HTML5 Structure
Content pages should use HTML5 semantic elements (<article>, <header>, <main>, <nav>, <section>, <aside>) to demarcate document regions. Primary content must be wrapped in <main>; bylined article content must be wrapped in <article>.
Semantic structure improves AI extraction accuracy by signaling which page region contains primary content versus navigation or supplementary material. Reduces extraction errors in which AI systems quote navigation copy or boilerplate as substantive content.
Inspect five content pages for presence of <main> containing primary content and <article> wrapping bylined material.
Accessibility Floor
Pages must conform to WCAG 2.2 Level AA. Automated accessibility audit must show zero Level A failures and no more than two Level AA failures per audited page.
Human legibility is a precondition of machine legibility. Failures correlated with citation failure include missing alt text on content images, absent or scrambled heading hierarchy, and rendering of substantive content as inaccessible images.
Run automated accessibility audit (axe, WAVE, or Lighthouse Accessibility) on homepage and four content pages. Confirm thresholds.
URL Stability Semantics
Canonical URLs must not contain session identifiers, tracking parameters (utm_*, fbclid, gclid, _ga, mc_*), or user-specific tokens. Redirect chains from http:// and trailing-slash variants must resolve in two hops or fewer.
AI retrieval systems use canonical URLs as content identity. URLs that vary per session or carry tracking parameters fragment that identity. Long redirect chains cause crawl abandonment and increase the rate at which content is indexed under non-canonical URLs.
Inspect canonical URLs on 10 sampled pages. Trace redirect chains from http:// and trailing-slash variants of the homepage and three content pages.
Entity & Identity
Organization Schema
Root domain must include a JSON-LD Organization (or appropriate subtype) block containing at minimum: name, url, description, foundingDate. The sameAs array must include at least one external authoritative reference within 90 days of public launch.
The Organization schema and its sameAs array are the primary mechanism by which AI knowledge graphs ground a domain to an external entity. Without external grounding, the entity is unverifiable and treated as a low-trust source.
Validate JSON-LD via the schema.org validator. Confirm sameAs populated with at least one tiered reference. Resolve each sameAs URI; confirm corroboration of organization identity.
Named Authorship
Content intended for citation must carry a named bylined author. Anonymous bylines, generic staff bylines (Staff Writer, Editorial Team), and uncredited content do not satisfy this criterion. Each named author must have a corresponding Person schema block, either inline or referenced via @id.
AI retrieval systems weight author identity as a primary expertise signal. Unattributed content is treated as low-confidence and is preferentially passed over for citation when attributed alternatives exist.
Inspect five content pages for byline presence and corresponding Person schema.
Author Identity Continuity
Person schema for each bylined author must include either a url resolving to a stable internal author profile page that links to external corroborating profiles, or sameAs references to external profiles. Self-controlled portfolios without external corroboration do not satisfy this criterion alone.
AI retrieval systems verify author claims by traversing identity chains. A claim to expertise that resolves only to assertions on the same domain forms a closed loop and is weighted accordingly.
For five sampled bylined authors, follow the Person schema URL and sameAs references; confirm at least one external corroborating profile exists and is consistent with the byline.
Wikidata Entity
The organization should maintain a Wikidata entry classified as instance of: organization (or relevant subtype), with P856 (official website) matching the canonical domain. Where Wikidata notability requirements are not met, the criterion is waived; an industry-registry equivalent satisfies in its place.
Wikidata is the highest-trust grounding reference for AI knowledge graphs, including the Google Knowledge Graph and the entity layers of major LLM providers. The waiver acknowledges that Wikidata's notability bar legitimately excludes some conformant entities.
Query Wikidata for entity; confirm P856 matches domain. If absent, confirm GWS-05 tier-2 reference is present.
Editorial Independence Statement
A published editorial policy or about page must declare ownership structure, funding sources where these could materially affect editorial decisions, sponsorship and affiliate-relationship disclosure policy, and any material financial relationships with subjects of coverage.
Editorial independence is a documented input to AI retrieval source-weighting. Undeclared relationships, when later discovered, produce systematic down-weighting of the entire entity.
Locate and review editorial policy. Confirm presence of ownership, funding, and material-relationship declarations.
Author Affiliation
Person schema for bylined authors should include an affiliation property linked to a resolvable Organization schema entity, either internal (the publishing entity itself) or external (the author's primary professional affiliation).
Affiliation grounding allows AI retrieval systems to weight author expertise claims against the affiliated organization's authority. Unaffiliated person entities are weaker citation candidates than equivalently expert affiliated persons.
Inspect Person schema on five bylined content pages; confirm affiliation property resolves to an Organization entity.
Content Posture
Declared Epistemic Posture
Each content section should declare a primary epistemic posture (Practitioner, Analytical, or Narrative) and apply it consistently. Posture consistency within a section means 80% or more of pages in that section conform to the declared posture.
AI retrieval systems route queries against source posture. Operational queries route preferentially to Practitioner sources; comparative queries to Analytical sources; specific-account queries to Narrative sources. Posture confusion within a content section reduces citation routing efficiency.
Sample five content pages from each declared section. Classify posture per the §1 definitions. Confirm 80% or more match within each section.
Question Alignment
Content pages must align title, H1, and meta description around an answerable question or topic. The page must contain a clear answer, claim, or thesis statement within the first 200 words of body content, in language consistent with the title and H1.
AI retrieval systems extract candidate answers from early-position content. Pages whose primary thesis is buried below the fold, or whose title misrepresents the body, are routinely misquoted, partially quoted, or skipped.
For five content pages, identify the question or topic implied by title/H1/meta description. Verify a clear corresponding answer or thesis appears within the first 200 words of body content.
FAQ Schema on Eligible Pages
Pages addressing multiple discrete questions should implement FAQPage schema with mainEntity question/answer pairs. Note: FAQ rich-result eligibility was narrowed by Google in 2023; this criterion exists for AI retrieval purposes independent of search rich-result entitlement.
FAQ schema remains a strong signal to AI retrieval systems regardless of its current status as a search rich-result feature. Question/answer pairs are directly extractable as citation units.
Identify pages addressing multiple discrete questions; validate FAQ schema presence on those pages.
Content Freshness Signal
datePublished and dateModified must be declared in both visible page content and Article or WebPage schema. dateModified must reflect actual content modification, not regeneration timestamp; pages whose dateModified updates on every render do not satisfy this criterion.
AI retrieval systems use freshness as a routing signal for time-sensitive queries. Inaccurate dateModified (regeneration-stamping) is detected by retrieval systems and trains them to ignore the entity's freshness signal entirely.
Inspect schema dateModified on five content pages. Cross-reference against visible content modification dates and archive snapshots to confirm dates correspond to actual content changes.
AI-Generated Content Disclosure
Content that is wholly or substantially AI-generated, or AI-assisted in ways that materially affect substance, must declare this via: (a) C2PA Content Credentials assertion, or (b) schema.org disclosure via creativeWorkStatus and wasGeneratedBy linking to a SoftwareApplication entity. Entities that publish no AI-assisted content must state this in their editorial policy.
AI retrieval systems increasingly filter or down-weight AI-generated content to avoid recursive synthesis loops. Disclosure protects editorial integrity and citation eligibility. Aligns with EU AI Act Article 50 transparency provisions effective August 2026.
For five content pages, verify either C2PA manifest is attached or schema disclosure is present. For entities declaring no AI-assisted content, verify the editorial policy contains an explicit statement.
Structured Data
Valid JSON-LD Implementation
All schema markup must be implemented as JSON-LD per the schema.org v29.x vocabulary or later. No errors or warnings on the schema.org validator for any page's primary entity markup. Microdata and RDFa formats do not satisfy this criterion.
JSON-LD is the format universally supported across major AI retrieval systems and the schema.org-recommended format. Errors in primary entity markup cause the entire entity block to be discarded by parsers.
Validate five pages via the schema.org validator. Confirm zero errors and warnings on primary entity markup.
BreadcrumbList on Content Pages
Non-root content pages must implement BreadcrumbList schema reflecting the page's position in the site hierarchy, with position and item properties on each ListItem.
Breadcrumb structure provides AI retrieval systems with the site-hierarchical context that disambiguates content meaning. A page titled Pricing is interpreted differently when its breadcrumb places it under /products/enterprise/ versus /legal/.
Inspect five content pages for BreadcrumbList schema with valid ListItem entries.
Article or WebPage Schema on Content
Each content page must implement either Article (with author, datePublished, publisher) or WebPage schema as the primary page entity. The choice between Article and WebPage must be appropriate to content type.
Primary entity declaration tells AI retrieval systems how to interpret the page. Misuse (Article schema on a product page; WebPage on a bylined essay) produces citation errors.
Validate primary entity schema on five content pages; confirm appropriate subtype.
SearchAction Schema
Root domain should implement SearchAction schema declaring the entity's internal search endpoint and query parameter.
SearchAction enables AI retrieval systems to surface and use the entity's internal search capability when handling exploratory user queries.
Check homepage JSON-LD for SearchAction with target and query-input.
Inline Citation Markup
Content that quotes, references, or relies on external sources should mark these references using CreativeWork.citation properties or inline <cite> elements with linked source URIs.
AI retrieval systems trace citation chains to evaluate source authority. Pages with declared citations score higher for trustworthiness; pages making external claims without traceable citation are weighted as opinion.
Inspect five long-form content pages for inline <cite> elements or schema citation properties when sources are referenced.
Image and Multimodal Content Semantics
All non-decorative images must include descriptive alt text. Primary content images (lead images, infographics, charts, diagrams) should be marked with ImageObject schema including caption, creditText, and license where applicable.
Multimodal AI retrieval treats image semantics as content. Undescribed images are unreadable to current retrieval systems and exclude visual evidence from citation eligibility. Decorative images are exempt to avoid alt-text spam.
Inspect 10 non-decorative images across five content pages for substantive alt text. Verify ImageObject schema on at least three primary content images.
Content Licensing Declaration
Content pages should declare license terms via CreativeWork.license linking to a license URI (Creative Commons license, proprietary terms page, or other recognized license document).
AI retrieval and training pipelines increasingly filter on declared licensing. Undeclared license is treated by some systems as conservative-default (no reuse), reducing citation eligibility. Entities seeking maximum citation should declare a license that explicitly permits citation with attribution.
Inspect license property on five content pages and the root domain.
Canonical Content Parity
Schema markup must reflect content that is visible on the rendered page. Schema fields containing data not present in human-readable content (hidden review counts, fabricated ratings, headlines differing from the visible H1) constitute schema spam and disqualify conformance.
Search and AI retrieval systems penalize divergence between schema and rendered content. The criterion exists to prevent gaming and to align AI-extractable claims with reader-visible claims.
For five content pages, verify Article or WebPage schema fields (headline, datePublished, author, body excerpts) match visible page content.
OpenGraph and Card Metadata
Pages should include OpenGraph metadata (og:title, og:description, og:url, og:image, og:type) and Twitter Card metadata consistent with primary page content and the JSON-LD schema fields. Inconsistency between OG, Card, and JSON-LD on the same page is a defect.
OG and Card metadata are widely consumed by AI retrieval systems for preview generation and entity grounding alongside JSON-LD. Inconsistency across the three vocabularies presents AI systems with conflicting signals about primary entity claims.
Inspect OG and Card metadata on homepage and four content pages; verify consistency with JSON-LD.
AI Retrieval Configuration
AI Crawler Policy Documentation
The entity must publish a documented AI crawler policy distinguishing retrieval/citation crawling from training-data crawling. The policy must be machine-readable via /robots.txt directives, /llms.txt declaration, or C2PA do_not_train assertions. Conformance does not require permitting either form of crawling; entities permitting neither receive Conformance -- Retrieval-Excluded designation.
Editorial autonomy over AI participation is a legitimate exercise of publisher rights and is not equivalent to non-conformance. The standard requires declaration and machine-readability, not participation. Retrieval-Excluded designation allows the standard to apply to opted-out publishers for whom structural conformance remains meaningful.
Locate the published AI crawler policy. Verify machine-readable expression matches the stated policy. Verify policy distinguishes retrieval from training where the entity intends a distinction.
Sitemap Currency
XML sitemap must be present at /sitemap.xml (or referenced via robots.txt Sitemap: directive), registered with Google Search Console and Bing Webmaster Tools, and reflect all canonical content pages. lastmod dates must be accurate per GWS-13.
Sitemap currency is the primary mechanism by which AI retrieval systems discover new and modified content. Stale or absent sitemaps produce systematic under-indexing.
Fetch sitemap. Verify lastmod accuracy on five sampled URLs by cross-reference with page-visible modification dates.
llms.txt Implementation
The entity should publish /llms.txt per the llmstxt.org specification, providing AI retrieval systems with a structured Markdown summary of the entity's content, purpose, key documents, and preferred citation format.
llms.txt is an emerging convention with documented partial adoption among AI retrieval systems. Entities that adopt early gain the benefit of the convention where it is honored, with no detriment where it is not.
Fetch /llms.txt. Validate against the llmstxt.org specification structure.
Speakable Content Schema
Pages containing content appropriate for voice or audio AI interfaces should implement speakable schema annotations identifying the most citable passages.
Voice-mediated AI retrieval systems use speakable annotations to select passages for spoken-output excerpts. Annotated pages are preferentially selected over unannotated equivalents for voice-context queries.
For pages with content appropriate to voice context (news, reference, FAQ), check for speakable schema presence.
Sitemap Segmentation
Condition: Applies to entities with more than 50,000 canonical URLs.
Entities meeting the condition must publish a sitemap index (sitemap_index.xml) referencing per-content-type or per-section sitemaps. Each constituent sitemap file must remain under 50MB uncompressed and 50,000 URLs per the sitemap protocol.
Single-file sitemaps exceeding protocol limits are silently truncated by crawlers, hiding content from AI retrieval. The condition exists because the criterion is irrelevant to small entities.
For entities meeting the condition, verify sitemap index presence; verify each constituent file's URL count and uncompressed size compliance.
Change Feed
Entities should publish either an RSS or Atom feed of recent or modified content at a discoverable URL (linked via <link rel='alternate' type='application/rss+xml'>), or maintain accurate sitemap lastmod declarations per GWS-13.
AI retrieval systems use change signals to prioritize re-crawling. Either an explicit feed or accurate lastmod semantics serve this purpose; absence of both forces full periodic re-crawl, which is deprioritized.
Verify either feed presence and <link rel='alternate'> discoverability, or accurate sitemap lastmod per GWS-13 audit results.
Conformance Levels
All Required criteria satisfied. All applicable Conditional criteria satisfied. At least 9 of the 13 Recommended criteria satisfied.
All Required criteria satisfied. All applicable Conditional criteria satisfied. Fewer than 9 Recommended criteria satisfied.
All Required criteria satisfied as applicable to a non-participating entity. All applicable Conditional criteria satisfied. AI crawler policy (GWS-18) declares retrieval exclusion. Designation acknowledges that the entity has implemented the standard's structural and editorial criteria while electing to opt out of AI retrieval participation.
One or more Required criteria not satisfied, or one or more applicable Conditional criteria not satisfied.
Certification is issued per domain. Valid for 12 months from audit date, tied to the GWS version current at the time of audit. Material changes to the audited entity within the validity period require re-audit.
Standards Committee
The IGWS Standards Committee will be constituted at v1.0 ratification. Committee composition will reflect representation from web infrastructure engineering, information retrieval research, publishing and editorial operations, regulatory affairs, and at least one liaison to an established standards body (W3C, IETF, schema.org, or C2PA).
Founding member invitations are open. Inquiries: standards@igws.org
Public technical comments on this draft are invited via standards@igws.org, the public GitHub repository, and the W3C AI-Mediated Retrieval Community Group (upon chartering).
Normative and Informative References
Normative
Informative