Does Schema Markup Predict AI Citation?
A Cross-Platform Empirical Study of Structured Data and Generative Engine Optimization
Kurt Fischman · Growth Marshal · February 2026 · Preprint — Not yet peer-reviewed
Key Findings
730 AI citations · 75 queries · 1,006 pages analyzed
Generic schema markup does not predict AI citation probability.
After correcting for Google's ranking-algorithm confound, JSON-LD schema presence produced a null result (OR = 0.678, p = .296). Entity richness scores and schema-to-query alignment were similarly non-significant.
Google organic rank position is the dominant predictor of AI citation.
Position-1 pages were cited in 43% of queries, declining to 5% at position 7. Each rank position reduces citation odds by approximately 24% (OR = 0.762, p < .001).
Product and Review schema with concrete attributes is the significant exception.
Pages with attribute-rich schema (pricing, aggregateRating, specifications) were cited at 61.7% vs. 41.6% for generic types (p = .012). Most pronounced among lower-authority domains (DR ≤ 60).
Sophisticated entity-linking techniques remain untested.
Wikidata sameAs links, genuine @id cross-referencing, and nested entity structures appeared on fewer than 4% of schema-present pages. The upper bound of schema's potential contribution remains unknown.
Abstract
This study examines whether JSON-LD schema markup independently predicts the probability that a web page will be cited in AI-generated responses. We collected 730 AI citations from ChatGPT (GPT-4o with web browsing) and Gemini (1.5 Pro with search grounding) across 75 commercial queries spanning five categories: SaaS and Technology, Health and Medical, Finance and Insurance, Professional Services, and How-To and DIY. Google top-10 organic results for the same queries were collected via SerpAPI as a control set, yielding 1,006 total unique pages analyzed for schema characteristics and domain authority (Ahrefs DR).
Initial pooled analysis produced a significant negative association between schema presence and AI citation (OR = 0.546, p < .001) — suggesting schema actively reduced citation probability. This finding proved to be a methodological artifact: Google's ranking algorithm systematically enriches top-10 organic results for schema-bearing pages, inflating schema prevalence in the non-cited control population. A within-Google diagnostic revealed that schema prevalence among AI-cited and non-cited Google pages was statistically indistinguishable (43.1% vs. 44.8%), collapsing the apparent effect entirely. Corrected models using Generalized Estimating Equations with query-clustered standard errors produced a null result for schema presence (OR = 0.678, p = .296), entity richness score (OR = 1.001, p = .833), and schema-to-query alignment (OR = 1.068, p = .626).
The dominant predictor of AI citation was Google organic rank position (OR = 0.762 per position, p < .001). Position-1 pages were cited in 43% of queries in which they appeared, declining to 5% at position 7. This gradient implies that each rank position reduces citation odds by approximately 24%, and that AI citation behavior is substantially mediated by the search backend ranking that precedes AI-level content evaluation.
One significant exception emerged: pages implementing Product or Review schema with populated concrete attribute fields — pricing, aggregateRating, specifications — were cited at substantially higher rates than pages implementing generic schema types such as Article, Organization, or BreadcrumbList (61.7% vs. 41.6%, p = .012). This attribute-rich advantage was most pronounced among lower-authority domains (DR ≤ 60), consistent with the interpretation that factual payload in structured data partially compensates for weak authority signals.
These findings support a more precise version of the schema-helps hypothesis than the practitioner consensus has articulated: attribute-rich schema that provides extractable factual content may confer modest citation advantages for lower-authority domains, while generic schema provides none. The dominant practical implication is that traditional organic rank position remains the primary lever for AI visibility, and that GEO-specific optimization efforts are most productive when directed at content quality and authority rather than generic structured data implementation.
Study Design
Data collection: 730 AI citations from ChatGPT (GPT-4o with web browsing) and Gemini (1.5 Pro with search grounding) across 75 commercial queries in five categories. Google top-10 organic results collected via SerpAPI as a control set. 1,006 total unique pages analyzed.
Schema analysis: Two-pass extraction procedure — first pass assessed schema presence and type inventory; second pass scored entity richness across seven dimensions including @id specificity, sameAs link quality, nesting depth, and property completeness.
Statistical approach: Four-stage analysis progressing from naïve pooled logistic regression through within-Google diagnostics to corrected Generalized Estimating Equations (GEE) with query-clustered standard errors. Results validated using mixed-effects logistic regression with query-level random intercepts.
Domain authority: Ahrefs Domain Rating (DR) collected for all 1,006 pages. DR included in all models as a confound control (correlation between DR and schema implementation: r = 0.31, p < .001).
Practical Implications
For GEO practitioners: The most reliable path to AI citation is ranking higher in Google's organic results. Schema optimization should focus on attribute-rich implementations (Product, Review schema with populated pricing, ratings, and specifications) rather than generic schema types. Generic Article, Organization, and BreadcrumbList schema showed no citation advantage.
For lower-authority domains: Attribute-rich schema provides the most measurable advantage where authority signals are weakest. For domains with DR ≤ 60, implementing Product or Review schema with concrete attribute fields partially compensates for lower domain authority.
For the GEO field: Optimization recommendations should be tested against observed citation behavior, not derived from AI-generated advice about AI systems. The feedback loop between LLM-generated recommendations and practitioner adoption creates a self-reinforcing consensus that may not reflect actual system behavior.
Read the Full Paper
The complete preprint is available on the following platforms
Cite this paper
Fischman, K. (2026). Does schema markup predict AI citation? A cross-platform empirical study of structured data and generative engine optimization. Preprint. https://doi.org/10.5281/zenodo.18728697
About This Research
Kurt Fischman is the founder of Growth Marshal, an AI search agency specializing in generative engine optimization. This research was conducted independently and was not supported by external funding.
This study began as an internal challenge to Growth Marshal's own methodology. The MKA (Modular Knowledge Asset) framework assigns significant weight to schema implementation as a pathway to AI visibility. The null result for generic schema required honest engagement with that assumption. The findings have informed revisions to the MKA framework, redirecting emphasis from generic schema deployment toward attribute-rich implementation and content quality optimization.
Claude (Anthropic) contributed to research design discussion, statistical interpretation, and manuscript drafting. AI contribution is acknowledged in accordance with emerging norms for transparency in human-AI collaborative research. All empirical data collection, analysis decisions, and conclusions were made by the human author.
Contact: hello@growthmarshal.io
Last updated: February 22, 2026 · Version 1.0