Data study · GEO Updated June 2026 · 11 min read · By Vincent Wesley Couey

Last reviewed: June 10, 2026 Next review due: September 2026 Snapshot data: June 7, 2026

How Do AI Search Engines Choose Their Sources? (162-Citation Study, 2026)

Q: What type of page does an AI engine cite most often?

The most-cited page type in Lucreya's Perplexity capture was the third-party review or listicle, which made up roughly 58 percent of the classified citations. Vendor first-party pages were about 20 percent, forum and community sources (Reddit, YouTube, LinkedIn) about 16 percent, and dedicated comparison aggregators such as G2 and TrustRadius about 6 percent. These proportions are domain-classified estimates from a single-engine snapshot and are labelled as such.

Q: Why does Reddit get cited so often by AI search engines?

Reddit appeared in the citation set of 15 of 20 Perplexity answers (75 percent share of voice) in our June 2026 run, more than any other domain. Forum threads carry first-person, unsponsored buyer language that reads as genuine experience, and that signal is exactly what an engine reaches for when justifying a recommendation. No other domain appeared in more than 30 percent of queries (Zapier 30 percent, YouTube 25 percent), so the concentration is in coverage breadth: one community source cited almost everywhere, rather than one dominant publisher.

How do AI search engines choose their sources?

AI engines recommend a tool but cite someone other than the tool to justify it: roughly 80 percent of Perplexity's citations pointed to independent third-party pages, not the vendor's own site.

The single most important fact about AI source selection is that it routes around the vendor. When we logged the full citation list behind 20 buying-intent answers, the recommended brand's own website was the minority of what the engine actually cited. Out of 162 logged Perplexity citationsverified 2026-06-07, only about one in five pointed to the vendor's own domain. The other four in five were independent: review roundups, comparison blogs, aggregators, and forum threads. The engine names the tool, then reaches for an outside page to support naming it. This is the mechanic the CONSENSUS Protocol calls off-vendor weighting, and it is the part of the system most marketing teams measure last.

That single ratio reframes the entire game. A brand that pours its budget into its own marketing site is optimising the roughly 20 percent of the citation surface that was already most likely to point to it. The roughly 80 percent that decides most answers lives on pages the brand does not control. Source selection, in practice, is a referendum held on other people's domains.

Q: So is on-site optimisation pointless?

A: No, but it is the smaller lever. Your own page still has to be structured, dated, and extractable to win the ~20 percent of citations that do go to vendors, and a clean vendor page can become the canonical fact the third parties quote. The point is sequencing: if you only work your own domain, you are ignoring the four-fifths of the citation surface that actually moves the answer. Earning and shaping third-party coverage is the larger, slower, off-domain lever.

What type of page does an AI engine cite most?

The third-party review or listicle was the most-cited page type, making up roughly 58 percent of the classified Perplexity citations, with vendor pages a distant 20 percent.

Source selection has a clear pecking order by page type. We classified each of the 162 citations from its domain signature into four buckets. The table below is the centrepiece of this study: it is the source-type Share of Voice that an AI answer is built from. Read it as the answer to "what kind of page do I need to be on to get cited," not as a leaderboard of individual sites.

Source type	What it is	Share of citations	Reading
Third-party review / listicle	Independent "best X 2026" roundups and review blogs (daveswift.com, marketermilk.com, saleshandy.com)	~58%	The dominant surface
Vendor first-party	The recommended tool's own site (surferseo.com, instantly.ai, clay.com)	~20%	Minority slice
Forum / community / UGC	Reddit, YouTube, LinkedIn: first-person buyer discussion	~16%	High coverage breadth
Comparison aggregator	Dedicated review platforms (G2, TrustRadius)	~6%	Tie-breaker layer

Source: Lucreya original measurement, data.json sourceTypeMix_perplexity_estimate. Classified from domain signatures across ~162 Perplexity citations; proportions are approximate and labelled as estimates. Perplexity-only fidelity. Snapshot 2026-06-07. License CC BY 4.0. Dataset DOI: 10.5281/zenodo.20632768 (Zenodo, CC BY 4.0).verified 2026-06-07

The reading is blunt. If you want to be the source an engine cites, the highest-probability surface is an independent comparison page, not your own marketing site and not a sponsored placement. Vendor pages earn citations, but they earn the smallest share, and they earn it only when they are structured well enough to be quoted. The aggregator layer (G2, TrustRadius) is a thin tie-breaker, not the main event. For the playbook on turning these surfaces into your citations, see our sibling guide on how to get cited in ChatGPT.

Why does Reddit get cited so often?

Reddit appeared in 15 of 20 answers (75 percent share of voice), more than twice the reach of any other domain, because forum threads carry the unsponsored buyer language engines reach for.

One community source dominated coverage, and it was not a publisher. Reddit showed up in the citation set of 15 of the 20 answers (75 percent)verified 2026-06-07, the single most-cited domain in the entire study. The next two were Zapier at 30 percent (6 of 20 queries) and YouTube at 25 percent (5 of 20). After that, citations scattered across more than 100 distinct domains, most appearing in just one query. So the concentration is in coverage, not in a single dominant publisher: one forum is cited almost everywhere, while everything else is long-tail.

Domain	Queries cited (of 20)	Coverage SoV
reddit.com	15	75%
zapier.com	6	30%
youtube.com	5	25%

Source: Lucreya June 2026 study, data.json concentration. Beyond these three, citations spread across 100+ distinct domains. Perplexity-only fidelity. Snapshot 2026-06-07.verified 2026-06-07

The mechanism is intuitive once you see it. A forum thread is full of first-person, unsponsored experience ("I switched from X to Y and here is what broke"), and that is exactly the signal an engine wants when it has to defend a recommendation. A vendor page asserts; a forum thread testifies. The strategic implication for a brand is uncomfortable but clear: the conversation about you on Reddit and YouTube is a citation surface you do not own and largely cannot buy, only earn. That is the off-domain leverage the source-type table quantifies.

Q: Can I just post about my own tool on Reddit to get cited?

A: That backfires. The reason Reddit carries weight is that the language reads as unsponsored; an obvious vendor plant is the opposite of the signal the engine values, and communities punish it. The durable move is to be genuinely good enough that organic threads recommend you, and to make sure the structured third-party roundups (the ~58 percent surface) name you accurately. You influence the forum layer indirectly, by earning it, not by seeding it.

What does a page that gets cited actually look like?

The coded structural exemplar was a Zapier comparison page: Article and BreadcrumbList schema, a comparison table, real pricing, a 2026 date, a named structure, and about 2,600 words.

To move from "third parties win" to "here is the page shape that wins," we visited and coded one cited page directly. We picked a Zapier comparison article from the cited set as the structural archetype and read its anatomy line by line. It is the template the rest of the cited corpus rhymes with: structured, priced, schema-marked, and recently dated.

Coded structural exemplar: a Zapier comparison page

Schema: Article + BreadcrumbList (machine-readable structure the engine can parse)
Length: About 2,600 words (enough to cover the comparison in depth)
Table: An explicit comparison table (extractable, side-by-side)
Pricing: Real pricing data, not "affordable" hand-waving
Freshness: A visible 2026 date (engines down-weight stale comparison content)
Type: Third-party comparison roundup, not a vendor brochure

The reading from coding it: original first-party measurement was nearly absent from the entire cited set. Almost everything that got cited was consensus aggregation, pages that restate the same comparison everyone else publishes. That absence is the opening. A page carrying original data (a dated study, a real benchmark, a logged dataset) is differentiated in a field where almost nothing is original, which is precisely the structural bet behind publishing studies like this one. We expand the build mechanics in how to rank in AI answers, and the tool landscape that competes for these citations is tracked at the best GEO tools.

Can these source findings be trusted across every engine?

The source-type findings are Perplexity-grounded only; what we can report across all three engines is tool-recommendation agreement, not citation-level data.

This is the most important honesty caveat in the study, so we state it before any takeaway. Citation-URL data was captured at fidelity for one engine. Perplexity exposes a native numbered source list, so its 162 citations are real, logged URLs. ChatGPT renders citations as in-product chips without stable, extractable URLs. Google AI Overviews collapses its citation list and intermixes it with organic results. So every source-type number on this page (the ~80 percent off-vendor weight, the ~58/20/16/6 source-type mix, the 75 percent Reddit coverage) describes Perplexity, and we do not generalise it to the other engines as if their citations were captured.

What we can report across all three engines is a different metric: how often they agree on the recommended tool. That is a tool-agreement reading, not citation data. Across the 14 category and intent queriesverified 2026-06-07, the three engines named the same top tool on only 5 (36 percent full agreement), agreed two of three on 6 (43 percent), and named three completely different top tools on 3 (21 percent full divergence). All three full-divergence queries fell in the GEO category itself.

Read the two numbers as different things: the source-type Share of Voice is Perplexity citation-level data (high fidelity, single engine). The 36 percent full-agreement / 21 percent full-divergence figure is the cross-engine tool-agreement metric (all three engines, but at the recommendation level, not the citation level). We do not present per-engine citation sources for ChatGPT or Google AI Overviews, because we did not capture them at that fidelity. Conflating the two would overstate the data, which is exactly what this caveat exists to prevent. Lucreya figures: data.json sourceTypeMix_perplexity_estimate + crossEngineAnalysis. Snapshot 2026-06-07. Perplexity citation fidelity; ChatGPT/Google AIO not captured at URL level.

For comparison, the controlled-benchmark literature points the same direction on what wins a citation. The 2023 Princeton GEO study (Aggarwal et al.) found, in a single-engine benchmark, that adding statistics, quotations, and citations to a source could raise its generative-engine visibility by up to roughly 40 percent. That is an external finding from a controlled setting, labelled as such; our contribution is the field measurement of which kinds of pages actually carry those signals in live answers.

Find out which sources are deciding your category

The free AI Visibility Audit runs the first step of the CONSENSUS Protocol: it returns your Engine-Consensus flag across ChatGPT, Perplexity, and Google AI Overviews, with a snapshot date, in minutes. No blended black-box score, just where you actually stand on the surfaces this study measured.

Run my free AI Visibility Audit ›

What should a brand do about how engines pick sources?

Stop optimising only the 20 percent you own and start earning the 80 percent you do not: structured third-party coverage, accurate comparison-page mentions, and genuine community presence.

The data converts cleanly into a priority order. Because the citation surface is roughly four-fifths off-vendor, the highest-leverage work is off your own domain. First, make sure the independent "best X 2026" roundups (the ~58 percent surface) name you, and name you accurately, because an engine quoting a roundup that omits you cannot cite you. Second, ensure the comparison and aggregator layer (G2, TrustRadius, dedicated versus pages) has current, correct entries for you. Third, treat the community layer (Reddit, YouTube) as earned, not bought. Fourth, and only fourth, harden your own pages to the exemplar shape so the ~20 percent of vendor citations that exist go to a page worth quoting.

Off-vendor weight

The share of an engine's citations that point to third-party pages rather than the brand's own site. In our Perplexity capture, roughly four in five. A high off-vendor weight means you cannot self-publish your way into the answer.

Source-type SoV

The rate at which a class of source (third-party review, vendor, forum, aggregator) appears in the citation set across a fixed query set, expressed as coverage rather than a blended score.

None of this is a one-time fix. AI answers are volatile: one query that showed a Google AI Overview on an earlier probe did not trigger one on a timed re-run, so any source-selection reading is a dated snapshot, not a permanent map. The honest discipline is to re-run on a cadence and treat every figure as decaying. For the full reproducible standard behind these readings, the parent guide is the CONSENSUS Protocol, and the productised version of running it for you is our monitoring and placement retainer. The starting-point definition lives in what is GEO, and the auditing method in our GEO audit methodology.

For the wider tool landscape that competes for these citations, our colleagues at Nesyona's AI SEO tools index track the platforms feeding these answers.

📊 Want the source-type classification scheme we used? The exact rubric for sorting citations into third-party, vendor, forum, and aggregator, so you can run the autopsy on your own category. We will send it.

What this study does not claim: The source-type findings describe Perplexity, the one engine whose citation URLs we captured at fidelity; we do not present per-engine citation sources for ChatGPT or Google AI Overviews. The cross-engine 36 percent full-agreement / 21 percent full-divergence figure is a tool-recommendation metric, not citation data. Source-type proportions are domain-classified estimates, labelled as such. The ~40 percent Princeton figure is an external controlled-benchmark finding, not ours. Every figure is a dated snapshot from a volatile system (2026-06-07). The dataset is deposited at Zenodo under DOI 10.5281/zenodo.20632768 (CC BY 4.0). We make no claim that any Lucreya page is itself cited by an engine.

Frequently asked questions

How do AI search engines choose their sources?

AI search engines lean heavily on independent third-party pages and community discussion rather than the recommended brand's own website. In our June 2026 measurement, roughly four in five of 162 logged Perplexity citations pointed to off-vendor sources: independent roundups, review blogs, comparison aggregators, and forums. Reddit alone appeared in 15 of 20 answers (75 percent), the single most-cited domain. The engine recommends a tool but cites someone other than the tool to justify it, so the page that wins is usually a structured, dated, third-party comparison rather than the vendor's marketing site.

What type of page does an AI engine cite most often?

The most-cited page type in our Perplexity capture was the third-party review or listicle, at roughly 58 percent of classified citations. Vendor first-party pages were about 20 percent, forum and community sources (Reddit, YouTube, LinkedIn) about 16 percent, and dedicated comparison aggregators such as G2 and TrustRadius about 6 percent. These proportions are domain-classified estimates from a single-engine snapshot and are labelled as such.

Do AI engines prefer vendor pages or third-party pages?

Third-party pages, by a wide margin in our data. Only about one in five of the 162 logged Perplexity citations went to the recommended vendor's own site; the other roughly 80 percent were independent. A brand that optimises only its own website is working the smallest slice of the citation surface. The pages that decide most answers are review roundups, comparison pages, and forum threads the brand does not own.

Why does Reddit get cited so often by AI search engines?

Reddit appeared in 15 of 20 Perplexity answers (75 percent share of voice) in our June 2026 run, more than any other domain. Forum threads carry first-person, unsponsored buyer language that reads as genuine experience, which is the signal an engine reaches for when justifying a recommendation. No other domain appeared in more than 30 percent of queries (Zapier 30 percent, YouTube 25 percent), so the concentration is in coverage breadth, not a single dominant publisher.

Can these findings be trusted across every AI engine?

The source-type findings are Perplexity-grounded, and we state that limit plainly. Perplexity exposes a native numbered citation list, so its URLs were captured at high fidelity. ChatGPT renders citations as in-product chips without stable URLs, and Google AI Overviews collapses and intermixes its citations, so neither was captured at the same fidelity. What we can report across all three engines is tool-recommendation agreement: full three-engine agreement in 36 percent of category and intent queries, and full divergence in 21 percent. That agreement metric is cross-engine; the per-source-type autopsy is Perplexity-only.

Bottom line

AI search engines choose their sources by reaching past the vendor and citing independent third parties, so the leverage to get cited is mostly off your own domain. In our June 2026 study, roughly 80 percent of the 162 logged Perplexity citations were off-vendor; the third-party review or listicle was the dominant page type at about 58 percent; Reddit led all domains with 75 percent coverage; and the page that gets cited looks like the coded Zapier exemplar: Article and BreadcrumbList schema, a comparison table, real pricing, a 2026 date, about 2,600 words. The source-type findings are Perplexity-grounded by fidelity; across all three engines we can report only the tool-agreement metric (36 percent full agreement, 21 percent full divergence). Run the standard on your brand with our free AI Visibility Audit, read the parent method in the CONSENSUS Protocol, or see the full evidence in the Who AI Recommends: GTM 2026 study.

Lucreya original measurement. Who AI Recommends: GTM Tool and Source Citations Across ChatGPT, Perplexity, and Google AI Overviews (2026). 20 queries, 3 engines, 60 answers, 162 logged Perplexity citations. Snapshot date 2026-06-07. lucreya.com/research/who-ai-recommends-gtm-2026/. CC BY 4.0. Dataset DOI: 10.5281/zenodo.20632768 (Zenodo, CC BY 4.0). verified 2026-06-07
Aggarwal, P., Murahari, V., Rajpurohit, T., Kalyan, A., Narasimhan, K., Deshpande, A. GEO: Generative Engine Optimization. 2023. arxiv.org/abs/2311.09735. External controlled-benchmark finding: statistic, quotation, and citation additions can raise generative-engine visibility by up to ~40 percent on a single engine. Labelled external.
Perplexity AI. perplexity.ai. The one engine in this study that exposes a native numbered source list; all source-type figures derive from its high-fidelity citation capture.
Google. Generative AI in Search: Let Google do the searching for you. blog.google/products/search/generative-ai-search/. Reference for Google AI Overviews behaviour and why its citations collapse into organic results.
Schema.org. Article and BreadcrumbList specifications. schema.org/Article. The markup the coded structural exemplar carried.
Creative Commons. CC BY 4.0 License. creativecommons.org/licenses/by/4.0/. License for the Lucreya measurement dataset.