How Do AI Search Engines Choose Their Sources? (162-Citation Study, 2026)
How do AI search engines choose their sources?
AI engines recommend a tool but cite someone other than the tool to justify it: roughly 80 percent of Perplexity's citations pointed to independent third-party pages, not the vendor's own site.
The single most important fact about AI source selection is that it routes around the vendor. When we logged the full citation list behind 20 buying-intent answers, the recommended brand's own website was the minority of what the engine actually cited. Out of 162 logged Perplexity citationsverified 2026-06-07, only about one in five pointed to the vendor's own domain. The other four in five were independent: review roundups, comparison blogs, aggregators, and forum threads. The engine names the tool, then reaches for an outside page to support naming it. This is the mechanic the CONSENSUS Protocol calls off-vendor weighting, and it is the part of the system most marketing teams measure last.
That single ratio reframes the entire game. A brand that pours its budget into its own marketing site is optimising the roughly 20 percent of the citation surface that was already most likely to point to it. The roughly 80 percent that decides most answers lives on pages the brand does not control. Source selection, in practice, is a referendum held on other people's domains.
What type of page does an AI engine cite most?
The third-party review or listicle was the most-cited page type, making up roughly 58 percent of the classified Perplexity citations, with vendor pages a distant 20 percent.
Source selection has a clear pecking order by page type. We classified each of the 162 citations from its domain signature into four buckets. The table below is the centrepiece of this study: it is the source-type Share of Voice that an AI answer is built from. Read it as the answer to "what kind of page do I need to be on to get cited," not as a leaderboard of individual sites.
| Source type | What it is | Share of citations | Reading |
|---|---|---|---|
| Third-party review / listicle | Independent "best X 2026" roundups and review blogs (daveswift.com, marketermilk.com, saleshandy.com) | The dominant surface | |
| Vendor first-party | The recommended tool's own site (surferseo.com, instantly.ai, clay.com) | Minority slice | |
| Forum / community / UGC | Reddit, YouTube, LinkedIn: first-person buyer discussion | High coverage breadth | |
| Comparison aggregator | Dedicated review platforms (G2, TrustRadius) | Tie-breaker layer |
Source: Lucreya original measurement, data.json sourceTypeMix_perplexity_estimate. Classified from domain signatures across ~162 Perplexity citations; proportions are approximate and labelled as estimates. Perplexity-only fidelity. Snapshot 2026-06-07. License CC BY 4.0. Dataset DOI: 10.5281/zenodo.20632768 (Zenodo, CC BY 4.0).verified 2026-06-07
The reading is blunt. If you want to be the source an engine cites, the highest-probability surface is an independent comparison page, not your own marketing site and not a sponsored placement. Vendor pages earn citations, but they earn the smallest share, and they earn it only when they are structured well enough to be quoted. The aggregator layer (G2, TrustRadius) is a thin tie-breaker, not the main event. For the playbook on turning these surfaces into your citations, see our sibling guide on how to get cited in ChatGPT.
Why does Reddit get cited so often?
Reddit appeared in 15 of 20 answers (75 percent share of voice), more than twice the reach of any other domain, because forum threads carry the unsponsored buyer language engines reach for.
One community source dominated coverage, and it was not a publisher. Reddit showed up in the citation set of 15 of the 20 answers (75 percent)verified 2026-06-07, the single most-cited domain in the entire study. The next two were Zapier at 30 percent (6 of 20 queries) and YouTube at 25 percent (5 of 20). After that, citations scattered across more than 100 distinct domains, most appearing in just one query. So the concentration is in coverage, not in a single dominant publisher: one forum is cited almost everywhere, while everything else is long-tail.
| Domain | Queries cited (of 20) | Coverage SoV |
|---|---|---|
| reddit.com | 15 | |
| zapier.com | 6 | |
| youtube.com | 5 |
Source: Lucreya June 2026 study, data.json concentration. Beyond these three, citations spread across 100+ distinct domains. Perplexity-only fidelity. Snapshot 2026-06-07.verified 2026-06-07
The mechanism is intuitive once you see it. A forum thread is full of first-person, unsponsored experience ("I switched from X to Y and here is what broke"), and that is exactly the signal an engine wants when it has to defend a recommendation. A vendor page asserts; a forum thread testifies. The strategic implication for a brand is uncomfortable but clear: the conversation about you on Reddit and YouTube is a citation surface you do not own and largely cannot buy, only earn. That is the off-domain leverage the source-type table quantifies.
What does a page that gets cited actually look like?
The coded structural exemplar was a Zapier comparison page: Article and BreadcrumbList schema, a comparison table, real pricing, a 2026 date, a named structure, and about 2,600 words.
To move from "third parties win" to "here is the page shape that wins," we visited and coded one cited page directly. We picked a Zapier comparison article from the cited set as the structural archetype and read its anatomy line by line. It is the template the rest of the cited corpus rhymes with: structured, priced, schema-marked, and recently dated.
- Schema
- Article + BreadcrumbList (machine-readable structure the engine can parse)
- Length
- About 2,600 words (enough to cover the comparison in depth)
- Table
- An explicit comparison table (extractable, side-by-side)
- Pricing
- Real pricing data, not "affordable" hand-waving
- Freshness
- A visible 2026 date (engines down-weight stale comparison content)
- Type
- Third-party comparison roundup, not a vendor brochure
The reading from coding it: original first-party measurement was nearly absent from the entire cited set. Almost everything that got cited was consensus aggregation, pages that restate the same comparison everyone else publishes. That absence is the opening. A page carrying original data (a dated study, a real benchmark, a logged dataset) is differentiated in a field where almost nothing is original, which is precisely the structural bet behind publishing studies like this one. We expand the build mechanics in how to rank in AI answers, and the tool landscape that competes for these citations is tracked at the best GEO tools.
Can these source findings be trusted across every engine?
The source-type findings are Perplexity-grounded only; what we can report across all three engines is tool-recommendation agreement, not citation-level data.
This is the most important honesty caveat in the study, so we state it before any takeaway. Citation-URL data was captured at fidelity for one engine. Perplexity exposes a native numbered source list, so its 162 citations are real, logged URLs. ChatGPT renders citations as in-product chips without stable, extractable URLs. Google AI Overviews collapses its citation list and intermixes it with organic results. So every source-type number on this page (the ~80 percent off-vendor weight, the ~58/20/16/6 source-type mix, the 75 percent Reddit coverage) describes Perplexity, and we do not generalise it to the other engines as if their citations were captured.
What we can report across all three engines is a different metric: how often they agree on the recommended tool. That is a tool-agreement reading, not citation data. Across the 14 category and intent queriesverified 2026-06-07, the three engines named the same top tool on only 5 (36 percent full agreement), agreed two of three on 6 (43 percent), and named three completely different top tools on 3 (21 percent full divergence). All three full-divergence queries fell in the GEO category itself.
For comparison, the controlled-benchmark literature points the same direction on what wins a citation. The 2023 Princeton GEO study (Aggarwal et al.) found, in a single-engine benchmark, that adding statistics, quotations, and citations to a source could raise its generative-engine visibility by up to roughly 40 percent. That is an external finding from a controlled setting, labelled as such; our contribution is the field measurement of which kinds of pages actually carry those signals in live answers.
Find out which sources are deciding your category
The free AI Visibility Audit runs the first step of the CONSENSUS Protocol: it returns your Engine-Consensus flag across ChatGPT, Perplexity, and Google AI Overviews, with a snapshot date, in minutes. No blended black-box score, just where you actually stand on the surfaces this study measured.
Run my free AI Visibility Audit ›What should a brand do about how engines pick sources?
Stop optimising only the 20 percent you own and start earning the 80 percent you do not: structured third-party coverage, accurate comparison-page mentions, and genuine community presence.
The data converts cleanly into a priority order. Because the citation surface is roughly four-fifths off-vendor, the highest-leverage work is off your own domain. First, make sure the independent "best X 2026" roundups (the ~58 percent surface) name you, and name you accurately, because an engine quoting a roundup that omits you cannot cite you. Second, ensure the comparison and aggregator layer (G2, TrustRadius, dedicated versus pages) has current, correct entries for you. Third, treat the community layer (Reddit, YouTube) as earned, not bought. Fourth, and only fourth, harden your own pages to the exemplar shape so the ~20 percent of vendor citations that exist go to a page worth quoting.
The share of an engine's citations that point to third-party pages rather than the brand's own site. In our Perplexity capture, roughly four in five. A high off-vendor weight means you cannot self-publish your way into the answer.
The rate at which a class of source (third-party review, vendor, forum, aggregator) appears in the citation set across a fixed query set, expressed as coverage rather than a blended score.
None of this is a one-time fix. AI answers are volatile: one query that showed a Google AI Overview on an earlier probe did not trigger one on a timed re-run, so any source-selection reading is a dated snapshot, not a permanent map. The honest discipline is to re-run on a cadence and treat every figure as decaying. For the full reproducible standard behind these readings, the parent guide is the CONSENSUS Protocol, and the productised version of running it for you is our monitoring and placement retainer. The starting-point definition lives in what is GEO, and the auditing method in our GEO audit methodology.
For the wider tool landscape that competes for these citations, our colleagues at Nesyona's AI SEO tools index track the platforms feeding these answers.
Frequently asked questions
How do AI search engines choose their sources?
What type of page does an AI engine cite most often?
Do AI engines prefer vendor pages or third-party pages?
Why does Reddit get cited so often by AI search engines?
Can these findings be trusted across every AI engine?
Bottom line
AI search engines choose their sources by reaching past the vendor and citing independent third parties, so the leverage to get cited is mostly off your own domain. In our June 2026 study, roughly 80 percent of the 162 logged Perplexity citations were off-vendor; the third-party review or listicle was the dominant page type at about 58 percent; Reddit led all domains with 75 percent coverage; and the page that gets cited looks like the coded Zapier exemplar: Article and BreadcrumbList schema, a comparison table, real pricing, a 2026 date, about 2,600 words. The source-type findings are Perplexity-grounded by fidelity; across all three engines we can report only the tool-agreement metric (36 percent full agreement, 21 percent full divergence). Run the standard on your brand with our free AI Visibility Audit, read the parent method in the CONSENSUS Protocol, or see the full evidence in the Who AI Recommends: GTM 2026 study.
- Lucreya original measurement. Who AI Recommends: GTM Tool and Source Citations Across ChatGPT, Perplexity, and Google AI Overviews (2026). 20 queries, 3 engines, 60 answers, 162 logged Perplexity citations. Snapshot date 2026-06-07. lucreya.com/research/who-ai-recommends-gtm-2026/. CC BY 4.0. Dataset DOI: 10.5281/zenodo.20632768 (Zenodo, CC BY 4.0). verified 2026-06-07
- Aggarwal, P., Murahari, V., Rajpurohit, T., Kalyan, A., Narasimhan, K., Deshpande, A. GEO: Generative Engine Optimization. 2023. arxiv.org/abs/2311.09735. External controlled-benchmark finding: statistic, quotation, and citation additions can raise generative-engine visibility by up to ~40 percent on a single engine. Labelled external.
- Perplexity AI. perplexity.ai. The one engine in this study that exposes a native numbered source list; all source-type figures derive from its high-fidelity citation capture.
- Google. Generative AI in Search: Let Google do the searching for you. blog.google/products/search/generative-ai-search/. Reference for Google AI Overviews behaviour and why its citations collapse into organic results.
- Schema.org. Article and BreadcrumbList specifications. schema.org/Article. The markup the coded structural exemplar carried.
- Creative Commons. CC BY 4.0 License. creativecommons.org/licenses/by/4.0/. License for the Lucreya measurement dataset.