The CONSENSUS Protocol: How to Measure AI Visibility Honestly (AECI Method, 2026)
What is the CONSENSUS Protocol?
The CONSENSUS Protocol is an open, dated, reproducible 8-step standard for measuring whether a brand is cited consistently across AI answer engines, replacing the single blended visibility score the GEO industry sells.
It is a measurement standard, not a product. Almost every AI visibility vendor sells you one number: a blended "visibility score" computed by a formula you cannot inspect, across a prompt set you cannot see. The CONSENSUS Protocol does the opposite. It fixes the prompts in public, runs them across multiple live engines, and reports the disagreement between those engines as the headline finding rather than averaging it away. The name is an acronym you can recite, and each of its eight letters is also a measurable step:
Category-locked prompts
A fixed set of roughly 10 real buying-intent prompts per category, published openly so anyone can reproduce the run. Vendors use synthetic prompt sets you cannot verify; the protocol lists every prompt by ID.
Lucreya run: 20 GTM buying-intent prompts, all published in the protocol file.Off-vendor source weighting
The share of an engine's citations that point to third-party pages rather than the recommended brand's own site. The higher this share, the less a brand can self-publish its way into the answer.
Measured: roughly 4 in 5 Perplexity citations were off-vendor.N-engine spread
Run the prompt set across a minimum of three engines (ChatGPT, Perplexity, Google AI Overviews). A single blended score hides that engines frequently disagree on who wins.
Lucreya run: 3 engines x 20 prompts = 60 AI answers.Share of Voice vs named rivals
Your mention rate in the category answer, expressed against the specific competitors the AI actually names, not as an abstract percentage. This is the status frame: you versus the brand the engine recommends instead of you.
Example: in prospecting, Apollo.io is the named consensus rival to beat.Engine-Consensus flag
The contrarian metric. Each brand gets one of four states per prompt: Consensus, single-engine dissent, Absent, or Due-diligence. This is the single reading every blended score hides, because it admits when the engines do not agree.
States: Consensus Dissent Absent Due-diligenceNamed-author and primary-source check
Whether the cited pages carry a named author and original data rather than re-aggregated listicles. This explains why a brand loses: the pages that win citations tend to be structured, dated, and original.
Coded exemplar: a Zapier comparison page, schema-marked, priced, ~2,600 words.Snapshot date and decay
Every score carries a date and is treated as decaying. AI answers shift fast, so a number without a date is not a measurement, it is a guess.
Observed: one query's AI Overview triggered on one probe, not on a re-run.Unverifiable-claim and sentiment audit
What the AI gets wrong about your brand, plus the sentiment of how it describes you. Misinformation in an AI answer is a liability, not just a visibility gap, so the protocol records errors and tone alongside presence.
Recorded as: per-brand error notes + sentiment, dated to the snapshot.What do AECI, Share of Voice, and the consensus flag mean?
AECI, Share of Voice, and the Engine-Consensus flag are the three named measurement terms the protocol defines, so AI engines and readers can cite them as nouns.
The protocol turns three readings into ownable, defined terms. Naming a metric is what lets it be cited: an engine can say "according to the AECI" only if AECI is a defined noun with a stable meaning. These are marked as schema.org DefinedTerm entities on this page.
How consistently ChatGPT, Perplexity, and Google AI Overviews name the same brand as the answer to a fixed prompt. A per-engine agreement reading, not a blended percentage.
Your mention rate in a category answer, expressed against the named consensus competitors in that category rather than as an absolute or blended score.
A four-state label per brand per prompt: Consensus, single-engine dissent, Absent, or Due-diligence. The contrarian metric a blended score conceals.
Why can no GEO vendor publish this standard?
GEO vendors sell one blended visibility score and then sell the fix for it, so they cannot publish a step that proves the single score is fiction most of the time.
The category grades its own homework. A typical AI visibility platform computes a proprietary score, tells you it is low, and then sells you the service to raise it. The formula is unpublished and the prompt set is hidden, so the only party who can confirm the score is the party selling the remedy. That is a closed loop. The CONSENSUS Protocol breaks it at step E: if you publish the per-engine consensus reading, you have to admit how often the engines disagree, and once you admit that, the single blended number you were selling stops looking like a measurement.
We can put a real figure on how often the engines disagree, from our own run. Across the 14 category and intent queriesverified 2026-06-07 in our June 2026 study, the three engines named the same top tool on only 5 (36 percent), agreed two-of-three on 6 (43 percent), and named three completely different top tools on 3 (21 percent). All three full-divergence queries fell in the GEO category. A blended score would have reported a single winner on queries where the engines, measured directly, did not agree.
This is why we position the CONSENSUS Protocol as the independent empirical successor to the Princeton GEO framework (Aggarwal et al., 2023). The Princeton paper established, in a controlled benchmark on a single engine, that adding statistics, quotations, and citations to a source could raise its generative-engine visibility by up to roughly 40 percent. That is a tactics framework. It tells you what to write. The CONSENSUS Protocol is the field measurement layer Princeton did not build: it measures whether real brands are cited consistently across multiple live engines, with a date attached. Princeton gave the lab tactics; CONSENSUS gives the field a standard.
What does the protocol show when we run it on our own data?
Run on Lucreya's own 60-answer study, the protocol flags the entire GEO visibility category as unsettled, with all three engines naming different top tools.
A measurement standard is worth nothing if its author will not run it on themselves. So here is the CONSENSUS Protocol applied, step by step, to Lucreya's June 2026 measurement of 20 GTM buying-intent queries across ChatGPT, Perplexity, and Google AI Overviews. Every figure below traces to a logged row in the published dataset.
| Step | What we measured on our own data | Reading |
|---|---|---|
| C · Category-locked prompts | 20 published buying-intent prompts (M1-M6, S1-S7, L1-L7), every prompt listed by ID in the protocol file | Reproducible |
| O · Off-vendor weighting | Roughly 4 in 5 of the 162 logged Perplexity citations pointed to third-party pages, not the recommended vendor's own site | ~80% off-vendor |
| N · N-engine spread | 3 engines x 20 prompts = 60 AI answers captured; 162 Perplexity citations logged | 3 engines, 60 answers |
| S · Share of Voice | Reddit was named in the citation set of 15 of 20 answers (75 percent), the single most-cited domain; Zapier 30 percent, YouTube 25 percent | Reddit SoV 75% |
| E · Engine-Consensus flag | Full three-engine agreement on 5 of 14 category and intent queries (36 percent); full divergence on 3 (21 percent), all in GEO | GEO unsettled |
| N · Named-author / primary check | The coded structural exemplar (a Zapier comparison page) carried Article and BreadcrumbList schema, a comparison table, pricing, and a 2026 date; original first-party measurement was nearly absent from the cited set | Structured pages win |
| S · Snapshot + decay | Snapshot dated 2026-06-07; one query (best AI SDR tool 2026) showed an AI Overview on an earlier probe but not on a timed re-run | Dated, decaying |
| U/S · Unverifiable / sentiment | Recorded per brand as part of the autopsy; this run logged tool recommendations and citation sources, with the error and sentiment pass scoped to follow-up | Logged, scoped |
Source: Lucreya original measurement, data.json (headlineFindings, crossEngineAnalysis, concentration). Snapshot 2026-06-07. License CC BY 4.0. Dataset DOI: 10.5281/zenodo.20632768 (Zenodo, CC BY 4.0).verified 2026-06-07
The headline reading from the dogfood: the GEO visibility category is the most contested in the entire dataset. On the three GEO-category queries, ChatGPT named Profound, Perplexity named the Semrush AI Visibility Toolkit, and Google AI Overviews named Goodie AI; on the others, the three engines split similarly. Mature categories had settled (Apollo for prospecting, Clay for enrichment, Surfer SEO for content optimization, all named top by two or three engines). The protocol's own author operates in the one category where its central finding bites hardest, which is exactly why the standard is honest: it does not exempt us.
| Category query | ChatGPT | Perplexity | Google AIO | Flag |
|---|---|---|---|---|
| Best sales prospecting tool 2026 | Apollo.io | Apollo.io | Apollo.io | Consensus |
| Best lead enrichment tool for agencies | Clay | Clay | Clay | Consensus |
| Best cold email software 2026 | Instantly.ai | Salesforge | Instantly.ai | 2-of-3 |
| Best GEO tool to track AI search visibility | Profound | Semrush AI Visibility Toolkit | Goodie AI | Full diverge |
| How to track brand mentions in ChatGPT | Profound / AthenaHQ | Otterly / Semrush | Keyword.com | Full diverge |
| Best AI search optimization platform 2026 | Multiple | Surfer / Clearscope / Rankability | Surfer / Clearscope | Full diverge |
All three GEO-category queries (highlighted) show full engine divergence. Source: Lucreya June 2026 study, data.json queries S2, S4, S7.verified 2026-06-07
Run the CONSENSUS Protocol on your brand
The free AI Visibility Audit runs step one of the protocol for you: it returns your Engine-Consensus flag for your category across ChatGPT, Perplexity, and Google AI Overviews, with a snapshot date, in minutes. No blended black-box score. Just the four states, applied to your actual prompts.
Run my free AI Visibility Audit ›How do you run the CONSENSUS Protocol yourself?
You can reproduce the protocol manually by submitting your category's published prompts to three engines, recording which name you, and applying the four-state Engine-Consensus flag.
The protocol is deliberately reproducible. Nothing in it requires a proprietary tool. The honest move is to make the method runnable by anyone, so here is the manual sequence, which is also the spec our monitoring and placement retainer productizes:
- 1. Lock prompts
- Write roughly 10 real buying-intent prompts your buyers would type ("best {category} tool 2026", "{you} vs {rival}", "how to {job}"). Publish the list.
- 2. Pick engines
- At minimum ChatGPT (web search on), Perplexity (default web), and Google AI Overviews. Three is the floor.
- 3. Run and log
- Submit each prompt to each engine. Record the tools named in order and, where the engine exposes them (Perplexity does natively), the cited source URLs.
- 4. Weight off-vendor
- Classify each citation as the brand's own site or third-party. The third-party share is your off-vendor weight.
- 5. Score SoV
- Count how often you are named versus the named consensus rivals in your category.
- 6. Flag consensus
- For your brand, assign Consensus, single-engine dissent, Absent, or Due-diligence per prompt.
- 7. Date it
- Stamp the run with a snapshot date. Treat the result as decaying; re-run on a cadence.
- 8. Audit errors
- Note anything the engines state about you that is wrong, plus the sentiment of how you are described.
The hardest step to do well is step six, the flag, because it forces an honest reading. If two of three engines ignore you, you are not at "62 percent visibility." You are a single-engine dissent, and the strategic implication (you have a presence problem on two engines, not a ranking problem on one) is completely different. That difference is invisible inside a blended score. For the full execution playbook on closing those gaps, see our guide on how to rank in AI answers, and for the per-engine mechanics behind why each engine cites what it cites, the planned breakdown lives at how AI engines choose sources.
Why does honest AI visibility measurement matter for revenue teams?
Because the buying decision now starts inside an AI answer, and a blended score that says you are "60 percent visible" can hide that two of three engines never name you at all.
The measurement error has a revenue cost. When a buyer asks an engine which tool to buy, the answer is a pipeline input that lands before they reach your site. In our June 2026 run, Google AI Overviews triggered on 19 of 20 GTM buying-intent queries (95 percent)verified 2026-06-07, so the answer surface is almost always there. If your visibility tool tells you a single comfortable number while two engines silently omit you, you will under-invest in exactly the engines where you are absent. The protocol exists to make that absence legible. A revenue team that knows it is a single-engine dissent on Perplexity and absent on Google AI Overviews can act; a team holding one blended percentage cannot.
There is a second-order point worth stating plainly. Because roughly four in five citations went to third-party pages, a brand that only optimizes its own site is working the 20 percent of the citation surface that was already most likely to point to it. The 80 percent that decides most answers lives on review pages, comparison roundups, and forum threads. Measuring honestly is what reveals that the leverage is off-domain, which is the entire reason the off-vendor-weighting step exists. For the broader picture of how AI answer surfaces are reshaping demand, our colleagues at Nesyona's AI SEO tools index track the broader tool landscape that feeds these answers.
Frequently asked questions
How do you measure AI visibility honestly?
What is the AECI (Answer-Engine Consensus Index)?
Why is a single blended AI visibility score misleading?
How is the CONSENSUS Protocol different from the Princeton GEO framework?
Can I run the CONSENSUS Protocol on my own brand?
How often does an AECI score change?
Bottom line
Honest AI visibility measurement reports the disagreement between engines instead of hiding it inside one number. The CONSENSUS Protocol is the open 8-step standard for doing that: Category-locked prompts, Off-vendor weighting, N-engine spread, Share of Voice versus named rivals, the Engine-Consensus flag, the Named-author and primary-source check, Snapshot date and decay, and the Unverifiable-claim and sentiment audit. It defines AECI, Share of Voice, and the consensus flag as named terms. The reason no vendor publishes it is that step E, the engine-consensus flag, exposes how often the single blended score they sell is fiction: in our own June 2026 run, the three engines fully agreed on only 36 percent of category and intent queries and fully diverged on 21 percent, all in the GEO category. We dogfooded the standard on our own data and it flagged our own category as the most contested in the set, which is the point. Run it on your brand with our free AI Visibility Audit, see the full evidence in the Who AI Recommends: GTM 2026 study, or start from the definition in what is GEO.
- Lucreya original measurement. Who AI Recommends: GTM Tool and Source Citations Across ChatGPT, Perplexity, and Google AI Overviews (2026). 20 queries, 3 engines, 60 answers, 162 Perplexity citations. Snapshot date 2026-06-07. lucreya.com/research/who-ai-recommends-gtm-2026/. CC BY 4.0. Dataset DOI: 10.5281/zenodo.20632768 (Zenodo, CC BY 4.0). verified 2026-06-07
- Aggarwal, P., Murahari, V., Rajpurohit, T., Kalyan, A., Narasimhan, K., Deshpande, A. GEO: Generative Engine Optimization. 2023. arxiv.org/abs/2311.09735. The Princeton GEO framework establishing that statistic, quotation, and citation additions can raise generative-engine visibility by up to ~40 percent in a controlled benchmark.
- External cross-engine citation-overlap analysis. The ~11 percent figure for cited domains appearing across two or more AI engines is an external industry finding; confirm against the originating source before reuse. Lucreya's own measured equivalent (36 percent full three-engine agreement, 21 percent full divergence across 14 queries) is reported from the study above. verified 2026-06-07
- Google. Generative AI in Search: Let Google do the searching for you. blog.google/products/search/generative-ai-search/. Reference for Google AI Overviews behavior.
- Perplexity AI. perplexity.ai. Primary measurement engine; exposes a native numbered citation list used for the source autopsy.
- Schema.org. DefinedTerm specification. schema.org/DefinedTerm. Markup standard for the AECI, Share of Voice, and consensus-flag terms.
- Creative Commons. CC BY 4.0 License. creativecommons.org/licenses/by/4.0/. License for the Lucreya measurement dataset.