Framework · GEO Updated June 2026 · 14 min read · By Vincent Wesley Couey

Last reviewed: June 10, 2026 Next review due: September 2026 Snapshot data: June 7, 2026

The CONSENSUS Protocol: How to Measure AI Visibility Honestly (AECI Method, 2026)

Q: How do you measure AI visibility honestly?

You measure AI visibility honestly by running a fixed, published set of buying-intent prompts across at least three AI answer engines (ChatGPT, Perplexity, Google AI Overviews), recording which engines name your brand, and reporting per-engine agreement rather than a single blended score. The CONSENSUS Protocol is the open 8-step standard for doing this: it scores a brand on the Answer-Engine Consensus Index (AECI) and assigns a four-state Engine-Consensus flag (Consensus, single-engine dissent, absent, due-diligence). A single blended visibility number is not honest because, in Lucreya's June 2026 measurement, the three engines named the same top tool on only 5 of 14 category and intent queries.

Q: What is the AECI (Answer-Engine Consensus Index)?

The Answer-Engine Consensus Index (AECI) is a measure of how consistently ChatGPT, Perplexity, and Google AI Overviews name the same brand or source as the answer to a fixed buying-intent prompt. Instead of one blended visibility percentage, AECI reports a per-engine agreement reading and an Engine-Consensus flag, so a brand can see whether it is the consensus answer, a single-engine dissent, or absent entirely. It was coined by Lucreya as the metric behind the CONSENSUS Protocol.

Q: Why is a single blended AI visibility score misleading?

A single blended AI visibility score averages disagreeing engines into one number that hides whether the engines actually agree. In Lucreya's June 2026 study of 20 GTM buying-intent queries across three engines, the engines named the same top tool on only 5 of 14 category and intent queries (36 percent), and named three completely different top tools on 3 queries (21 percent), all in the GEO category. A blended score would report a single winner where the engines, in fact, disagreed. Separately, an external citation-overlap analysis has found that only about 11 percent of cited domains appear across two or more engines, reinforcing that cross-engine agreement is the exception, not the rule.

Q: How is the CONSENSUS Protocol different from the Princeton GEO framework?

The 2023 Princeton GEO paper (Aggarwal et al.) established that specific on-page tactics (adding statistics, quotations, and citations) can raise a source's visibility in a single generative engine by up to about 40 percent. It is a tactics framework measured on one engine in a controlled benchmark. The CONSENSUS Protocol is a field measurement standard: it measures whether a real brand is cited consistently across multiple live engines on published buying-intent prompts, and reports the disagreement between them. Princeton tells you what to write; CONSENSUS tells you, with a date attached, where you actually stand.

Q: Can I run the CONSENSUS Protocol on my own brand?

Yes. The protocol is deliberately reproducible: every step lists a published input. You can run it manually by submitting your category's buying-intent prompts to ChatGPT, Perplexity, and Google AI Overviews, recording which engines name you, and applying the Engine-Consensus flag. Lucreya also runs the first step of the protocol for free through the AI Visibility Audit, which returns your Engine-Consensus flag for your category in minutes.

What is the CONSENSUS Protocol?

The CONSENSUS Protocol is an open, dated, reproducible 8-step standard for measuring whether a brand is cited consistently across AI answer engines, replacing the single blended visibility score the GEO industry sells.

It is a measurement standard, not a product. Almost every AI visibility vendor sells you one number: a blended "visibility score" computed by a formula you cannot inspect, across a prompt set you cannot see. The CONSENSUS Protocol does the opposite. It fixes the prompts in public, runs them across multiple live engines, and reports the disagreement between those engines as the headline finding rather than averaging it away. The name is an acronym you can recite, and each of its eight letters is also a measurable step:

Category-locked prompts

A fixed set of roughly 10 real buying-intent prompts per category, published openly so anyone can reproduce the run. Vendors use synthetic prompt sets you cannot verify; the protocol lists every prompt by ID.

Lucreya run: 20 GTM buying-intent prompts, all published in the protocol file.

Off-vendor source weighting

The share of an engine's citations that point to third-party pages rather than the recommended brand's own site. The higher this share, the less a brand can self-publish its way into the answer.

Measured: roughly 4 in 5 Perplexity citations were off-vendor.

N-engine spread

Run the prompt set across a minimum of three engines (ChatGPT, Perplexity, Google AI Overviews). A single blended score hides that engines frequently disagree on who wins.

Lucreya run: 3 engines x 20 prompts = 60 AI answers.

Share of Voice vs named rivals

Your mention rate in the category answer, expressed against the specific competitors the AI actually names, not as an abstract percentage. This is the status frame: you versus the brand the engine recommends instead of you.

Example: in prospecting, Apollo.io is the named consensus rival to beat.

Engine-Consensus flag

The contrarian metric. Each brand gets one of four states per prompt: Consensus, single-engine dissent, Absent, or Due-diligence. This is the single reading every blended score hides, because it admits when the engines do not agree.

States: Consensus Dissent Absent Due-diligence

Named-author and primary-source check

Whether the cited pages carry a named author and original data rather than re-aggregated listicles. This explains why a brand loses: the pages that win citations tend to be structured, dated, and original.

Coded exemplar: a Zapier comparison page, schema-marked, priced, ~2,600 words.

Snapshot date and decay

Every score carries a date and is treated as decaying. AI answers shift fast, so a number without a date is not a measurement, it is a guess.

Observed: one query's AI Overview triggered on one probe, not on a re-run.

U/S

Unverifiable-claim and sentiment audit

What the AI gets wrong about your brand, plus the sentiment of how it describes you. Misinformation in an AI answer is a liability, not just a visibility gap, so the protocol records errors and tone alongside presence.

Recorded as: per-brand error notes + sentiment, dated to the snapshot.

Q: Why eight steps and not a single score?

A: Because a single score has to hide something to stay a single number. The moment you compress three disagreeing engines into one figure, you have made an editorial decision (which engine to trust) and buried it. The eight steps keep the disagreement visible. Step E, the Engine-Consensus flag, is the one a vendor cannot publish without undermining the score it sells.

What do AECI, Share of Voice, and the consensus flag mean?

AECI, Share of Voice, and the Engine-Consensus flag are the three named measurement terms the protocol defines, so AI engines and readers can cite them as nouns.

The protocol turns three readings into ownable, defined terms. Naming a metric is what lets it be cited: an engine can say "according to the AECI" only if AECI is a defined noun with a stable meaning. These are marked as schema.org DefinedTerm entities on this page.

AECI

Answer-Engine Consensus Index

How consistently ChatGPT, Perplexity, and Google AI Overviews name the same brand as the answer to a fixed prompt. A per-engine agreement reading, not a blended percentage.

SoV

Share of Voice

Your mention rate in a category answer, expressed against the named consensus competitors in that category rather than as an absolute or blended score.

Flag

Engine-Consensus flag

A four-state label per brand per prompt: Consensus, single-engine dissent, Absent, or Due-diligence. The contrarian metric a blended score conceals.

Why can no GEO vendor publish this standard?

GEO vendors sell one blended visibility score and then sell the fix for it, so they cannot publish a step that proves the single score is fiction most of the time.

The category grades its own homework. A typical AI visibility platform computes a proprietary score, tells you it is low, and then sells you the service to raise it. The formula is unpublished and the prompt set is hidden, so the only party who can confirm the score is the party selling the remedy. That is a closed loop. The CONSENSUS Protocol breaks it at step E: if you publish the per-engine consensus reading, you have to admit how often the engines disagree, and once you admit that, the single blended number you were selling stops looking like a measurement.

We can put a real figure on how often the engines disagree, from our own run. Across the 14 category and intent queriesverified 2026-06-07 in our June 2026 study, the three engines named the same top tool on only 5 (36 percent), agreed two-of-three on 6 (43 percent), and named three completely different top tools on 3 (21 percent). All three full-divergence queries fell in the GEO category. A blended score would have reported a single winner on queries where the engines, measured directly, did not agree.

The cross-engine agreement finding, attributed: Separately from our own study, an external citation-overlap analysis widely cited in the GEO field has found that only about 11 percent of cited domains appear across two or more AI engines, meaning roughly nine in ten cited sources are engine-specific. We do not claim that 11 percent as our own number. Our own measured equivalent is the tool-agreement reading above: full three-engine agreement on 36 percent of category and intent queries, and full divergence on 21 percent. Both point the same way: cross-engine agreement is the exception, so a single blended score that implies one answer is, most of the time, fiction. Lucreya figures: data.json crossEngineAnalysis, snapshot 2026-06-07. The ~11 percent domain-overlap figure is external (see citations); verify against the source before reuse.

This is why we position the CONSENSUS Protocol as the independent empirical successor to the Princeton GEO framework (Aggarwal et al., 2023). The Princeton paper established, in a controlled benchmark on a single engine, that adding statistics, quotations, and citations to a source could raise its generative-engine visibility by up to roughly 40 percent. That is a tactics framework. It tells you what to write. The CONSENSUS Protocol is the field measurement layer Princeton did not build: it measures whether real brands are cited consistently across multiple live engines, with a date attached. Princeton gave the lab tactics; CONSENSUS gives the field a standard.

Q: Is this just an attack on visibility tools?

A: No. The tools are useful for tracking; many expose genuinely good per-engine data inside their dashboards. The objection is narrower and specific: the headline blended score most of them lead with compresses disagreement into a number that reads as more certain than the underlying data supports. The protocol asks for the disagreement to be shown, not hidden. A tool that reports per-engine consensus openly already passes most of it.

What does the protocol show when we run it on our own data?

Run on Lucreya's own 60-answer study, the protocol flags the entire GEO visibility category as unsettled, with all three engines naming different top tools.

A measurement standard is worth nothing if its author will not run it on themselves. So here is the CONSENSUS Protocol applied, step by step, to Lucreya's June 2026 measurement of 20 GTM buying-intent queries across ChatGPT, Perplexity, and Google AI Overviews. Every figure below traces to a logged row in the published dataset.

Step	What we measured on our own data	Reading
C · Category-locked prompts	20 published buying-intent prompts (M1-M6, S1-S7, L1-L7), every prompt listed by ID in the protocol file	Reproducible
O · Off-vendor weighting	Roughly 4 in 5 of the 162 logged Perplexity citations pointed to third-party pages, not the recommended vendor's own site	~80% off-vendor
N · N-engine spread	3 engines x 20 prompts = 60 AI answers captured; 162 Perplexity citations logged	3 engines, 60 answers
S · Share of Voice	Reddit was named in the citation set of 15 of 20 answers (75 percent), the single most-cited domain; Zapier 30 percent, YouTube 25 percent	Reddit SoV 75%
E · Engine-Consensus flag	Full three-engine agreement on 5 of 14 category and intent queries (36 percent); full divergence on 3 (21 percent), all in GEO	GEO unsettled
N · Named-author / primary check	The coded structural exemplar (a Zapier comparison page) carried Article and BreadcrumbList schema, a comparison table, pricing, and a 2026 date; original first-party measurement was nearly absent from the cited set	Structured pages win
S · Snapshot + decay	Snapshot dated 2026-06-07; one query (best AI SDR tool 2026) showed an AI Overview on an earlier probe but not on a timed re-run	Dated, decaying
U/S · Unverifiable / sentiment	Recorded per brand as part of the autopsy; this run logged tool recommendations and citation sources, with the error and sentiment pass scoped to follow-up	Logged, scoped

Source: Lucreya original measurement, data.json (headlineFindings, crossEngineAnalysis, concentration). Snapshot 2026-06-07. License CC BY 4.0. Dataset DOI: 10.5281/zenodo.20632768 (Zenodo, CC BY 4.0).verified 2026-06-07

The headline reading from the dogfood: the GEO visibility category is the most contested in the entire dataset. On the three GEO-category queries, ChatGPT named Profound, Perplexity named the Semrush AI Visibility Toolkit, and Google AI Overviews named Goodie AI; on the others, the three engines split similarly. Mature categories had settled (Apollo for prospecting, Clay for enrichment, Surfer SEO for content optimization, all named top by two or three engines). The protocol's own author operates in the one category where its central finding bites hardest, which is exactly why the standard is honest: it does not exempt us.

Category query	ChatGPT	Perplexity	Google AIO	Flag
Best sales prospecting tool 2026	Apollo.io	Apollo.io	Apollo.io	Consensus
Best lead enrichment tool for agencies	Clay	Clay	Clay	Consensus
Best cold email software 2026	Instantly.ai	Salesforge	Instantly.ai	2-of-3
Best GEO tool to track AI search visibility	Profound	Semrush AI Visibility Toolkit	Goodie AI	Full diverge
How to track brand mentions in ChatGPT	Profound / AthenaHQ	Otterly / Semrush	Keyword.com	Full diverge
Best AI search optimization platform 2026	Multiple	Surfer / Clearscope / Rankability	Surfer / Clearscope	Full diverge

All three GEO-category queries (highlighted) show full engine divergence. Source: Lucreya June 2026 study, data.json queries S2, S4, S7.verified 2026-06-07

Run the CONSENSUS Protocol on your brand

The free AI Visibility Audit runs step one of the protocol for you: it returns your Engine-Consensus flag for your category across ChatGPT, Perplexity, and Google AI Overviews, with a snapshot date, in minutes. No blended black-box score. Just the four states, applied to your actual prompts.

Run my free AI Visibility Audit ›

How do you run the CONSENSUS Protocol yourself?

You can reproduce the protocol manually by submitting your category's published prompts to three engines, recording which name you, and applying the four-state Engine-Consensus flag.

The protocol is deliberately reproducible. Nothing in it requires a proprietary tool. The honest move is to make the method runnable by anyone, so here is the manual sequence, which is also the spec our monitoring and placement retainer productizes:

The 8-step run, in practice

1. Lock prompts: Write roughly 10 real buying-intent prompts your buyers would type ("best {category} tool 2026", "{you} vs {rival}", "how to {job}"). Publish the list.
2. Pick engines: At minimum ChatGPT (web search on), Perplexity (default web), and Google AI Overviews. Three is the floor.
3. Run and log: Submit each prompt to each engine. Record the tools named in order and, where the engine exposes them (Perplexity does natively), the cited source URLs.
4. Weight off-vendor: Classify each citation as the brand's own site or third-party. The third-party share is your off-vendor weight.
5. Score SoV: Count how often you are named versus the named consensus rivals in your category.
6. Flag consensus: For your brand, assign Consensus, single-engine dissent, Absent, or Due-diligence per prompt.
7. Date it: Stamp the run with a snapshot date. Treat the result as decaying; re-run on a cadence.
8. Audit errors: Note anything the engines state about you that is wrong, plus the sentiment of how you are described.

The hardest step to do well is step six, the flag, because it forces an honest reading. If two of three engines ignore you, you are not at "62 percent visibility." You are a single-engine dissent, and the strategic implication (you have a presence problem on two engines, not a ranking problem on one) is completely different. That difference is invisible inside a blended score. For the full execution playbook on closing those gaps, see our guide on how to rank in AI answers, and for the per-engine mechanics behind why each engine cites what it cites, the planned breakdown lives at how AI engines choose sources.

📊 Want the CONSENSUS Protocol prompt template? The 10-prompt-per-category worksheet plus the Engine-Consensus flag scoring rubric, ready to run. We will send it.

Why does honest AI visibility measurement matter for revenue teams?

Because the buying decision now starts inside an AI answer, and a blended score that says you are "60 percent visible" can hide that two of three engines never name you at all.

The measurement error has a revenue cost. When a buyer asks an engine which tool to buy, the answer is a pipeline input that lands before they reach your site. In our June 2026 run, Google AI Overviews triggered on 19 of 20 GTM buying-intent queries (95 percent)verified 2026-06-07, so the answer surface is almost always there. If your visibility tool tells you a single comfortable number while two engines silently omit you, you will under-invest in exactly the engines where you are absent. The protocol exists to make that absence legible. A revenue team that knows it is a single-engine dissent on Perplexity and absent on Google AI Overviews can act; a team holding one blended percentage cannot.

There is a second-order point worth stating plainly. Because roughly four in five citations went to third-party pages, a brand that only optimizes its own site is working the 20 percent of the citation surface that was already most likely to point to it. The 80 percent that decides most answers lives on review pages, comparison roundups, and forum threads. Measuring honestly is what reveals that the leverage is off-domain, which is the entire reason the off-vendor-weighting step exists. For the broader picture of how AI answer surfaces are reshaping demand, our colleagues at Nesyona's AI SEO tools index track the broader tool landscape that feeds these answers.

Q: Does a low AECI mean my content is bad?

A: Not necessarily. A low AECI usually means one of three things: your category has not settled yet (engines disagree, so nobody has consensus), the citations that decide your category live on third-party pages you have not earned, or your own pages are not structured for extraction (no named author, no original data, no schema). The protocol's named-author and off-vendor steps are designed to tell those three causes apart, which a single score cannot do.

What this standard does not claim: The CONSENSUS Protocol does not claim AI answers are permanent, that any single run is definitive, or that Lucreya itself is cited by any engine. Every figure here is a dated snapshot from a volatile system, attributed to a logged row in the published dataset. The ~11 percent cross-engine domain-overlap figure is an external finding and is labelled as such; our own measured equivalent is the 36 percent full-agreement / 21 percent full-divergence reading from our 14 category and intent queries. The dataset is deposited at Zenodo under DOI 10.5281/zenodo.20632768 (CC BY 4.0). Until any Lucreya page is screenshotted being cited by an engine, we describe our content as crawled and indexed, never as cited.

Frequently asked questions

How do you measure AI visibility honestly?

You run a fixed, published set of buying-intent prompts across at least three AI engines (ChatGPT, Perplexity, Google AI Overviews), record which engines name your brand, and report per-engine agreement rather than a single blended score. The CONSENSUS Protocol is the open 8-step standard for this: it scores a brand on the Answer-Engine Consensus Index (AECI) and assigns a four-state Engine-Consensus flag. A blended score is not honest because, in our June 2026 measurement, the three engines named the same top tool on only 5 of 14 category and intent queries.

What is the AECI (Answer-Engine Consensus Index)?

The Answer-Engine Consensus Index is a measure of how consistently ChatGPT, Perplexity, and Google AI Overviews name the same brand or source as the answer to a fixed buying-intent prompt. Instead of one blended visibility percentage, AECI reports a per-engine agreement reading and an Engine-Consensus flag, so a brand can see whether it is the consensus answer, a single-engine dissent, or absent. Lucreya coined AECI as the metric behind the CONSENSUS Protocol.

Why is a single blended AI visibility score misleading?

A blended score averages disagreeing engines into one number that hides whether they agree. In our June 2026 study, the engines named the same top tool on only 5 of 14 category and intent queries (36 percent) and named three different top tools on 3 queries (21 percent), all in the GEO category. A blended score would report a single winner where the engines actually disagreed. An external citation-overlap analysis separately finds only about 11 percent of cited domains appear across two or more engines, reinforcing that cross-engine agreement is rare.

How is the CONSENSUS Protocol different from the Princeton GEO framework?

The 2023 Princeton GEO paper (Aggarwal et al.) showed that on-page tactics such as adding statistics, quotations, and citations can raise a source's visibility in a single generative engine by up to roughly 40 percent. It is a tactics framework measured on one engine in a controlled benchmark. The CONSENSUS Protocol is a field measurement standard: it measures whether real brands are cited consistently across multiple live engines on published prompts, and reports the disagreement. Princeton tells you what to write; CONSENSUS tells you, with a date attached, where you stand.

Can I run the CONSENSUS Protocol on my own brand?

Yes. Every step lists a published input, so the protocol is reproducible by hand: submit your category's buying-intent prompts to ChatGPT, Perplexity, and Google AI Overviews, record which engines name you, and apply the Engine-Consensus flag. Lucreya also runs step one for free through the AI Visibility Audit, which returns your Engine-Consensus flag for your category in minutes.

How often does an AECI score change?

AI answers are volatile, so every AECI score carries a snapshot date and is treated as decaying. In our data, one query that showed a Google AI Overview on an earlier probe did not trigger one on a timed re-run. The snapshot-and-decay step requires re-running on a cadence (we re-run quarterly) rather than treating any score as permanent.

Bottom line

Honest AI visibility measurement reports the disagreement between engines instead of hiding it inside one number. The CONSENSUS Protocol is the open 8-step standard for doing that: Category-locked prompts, Off-vendor weighting, N-engine spread, Share of Voice versus named rivals, the Engine-Consensus flag, the Named-author and primary-source check, Snapshot date and decay, and the Unverifiable-claim and sentiment audit. It defines AECI, Share of Voice, and the consensus flag as named terms. The reason no vendor publishes it is that step E, the engine-consensus flag, exposes how often the single blended score they sell is fiction: in our own June 2026 run, the three engines fully agreed on only 36 percent of category and intent queries and fully diverged on 21 percent, all in the GEO category. We dogfooded the standard on our own data and it flagged our own category as the most contested in the set, which is the point. Run it on your brand with our free AI Visibility Audit, see the full evidence in the Who AI Recommends: GTM 2026 study, or start from the definition in what is GEO.

Lucreya original measurement. Who AI Recommends: GTM Tool and Source Citations Across ChatGPT, Perplexity, and Google AI Overviews (2026). 20 queries, 3 engines, 60 answers, 162 Perplexity citations. Snapshot date 2026-06-07. lucreya.com/research/who-ai-recommends-gtm-2026/. CC BY 4.0. Dataset DOI: 10.5281/zenodo.20632768 (Zenodo, CC BY 4.0). verified 2026-06-07
Aggarwal, P., Murahari, V., Rajpurohit, T., Kalyan, A., Narasimhan, K., Deshpande, A. GEO: Generative Engine Optimization. 2023. arxiv.org/abs/2311.09735. The Princeton GEO framework establishing that statistic, quotation, and citation additions can raise generative-engine visibility by up to ~40 percent in a controlled benchmark.
External cross-engine citation-overlap analysis. The ~11 percent figure for cited domains appearing across two or more AI engines is an external industry finding; confirm against the originating source before reuse. Lucreya's own measured equivalent (36 percent full three-engine agreement, 21 percent full divergence across 14 queries) is reported from the study above. verified 2026-06-07
Google. Generative AI in Search: Let Google do the searching for you. blog.google/products/search/generative-ai-search/. Reference for Google AI Overviews behavior.
Perplexity AI. perplexity.ai. Primary measurement engine; exposes a native numbered citation list used for the source autopsy.
Schema.org. DefinedTerm specification. schema.org/DefinedTerm. Markup standard for the AECI, Share of Voice, and consensus-flag terms.
Creative Commons. CC BY 4.0 License. creativecommons.org/licenses/by/4.0/. License for the Lucreya measurement dataset.