Updated June 2026 · 11 min read · By Vincent Wesley Couey
Last reviewed: June 10, 2026 Next review due: September 2026 Snapshot data: June 7, 2026

What a Real GEO Audit Tests (Methodology, Not a Sales Pitch)

What does a GEO audit actually test?

A real GEO audit tests four distinct things that no blended visibility score can surface on its own: brand presence, source structure, cross-engine agreement, and AI-stated errors.

Most "AI visibility audits" on the market test exactly one thing: how often a brand is mentioned in AI answers to a proprietary prompt set, compressed into one percentage. That is the easiest figure to sell and the least actionable to receive. By the time a buyer sees "your AI visibility is 34 percent," three questions that matter for strategy are already buried:

A GEO audit that answers all four questions is an implementation of the CONSENSUS Protocol: the open 8-step measurement standard that defines the Answer-Engine Consensus Index (AECI), the four-state Engine-Consensus flag, and the per-engine Share of Voice metric. This page publishes the full methodology so that any team can evaluate whether an audit they receive, whether from Lucreya or anyone else, actually covers it.

Q: Why do vendors omit steps 2, 3, and 4?
A: Because showing the Engine-Consensus flag requires admitting when the engines disagree, which makes a single blended score look like what it is: an editorial choice about which engine to trust, not a measurement. Showing source-type classification reveals that most of the citation leverage is on third-party pages the vendor cannot directly sell. Showing AI-stated errors requires reading every answer for negative content, which is slower and harder to automate than counting mentions.

How many queries does a proper audit run, and across which engines?

A minimum-viable audit runs 10-15 real buying-intent queries per vertical across 3 engines; a full audit runs 50-100 queries across all category tiers and 4 engines.

The query set is the single most important audit decision. Queries cannot be synthetic or proprietary. They must match actual buyer search behavior: the same strings that buyers type into Perplexity or ChatGPT when evaluating tools. There are three taxonomic tiers that a complete audit covers:

Tier 1
Category buying-intent

"Best [category] tool 2026" queries. These are the most competitive, the highest volume, and the queries where consensus winner brands are established. Example: "best sales prospecting tool 2026."

Tier 2
Head-to-head comparisons

"[Your brand] vs [competitor]" queries. Buyers use these late in evaluation. The AI answer here shapes the last-mile decision. Example: "Profound vs Otterly."

Tier 3
Intent and workflow queries

"How to [job-to-be-done]" and "best tool for [specific workflow]" queries. These drive the early discovery that later becomes a category query. Example: "how to track brand mentions in ChatGPT."

The engine floor for any honest audit is three: ChatGPT with web search enabled, Perplexity on default web search, and Google AI Overviews via standard Google Search. A full audit adds Claude.ai with web access as a fourth engine. In our June 2026 measurement of 20 GTM buying-intent queries, Google AI Overviews triggered on 19 of 20 queries (95 percent)verified 2026-06-07. The one miss, "best AI SDR tool 2026," had shown an overview on an earlier probe, illustrating that AI Overview trigger volatility is a real measurement hazard: a single-run audit that misses a volatile trigger can report a false absence.

Why Perplexity is the citation-autopsy engine: Perplexity exposes a native numbered source list for every answer. ChatGPT renders citations as in-product chips and Google AI Overviews collapses its citation list into organic results, making source extraction unreliable. This is why citation-source autopsy in our own methodology rests on Perplexity's high-fidelity capture. In our June 2026 run, we logged 162 Perplexity citations across 20 queries, classified by domain type. The source-type mix: roughly 58 percent third-party review or listicle, 20 percent vendor first-party, 16 percent forum/UGC (Reddit, YouTube, LinkedIn), and 6 percent comparison aggregators (G2, TrustRadius). Source: Lucreya data.json, perplexityCitationsLogged=162, sourceTypeMix_perplexity_estimate. Snapshot 2026-06-07.

What does the CONSENSUS Protocol scoring rubric look like?

The rubric scores each brand on eight dimensions, with the Engine-Consensus flag as the single most important output because it is the reading no blended score can produce.

The rubric below is the same one we apply on every audit we run. It is published here because making it public is what makes the methodology honest. An agency that runs a proprietary rubric you cannot see is asking you to trust a black box. The table is the centerpiece of every audit report we deliver, reproduced for the scored brand.

Rubric dimension What we measure Why it matters Output format
Category-locked prompts 50-100 real buying-intent queries across all three tiers, drawn from the brand's actual category taxonomy. Published so the client can reproduce the run. The query set IS the audit. Synthetic or undisclosed prompts make reproduction impossible. Published prompt list with tier and vertical tags
Presence per engine Is the brand named in the AI answer for each query, on each engine? Named in what position (first recommendation, secondary mention, comparison context)? Presence is the prerequisite. A brand absent from all three engines on a Tier 1 query has zero pipeline contribution from that query. Per-query per-engine presence grid (yes / secondary / absent)
Engine-Consensus flag For each query where the brand is relevant, does it receive Consensus (all 3+ engines name it), Dissent (named by only one engine), Absent, or Due-diligence (named only in comparison context)? This is the single reading a blended score cannot show. In our June 2026 GTM run, full engine agreement occurred on only 5 of the 14 category and intent queries (36%). A brand can look "visible" in aggregate while being absent on the two engines a buyer actually uses. Four-state flag per query, rolled up to brand-level AECI reading
Off-vendor source share Of the Perplexity citations for queries where the brand's category is named, what fraction point to third-party pages vs the brand's own site? If 80 percent of citations are third-party, optimizing only your own site addresses 20 percent of the leverage surface. The audit maps which third-party domains are winning citations in your category. Percentage off-vendor; top 10 cited third-party domains in your category
Source-type classification Are the citations in your category review/listicle, vendor first-party, forum/UGC, or comparison aggregator? Which type is dominant? Source-type distribution tells you where to invest in earned coverage. A category dominated by forum citations (like our GTM data showing Reddit cited in 75 percent of Perplexity answers) requires a different strategy than one dominated by G2 reviews. Pie-slice breakdown of source types, with named domain examples
Share of Voice vs named rivals How often is the brand named relative to the specific competitors the engines actually recommend in the category? SoV against the named consensus winner (Apollo for prospecting, Clay for enrichment) tells you the real competitive gap, not an abstract percentage against "all possible answers." Ranked mention table: brand vs top 3 named rivals per engine
Structural gap analysis Does the brand's site carry the structural signals that the citation-winning pages in its category carry? Named author, original data, comparison table, schema markup, pricing, freshness date? The structural exemplar we coded in our GTM run: zapier.com/blog/jasper-vs-copy-ai/. Article and BreadcrumbList schema, comparison table, pricing, a 2026 freshness date, approximately 2,600 words. Original first-party measurement is nearly absent from the cited set, which is exactly why it is differentiated. Gap scorecard: 8 structural signals, pass/fail per signal, per page
AI-stated errors and sentiment Does any engine state something factually wrong about the brand? What is the sentiment of how the brand is described when it is named? A misstated price, a wrong feature claim, or a negative qualifier in an AI answer is a sales liability. Most visibility tools never read the answer for content, only for presence. Error log per engine per query; sentiment label (positive / neutral / qualifying)
Snapshot date and decay rate Every run carries a date. The report includes an estimate of re-run frequency needed to keep the flag current, based on observed category volatility. AI answers are volatile. In our data, one query's AI Overview did not trigger on a timed re-run that had triggered on an earlier probe. A score without a date is not a measurement. ISO 8601 snapshot date; recommended re-run cadence

What does a sample audit report row look like?

A real audit report delivers per-query rows that show presence, consensus flag, and the top cited source in a single scannable table, not a dashboard percentage.

Below is the format of a redacted sample from our own June 2026 GTM run. Brand names and domain-specific citations for client brands are replaced here, but the structure is identical to what a paid audit delivers. This sample draws from our own measurement data, not a hypothetical.

Sample audit rows: GEO / AI visibility category (from Lucreya June 2026 measurement, snapshot 2026-06-07)
QueryEngine-Consensus flagTop cited 3rd-party domainClient brand position
best GEO tool to track AI search visibility Full diverge reddit.com, visible.seranking.com Brand X [REDACTED]
how to track brand mentions in ChatGPT Full diverge keyword.com, otterly.ai Brand X [REDACTED]
best AI search optimization platform 2026 Full diverge rankability.com, yotpo.com Brand X [REDACTED]
best AI visibility tool for agencies 2-of-3 tryprofound.com, reddit.com Brand X [REDACTED]
best SEO content optimization tool 2026 Consensus zapier.com, onelittleweb.com Brand X [REDACTED]

Real data: queries S2, S4, S5, S7, S1 from Lucreya June 2026 GTM measurement, data.json. Brand position cells are redacted here; a paid audit fills these with your brand's actual placement. Snapshot 2026-06-07. verified 2026-06-07

Across those 14 queries, 2-of-3 engine agreement (partial consensus) appeared on 6 (43%), and all-engine full divergence on 3 (21%); all three full-divergence queries fell in the GEO/AI-visibility category.

The three full-divergence rows in this sample are not noise. All three fall in the GEO/AI-visibility category, and that pattern is the finding. Our data shows that in mature categories (prospecting, enrichment, content optimization) the engines have converged on consensus winners: Apollo.io for prospecting, Clay for enrichment, and Surfer SEO for SEO content optimization were each named top tool by two or three engines. The GEO category itself has not settled. This is the strategic opening: when engines disagree, no brand has locked the consensus flag, and the category is winnable.

What signals change, and when does pipeline impact show up?

Citation signals emerge in 3-4 weeks; pipeline impact is measurable at 60-90 days at citation rates of 20-30 percent, but individual AI answer volatility means any single snapshot can shift before that window closes.

The timeline is the thing most GEO vendors omit from their pitch decks. A buyer who receives an audit report wants to know: if we fix the structural gaps you identified, when will the Engine-Consensus flag change? The honest answer has two parts.

First, structural signals, meaning pages with the right schema, original data, named authors, and freshness dates, typically begin to appear in AI engine indexes in 3-4 weeks after publication, assuming they are crawlable and indexed by search engines. This is consistent with how the Princeton GEO research (Aggarwal et al., 2023) framed the indexing-to-citation pathway: the tactics that raise citation rates are structural, not just content-volume. (That study's ~40 percent visibility gain is an external controlled benchmark result; Lucreya's field measurement did not run a controlled intervention and does not replicate that figure.) The lag between publication and citation is roughly the time it takes the AI engine to re-crawl and re-weight the source.

Second, pipeline impact is a downstream measurement. An improvement in the Engine-Consensus flag means a brand shows up in AI answers that buyers are using to make decisions. At citation rates of 20-30 percent, where a brand appears in roughly one in four to one in three AI answers for high-intent queries, that pipeline contribution becomes measurable against demo-request and inbound attribution data at 60-90 days. Below 20 percent, the signal is present but statistically thin.

Q: What does a 20-30 percent citation rate actually mean in practice?
A: It means that for a given category buying-intent query, the brand appears in the AI answer on roughly 1-in-4 to 1-in-3 runs of that query. Because AI answers are non-deterministic, the same query run twice can produce different citations. A 20-30 percent rate suggests the brand is in the "consideration zone" for the engine but has not achieved a Consensus flag. The practical effect: a meaningful fraction of buyers researching the category will see the brand named, but not all of them.

One additional signal that accelerates pipeline impact: Reddit citation dominance. In our June 2026 measurement, Reddit was cited in 15 of 20 Perplexity answers (75 percent)verified 2026-06-07, far ahead of the next most-cited domain (Zapier at 30 percent, YouTube at 25 percent). A brand whose product generates genuine Reddit discussion, including honest comparisons and workflow threads, is building the citation substrate that Perplexity draws from most. An audit that does not map Reddit coverage is missing the single largest citation channel in the Perplexity source set. For a broader look at the tool landscape this audit methodology applies to, the Nesyona AI SEO tools index tracks the GEO tool category independently.

What distinguishes an honest GEO audit from a sales tool?

An honest audit publishes its query set, reports per-engine disagreement, attributes every figure to a dated snapshot, and does not claim improvements it has not yet produced.

The GEO audit market has a conflict-of-interest structure that is hard to miss. A vendor who sells you the audit also sells you the remediation. Their incentive is to find problems, then sell the fix. This does not make the audit wrong, but it does mean the audit's framing is oriented toward creating a remediation sale rather than giving you an independent read. The tells are predictable:

Our approach differs on each of these. The CONSENSUS Protocol query sets are published. The Engine-Consensus flag is the headline output, not the blended score. The third-party source map is a required deliverable. The timeline is stated as a practitioner estimate with the citation-rate assumption visible. And the GEO placement retainer is a separate, opt-in service, not the upsell that the audit exists to create. For how to read AI answers as a practitioner, our guide on how to rank in AI answers covers the structural content decisions that follow an audit.

What this methodology does not claim: The Lucreya GEO audit methodology does not guarantee Engine-Consensus flag changes on any timeline. Every figure in this article traces to our published data.json dataset, snapshot date 2026-06-07. AI answers are volatile; any figure may change before a follow-up run. We do not claim that Lucreya itself is cited by any engine. The 20-30 percent citation rate and 60-90 day pipeline impact estimates are practitioner estimates, not guarantees. DOI for the June 2026 dataset is pending Zenodo deposit; we will publish a real DOI when the deposit is live.

Run the CONSENSUS Protocol on your brand, free

The free AI Visibility Audit runs step one of the protocol: it returns your Engine-Consensus flag for your category across ChatGPT, Perplexity, and Google AI Overviews, with a snapshot date. No blended score, no black box. The full audit, with source-type classification and structural gap analysis, is the paid retainer deliverable.

Run my free AI Visibility Audit ›

Frequently asked questions

What does a GEO audit actually test?
A real GEO audit tests four things: (1) whether AI answer engines name your brand when buyers ask category buying-intent questions, (2) which third-party sources those engines cite when they do or do not name you, (3) whether the engines agree or disagree on your category (the Engine-Consensus flag), and (4) what errors or negative framing the engines state about you. A vendor audit that returns only a blended visibility score is skipping steps 2, 3, and 4, which are where the actionable data lives.
How many queries does a proper AI visibility audit run?
A minimum-viable audit runs 10-15 real buying-intent queries per vertical across 3-4 AI engines. A full audit runs 50-100 queries across a brand's complete category taxonomy. Lucreya's own GTM measurement ran 20 queries across 3 engines (60 total AI answers) as a focused vertical study; a full-brand audit for a multi-category vendor would expand that prompt set proportionally.
Which engines should a GEO audit cover?
The minimum floor is three engines: ChatGPT (web-search enabled), Perplexity (default web search), and Google AI Overviews. A full audit adds Claude.ai with web access as a fourth engine. In Lucreya's June 2026 measurement, Google AI Overviews triggered on 19 of 20 GTM buying-intent queries (95 percent), confirming the surface is nearly always present. A single-engine audit misses the divergence between them, which is the most important finding.
How long does it take to see pipeline impact from a GEO audit?
Signal on citation changes typically emerges in 3-4 weeks. Pipeline impact, meaning brand mentions showing up in buyer conversations or demo requests citing AI answers, typically becomes measurable at 60-90 days at citation rates of 20-30 percent. These are practitioner estimates; AI answer volatility means any individual snapshot can shift.
What is the difference between a GEO audit and ongoing GEO monitoring?
A GEO audit is a dated snapshot: it runs your query set once, classifies the sources, assigns the Engine-Consensus flag, and identifies the gaps. Monitoring re-runs the same query set on a cadence (monthly or quarterly) so that changes in the Engine-Consensus flag are visible over time. An audit answers "where do we stand"; monitoring answers "is our position improving or eroding". The Lucreya GEO placement retainer is the productized version of ongoing monitoring after an initial audit.

Bottom line

A real GEO audit is the CONSENSUS Protocol applied as a service. It runs 50-100 published buying-intent queries across at minimum three engines, classifies every Perplexity citation by source type, assigns the four-state Engine-Consensus flag per brand per query, maps the structural gaps between the brand's pages and the citation-winning pages in its category, and delivers a dated report reproducible by the client. The centerpiece finding is not a blended visibility score but the per-query consensus flag, because that is the reading that distinguishes "absent on two engines" from "visible but contested" from "consensus winner." In our June 2026 GTM measurement, the engines fully agreed on the top tool on only 5 of the 14 category and intent queries (36%; denominator is category and intent queries, not all 20 queries). The GEO category itself shows full divergence on all three of its category queries, meaning no brand has locked the consensus flag there yet. Run the first step free with our AI Visibility Audit, read the full CONSENSUS Protocol at the-consensus-protocol, and see the raw data behind every figure at our June 2026 GTM measurement.

  1. Lucreya original measurement. Who AI Recommends: GTM Tool and Source Citations Across ChatGPT, Perplexity, and Google AI Overviews (2026). 20 queries, 3 engines, 60 answers, 162 Perplexity citations. Snapshot date 2026-06-07. lucreya.com/research/who-ai-recommends-gtm-2026/. CC BY 4.0. Dataset DOI: 10.5281/zenodo.20632768 (Zenodo, CC BY 4.0). verified 2026-06-07
  2. Aggarwal, P., Murahari, V., Rajpurohit, T., Kalyan, A., Narasimhan, K., Deshpande, A. GEO: Generative Engine Optimization. 2023. arxiv.org/abs/2311.09735. Princeton GEO framework establishing that structural on-page signals (statistics, quotations, citations) can raise generative-engine visibility by up to roughly 40 percent in a controlled benchmark. (External controlled benchmark; this gain was not replicated or tested in Lucreya's field measurement, which did not run a controlled intervention.)
  3. Perplexity AI. perplexity.ai. Primary citation-source autopsy engine; exposes a native numbered citation list used for source-type classification across the 162 logged citations.
  4. Zapier. Jasper vs Copy.ai (2026 comparison). zapier.com/blog/jasper-vs-copy-ai/. Structural exemplar coded in the June 2026 measurement: Article and BreadcrumbList schema, comparison table, pricing, 2026 freshness, approximately 2,600 words. Representative of the citation-winning page archetype in the GTM category.
  5. Lucreya. The CONSENSUS Protocol: How to Measure AI Visibility Honestly (AECI Method, 2026). lucreya.com/articles/the-consensus-protocol. The open 8-step measurement standard that this audit methodology implements; defines AECI, Share of Voice, and the Engine-Consensus flag as named, schema-marked terms.
  6. OpenAI. ChatGPT. openai.com/chatgpt. Primary AI answer engine; web-search-enabled variant used for all query runs in the June 2026 measurement.
  7. Google. AI Overviews in Search. blog.google/products/search/generative-ai-search/. Third engine in the measurement; triggered on 19 of 20 GTM buying-intent queries in the June 2026 run.
Save
Dashboard

From our network

Best AI Tools for Amazon Sellers - bagengine.comBest AI Courses 2026 - edubracket.comBest Accounting Software for Online Sellers - ceocult.com