What a Real GEO Audit Tests (Methodology, Not a Sales Pitch)
What does a GEO audit actually test?
A real GEO audit tests four distinct things that no blended visibility score can surface on its own: brand presence, source structure, cross-engine agreement, and AI-stated errors.
Most "AI visibility audits" on the market test exactly one thing: how often a brand is mentioned in AI answers to a proprietary prompt set, compressed into one percentage. That is the easiest figure to sell and the least actionable to receive. By the time a buyer sees "your AI visibility is 34 percent," three questions that matter for strategy are already buried:
- Which engines name you and which do not? (Absent on two out of three is not "34 percent visible"; it is a structural gap on the engines with 95 percent daily trigger rates.)
- What third-party sources do the engines cite when they recommend your category? (In our June 2026 GTM measurement, roughly 4 in 5 Perplexity citations pointed to third-party pagesverified 2026-06-07, not the vendor's own site. The leverage is off-domain.)
- Does the AI say anything factually wrong about you? (A misstated price or a wrong feature claim in an AI answer is a liability that a visibility score never measures.)
A GEO audit that answers all four questions is an implementation of the CONSENSUS Protocol: the open 8-step measurement standard that defines the Answer-Engine Consensus Index (AECI), the four-state Engine-Consensus flag, and the per-engine Share of Voice metric. This page publishes the full methodology so that any team can evaluate whether an audit they receive, whether from Lucreya or anyone else, actually covers it.
How many queries does a proper audit run, and across which engines?
A minimum-viable audit runs 10-15 real buying-intent queries per vertical across 3 engines; a full audit runs 50-100 queries across all category tiers and 4 engines.
The query set is the single most important audit decision. Queries cannot be synthetic or proprietary. They must match actual buyer search behavior: the same strings that buyers type into Perplexity or ChatGPT when evaluating tools. There are three taxonomic tiers that a complete audit covers:
"Best [category] tool 2026" queries. These are the most competitive, the highest volume, and the queries where consensus winner brands are established. Example: "best sales prospecting tool 2026."
"[Your brand] vs [competitor]" queries. Buyers use these late in evaluation. The AI answer here shapes the last-mile decision. Example: "Profound vs Otterly."
"How to [job-to-be-done]" and "best tool for [specific workflow]" queries. These drive the early discovery that later becomes a category query. Example: "how to track brand mentions in ChatGPT."
The engine floor for any honest audit is three: ChatGPT with web search enabled, Perplexity on default web search, and Google AI Overviews via standard Google Search. A full audit adds Claude.ai with web access as a fourth engine. In our June 2026 measurement of 20 GTM buying-intent queries, Google AI Overviews triggered on 19 of 20 queries (95 percent)verified 2026-06-07. The one miss, "best AI SDR tool 2026," had shown an overview on an earlier probe, illustrating that AI Overview trigger volatility is a real measurement hazard: a single-run audit that misses a volatile trigger can report a false absence.
What does the CONSENSUS Protocol scoring rubric look like?
The rubric scores each brand on eight dimensions, with the Engine-Consensus flag as the single most important output because it is the reading no blended score can produce.
The rubric below is the same one we apply on every audit we run. It is published here because making it public is what makes the methodology honest. An agency that runs a proprietary rubric you cannot see is asking you to trust a black box. The table is the centerpiece of every audit report we deliver, reproduced for the scored brand.
| Rubric dimension | What we measure | Why it matters | Output format |
|---|---|---|---|
| Category-locked prompts | 50-100 real buying-intent queries across all three tiers, drawn from the brand's actual category taxonomy. Published so the client can reproduce the run. | The query set IS the audit. Synthetic or undisclosed prompts make reproduction impossible. | Published prompt list with tier and vertical tags |
| Presence per engine | Is the brand named in the AI answer for each query, on each engine? Named in what position (first recommendation, secondary mention, comparison context)? | Presence is the prerequisite. A brand absent from all three engines on a Tier 1 query has zero pipeline contribution from that query. | Per-query per-engine presence grid (yes / secondary / absent) |
| Engine-Consensus flag | For each query where the brand is relevant, does it receive Consensus (all 3+ engines name it), Dissent (named by only one engine), Absent, or Due-diligence (named only in comparison context)? | This is the single reading a blended score cannot show. In our June 2026 GTM run, full engine agreement occurred on only 5 of the 14 category and intent queries (36%). A brand can look "visible" in aggregate while being absent on the two engines a buyer actually uses. | Four-state flag per query, rolled up to brand-level AECI reading |
| Off-vendor source share | Of the Perplexity citations for queries where the brand's category is named, what fraction point to third-party pages vs the brand's own site? | If 80 percent of citations are third-party, optimizing only your own site addresses 20 percent of the leverage surface. The audit maps which third-party domains are winning citations in your category. | Percentage off-vendor; top 10 cited third-party domains in your category |
| Source-type classification | Are the citations in your category review/listicle, vendor first-party, forum/UGC, or comparison aggregator? Which type is dominant? | Source-type distribution tells you where to invest in earned coverage. A category dominated by forum citations (like our GTM data showing Reddit cited in 75 percent of Perplexity answers) requires a different strategy than one dominated by G2 reviews. | Pie-slice breakdown of source types, with named domain examples |
| Share of Voice vs named rivals | How often is the brand named relative to the specific competitors the engines actually recommend in the category? | SoV against the named consensus winner (Apollo for prospecting, Clay for enrichment) tells you the real competitive gap, not an abstract percentage against "all possible answers." | Ranked mention table: brand vs top 3 named rivals per engine |
| Structural gap analysis | Does the brand's site carry the structural signals that the citation-winning pages in its category carry? Named author, original data, comparison table, schema markup, pricing, freshness date? | The structural exemplar we coded in our GTM run: zapier.com/blog/jasper-vs-copy-ai/. Article and BreadcrumbList schema, comparison table, pricing, a 2026 freshness date, approximately 2,600 words. Original first-party measurement is nearly absent from the cited set, which is exactly why it is differentiated. | Gap scorecard: 8 structural signals, pass/fail per signal, per page |
| AI-stated errors and sentiment | Does any engine state something factually wrong about the brand? What is the sentiment of how the brand is described when it is named? | A misstated price, a wrong feature claim, or a negative qualifier in an AI answer is a sales liability. Most visibility tools never read the answer for content, only for presence. | Error log per engine per query; sentiment label (positive / neutral / qualifying) |
| Snapshot date and decay rate | Every run carries a date. The report includes an estimate of re-run frequency needed to keep the flag current, based on observed category volatility. | AI answers are volatile. In our data, one query's AI Overview did not trigger on a timed re-run that had triggered on an earlier probe. A score without a date is not a measurement. | ISO 8601 snapshot date; recommended re-run cadence |
What does a sample audit report row look like?
A real audit report delivers per-query rows that show presence, consensus flag, and the top cited source in a single scannable table, not a dashboard percentage.
Below is the format of a redacted sample from our own June 2026 GTM run. Brand names and domain-specific citations for client brands are replaced here, but the structure is identical to what a paid audit delivers. This sample draws from our own measurement data, not a hypothetical.
Real data: queries S2, S4, S5, S7, S1 from Lucreya June 2026 GTM measurement, data.json. Brand position cells are redacted here; a paid audit fills these with your brand's actual placement. Snapshot 2026-06-07. verified 2026-06-07
Across those 14 queries, 2-of-3 engine agreement (partial consensus) appeared on 6 (43%), and all-engine full divergence on 3 (21%); all three full-divergence queries fell in the GEO/AI-visibility category.
The three full-divergence rows in this sample are not noise. All three fall in the GEO/AI-visibility category, and that pattern is the finding. Our data shows that in mature categories (prospecting, enrichment, content optimization) the engines have converged on consensus winners: Apollo.io for prospecting, Clay for enrichment, and Surfer SEO for SEO content optimization were each named top tool by two or three engines. The GEO category itself has not settled. This is the strategic opening: when engines disagree, no brand has locked the consensus flag, and the category is winnable.
What signals change, and when does pipeline impact show up?
Citation signals emerge in 3-4 weeks; pipeline impact is measurable at 60-90 days at citation rates of 20-30 percent, but individual AI answer volatility means any single snapshot can shift before that window closes.
The timeline is the thing most GEO vendors omit from their pitch decks. A buyer who receives an audit report wants to know: if we fix the structural gaps you identified, when will the Engine-Consensus flag change? The honest answer has two parts.
First, structural signals, meaning pages with the right schema, original data, named authors, and freshness dates, typically begin to appear in AI engine indexes in 3-4 weeks after publication, assuming they are crawlable and indexed by search engines. This is consistent with how the Princeton GEO research (Aggarwal et al., 2023) framed the indexing-to-citation pathway: the tactics that raise citation rates are structural, not just content-volume. (That study's ~40 percent visibility gain is an external controlled benchmark result; Lucreya's field measurement did not run a controlled intervention and does not replicate that figure.) The lag between publication and citation is roughly the time it takes the AI engine to re-crawl and re-weight the source.
Second, pipeline impact is a downstream measurement. An improvement in the Engine-Consensus flag means a brand shows up in AI answers that buyers are using to make decisions. At citation rates of 20-30 percent, where a brand appears in roughly one in four to one in three AI answers for high-intent queries, that pipeline contribution becomes measurable against demo-request and inbound attribution data at 60-90 days. Below 20 percent, the signal is present but statistically thin.
One additional signal that accelerates pipeline impact: Reddit citation dominance. In our June 2026 measurement, Reddit was cited in 15 of 20 Perplexity answers (75 percent)verified 2026-06-07, far ahead of the next most-cited domain (Zapier at 30 percent, YouTube at 25 percent). A brand whose product generates genuine Reddit discussion, including honest comparisons and workflow threads, is building the citation substrate that Perplexity draws from most. An audit that does not map Reddit coverage is missing the single largest citation channel in the Perplexity source set. For a broader look at the tool landscape this audit methodology applies to, the Nesyona AI SEO tools index tracks the GEO tool category independently.
What distinguishes an honest GEO audit from a sales tool?
An honest audit publishes its query set, reports per-engine disagreement, attributes every figure to a dated snapshot, and does not claim improvements it has not yet produced.
The GEO audit market has a conflict-of-interest structure that is hard to miss. A vendor who sells you the audit also sells you the remediation. Their incentive is to find problems, then sell the fix. This does not make the audit wrong, but it does mean the audit's framing is oriented toward creating a remediation sale rather than giving you an independent read. The tells are predictable:
- The query set is proprietary and not shown to the client.
- The output is one blended score without per-engine breakdown.
- The report does not show which third-party domains are winning citations in the client's category.
- The improvement forecast is stated as a guaranteed outcome rather than a probabilistic estimate tied to a specific citation-rate assumption.
- The audit claims the vendor's own content placement as the primary lever without disclosing that the vendor is placing content that links to themselves.
Our approach differs on each of these. The CONSENSUS Protocol query sets are published. The Engine-Consensus flag is the headline output, not the blended score. The third-party source map is a required deliverable. The timeline is stated as a practitioner estimate with the citation-rate assumption visible. And the GEO placement retainer is a separate, opt-in service, not the upsell that the audit exists to create. For how to read AI answers as a practitioner, our guide on how to rank in AI answers covers the structural content decisions that follow an audit.
Run the CONSENSUS Protocol on your brand, free
The free AI Visibility Audit runs step one of the protocol: it returns your Engine-Consensus flag for your category across ChatGPT, Perplexity, and Google AI Overviews, with a snapshot date. No blended score, no black box. The full audit, with source-type classification and structural gap analysis, is the paid retainer deliverable.
Run my free AI Visibility Audit ›Frequently asked questions
What does a GEO audit actually test?
How many queries does a proper AI visibility audit run?
Which engines should a GEO audit cover?
How long does it take to see pipeline impact from a GEO audit?
What is the difference between a GEO audit and ongoing GEO monitoring?
Bottom line
A real GEO audit is the CONSENSUS Protocol applied as a service. It runs 50-100 published buying-intent queries across at minimum three engines, classifies every Perplexity citation by source type, assigns the four-state Engine-Consensus flag per brand per query, maps the structural gaps between the brand's pages and the citation-winning pages in its category, and delivers a dated report reproducible by the client. The centerpiece finding is not a blended visibility score but the per-query consensus flag, because that is the reading that distinguishes "absent on two engines" from "visible but contested" from "consensus winner." In our June 2026 GTM measurement, the engines fully agreed on the top tool on only 5 of the 14 category and intent queries (36%; denominator is category and intent queries, not all 20 queries). The GEO category itself shows full divergence on all three of its category queries, meaning no brand has locked the consensus flag there yet. Run the first step free with our AI Visibility Audit, read the full CONSENSUS Protocol at the-consensus-protocol, and see the raw data behind every figure at our June 2026 GTM measurement.
- Lucreya original measurement. Who AI Recommends: GTM Tool and Source Citations Across ChatGPT, Perplexity, and Google AI Overviews (2026). 20 queries, 3 engines, 60 answers, 162 Perplexity citations. Snapshot date 2026-06-07. lucreya.com/research/who-ai-recommends-gtm-2026/. CC BY 4.0. Dataset DOI: 10.5281/zenodo.20632768 (Zenodo, CC BY 4.0). verified 2026-06-07
- Aggarwal, P., Murahari, V., Rajpurohit, T., Kalyan, A., Narasimhan, K., Deshpande, A. GEO: Generative Engine Optimization. 2023. arxiv.org/abs/2311.09735. Princeton GEO framework establishing that structural on-page signals (statistics, quotations, citations) can raise generative-engine visibility by up to roughly 40 percent in a controlled benchmark. (External controlled benchmark; this gain was not replicated or tested in Lucreya's field measurement, which did not run a controlled intervention.)
- Perplexity AI. perplexity.ai. Primary citation-source autopsy engine; exposes a native numbered citation list used for source-type classification across the 162 logged citations.
- Zapier. Jasper vs Copy.ai (2026 comparison). zapier.com/blog/jasper-vs-copy-ai/. Structural exemplar coded in the June 2026 measurement: Article and BreadcrumbList schema, comparison table, pricing, 2026 freshness, approximately 2,600 words. Representative of the citation-winning page archetype in the GTM category.
- Lucreya. The CONSENSUS Protocol: How to Measure AI Visibility Honestly (AECI Method, 2026). lucreya.com/articles/the-consensus-protocol. The open 8-step measurement standard that this audit methodology implements; defines AECI, Share of Voice, and the Engine-Consensus flag as named, schema-marked terms.
- OpenAI. ChatGPT. openai.com/chatgpt. Primary AI answer engine; web-search-enabled variant used for all query runs in the June 2026 measurement.
- Google. AI Overviews in Search. blog.google/products/search/generative-ai-search/. Third engine in the measurement; triggered on 19 of 20 GTM buying-intent queries in the June 2026 run.