What AI Visibility Benchmarks Can and Cannot Prove

This benchmark measures page readiness. It does not guarantee AI citations.

That limitation is not fine print. It is central to using the benchmark honestly.

What It Can Support

Whether key pages are crawlable and indexable.
Whether pages are snippet-eligible.
Whether source content is visible in HTML.
Whether claims have evidence, methodology, or official citations.
Whether schema matches visible content.
Whether internal routes connect benchmark, support, service, proof, and contact pages.
Whether crawler access is being monitored.
Whether a page is built more like a reusable source or generic service copy.

What It Cannot Prove

Stable LLM ranking position.
Guaranteed AI Overview, ChatGPT, Claude, Perplexity, Copilot, or other AI-answer citations.
Platform-wide visibility from one prompt test.
Causal attribution between one page edit and one AI citation.
Lead quality without analytics and CRM data.
Hidden platform trust, authority, or weighting systems.
Market-wide service-business averages.
Vertical-level performance where the vertical sample is too small.

2026.06 Sample Limits

Limitation	Why It Matters
50 reviewed records	Strong enough for a narrow benchmark, not enough for sweeping market claims.
Anonymized domains	Protects audited businesses, but prevents third-party URL-level rechecks from the public dataset.
Uneven vertical mix	Consulting, accounting, and agency sites make up 35 of 50 rows.
Public-page only	No private analytics, CRM, Search Console, server-log, or conversion data.
Snapshot timing	Crawler access, HTTP responses, and page content can change after collection.
Semantic review still has caveats	28 rows were approved with caveats, so review status should stay visible.
No actual citation tracking in this dataset	DUCR measures readiness, not confirmed AI answer citations.

Correct Claim Shape

Use this kind of language:

"In a reviewed anonymized sample of 50 service-business websites, pages were generally accessible to search and AI-retrieval crawlers, but median citable readiness was only 4/30."

Do not use this kind of language:

"Service-business websites cannot get cited by AI unless they follow this system."

The second claim is garbage. It overstates what the evidence can prove.

Revision Policy

When official platform guidance changes, update the methodology and changelog. When the scoring model changes, preserve the old version and explain the change.

When the dataset expands beyond 50 records, update the sample-size language before changing any benchmark claims. Do not silently blend 2026.06 findings with later runs.