# AI Visibility Benchmark 2026.06 Codebook

This codebook describes the anonymized reviewed dataset for the SEO Informatica AI Visibility Benchmark.

## Dataset

- Rows: 50
- Unique anonymized domains: 50
- Public domains and URLs: withheld in the public dataset (`domain_public=false`).
- Collection period: June 3, 2026 UTC / June 4, 2026 IST.
- Review status: semantic-review applied before final scoring.

## Field Groups

### Run/provenance

- `run_id`
- `record_id`
- `audit_date`
- `auditor_id`
- `collector_version`
- `source_html_hash`
- `crawl_timestamp`
- `crawl_user_agent`

### Sample metadata

- `sample_source`
- `anonymized_site_id`
- `domain_hash`
- `domain_public`
- `site_url`
- `vertical`
- `country`
- `region`
- `business_type`
- `lead_gen_type`

### Page URLs

- `homepage_url`
- `primary_service_url`
- `support_article_url`
- `about_url`
- `contact_url`
- `location_url`

### Crawl access

- `http_status_home`
- `http_status_service`
- `robots_url`
- `robots_fetch_status`
- `robots_allows_googlebot`
- `robots_allows_bingbot`
- `robots_allows_oai_searchbot`
- `robots_allows_gptbot`
- `robots_allows_chatgpt_user`
- `robots_allows_claudebot`
- `robots_allows_claude_searchbot`
- `robots_allows_claude_user`
- `robots_allows_perplexitybot`
- `robots_allows_perplexity_user`
- `waf_block_signal`
- `captcha_signal`
- `js_challenge_signal`

### Index/snippet

- `sitemap_found`
- `sitemap_url_count`
- `key_pages_in_sitemap`
- `meta_robots_service`
- `x_robots_tag_service`
- `noindex_present`
- `nosnippet_present`
- `max_snippet_value`
- `data_nosnippet_on_core_content`
- `canonical_url_service`
- `canonical_self_service`

### Page content

- `page_title_service`
- `meta_description_service`
- `h1_service`
- `visible_word_count_service`
- `h1_count_service`
- `h2_count_service`
- `h3_count_service`
- `answer_block_count`
- `table_count`
- `list_count`
- `faq_visible_count`
- `section_anchor_count`
- `hidden_core_content_flag`
- `pdf_core_info_flag`
- `image_only_core_info_flag`
- `rendered_text_ratio`

### Semantic/source review

- `org_name_present`
- `service_name_consistent`
- `location_info_present`
- `about_page_present`
- `contact_details_present`
- `author_name_present`
- `author_bio_url`
- `reviewer_name`
- `date_published`
- `date_modified`
- `version_history_present`
- `sameas_count`
- `external_citation_count`
- `official_source_citation_count`
- `original_data_present`
- `methodology_present`
- `limitations_present`
- `dataset_download_present`
- `proof_examples_count`
- `claim_support_score`
- `manual_review_status`
- `reviewer_notes`

### Schema

- `schema_types_detected`
- `organization_schema_present`
- `person_schema_present`
- `webpage_schema_present`
- `article_schema_present`
- `dataset_schema_present`
- `faqpage_schema_present`
- `breadcrumb_schema_present`
- `howto_schema_present`
- `schema_visible_content_match`

### DUCR scores

- `ducr_discoverable_score`
- `ducr_understandable_score`
- `ducr_citable_score`
- `ducr_routable_score`
- `ducr_total_score`
- `critical_blocker`

## Scoring Notes

DUCR is a 100-point score: Discoverable 25, Understandable 25, Citable 30, Routable 20. Critical blockers can cap scores or mark a record as unfit for publication.

## Privacy Notes

The public dataset is anonymized. `domain_hash` permits deduplication without exposing the audited domain. Real URLs are retained only in internal working files.
