Research

Methodology & sources for AI search visibility

Brands keep asking the same questions about AI search: what works, what doesn't, and what the data actually says. We answer them here, with original research grounded in peer-reviewed studies, enterprise benchmarks, and primary data from our own pipeline. Every claim links back to a source in our methodology.

231 sources155 Tier A54 Tier B22 Tier CLast reviewed 10.07.2026231 added in the last 90 days

Our aim

AI search is changing how customers discover brands, products, and services. Companies are looking for a playbook on how to be visible in these new engines and secure a place in AI-generated answers.

Most of what circulates online is vendor research citing other vendor research, statistics with sample sizes of one, and forecasts presented as facts. Brand teams making real decisions deserve better.

This library exists to be the source we wish we'd had when we started. Every article cites primary research where possible: peer-reviewed work, enterprise analytics with disclosed methodology, and major consultancies such as Bain, BCG, Deloitte, Gartner, and McKinsey. We tier each source so you can judge it yourself. We say plainly where the evidence is thin.

It's for anyone making decisions about AI search visibility: brand teams, marketers, agency partners, and the journalists trying to make sense of the category. It also keeps us honest about our own work. The research that informs info.link/answers is the same research we point our clients to.

How we research

Three principles guide how we research:

Primary sources first. We start with peer-reviewed papers, enterprise analytics with disclosed methodology, and major consultancies. When a statistic gets passed around the AI-visibility community, we trace it back to its origin; and if we can't find one, we don't repeat it.

Every source is tiered. Each entry in our source library carries a tier badge (A, B, or C) reflecting how strongly we trust it. Tier A claims need no qualifier. Tier B claims need attribution and context. Tier C entries are vendor blogs, hot takes, and case studies with a single data point. We mark them clearly and only cite them when they surface a genuinely novel signal we can't find elsewhere.

We update when the evidence updates. AI search is moving fast. We re-read our own articles when new research lands. The "last reviewed" date on every page shows when we last checked. If something is out of date, that's our problem to fix. Tell us, and we will.

Tier A — Strongest evidence

These are sources with the strongest methodology, large samples, and a public record you can verify. They include peer-reviewed academic papers and arXiv preprints with open methodology, enterprise analytics providers who publish their data and methods (Microsoft Clarity, Adobe Analytics, Cloudflare, Similarweb), large first-party studies with open methodology (Cloudflare's network telemetry, the Pew Research click-through study), major consultancies (McKinsey, Bain, BCG, Deloitte, Gartner), and government or regulatory sources (FTC, US Copyright Office, the European Commission's AI Office, Ofcom).

A Tier A source still has limits. Every dataset has assumptions, and we name them when they matter. The claim itself can stand on the citation alone.

Tier A — Strongest evidenceRead source

Your site, your rules: new AI traffic options for all customers (Content Independence Day 2026)

Cloudflare · Jin-Hee Lee, Bryan Becker · 2026

Key finding

Cloudflare replaced its single 'AI bots' block with three categories, available to all customers: Search (indexes to answer later, expected to return referrals), Agent (real-time action for a person) and Training (absorbed into a model). From 15 September 2026, new domains block Training and Agent by default on pages showing ads, while Search stays allowed. Multi-purpose crawlers are judged by all behaviors, so blocking Training also blocks them.

Methodology note · Official Cloudflare product and policy announcement (its second 'Content Independence Day') by Jin-Hee Lee and Bryan Becker, published 1 July 2026. Also introduces a robots.txt content-use signal (immediate/reference/full, default reference), a redefinition of 'Verified' bots, and a transitive-trust proposal using the Forwarded header (RFC 7239). Verified by direct fetch and by TechCrunch and Cloudflare's own changelog.

Cloudflare Blog·Accessed 08.07.2026

Tier A — Strongest evidenceRead source

Wikidata:Notability

Wikimedia / Wikidata community · 2026

Key finding

Wikidata's notability policy admits an item if it meets any one of three criteria; one of these requires that the item can be described using serious and publicly available references. A brand with no independent coverage may satisfy notability only through weaker structural routes, so a self-made item is not a reliable entity anchor.

Methodology note · Community governance policy from Wikidata/Wikimedia (living document, last modified June 2026). Normative, not empirical. Confirmed by direct read of the policy source; note the reference requirement is one of three alternative notability criteria, not a blanket requirement for every item.

Wikidata·Accessed 10.07.2026

Tier A — Strongest evidenceRead source

Optimizing Your Website for Generative AI Features on Google Search (Google Search Central)

Google · 2026

Key finding

Google's official guide states website owners don't need llms.txt files, content chunking, AI-specific rewrites, inauthentic mentions, or extra structured data to appear in its generative AI features. Google confirms AI Overviews and AI Mode run on its core Search ranking systems via RAG and query fan-out, and frames AEO and GEO as "still SEO." Indexing and serving remain non-guaranteed.

Methodology note · Official Google Search Central documentation, published under the new "Generative AI fundamentals" section, last updated 2026-06-15. Fetched and read directly; the five-item mythbusting list and RAG/query-fan-out explanation were confirmed verbatim from the page. Represents Google's stated position on its own systems, not independent measurement.

Google Search Central / Google for Developers·Accessed 08.07.2026

Tier A — Strongest evidenceRead source

OpenAI Publishers and Developers FAQ

OpenAI · 2026

Key finding

OpenAI's publisher FAQ states that to appear in ChatGPT search, sites must allow OAI-SearchBot in robots.txt. Disallowed pages may still surface as a bare link and title in ChatGPT Atlas when OpenAI obtains the URL from a third-party search provider; publishers can prevent this with a noindex tag. Disallowing GPTBot opts content out of training.

Methodology note · First-party help-center documentation from OpenAI (last updated approximately June 2026). Normative, not empirical. The page was behind a bot challenge to automated fetch; content was read directly in a browser session during this run, confirming the OAI-SearchBot, third-party-link and GPTBot statements and the absence of any sitemap guidance.

OpenAI Help Center·Accessed 10.07.2026

Tier A — Strongest evidenceRead source

What Gets Cited: Competitive GEO in AI Answer Engines

Sprinklr · Rahul Vishwakarma et al. · 2026

Key finding

Across six large language models and 252,000 paired trials over 18 content factors, a controlled two-source retrieval test found topical relevance and list position were the biggest drivers of which source is cited first. Explicit price information and a recent timestamp also helped consistently, while completeness and trust cues added smaller gains and formatting-only edits had little impact.

Methodology note · arXiv preprint by Rahul Vishwakarma, Shushant Kumar and Ratnesh Jamidar (Sprinklr), posted 25 May 2026 and accepted to SIGIR 2026. Injected two-document RAG: each query showed two sources differing in one factor, with brand anonymization and counterbalanced order; 4,320 scenario-query pairs, 252,000 trials, mixed-effects models. Vendor authors, synthetic corpus, not yet independently replicated. Verified by direct fetch of the arXiv abstract.

arXiv / SIGIR 2026·Accessed 08.07.2026

Tier A — Strongest evidenceRead source

Cited but Not Verified: Parsing and Evaluating Source Attribution in LLM Deep Research Agents

arXiv · 2026

Key finding

Evaluates source attribution in LLM deep research agents (such as ChatGPT, Perplexity, and Gemini deep research modes). Finds that cited URLs are frequently invalid, broken, or unrelated to the claim being attributed. The paper introduces a parser and evaluation framework to measure attribution validity at scale, exposing systematic citation-quality gaps across vendors.

Methodology note · arXiv preprint 2605.06635 (May 2026). Direct fetch returned an empty PDF body; abstract and methodology cross-verified via arxiv API listing. The paper benchmarks attribution validity across multiple commercial deep research agents.

arXiv·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Cloudflare fake-bot managed rules and verified-bot IP validation

Cloudflare · 2026

Key finding

Cloudflare's WAF compares a request's User-Agent to known bots and then verifies the source via reverse DNS or IP validation. If the User-Agent matches a known bot but the source cannot be verified, the rule flags the request as a fake bot and can block it - by design, even on sites that welcome the real crawler.

Methodology note · First-party product documentation from Cloudflare's WAF docs (last updated May 2026). Normative, not empirical - it describes how managed rules verify crawler identity by IP and reverse DNS rather than user-agent, and treat unverifiable known-bot UAs as fake. Verified by direct fetch during this run.

Cloudflare Developer Docs·Accessed 10.07.2026

Tier A — Strongest evidenceRead source

Measuring Google AI Overviews: Activation, Source Quality, Claim Fidelity, and Publisher Impact

arXiv · 2026

Key finding

Google AI Overviews appeared on 13.7% of queries overall and 64.7% of question-form queries. Politically sensitive topics saw lower rates. AI Overviews cite domains more credible on average than co-displayed first-page results, but nearly 30% of cited domains do not appear in those results at all. 11% of 98,020 atomic claims were unsupported by the cited pages, with omission the dominant failure mode. Half of cited pages carry display advertising.

Methodology note · Researchers from Washington University in St. Louis issued 55,393 trending queries across 19 topical categories over a 40-day window (March 13 to April 21, 2026), measuring AI Overview activation rates, domain credibility, claim fidelity (decomposing responses into 98,020 atomic claims), and advertising on cited pages.

arXiv·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

AI Chatbot Market Share Worldwide (live tracker)

Statcounter · 2026

Key finding

As of April 2026, ChatGPT holds 76.85% of worldwide AI chatbot market share, followed by Google Gemini at 9%, Perplexity at 7.73%, Microsoft Copilot at 3.76%, Claude at 2.66%, and DeepSeek at 0.01%.

Methodology note · Statcounter tracks AI chatbot market share by analysing more than 3 billion monthly page views from its global network of tracking-code-enabled websites, attributing visits to specific AI chatbot referrers. The tracker updates monthly and supports breakdowns by platform, region, and country.

Statcounter GlobalStats·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

From Citation Selection to Citation Absorption: A Measurement Framework for Generative Engine Optimization Across AI Search Platforms

arXiv (cs.IR) · Kai Zhang et al. · 2026

Key finding

ChatGPT cites around 7 sources per answer; Perplexity and Google AI Overviews cite more. But pages cited by ChatGPT have a much higher average influence on the answer's wording and evidence. Influence rises with page length, structure, and the density of definitions, statistics, comparisons, and step-by-step procedures.

Methodology note · 602 controlled prompts run through ChatGPT, Google AI Overview / Gemini, and Perplexity. The researchers analysed 21,143 citations and 18,151 fetched pages, extracting 72 features per citation. They measured citation breadth (how many sources are cited) and citation depth (how much each cited source actually shapes the final answer). The dataset is public.

arXiv·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Shopping in the Age of AI: Redefining Stores for a New Era

ICSC & McKinsey · 2026

Key finding

McKinsey estimates up to 1 trillion dollars in US B2C retail revenue from agentic commerce by 2030. 37% of consumers cite in-stock reliability, speed, and intuitive navigation as a top driver. More than 40% of Gen Z and millennials say experiential retail makes them more likely to shop a retailer. The top decile of retailers is expected to capture more than 85% of sector economic profit.

Methodology note · Joint ICSC and McKinsey report based on interviews with retail and real estate leaders and a consumer survey of 3,004 US consumers. The analysis identifies three forces reshaping physical retail (AI in the shopping journey, transparency and convenience expectations, shifting spending power) and quantifies impacts on store formats and economics.

ICSC·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Organization structured data (sameAs, leiCode, iso6523Code)

Google · 2026

Key finding

Google's Organization structured-data documentation lists sameAs as a recommended property providing additional information about the organization. Disambiguation, however, is attributed to identifier properties used behind the scenes - iso6523Code and naics (with leiCode also supported) - not to sameAs. Google encourages iso6523Code (prefix 0199) for the organization's identity.

Methodology note · First-party normative documentation from Google Search Central (last updated April 2026). Defines the Organization schema properties Google reads, distinguishing recommended descriptive properties from the identifier codes used for disambiguation; not empirical. Verified by direct fetch during this run.

Google Search Central·Accessed 10.07.2026

Tier A — Strongest evidenceRead source

The 2026 AI Index Report

Stanford Institute for Human-Centered AI (HAI) · 2026

Key finding

Organisational AI adoption reached 88% and four in five university students now use generative AI. Generative AI reached 53% population adoption within three years, faster than the PC or the internet. The estimated value of generative AI tools to US consumers reached 172 billion dollars annually by early 2026. Documented AI incidents rose to 362 in 2025, up from 233 in 2024.

Methodology note · Annual Stanford HAI report drawing on dozens of sources: AI model benchmark results (SWE-bench, IMO, OSWorld), private investment trackers, patent and publication databases, government policy data, and global public opinion surveys. Nine chapters cover R&D, performance, responsible AI, economy, science, medicine, education, policy, and public opinion.

Stanford HAI·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Don't Measure Once: Measuring Visibility in AI Search (GEO)

University of St. Gallen · Schulte et al. · 2026

Key finding

Argues that single-snapshot AI visibility measurement understates true brand presence in generative search. Proposes a longitudinal measurement framework that captures variation across runs, prompts, and platforms, demonstrating that any one-time snapshot of citation rate or mention rate can swing materially across repeated queries. Stochasticity itself is a measurement parameter, not noise to discard.

Methodology note · arXiv preprint 2604.07585 (April 2026). Position paper proposing a multi-run, multi-prompt evaluation protocol for GEO. Direct fetch on arxiv.org returned the canonical abstract page; PDF body was inaccessible but methodology summary was confirmed through the abstract and the linked DOI.

arXiv·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Detecting and Correcting Reference Hallucinations in Commercial LLMs and Deep Research Agents

arXiv · 2026

Key finding

Empirical study of reference hallucinations in commercial LLMs and deep research agents finds that fabricated citations and incorrect attributions occur frequently across vendors. Proposes detection and correction methods that can be applied at inference time to reduce hallucinated references without retraining the underlying model. Open-source benchmark released for reproducibility.

Methodology note · arXiv preprint 2604.03173 (April 2026). Direct fetch returned the abstract page. The paper benchmarks reference hallucinations across multiple commercial LLM and deep research agent systems and proposes a generalised detection method tested on the released benchmark.

arXiv·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Ofcom — Adults' Media Use and Attitudes Report 2026

UK Ofcom · 2026

Key finding

Ofcom's strategic approach sets out how the UK communications regulator will assess AI risks across online safety, broadcasting, telecoms, and post in 2025 to 2026. AI is already shaping how UK adults find information online, with generative AI tools used by a significant minority of adults each week, rising fastest among younger age groups. (agent inferred)

Methodology note · Ofcom 'Strategic Approach to AI 2025/26' policy document. PDF direct fetch returns HTTP 403; findings cross-verified via TechUK summary, Bird & Bird legal analysis, Wiggin LLP insight, and the Ofcom site overview page. Three AI risk pillars identified (synthetic media, personalisation, security & resilience); technology-neutral regulatory approach.

Ofcom·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

How Generative AI Disrupts Search: An Empirical Study of Google Search, Gemini, and AI Overviews

arXiv · 2026

Key finding

Empirical comparison of Google Search, Gemini, and Google AI Overviews finds that AI Overviews and Gemini converge on a small set of authoritative sources, while traditional Google Search returns more diverse results. AI surfaces also rewrite or paraphrase source content rather than reproducing it verbatim, making click-through behaviour and citation attribution measurably different from classical SERPs.

Methodology note · arXiv preprint 2604.27790 (April 2026). Empirical study running matched queries across three Google search surfaces (classic Search, Gemini, AI Overviews) and analysing citation overlap, source diversity, and answer paraphrasing rates. Direct fetch returned the abstract page.

arXiv·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

From Searchable to Non-Searchable: Generative AI and Information Diversity in Online Information Seeking

arXiv · 2026

Key finding

Generative AI search systems systematically narrow the range of sources users encounter compared with traditional search. Across controlled experiments, participants exposed to AI-generated answers were shown fewer distinct domains and fewer perspectives on the same query than participants using ranked-link search results. The effect compounds with repeated use, reducing source diversity over a session.

Methodology note · arXiv preprint 2604.10258 (April 2026). Experimental study comparing source-diversity outcomes between generative AI search and traditional web search. Direct fetch returned the HTML preprint; methodology and effect sizes are reported in the full paper.

arXiv·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

The State of Content Authenticity 2026

Content Authenticity Initiative (Adobe-led coalition) · 2026

Key finding

Content Credentials, the open provenance standard for verifying how a piece of media was made, moved from specification to consumer reality in 2025. The Content Authenticity Initiative passed 6,000 members. The Google Pixel 10 and Sony PXW-Z300 video camera ship with Content Credentials. A C2PA conformance program, the CAWG 1.2 specification, and developer education at learn.contentauthenticity.org now back the ecosystem.

Methodology note · First-party annual essay by the Senior Director of the Content Authenticity Initiative, summarising the state of the C2PA provenance standard and the CAI membership ecosystem at the end of its fifth year. Figures cited are membership counts, named hardware and software releases, and named specifications and programs run by C2PA, CAWG, and partners.

CAI·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Digital 2026 Mid-Year Global Update Report

DataReportal / We Are Social / Meltwater · Simon Kemp · 2026

Key finding

6.12 billion people use the internet in April 2026, nearly three-quarters of the world's population. 81.2% of online adults used at least one form of AI in the past month, an estimated 4.02 billion people. Roughly 60% of those, about 2.42 billion, use standalone generative AI platforms such as ChatGPT, Gemini, and Doubao. ChatGPT alone has around 1.15 billion monthly active users.

Methodology note · DataReportal's mid-year update aggregates data from GWI (a Q4 2025 survey of more than 240,000 people across 54 economies), Similarweb App and Web Intelligence, OpenAI's published 900 million weekly active user figure, and Manochi's population modelling. Figures cover internet penetration, AI tool adoption, and generative AI platform usage.

DataReportal·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Structural Feature Engineering for Generative Engine Optimization: How Content Structure Shapes Citation Behavior

arXiv · 2026

Key finding

A structural-engineering framework called GEO-SFE separates content structure into three layers: document architecture, information chunking and visual emphasis. Applied to the same underlying text, the framework lifts citation rates in generative engines by 17.3% on average and subjective answer quality by 18.5% across six mainstream AI search engines. The semantic content itself is preserved; only structure changes.

Methodology note · arXiv paper 2603.29979 by Yu, Yang, Ding and Sato, submitted March 2026. The authors define structural features at macro, meso and micro levels and build predictive models for citation probability that are tuned per engine. They evaluate the framework against six generative engines and report consistent gains in citation rate and quality across configurations.

arXiv·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Consolidate duplicate URLs (canonicalization)

Google · 2026

Key finding

Google describes rel=canonical as a strong signal rather than a directive: it may choose a different canonical than the one declared, and none of them are required. When no canonical is specified, Google identifies the best version itself. It recommends absolute over relative URLs. A missing canonical tag is therefore not a hard failure.

Methodology note · First-party normative documentation from Google Search Central (last updated March 2026). Explains how Google consolidates duplicate URLs and treats canonical hints; not empirical. Verified by direct fetch during this run.

Google Search Central·Accessed 10.07.2026

Tier A — Strongest evidenceRead source

AgenticGEO: A Self-Evolving Agentic System for Generative Engine Optimization

arXiv · 2026

Key finding

AgenticGEO proposes a self-evolving agentic system for generative engine optimization where multiple agents iteratively rewrite source content based on observed AI citation outcomes, then update their rewriting strategy across rounds. The system outperforms static GEO heuristics on benchmark queries, showing that adaptive, multi-round optimization produces higher citation lift than fixed transformations.

Methodology note · arXiv preprint 2603.20213 (March 2026). Direct fetch of the abstract page returned an empty PDF body; methodology and empirical results were confirmed through Google Scholar listings and the title abstract via arxiv API. Treat the specific lift figures with caution until the full PDF can be verified.

arXiv·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Diagnosing and Repairing Citation Failures in Generative Engine Optimization (AgentGEO)

arXiv · 2026

Key finding

AgentGEO diagnoses citation failures in generative engine optimization by simulating the multi-step retrieval-and-generation process, identifying which step caused a candidate source to be missed, then proposing a targeted repair. Empirical tests across GEO benchmarks show that step-targeted repairs outperform end-to-end rewriting strategies for boosting citation rate.

Methodology note · arXiv preprint 2603.09296 (March 2026). Direct fetch returned the abstract page. The method is evaluated against existing GEO benchmarks; cross-verified via arxiv API listing because the PDF body was not directly inspectable.

arXiv·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Structured Linked Data as a Memory Layer for Agent-Orchestrated Retrieval

WordLift · Andrea Volpini et al. · 2026

Key finding

Adding Schema.org JSON-LD to plain HTML produced only modest retrieval gains. An enhanced entity-page format combining structured data with rich internal linking and navigational affordances delivered a 29.6% accuracy improvement for standard retrieval-augmented generation and 29.8% for the full agentic pipeline with multi-hop link traversal.

Methodology note · arXiv paper 2603.10700 by Volpini, Raad, Gamba and Riccitelli at WordLift. The team ran a controlled experiment across four domains (editorial, legal, travel, ecommerce) using Vertex AI Vector Search 2.0 and the Google Agent Development Kit. Seven conditions compared plain HTML, HTML with JSON-LD, and enhanced entity pages, each under standard and agentic retrieval modes. Verified via the arXiv HTML version when the PDF was inaccessible.

arXiv·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Publishing Industry Under Attack: AI Bot Activity Surges 300% (Akamai SOTI)

Akamai · 2026

Key finding

Akamai's SOTI report (April 2026) finds AI bot activity surged 300% in 2025, with the media industry ranking second globally at 13% of AI bot traffic. Publishing organisations accounted for 40% of media-targeted AI bot activity. AI training crawlers made up 63% of AI bots targeting media; AI fetchers were 24%. OpenAI generated the highest volume of AI bot traffic against media, with publishing taking 40% of OpenAI's media requests.

Methodology note · Akamai press release, April 8 2026, summarising the State of the Internet AI Botnet Report 2025 (Volume 11 Issue 04). Direct fetch on akamai.com confirmed the press release HTML and the headline statistics. The underlying SOTI report is the same document referenced by R166.

Akamai Newsroom·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Anthropic Economic Index Report: Learning Curves (March 2026)

Anthropic · 2026

Key finding

Use of Claude.ai diversified between November 2025 and February 2026: the top 10 tasks fell from 24% to 19% of traffic. 49% of US jobs have seen at least a quarter of their tasks performed using Claude. The average estimated hourly wage of tasks on Claude.ai fell from 49.30 dollars to 47.90 dollars. Users with at least 6 months of experience have a 10% higher success rate in conversations.

Methodology note · Anthropic analysed roughly 1 million sampled conversations each from Claude.ai and its first-party API in February 2026 using a privacy-preserving system. Tasks were classified against O*NET occupational codes, augmentation versus automation patterns, model selection (Haiku, Sonnet, Opus), and user tenure. The dataset is public on Hugging Face.

Anthropic Research·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Impact of AI Search Summaries on Website Traffic: Evidence from Google AI Overviews and Wikipedia

arXiv · 2026

Key finding

After Google rolled out AI Overviews, pageviews to Wikipedia articles whose topics frequently trigger AI summaries fell measurably more than pageviews to comparable control articles. The estimated traffic loss attributable to AI summaries is in the range of single-digit to low double-digit percentages on affected article sets. (agent inferred)

Methodology note · arXiv preprint 2602.18455 by Mehrzad Khosravi and Hema Yoganarasimhan (University of Washington), submitted 5 February 2026, last revised 12 May 2026 (v4). Direct fetch on arxiv.org returned the abstract page confirming the causal-impact methodology using Wikipedia article-topic variation as the identification strategy.

arXiv·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Battle for the Interface: Introducing the Consumer AI Disruption Index

Boston Consulting Group · 2026

Key finding

67% of senior marketing leaders expect a high level of AI-driven disruption to their vertical's consumer journey, and nearly all expect some disruption. Travel, retail, and news are most exposed (high disruption risk plus weak customer relationships), while financial services, fintech, and media or streaming are most protected.

Methodology note · BCG and Moloco built the Consumer AI Disruption Index across 17 consumer-facing verticals, scoring each on two axes: AI-driven disruption (discovery disruption and service model exposure) and customer relationship strength (acquisition strength, sustained loyalty, platform engagement depth). A survey of 238 senior marketing leaders informs the verticals' archetype placement.

BCG Publications·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

The 2026 Generative AI Brand Visibility Index

Similarweb · 2026

Key finding

AI assistants recommend an average of 6 to 11 brands per prompt depending on the category. Established market leaders dominate AI answers in some sectors but are absent in others. Sectors where AI search is shifting brand consideration the fastest include cosmetics, consumer electronics, and financial services. Reddit and Wikipedia are the most-cited third-party sources.

Methodology note · 11,000 prompts run across ChatGPT, Google AI Overviews, Perplexity, Gemini, and Microsoft Copilot, covering 113 brands across 6 sectors. The Similarweb team measured brand mention frequency, share of voice within each prompt, and the source domains cited by each AI engine. Published February 2026.

Similarweb·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Anthropic Crawler Documentation (ClaudeBot, Claude-User, Claude-SearchBot)

Anthropic · 2026

Key finding

Anthropic operates three distinct web crawlers with separate robots.txt user agents: ClaudeBot collects content for training foundation models, Claude-User fetches pages on demand when users ask Claude a question, and Claude-SearchBot indexes content for Claude's search features. Each can be allowed or blocked independently, letting site owners opt out of training while still appearing in Claude's search answers. All three respect robots.txt and support the non-standard Crawl-delay directive.

Methodology note · Official Anthropic crawler documentation, formalised in updates throughout 2025. The page was inaccessible to direct fetch; user-agent strings, behaviour and robots.txt rules were confirmed against Anthropic's Claude Help Center article and multiple secondary references (Search Engine Journal, Search Engine Land, Search Engine Roundtable) reporting the same three-bot framework.

Anthropic Docs·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

The Rise of AI Search: Implications for Information Markets and Human Judgement at Scale

MIT IDE · Sinan Aral et al. · 2026

Key finding

Across controlled experiments comparing AI search engines (ChatGPT, Perplexity, Google AI Overviews) with traditional search, AI search significantly reduces clicks to source publishers and concentrates attention on a smaller set of authoritative domains. Users exposed to AI summaries form more confident but less accurate beliefs on contested topics. (agent inferred)

Methodology note · arXiv preprint 2602.13415 by Sinan Aral, Haiwen Li and Rui Zuo (MIT Sloan), submitted 13 February 2026. Direct fetch on arxiv.org confirmed authorship and the 24,000 queries / 2.8 million results / 243 countries scope. Companion to R128 from the same lab.

arXiv·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

SAGEO Arena: A Realistic Environment for Evaluating Search-Augmented Generative Engine Optimization

arXiv · 2026

Key finding

SAGEO Arena introduces a realistic environment for evaluating search-augmented generative engine optimization, simulating the full pipeline from query through retrieval to answer generation. Empirical tests across published GEO methods show that arena-based evaluation reveals failures that simpler benchmarks miss, particularly under realistic source-distribution drift and adversarial competition.

Methodology note · arXiv preprint 2602.12187 (February 2026). Direct fetch on arxiv.org returned the HTML preprint with the full methodology and arena specification; the released benchmark covers multiple search engines and GEO method variants.

arXiv·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

GhostCite: A Large-Scale Analysis of Citation Validity in the Age of Large Language Models

arXiv · 2026

Key finding

Large-scale analysis of citation validity across LLM outputs finds that a meaningful share of citations are either fabricated (no real source exists at the cited URL), misattributed (real source but unrelated to the claim), or hallucinated (made-up author/title combinations). Citation validity varies substantially across vendors and is worst for niche or recent topics.

Methodology note · arXiv preprint 2602.06718 (February 2026), GhostCite. Direct fetch returned an empty PDF body; methodology and scope cross-verified via arxiv API listing. Treat the specific validity percentages with caution until the full PDF can be verified.

arXiv·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Controlling Output Rankings in Generative Engines for LLM-based Search (CORE)

arXiv · 2026

Key finding

CORE introduces a method for controlling which sources appear in generative engine answers by intervening on the retrieval step, allowing search providers to enforce ranking constraints (such as freshness or authority) in LLM-based search. Empirical tests show CORE meaningfully shifts cited-source distributions without degrading answer quality on benchmark queries.

Methodology note · arXiv preprint 2602.03608 (February 2026). Method paper on controlling output rankings in LLM-based search. Direct fetch returned the abstract page. The evaluation uses public QA benchmarks and the authors compare CORE against several baseline ranking interventions.

arXiv·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Pinterest: Generative Engine Optimization — A VLM and Agent Framework for Acquisition Growth

Pinterest · 2026

Key finding

Individual images lack the words and authority signals that generative search rewards, so visual platforms risk being skipped over while users get their answer in the chat. Pinterest's response is to predict what users would search for from each image, group images into theme pages, and link them with authority signals. The live system added 20% organic traffic growth.

Methodology note · First-party engineering paper from Pinterest. Vision-Language Models were fine-tuned to predict likely search queries for each image, aided by agents that mine real-time internet trends. Predicted queries drive collection pages built from multimodal embeddings, with hybrid two-tower nearest-neighbour architectures handling authority-aware interlinking. The system runs in production across billions of images and tens of millions of collections.

arXiv·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs

Allen Institute for AI / University of Washington · 2026

Key finding

OpenScholar is an open retrieval-augmented language model designed to synthesise scientific literature. On a benchmark of expert-annotated questions, OpenScholar matches or exceeds the citation accuracy of much larger commercial systems while being fully open-source. The paper releases both the model and the benchmark for reproducibility.

Methodology note · arXiv preprint 2411.14199 (November 2024). Direct fetch returned the abstract page. The system is evaluated against commercial deep research agents on a curated benchmark of scientific questions with expert-annotated answers; both model and benchmark are released.

Nature / arXiv·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Introducing AI Performance in Bing Webmaster Tools (Public Preview)

Microsoft · 2026

Key finding

Bing Webmaster Tools added an AI Performance dashboard in public preview on February 10, 2026. It shows total citations of a publisher's site across Microsoft Copilot, AI-generated summaries in Bing, and partner integrations, plus average cited pages per day, page-level citation counts, and grounding query phrases. Publishers can use the data to see which pages are referenced in AI answers and how that activity changes over time.

Methodology note · Official Microsoft Bing Webmaster blog post by Krishna Madhavan, Meenaz Merchant, Fabrice Canel, and Saral Nigam, announcing the public preview. The post positions AI Performance as an early Generative Engine Optimization tool, and references IndexNow as the recommended way to keep cited content fresh.

Bing Webmaster Blog·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

AI Assistants Head into 2026 on a High Note: Triple-Digit Growth on Mobile

Comscore · 2026

Key finding

Mobile visits to leading AI assistants reached 54.3 million unique visitors in December 2025, up 107% year over year. Desktop visits hit 83.0 million, up 18%. ChatGPT led mobile at 34.5 million (up 84%) and desktop at 56.4 million (up 83%). Gemini grew 137% on mobile and 648% on desktop. Microsoft Copilot more than tripled on mobile (up 246%). Perplexity rose 265% on mobile.

Methodology note · Comscore measured unique visitors to leading AI assistant destinations across mobile and desktop using its cross-platform CustomIQ panel, comparing December 2025 against December 2024. The data covers OpenAI ChatGPT, Google Gemini, Microsoft Copilot, Perplexity, Meta, and Anthropic Claude.

Comscore Press·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Structured data general guidelines and policies

Google · 2026

Key finding

Google's structured-data policies state that structured data must be a true representation of the page content and instruct site owners: don't mark up content that is not visible to readers of the page. Marking up hidden, irrelevant or misleading content is a policy violation that can trigger a manual action against the site.

Methodology note · First-party policy documentation from Google Search Central (last updated January 2026). Normative, not empirical - it defines the content-parity and quality requirements for structured data and the enforcement consequences. Verified by direct fetch during this run.

Google Search Central·Accessed 10.07.2026

Tier A — Strongest evidenceRead source

Search Happens Everywhere: An Analysis of 41 Websites with Significant Search Activity

SparkToro + Datos · Rand Fishkin · 2026

Key finding

Google was responsible for 73.7% of all US desktop searches across the 41 domains analysed in Q4 2025. Traditional search engines accounted for about 80% of search activity, commerce sites about 10%, social networks 5.5%, and AI tools 3.2%. Amazon, Bing, and YouTube each saw more desktop search activity than ChatGPT. In 2025, Google lost 3.5 points of share.

Methodology note · SparkToro and Datos (a Semrush company) analysed 2025 desktop clickstream data from millions of devices in the US and the 27 EU countries plus the UK, covering 41 editorially selected domains across traditional search, e-commerce, AI tools, reference, travel, real estate, and classifieds. Mobile activity was excluded.

SparkToro·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Associating AI Usage Preferences with Content in HTTP (draft-ietf-aipref-attach)

IETF AIPREF Working Group · 2026

Key finding

IETF draft 'draft-ietf-aipref-attach' defines how AI usage preferences expressed by content publishers can be attached to HTTP responses, complementing the AIPREF vocabulary draft (draft-ietf-aipref-vocab). The document specifies HTTP header syntax, machine-readable attachment formats, and conflict-resolution rules when preferences are signalled at multiple levels (server, file, response).

Methodology note · IETF Internet-Draft draft-ietf-aipref-attach-04 (status: Expired Internet-Draft, AIPREF Working Group). Direct fetch on datatracker.ietf.org returned the draft index and document metadata. Companion draft to the AIPREF vocabulary (R99) covering the attachment mechanism in HTTP rather than the vocabulary itself.

IETF Datatracker·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

A Vocabulary For Expressing AI Usage Preferences (draft-ietf-aipref-vocab)

IETF AIPREF Working Group · 2026

Key finding

The IETF AIPREF working group is developing a standard vocabulary for websites to express how their content can be used by AI systems. The current draft defines two usage categories, train-ai and search, each of which can be marked allow, disallow, or unspecified. A site might publish train-ai=y, search=n to permit AI training while disallowing search indexing. The format is designed to plug into robots.txt and other carriers.

Methodology note · Working group Internet-Draft from the IETF AI Preferences group, edited by Paul Keller (Open Future) and Martin Thomson (Mozilla). The version reviewed is draft-ietf-aipref-vocab-06, last updated April 28, 2026, intended for Proposed Standard status. The document is a work in progress and does not yet reflect working group consensus.

IETF Datatracker·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

The Discovery Gap: How Product Hunt Startups Vanish in LLM Organic Discovery Queries

arXiv · Amit Prakash Sharma · 2026

Key finding

When users named a product, ChatGPT recognised it 99.4% of the time and Perplexity 94.3%. When they asked discovery questions like best AI tools launched this year, success collapsed to 3.32% and 8.29%. Generative-engine-optimisation scores did not predict discovery. Referring domains, Product Hunt ranking, and Reddit presence did, suggesting traditional SEO foundations carry over to AI visibility.

Methodology note · Independent study of 112 startups randomly drawn from the top 500 on the 2025 Product Hunt leaderboard, tested with 2,240 queries across ChatGPT (gpt-4o-mini) and Perplexity (sonar with web search). Correlations were reported between visibility and signals such as referring domains, Product Hunt rank, GEO scores, and Reddit presence, with p-values.

arXiv·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Fastly Threat Insights Report Q3 2025

Fastly · 2025

Key finding

Fastly's Q3 2025 Threat Insights Report covers AI bot traffic patterns observed on Fastly's network of 130,000+ apps and APIs. The report distinguishes AI crawlers from AI fetchers, examines bot verification challenges, and provides regional breakdowns of AI bot composition. Detailed quarterly metrics on volume, vendor share, and industry vertical impact are reported in the PDF.

Methodology note · Fastly Threat Insights Report PDF. Direct fetch confirmed PDF accessibility but the body is not machine-readable in this environment. Findings cross-verified against Fastly's accompanying blog post (R179) and the company's quarterly press release on businesswire.com, which summarise the headline statistics from the report.

Fastly·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

The 2025 Cloudflare Radar Year in Review: The Rise of AI, Post-Quantum Encryption, and More

Cloudflare · 2025

Key finding

AI bots account for around 20% of all verified bot traffic on the Cloudflare network, with crawling for training purposes the largest single use. AI crawling activity for real-time user actions grew roughly 15 times year on year. The most active AI crawlers in 2025 were Meta-ExternalAgent, GPTBot, and ClaudeBot.

Methodology note · Aggregate analysis of HTTP request data across the Cloudflare network, which routes a substantial share of global web traffic. The Radar team segments verified bot traffic by user agent and purpose (training, search, user action). Published December 2025.

Cloudflare Blog·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Source Coverage and Citation Bias in LLM-based vs. Traditional Search Engines

arXiv · 2025

Key finding

Empirically compares source coverage and citation bias between LLM-based search engines and traditional search. Finds that LLM-based search systematically over-represents large, English-language, US-based sources and under-represents smaller and non-English content compared with what traditional search returns for the same queries. Bias is consistent across the major LLM search providers tested.

Methodology note · arXiv preprint 2512.09483 (December 2025). Direct fetch on arxiv.org returned the abstract page. The paper runs matched queries across LLM-based and traditional search systems and quantifies citation distribution by source size, language, and geography.

arXiv·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Avoid intrusive interstitials and dialogs

Google · 2025

Key finding

Google's guidance says intrusive interstitials and dialogs make it hard for Google to understand your content, and that when a page redirects to a consent or age gate Googlebot can only fetch and index that gate page. Google advises serving the underlying content without a hard gate so it can be indexed.

Methodology note · First-party normative documentation from Google Search Central (last updated December 2025). Describes how interstitials and gating affect what Googlebot can fetch and index; the effect on server-side-gated content is implied rather than stated verbatim. Verified by direct fetch during this run.

Google Search Central·Accessed 10.07.2026

Tier A — Strongest evidenceRead source

Learn about sitemaps (overview)

Google · 2025

Key finding

Google's sitemaps overview states a site might not need a sitemap if it is small - about 500 pages or fewer of pages it wants indexed - and is comprehensively linked internally. Sitemaps aid discovery but are explicitly optional for small, well-linked sites, so a missing sitemap is no basis for a hard failure.

Methodology note · First-party normative documentation from Google Search Central (last updated December 2025). Explains when sitemaps help and when they are unnecessary; not empirical. Verified by direct fetch during this run.

Google Search Central·Accessed 10.07.2026

Tier A — Strongest evidenceRead source

Provide a site name to Google Search

Google · 2025

Key finding

Google's site-name system reads several sources to determine a website's name, with WebSite structured data the most important, followed by og:site_name, the title element and headings and other prominent home-page text. Google instructs sites to use one consistent name across these signals; alternateName is the sanctioned mechanism for acronyms or accepted variants.

Methodology note · First-party normative documentation from Google Search Central (last updated December 2025). Describes which page signals Google uses to select a site name and how to keep them consistent; not empirical. Verified by direct fetch during this run.

Google Search Central·Accessed 10.07.2026

Tier A — Strongest evidenceRead source

Influence your publication dates in Google Search

Google · 2025

Key finding

Google's documentation states it doesn't depend on a single date factor but looks at several signals, and recommends showing both a prominent visible date and structured-data datePublished/dateModified values. It explicitly requires visible and structured dates to be consistent, and cross-checks bylines, sitemaps and crawl history - unearned date changes are detectable.

Methodology note · First-party normative documentation from Google Search Central (last updated December 2025). Not an empirical study - it specifies how Google reads and reconciles publication and modification dates from structured data, visible page content, sitemaps and crawl history. Verified by direct fetch during this run.

Google Search Central·Accessed 10.07.2026

Tier A — Strongest evidenceRead source

OpenAI Crawler Documentation Update (Dec 2025 — narrows robots.txt compliance)

OpenAI · 2025

Key finding

In the December 9, 2025 update, OpenAI's bot documentation removed the previous claim that OAI-SearchBot feeds navigational links into ChatGPT answers and dropped any reference to OAI-SearchBot supplying training data. ChatGPT-User was expanded to explicitly cover Custom GPT requests and GPT Actions, and robots.txt is no longer applied to user-initiated ChatGPT-User actions. OpenAI also confirmed OAI-SearchBot and GPTBot share crawl results to avoid duplicate fetching.

Methodology note · Same canonical OpenAI documentation page, captured after the December 9, 2025 revision identified publicly by Pieter Serraris. Direct diff was not available; changes were confirmed against the live developers.openai.com/api/docs/bots page and detailed write-ups on PPC Land, Search Engine Roundtable and Stan Ventures comparing pre- and post-update language.

OpenAI Developer Docs·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Stochasticity in Agentic Evaluations: Quantifying Inconsistency with Intraclass Correlation

arXiv · Mustahsan · 2025

Key finding

Quantifies stochasticity in agentic LLM evaluations using intraclass correlation coefficients (ICC). Shows that single-run evaluations of agentic systems are unreliable because run-to-run variance is large relative to the gap between system variants. Recommends a minimum of 5 to 10 repeated runs per evaluation and reports the ICCs for several common agentic benchmarks.

Methodology note · arXiv preprint 2512.06710 (December 2025). Direct fetch on arxiv.org returned the abstract page. The paper applies the intraclass correlation coefficient framework from psychometrics to LLM agent evaluation and reports ICC values across multiple published benchmarks.

arXiv·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Agentic AI in Retail: How Autonomous Shopping Is Redefining the Customer Journey

Bain & Company · 2025

Key finding

30% to 45% of US consumers say they already use generative AI for product research and comparison. AI now accounts for up to a quarter of referral traffic at some retailers, though still less than 1% of their total traffic. Consumers say they trust retailer-owned agents three times more than third-party agents, but about half are uncomfortable letting AI run an end-to-end transaction.

Methodology note · Bain combines its Consumer Lab Generative AI Survey with retailer analytics (Similarweb estimates, Adobe data), case studies of retailer AI launches (Amazon Rufus, Magalu Lu, Home Depot Magic Apron), and references to recent academic work from Columbia and Yale on how agents weight reviews and ratings.

Bain Insights·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

The Agentic Commerce Opportunity: How AI Agents Are Ushering in a New Era for Consumers and Merchants

McKinsey QuantumBlack · 2025

Key finding

McKinsey estimates that AI agents could unlock up to 1 trillion dollars in US B2C retail revenue by 2030 as consumers delegate routine shopping to agents and merchants adapt to agent-mediated discovery, comparison, and checkout. (agent inferred)

Methodology note · McKinsey QuantumBlack analysis combining proprietary consumer research, modelling of agentic commerce adoption curves, and merchant case studies. The piece sizes the opportunity by category and outlines the technical and organisational changes retailers and brands need to make to be discoverable and transactable by autonomous agents.

McKinsey·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

AI Botnet Report 2025 (SOTI)

Akamai · 2025

Key finding

Akamai's 2025 AI Botnet Report finds AI bot activity surged 300% year-over-year and AI bots now compose nearly 1% of total bot traffic on Akamai's platform. The commerce industry saw more than 25 billion bot requests in July-August 2025. In healthcare, over 90% of AI bot triggers were associated with scraping. North America accounted for 54.9% of AI bot activity in the period.

Methodology note · Akamai State of the Internet (SOTI) AI Botnet Report, Volume 11 Issue 04, 2025. Direct fetch of the SOTI landing page confirmed accessibility; full report is a downloadable PDF. Findings cross-verified against Akamai's official press release and trade press coverage on cybersecurityasia.net and cxomedia.in.

Akamai SOTI·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

E-GEO: A Testbed for Generative Engine Optimization in E-Commerce

arXiv · 2025

Key finding

Across 15 common product-page rewriting tactics tested on a shopping benchmark, no single hand-crafted heuristic reliably wins. A simple iterative prompt-optimisation routine outperforms all of them. The optimised prompts converge on the same pattern across categories, pointing to a stable, domain-agnostic recipe for making product listings more visible to conversational shopping agents.

Methodology note · First public e-commerce GEO benchmark (E-GEO) with over 7,000 multi-sentence consumer product queries paired with relevant listings, capturing intent, constraints, and shopping context. The authors evaluated 15 rewriting heuristics on this benchmark, then formulated GEO as an optimisation problem and ran a lightweight iterative prompt-optimisation algorithm. Data and code are public.

arXiv·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Losing Control: How Zero-Click Search Affects B2B Marketers

Bain & Company · 2025

Key finding

Click-through rates fell sharply in the year after Google introduced AI-generated summaries, with declines reaching 30% in some B2B categories including B2B software. 85% of B2B buyers purchase from their day-one list, the vendors they had in mind before searching, leaving brands less able to influence shortlists through smart search strategies.

Methodology note · Bain analysed click-through rate trends from B2B searches before and after the rollout of Google's AI-generated summaries (AI Overviews), combined with research on B2B buyer behaviour and shortlist formation. The Snap Chart format presents early data with directional commentary rather than a full study report.

Bain Insights·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

AI Agents Will Reshape E-Commerce — European Players Must Prepare Now

Boston Consulting Group · 2025

Key finding

AI search visits in Europe grew from 4% of organic visits in early 2024 to 8% in early 2025, and are projected to reach 25% by the end of 2026 and overtake organic in 2028. LLM referral traffic to leading European retailers is up more than 2,000% in fashion, nearly 1,200% in luxury, and almost 7,500% in specialty retail.

Methodology note · BCG analysed traffic patterns for a sample of leading European brands and retailers, comparing organic search visits to referrals from generative AI platforms (LLM browsers and chat services) across multiple categories. The piece combines proprietary BCG benchmarks with US adoption data as a forward indicator for Europe and projects growth curves through 2028.

BCG Publications·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Gen AI Inside Existing Search Engines Overtakes Standalone Gen AI (TMT Predictions 2026)

Deloitte · 2025

Key finding

Deloitte forecasts that in 2026, about 29% of adults in developed markets will run at least one search per day returning a generative AI summary, versus 10% using a standalone generative AI app daily. Daily passive AI use is projected to stay about three times standalone use through 2027. By mid-2026, 72% of adults will have generated a search overview versus 61% who used a standalone tool.

Methodology note · Deloitte's TMT Predictions 2026 prediction draws on its proprietary Digital Consumer Trends survey (fielded April and May 2025 across multiple developed markets, with longitudinal data from 2023 and 2024), Alphabet's reported AI Overviews monthly usage of over 2 billion, and additional industry data points.

Deloitte Insights·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Gartner Survey: Only One-Third of Consumers Say GenAI Rivals Search Engines

Gartner · 2025

Key finding

Only about one in three consumers say generative AI rivals search engines for finding information, with most still preferring traditional search for general queries. GenAI tools see higher use for creative, brainstorming, and writing tasks than for product research or factual lookup. (agent inferred)

Methodology note · Gartner press release on consumer GenAI preferences. Original URL returns HTTP 403 (Cloudflare bot challenge); findings cross-verified via Demand Gen Report, MarketScreener and Digit.fyi coverage of the same release. Sample: 377 US consumers, June-July 2025. Marketers must optimise for both AI-driven and traditional search per Gartner's framing.

Gartner Press Releases·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Google: Authenticating Requests with Web Bot Auth (Experimental)

Google · 2025

Key finding

Google is testing Web Bot Auth, a cryptographic protocol that lets bots sign HTTP requests so sites can verify their identity beyond user-agent strings or IP address ranges. During the experimental phase, only some Google AI agents sign requests, and signatures use HTTP Message Signatures (RFC 9421) keyed to https://agent.bot.goog. Google recommends continued reliance on reverse DNS and published IP ranges as a fallback.

Methodology note · Official Google Crawling Infrastructure documentation, last updated May 4, 2026, describing Google's implementation of the IETF Web Bot Auth Internet-Draft. The page links to the IETF Working Group, a Cloudflare reference implementation on GitHub, and a feedback form. Web Bot Auth itself is still a draft specification that may change.

Google for Developers·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

HTTP Message Signatures for Automated Traffic Architecture (Web Bot Auth)

IETF · Meunier · 2025

Key finding

IETF Internet-Draft 'draft-meunier-web-bot-auth-architecture' defines an architecture for HTTP message signatures applied to automated bot traffic. The architecture supports cryptographic verification of bot identity, allowing sites to confirm whether a self-identified Googlebot or GPTBot is genuinely from the claimed vendor. Aimed at replacing reverse-DNS verification as the standard mechanism for bot authentication.

Methodology note · IETF Internet-Draft draft-meunier-web-bot-auth-architecture-05. Direct fetch on datatracker.ietf.org returned the draft index and metadata. Companion architecture to the existing HTTP Message Signatures RFC (RFC 9421); applied specifically to bot-traffic authentication for AI crawlers and other automated agents.

IETF Datatracker·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

OG-RAG: Ontology-grounded Retrieval-Augmented Generation for LLMs

EMNLP 2025 · Sharma et al. · 2025

Key finding

OG-RAG grounds retrieval-augmented generation in ontologies rather than free-text documents, retrieving structured concepts and their relationships to provide more precise context to the LLM. On benchmark QA tasks, OG-RAG outperforms standard text-RAG by reducing irrelevant retrieval and improving answer specificity, with the largest gains on multi-hop questions requiring structured reasoning.

Methodology note · ACL Anthology entry for EMNLP 2025 (main conference) by Kartik Sharma, Peeyush Kumar and Yunqing Li. Direct fetch on aclanthology.org confirmed authorship and venue. Empirical evaluation against text-RAG baselines on standard QA benchmarks; method details in the published paper.

ACL Anthology·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Redefining Retrieval Evaluation in the Era of LLMs

arXiv · 2025

Key finding

Argues that traditional retrieval evaluation metrics (recall, MRR) underestimate the value of retrieval in LLM-based pipelines because LLMs can compensate for partial retrieval through their pre-existing knowledge. Proposes new metrics that measure retrieval value conditional on the LLM's downstream behaviour, finding that some 'high-recall' retrievers are actually worse for LLM-based search.

Methodology note · arXiv preprint 2510.21440 (October 2025). Direct fetch returned the abstract page. The paper introduces conditional retrieval metrics evaluated against standard RAG benchmarks and shows that metric choice changes the relative ranking of common retrieval methods.

arXiv·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Citation Failure: Definition, Analysis and Efficient Mitigation (CITECONTROL)

arXiv · 2025

Key finding

Defines citation failure as a measurable phenomenon in RAG systems where retrieved documents are not cited even when they support the answer. Introduces CITECONTROL, a method to detect and mitigate citation failure that improves citation recall without degrading answer quality. The method is lightweight and integrates with standard RAG pipelines.

Methodology note · arXiv preprint 2510.20303 (October 2025). Direct fetch returned the abstract page. The paper introduces a formal definition of citation failure and an empirical benchmark across multiple RAG systems, with CITECONTROL's improvements measured on standard QA datasets.

arXiv·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Introducing ChatGPT Atlas

OpenAI · 2025

Key finding

OpenAI launched ChatGPT Atlas, an AI-native web browser that embeds ChatGPT directly into browsing, summarises pages, answers questions in the sidebar, and can carry out multi-step tasks on the user's behalf such as filling forms, comparing products, and completing purchases. (agent inferred)

Methodology note · First-party product launch announcement from OpenAI. The post introduces Atlas as a Chromium-based browser with ChatGPT integrated as the default interface, agentic capabilities for browsing and transacting, and memory of past sessions. Initial availability is on macOS with other platforms to follow.

OpenAI·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

New Front Door to the Internet: Winning in the Age of AI Search

McKinsey & Company · 2025

Key finding

McKinsey projects that AI-powered search will mediate roughly $750 billion in US consumer revenue by 2028, representing a meaningful share of category-level discovery. Brands that win in AI answers tend to combine strong third-party coverage, structured product information, and active management of their entity presence across the open web.

Methodology note · McKinsey synthesis of consumer survey data, enterprise interviews, and proprietary modelling. The report combines a quantitative consumer survey on AI search adoption with case-level analysis of brand performance in AI answers. Published October 2025. Forecast figures should be cited as projections, not measured outcomes.

McKinsey·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Assessing Web Search Credibility and Response Groundedness in Chat Assistants

Kempelen Institute / EACL 2026 · Vykopal et al. · 2025

Key finding

Evaluates how reliably chat assistants ground answers in the web search results they cite. Tests show that even when models retrieve from credible sources, their summaries frequently include claims not supported by the retrieved passages. Citation alone does not imply groundedness, and the gap is largest for nuanced or contested topics.

Methodology note · arXiv preprint 2510.13749 (October 2025). Direct fetch returned an empty PDF body; abstract and methodology cross-verified via the arxiv API listing and abstract summary. Empirical study comparing retrieved-source content with cited claims across several chat assistants.

arXiv·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

What Generative Search Engines Like and How to Optimize Web Content Cooperatively (AutoGEO)

arXiv · 2025

Key finding

AutoGEO is a framework that extracts the preferences generative search engines apply when picking and rewriting content for AI answers. The researchers turn those preferences into rewriting rules, then test them on the GEO-Bench benchmark plus two new benchmarks built from real user queries. Both the prompt-based AutoGEO API and the trained AutoGEO Mini model raise content traction in AI answers while preserving search utility.

Methodology note · Academic preprint posted on arXiv on October 13, 2025, by researchers from Carnegie Mellon (Yujiang Wu, Shanshan Zhong, Yubin Kim, Chenyan Xiong). The team probes frontier large language models to surface preference rules, then uses them as context engineering for one system and as rule-based rewards for training a smaller cost-efficient model. Code is released on GitHub.

arXiv·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Characterizing Web Search in The Age of Generative AI

Ruhr University Bochum / Max Planck Institute for Software Systems · Elisabeth Kirsten et al. · 2025

Key finding

Generative search and traditional web search return different things even for the same query. Generative engines pull from a broader pool of sources than Google web search, mix in varying amounts of internal model knowledge versus retrieved pages, and surface different concept sets. That widens the set of pages that can earn visibility, but also breaks assumptions baked into classical ranked-list evaluation.

Methodology note · Academic comparison of one traditional engine (Google web search) with four generative engines from Google and OpenAI, run across queries from four content domains. The authors measured source coverage, the balance between model-internal knowledge and externally retrieved web pages, and the concepts surfaced in each output.

arXiv·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Is Misinformation More Open? A Study of robots.txt Gatekeeping on the Web

arXiv · 2025

Key finding

Examines whether misinformation websites are more permissive to AI crawlers than mainstream sites by analysing robots.txt directives across thousands of domains. Finds that low-credibility and misinformation domains block AI crawlers significantly less often than high-credibility news sources, meaning AI training data is systematically biased toward less reliable material at the source-access stage.

Methodology note · arXiv preprint 2510.10315 (October 2025). Direct fetch on arxiv.org returned the abstract page. The paper analyses robots.txt files across a labelled dataset of misinformation and mainstream news sites, comparing AI-crawler block rates by credibility tier.

arXiv·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Equipping Retrieval-Augmented Large Language Models with Document Structure Awareness (RDR²)

arXiv · 2025

Key finding

Treating retrieved passages as isolated chunks throws away signal that the original document layout carries. A router that navigates a document's structure tree, scoring both passage relevance and its position in the hierarchy, sets a new state of the art on multi-document question answering. Headings, section order, and parent-child relationships are themselves a ranking signal.

Methodology note · Academic paper (RDR2, EMNLP 2025 Findings) introducing a trainable document-routing step inside the retrieve-and-read pipeline. An LLM-based router walks document structure trees with automatic action curation and structure-aware passage selection. The framework was evaluated across five question-answering datasets that demand multi-document synthesis.

arXiv·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Generative AI and News Report 2025

Reuters Institute, University of Oxford · Nic Newman et al. · 2025

Key finding

Across six countries, only about 7% of adults say they use ChatGPT or another generative AI tool to access news in a typical week. Trust in AI for news is low: most respondents say AI-generated news content makes them feel uncomfortable, and only a minority think AI will improve journalism. (agent inferred)

Methodology note · Reuters Institute for the Study of Journalism (Oxford), 2025 Digital News Report companion publication. Direct fetch confirmed the report page and the host institution; full survey methodology (six-country sample, weekly news-access patterns) is in the linked report PDF.

Reuters Institute·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Rethinking Web Cache Design for the AI Era (SOCC '25)

ETH Zurich + Cloudflare · Yazhuo Zhang, Berger · 2025

Key finding

The peer-reviewed SOCC 2025 paper 'Rethinking Web Cache Design for the AI Era' by Yazhuo Zhang and colleagues (ETH Zurich) shows that traditional web caches are not designed to absorb high-diversity, low-reuse AI scraper traffic. Read the Docs reported 73TB of HTML scraped in one month; Wikimedia reported a 50% backend bandwidth increase from AI scrapers. Proposes filter-and-tier cache architectures.

Methodology note · Peer-reviewed paper at ACM Symposium on Cloud Computing 2025 (DOI 10.1145/3772052.3772255). Direct fetch of the PDF confirmed accessibility; findings cross-verified against the Cloudflare blog post 'Why we're rethinking cache for the AI era' co-authored with the same researchers, and ppc.land coverage of the paper.

SOCC 2025 (ACM)·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

DataDome's 2025 Global Bot Security Report

DataDome · 2025

Key finding

DataDome's 2025 Global Bot Security Report finds that LLM crawler traffic quadrupled across DataDome's customer base in 2025, rising from 2.6% of verified bot traffic in January to over 10.1% by August. DataDome detected nearly 1.7 billion OpenAI crawler requests in a single month. AI bot traffic targeted high-value endpoints: 64% reached forms, 23% login pages, 5% checkout flows. Only 2.8% of sites were fully protected.

Methodology note · DataDome 2025 Global Bot Security Report. Direct fetch returned HTTP 403; findings cross-verified against the official DataDome press release on businesswire.com (September 2025), the DataDome blog post 'The Web's Bot Problem Isn't Getting Better', and Yahoo Finance reporting on the same release.

DataDome·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

TDM Reservation Protocol Community Group (W3C)

W3C TDMRep CG · 2025

Key finding

W3C TDM Reservation Protocol Community Group develops a standardised mechanism for content owners in the EU to reserve their rights against text-and-data-mining uses under Article 4 of the EU Copyright Directive. The protocol specifies machine-readable opt-out signals (TDMRep) that AI training crawlers should honour, complementing robots.txt with a copyright-specific reservation mechanism.

Methodology note · First-party W3C Community Group page. Direct fetch on w3.org/community/tdmrep returned the group's overview, charter, and links to specification documents. Standard W3C standardisation venue. The protocol is the EU-jurisdiction counterpart to the IETF AIPREF vocabulary (R99) and attachment (R169) work.

W3C·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Concise and Sufficient Sub-Sentence Citations for RAG

arXiv · 2025

Key finding

Proposes sub-sentence-level citations in RAG outputs, where each cited passage is matched to a specific sub-sentence in the generated answer rather than to the answer as a whole. The approach improves attribution precision and reduces over-citation, where models cite a source for an entire sentence even when only part of it is supported.

Methodology note · arXiv preprint 2509.20859 (September 2025). Direct fetch on arxiv.org returned the abstract page; method details and benchmark scores are in the full PDF. Empirical evaluation against standard RAG attribution baselines.

arXiv·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Giving Users Choice with Cloudflare's New Content Signals Policy

Cloudflare · 2025

Key finding

Cloudflare introduced a Content Signals Policy that extends robots.txt with three new signals: search (indexing for traditional search), ai-input (use in retrieval augmented generation or AI answers), and ai-train (use for training or fine-tuning AI models). Each can be set to yes or no, or left blank. For 3.8 million domains already using Cloudflare's managed robots.txt, the company will publish search=yes, ai-train=no by default.

Methodology note · Official Cloudflare product announcement from September 24, 2025, written by Will Allen. The policy text is released under a CC0 license to encourage adoption, with a generator at ContentSignals.org. Cloudflare notes signals are preferences, not technical countermeasures, and recommends combining them with WAF and bot management rules.

Cloudflare Blog·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

HiChunk: Evaluating and Enhancing Retrieval-Augmented Generation with Hierarchical Chunking

arXiv · 2025

Key finding

Most retrieval benchmarks cannot tell a good chunking strategy from a bad one because the answers can be found in any reasonable split of the text. A new benchmark built on evidence-dense questions shows that chunking choices visibly change end-to-end answer quality, and that a hierarchical, multi-level chunker improves performance without paying a heavy time cost.

Methodology note · Academic paper introducing HiCBench (manually annotated multi-level chunk points plus synthesised evidence-dense question-answer pairs with traceable evidence) and the HiChunk framework: fine-tuned large language models that produce multi-level document structure, combined with an Auto-Merge retrieval algorithm. Chunking quality was tested across the full retrieval-augmented generation pipeline.

arXiv·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

How People Use ChatGPT (NBER Working Paper 34255)

NBER / OpenAI / Harvard University · Aaron Chatterji et al. · 2025

Key finding

ChatGPT reached around 700 million weekly active users by mid-2025, with roughly 18 billion messages sent per week. About 30% of conversations are work-related while 70% are personal, covering writing assistance, information seeking, and tutoring. Adoption is rising fastest in lower-income countries. (agent inferred)

Methodology note · NBER working paper by Aharon Chetrit, Aidan Toner-Rodgers and OpenAI co-authors analysing a representative sample of ChatGPT conversations. The researchers classified messages by topic, work versus personal use, and user demographics to characterise how people actually use the assistant in 2024 and 2025.

National Bureau of Economic Research·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

AI Answer Engine Citation Behavior: An Empirical Analysis of the GEO-16 Framework

arXiv · 2025

Key finding

Three on-page properties showed the strongest association with whether a page got cited by AI answer engines: metadata and freshness, semantic HTML markup, and structured data. Pages that scored at least 0.70 on the GEO-16 quality score and met at least 12 of 16 quality pillars were cited at substantially higher rates than pages that did not.

Methodology note · 70 product-intent prompts were run across Brave Summary, Google AI Overviews, and Perplexity, producing 1,702 citations across 1,100 unique URLs. The researchers audited each cited page against a 16-pillar framework and used logistic models with domain-clustered standard errors. The study focuses on English-language B2B SaaS pages. Published September 2025.

arXiv·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Beyond Keywords: Driving Generative Search Engine Optimization with Content-Centric Agents (CC-GSEO-Bench)

arXiv · 2025

Key finding

Generative search engines weaken the link between ranking and visibility, so source articles need new ways to prove they shape AI answers. The benchmark scores creator influence across five dimensions: exposure (does the article surface), faithful credit (is it cited), causal impact (does it move the wording), readability and structure, and trustworthiness and safety.

Methodology note · Academic benchmark (CC-GSEO-Bench) of over 1,000 source articles and over 5,000 query-article pairs, organised one article to many queries. Seed queries come from public question-answering datasets with limited synthesised expansion; only queries whose source reappeared in a follow-up retrieval step were kept. Article-level scores aggregate query-level signals into strength, coverage, and stability of influence.

arXiv·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

DOJ Wins Significant Remedies Against Google (US v. Google search remedies decision)

U.S. Department of Justice · 2025

Key finding

The US District Court for DC barred Google from exclusive distribution contracts for Google Search, Chrome, Google Assistant, and the Gemini app. Google must share certain search index and user-interaction data with rivals and offer search and ad syndication services. The court extended remedies to generative AI products to prevent the same tactics being used to monopolise GenAI. Google holds roughly 90% of US search queries.

Methodology note · US Department of Justice press release announcing the remedies ruling in United States et al. v. Google, following a 277-page liability opinion in August 2024 and a 15-day remedies trial in May 2025. The case was joined by 49 states, two territories, and the District of Columbia.

DOJ·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

About Amazonbot

Amazon · 2025

Key finding

Amazon documents Amazonbot as the web crawler used to power its services. Amazonbot identifies itself with the user-agent string 'Amazonbot' and respects robots.txt directives. The documentation page lists IP ranges, robots.txt examples, and Amazon's contact information for site owners reporting issues. Amazonbot does not currently train Alexa or general-purpose AI models per the published documentation.

Methodology note · First-party Amazon Developer documentation page for Amazonbot. Direct fetch on developer.amazon.com returned the HTML page with the canonical user-agent string, robots.txt syntax examples, and IP-range publication mechanism. Standard vendor crawler documentation; equivalent format to OpenAI and Anthropic crawler docs.

Amazon Developer·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Agentic Commerce is Redefining Retail — How to Respond

Boston Consulting Group · 2025

Key finding

More than half of consumers expect to use AI assistants for shopping by the end of 2025. US retail traffic from generative AI browsers and chat services grew 4,700% year over year in July 2025. These visitors are more engaged, spending 32% more time on site, browsing 10% more pages, and bouncing 27% less. By 2029, US AI search ad spend is projected to reach 26 billion dollars.

Methodology note · BCG synthesises third-party adoption data (Adobe traffic measurements, eMarketer forecasts, monday.com retailer survey) with its own analysis of AI agent behaviour. Findings are based on observed retail site analytics, advertising forecasts, and a survey of global retailers about agentic AI adoption plans.

BCG Publications·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Brave Search Crawler

Brave · 2025

Key finding

Brave documents that its search crawler does not advertise a differentiated user agent to avoid being discriminated against by websites that allow only Googlebot. However, if a domain or page is not crawlable by Googlebot, Brave's bot will not crawl it either. The documentation notes that robots.txt is not used to prevent Brave-specific access but applies through the Googlebot directive.

Methodology note · First-party Brave Search documentation page. Direct fetch on search.brave.com/help confirmed the crawler-identification policy and the inherited-from-Googlebot access model. Unusual among AI/search crawlers in not advertising a differentiated user agent, making per-crawler robots.txt rules ineffective for Brave specifically.

Brave Search Help·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

DuckAssistBot Help Page

DuckDuckGo · 2025

Key finding

DuckDuckGo documents DuckAssistBot as the crawler used to power DuckDuckGo's AI-assisted answer features. DuckAssistBot is related to DuckDuckGo Search but operates as a separate bot with its own user-agent string and IP ranges. The help page lists how site owners can identify, allow, or block DuckAssistBot via robots.txt directives and clarifies that DuckAssistBot fetches pages on demand rather than for AI training.

Methodology note · First-party DuckDuckGo Help Pages article. Direct fetch on duckduckgo.com confirmed the bot identification, robots.txt mechanism, and on-demand fetch behaviour. Standard vendor crawler documentation; DuckDuckGo's AI features (DuckAssist) are powered by partner LLMs rather than DuckDuckGo's own training.

DuckDuckGo Help·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Google Search Quality Rater Guidelines (Jan 2025 + Sep 2025 revisions)

Google · 2025

Key finding

Google relies on around 16,000 external Search Quality Raters across 80-plus languages to evaluate search results against published guidelines. Raters never decide rankings directly; they assess Page Quality (using the E-E-A-T framework of Experience, Expertise, Authoritativeness, Trust) and Needs Met. Standards are highest for Your Money or Your Life topics like health, finance and safety, where low-quality pages can cause real harm.

Methodology note · Official Google overview of its Search Quality Rater programme, dated November 2023 and published as a PDF on services.google.com. The document explains how raters are recruited and trained, the two rating tasks (Page Quality and Needs Met), the E-E-A-T criteria, the special treatment of YMYL topics, and how aggregate ratings feed back into search algorithm changes.

Google Search Central·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

About KagiBot

Kagi · 2025

Key finding

Kagi documents KagiBot as the web crawler for the Kagi search engine. KagiBot identifies as 'Mozilla/5.0 (compatible; Kagibot/1.0; +https://kagi.com/bot)' and originates from four declared IP addresses with reverse-DNS confirmations at kagibot.org. Standard robots.txt directives targeting Kagibot are respected. Kagi is a paid search engine that does not train AI models on crawled content.

Methodology note · First-party Kagi documentation page for KagiBot. Direct fetch on kagi.com/bot returned the HTML page with the canonical user-agent string, exact IP addresses, reverse-DNS records, and robots.txt compliance statement. Standard vendor crawler documentation.

Kagi Help·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Data Provenance Initiative (MIT Media Lab)

MIT Media Lab · Shayne Longpre · 2025

Key finding

The Data Provenance Initiative at MIT Media Lab audited the provenance, licensing, and attribution of over 1,800 text dataset collections used to train large language models. The audit found that more than 70% of datasets had 'unspecified' licenses; correcting the licensing reduced this to around 30%. The corrected licenses were often more restrictive than those originally assigned by repositories.

Methodology note · MIT Media Lab project page for the Data Provenance Initiative (PI Sandy Pentland and team). Direct fetch on media.mit.edu returned the project overview. Findings reported in the peer-reviewed paper 'A Large-Scale Audit of Dataset Licensing & Attribution in AI' (arXiv:2310.16787) and on MIT News (August 2024).

MIT Media Lab·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Meta Web Crawlers (Meta-ExternalAgent, FacebookBot)

Meta · 2025

Key finding

Meta documents multiple crawlers and their user agents: meta-externalagent for AI training and product improvement, meta-externalfetcher for user-initiated content fetches by Meta AI features, and the older facebookexternalhit for generating preview cards when links are shared. Publishers can use robots.txt to control meta-externalagent and meta-externalfetcher; facebookexternalhit follows different rules because it acts on a user's direct request to share a link.

Methodology note · Official Meta for Developers documentation page. The page is the canonical reference for the user-agent strings, supported robots.txt directives, and the purposes Meta declares for each crawler.

Facebook for Developers·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Mistral AI Crawlers (robots)

Mistral AI · 2025

Key finding

Mistral AI documents its crawler identification and robots.txt compliance policy. The documentation lists user-agent strings used by Mistral's crawlers (for web indexing, real-time retrieval, and model training), specifies how site owners can target each via robots.txt, and states Mistral's commitment to honour disallow directives. Standard format matching other major AI vendor crawler documentation.

Methodology note · First-party Mistral AI documentation page (docs.mistral.ai/robots). Direct fetch returned the HTML page. Standard vendor crawler documentation; equivalent to OpenAI, Anthropic, and Google's crawler docs in scope and format.

Mistral Docs·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

How people are using ChatGPT (OpenAI summary page)

OpenAI · 2025

Key finding

Publicly accessible summary of OpenAI's ChatGPT usage research. Describes the Asking/Doing/Expressing classification (49%/40%/11%) and the dominant consumer use cases: practical guidance, information seeking, and writing assistance. Useful as a citable first-party summary of the underlying research, but, like the underlying paper, it provides aggregate categories rather than a browsable feed of real user prompts.

Methodology note · Public-facing OpenAI summary page accompanying the 'How People Use ChatGPT' research paper. Provides accessible explanations of the Asking/Doing/Expressing taxonomy and the reported category shares (~49% / 40% / 11%). Content verified by fetch on 2026-05-27. No methodology beyond what is disclosed in the underlying paper (R192) and the NBER working paper (R52).

OpenAI·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

How People Use ChatGPT (OpenAI research paper PDF)

OpenAI · 2025

Key finding

OpenAI's research on ChatGPT usage classifies consumer conversations into three categories: roughly 49% 'Asking' (information seeking), 40% 'Doing' (practical task assistance), and 11% 'Expressing' (writing and creative work). This provides a defensible intent taxonomy for classifying prompt hypotheses — but it is an aggregate breakdown, not a browsable dump of user prompts.

Methodology note · OpenAI research paper published as a downloadable PDF, with co-authorship by external economists (David Deming, Christopher T. Stanton and colleagues). The paper presents an aggregate analysis of anonymised ChatGPT consumer usage and classifies conversations into Asking, Doing and Expressing categories. No individual prompts are published; methodology and definitions of each category are disclosed in full inside the PDF.

OpenAI·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Agentic Commerce Protocol (OpenAI Developers)

OpenAI; Stripe · 2025

Key finding

The Agentic Commerce Protocol is an open standard acting as the connective layer between merchants and ChatGPT users, letting ChatGPT ingest structured catalog data, understand inventory, and surface relevant products in context. Co-developed by OpenAI and Stripe and open-sourced under Apache 2.0 in September 2025, it powers ChatGPT Shopping and sources products from merchant catalogs formatted to OpenAI's commerce feed specification.

Methodology note · Primary standards documentation on the OpenAI Developers site describing ACP's purpose and feed/product specs, not an empirical study. Origin, Apache 2.0 license, September 2025 release and role in ChatGPT Shopping corroborated by BigCommerce, Adobe and Paz secondaries. Verified by direct fetch 2026-07-08. Use as a protocol reference, not outcome evidence.

OpenAI Developers·Accessed 10.07.2026

Tier A — Strongest evidenceRead source

The Crawl-to-Click Gap: Cloudflare Data on AI Bots, Training, and Referrals

Cloudflare · 2025

Key finding

AI crawlers read content far more than they send referrals back. Anthropic's ClaudeBot crawled around 70,000 pages for every visitor it referred; OpenAI's GPTBot crawled around 1,700 for every visitor; Perplexity around 5 for every visitor. Mistral was the only major AI engine where referrals outweighed crawl volume.

Methodology note · Aggregate analysis of crawl requests and referral traffic across the Cloudflare network. For each major AI crawler, the team divided pages crawled by visits sent to the same destinations during the same window, producing a crawl-to-refer ratio. Published August 2025.

Cloudflare Blog·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

A Deeper Look at AI Crawlers: Breaking Down Traffic by Purpose and Industry

Cloudflare · 2025

Key finding

AI crawler traffic is concentrated in news, technology, finance, and retail. The largest category is training crawling, but real-time user-action crawling (where an AI assistant fetches a page during a user conversation) is the fastest-growing segment. Different AI engines crawl with different mixes of purpose, which has direct implications for which crawlers a brand should allow.

Methodology note · Aggregate analysis of crawl traffic across the Cloudflare network, segmented by destination industry and by crawler purpose (training, search index, user-action fetch). Published August 2025.

Cloudflare Blog·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Generative AI-Powered Shopping Rises with Traffic to U.S. Retail Sites (Adobe Analytics)

Adobe Digital Insights · 2025

Key finding

Visitors arriving at US retail sites from generative AI sources show measurably higher engagement than visitors from other channels: 8% higher time on site, 12% more pages per visit, and 23% lower bounce rate. AI-driven retail traffic grew sharply through 2024 and 2025, though it remains a small share of total visits.

Methodology note · Aggregate analysis of Adobe Analytics data covering trillions of visits to US retail websites. Adobe compared engagement metrics for visitors arriving from generative AI assistants against visitors from other referral channels. Published August 2025.

Adobe Business Blog·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Attribution, Citation, and Quotation: A Survey of Evidence-based Text Generation with LLMs

arXiv · 2025

Key finding

Surveys the field of evidence-based text generation with LLMs, organising work into three categories: attribution (linking generated text to evidence), citation (formatting and presenting that evidence to users), and quotation (verbatim grounding). Identifies four open problems including attribution granularity, retrieval-attribution coupling, and evaluation of partial attribution.

Methodology note · arXiv preprint 2508.15396 (August 2025). Survey paper covering evidence-based text generation with LLMs. Direct fetch on arxiv.org returned the abstract page; full taxonomy and reference list are in the PDF.

arXiv·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

New Fastly Threat Research: AI Crawlers Are Almost 80% of AI Bot Traffic

Fastly · 2025

Key finding

Across April-July 2025, Fastly observed that AI crawlers made up almost 80% of AI bot traffic on its network (the other 20% being AI fetcher bots). Meta's AI crawlers alone generated 52% of crawler traffic, more than Google (23%) and OpenAI (20%) combined. Nearly 90% of North American AI bot traffic came from crawlers vs 41% in Europe. Fetcher bot bursts reached 39,000 requests per minute.

Methodology note · Fastly blog post by Threat Insights team, Q2 2025. Direct fetch returned the HTML article. Tier A: enterprise-scale analytics from Fastly's network covering 6.5 trillion monthly requests across 130,000+ apps. Methodology disclosed at network-flow level; vendor is the data owner.

Fastly / BusinessWire·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Role-Augmented Intent-Driven Generative Search Engine Optimization

arXiv · 2025

Key finding

Generative search engines reward content that anticipates the different roles a user might be playing when they ask a question. Rewriting a page through several informational personas, then refining it, produced larger gains in both subjective impression and measured presence inside generative answers than approaches that optimise on a single axis.

Methodology note · Academic paper introducing Role-Augmented Intent-Driven G-SEO, which models search intent through reflective refinement across multiple informational roles. The authors extended an existing GEO dataset with diversified query variations and introduced G-Eval 2.0, a six-level large-language-model-augmented rubric for finer-grained, human-aligned scoring of optimisation outputs.

arXiv·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

AI Should Be More Human, Not More Complex: A Large-Scale Study on User Preferences for Concise, Source-Backed AI Responses

arXiv · 2025

Key finding

Users prefer concise, source-attributed answers over verbose explanations from AI search. Longer, more lexically complex responses produced an uncanny-valley effect: systems sounded authoritative but lacked critical thinking, lowering trust and raising cognitive load. The pattern challenges the assumption that more elaborate AI output equals better output. (agent inferred)

Methodology note · arXiv preprint 2508.04713. Direct fetch on arxiv.org returned the HTML preprint with the full paper structure including methodology, AI systems evaluated, and detailed response analysis. Authored by Carlo Esposito.

arXiv·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Perplexity is Using Stealth, Undeclared Crawlers to Evade Website No-Crawl Directives

Cloudflare · 2025

Key finding

Cloudflare observed Perplexity using undeclared crawlers to fetch content from sites that had blocked its known bots in robots.txt and WAF rules. When PerplexityBot was blocked, traffic rotated through a generic Chrome-on-macOS user agent, undisclosed IP ranges, and multiple ASNs, hitting tens of thousands of domains and millions of requests a day. Cloudflare de-listed Perplexity from its verified bots list and added detection rules.

Methodology note · Cloudflare blog post from August 4, 2025, by Gabriel Corral, Vaibhav Singhal, Brian Mitchell, and Reid Tatoris. The team set up brand new test domains that had never been indexed, applied robots.txt and WAF blocks, then queried Perplexity about content on those domains. Detection used machine learning plus network signals.

Cloudflare Blog·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Which Crawlers Does Bing Use? (incl. Copilot)

Microsoft · 2025

Key finding

Bing identifies several active crawlers. Bingbot is the main web crawler for Bing search. AdIdxBot crawls pages for Bing Ads. BingPreview generates page snapshots. MicrosoftPreview supports preview cards. Copilot uses Bingbot data for grounding rather than a separate crawler. Publishers can issue user-agent specific rules in robots.txt, and Bing respects standard directives and meta robots tags for indexing control.

Methodology note · Official Bing Webmaster Tools help documentation page that lists Bing's crawlers, their purposes, and their user-agent strings. The page is the canonical reference for site owners configuring crawler rules for Microsoft search and Copilot.

Bing Webmaster·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

The Attribution Crisis in LLM Search Results: Estimating Ecosystem Exploitation

O'Reilly Media · Strauss et al. · 2025

Key finding

Argues that the citation behaviour of LLM-based search constitutes an attribution crisis: cited sources are systematically under-credited (fewer click-throughs than equivalent SERP positions), over-extracted (more content reproduced verbatim or near-verbatim), and concentrated on a small subset of high-authority publishers. Quantifies the ecosystem-level economic impact on publishers.

Methodology note · arXiv preprint 2508.00838 (August 2025). Direct fetch on arxiv.org returned the abstract page. The paper combines empirical citation analysis with economic modelling to estimate ecosystem-level effects on publisher revenue and proposes attribution reforms.

arXiv·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Google Users Are Less Likely to Click on Links When an AI Summary Appears in Search Results

Pew Research Center · 2025

Key finding

When a Google search result page includes an AI summary, users click on a traditional link in roughly 8% of visits. On result pages without an AI summary, they click in roughly 15% of visits. Users rarely click on the citations inside the AI summary itself, doing so on about 1% of visits.

Methodology note · Pew Research panel study covering 900 US adults and 68,879 Google searches conducted between March and May 2025. Sessions were tracked through opt-in browser participation; click behaviour was observed directly rather than self-reported. Published July 2025.

Pew Research Center·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Evaluation of Attribution Bias in Generator-Aware Retrieval-Augmented LLMs

University of Amsterdam et al. · Abolghasemi et al. · 2025

Key finding

Identifies attribution bias in generator-aware RAG: when an LLM is told which documents it can cite, the model preferentially attributes claims to documents that align with its own pre-existing beliefs, ignoring contradicting sources even when those contradict the model's output. The bias is measurable and persists across model families.

Methodology note · arXiv preprint 2410.12380 (October 2024). Direct fetch returned the abstract page. The paper develops a controlled experimental setup and runs it across several LLM families; the bias is reported as statistically significant on standard QA tasks.

ACL 2025 Findings·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Correctness is not Faithfulness in RAG Attributions

SIGIR/ICTIR 2025 · Wallat · 2025

Key finding

Shows that a RAG system's answers can be factually correct while its citations are unfaithful, meaning the cited passages do not actually support the generated claim. Across standard benchmarks, correctness and faithfulness diverge measurably, implying that citation-quality evaluation must be a separate metric from answer accuracy in any AI visibility tracking system.

Methodology note · arXiv preprint 2412.18004 (December 2024). Empirical study testing whether RAG answers and their cited evidence are mutually consistent. Direct fetch on arxiv.org confirmed the abstract; the full evaluation uses public attribution datasets and human annotation for faithfulness scoring.

arXiv·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

How Users Interact with Generative Information Retrieval Systems: A Study of User Behavior and Search Experience

Beijing Institute of Technology · Liang et al. · 2025

Key finding

Generative information-retrieval systems that return a written, cited answer instead of a ranked list of links can reduce a searcher's effort and improve their experience without lowering perceived credibility. The comparison covers conversational answer interfaces against traditional ranked-list search, suggesting brands should expect users to do less link-clicking and rely more on what the answer itself says.

Methodology note · SIGIR 2025 user study using Bing Chat as the generative system and Bing as the traditional baseline. Participants completed three task types on each system while the researchers logged behaviour such as clicks and query reformulation, alongside explicit ratings of satisfaction, credibility, and perceived success. The two conditions were compared head to head.

SIGIR 2025 (ACM)·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Code of Practice for General-Purpose AI Models (Copyright Chapter)

European Commission / EU AI Office · 2025

Key finding

The EU General-Purpose AI Code of Practice provides a voluntary route for AI model providers to demonstrate compliance with the AI Act's obligations on copyright, transparency, and safety. The Copyright Chapter requires signatories to honour machine-readable opt-out signals such as robots.txt and TDM reservations, to publish a summary of training data, and to put a complaint mechanism in place for rightsholders.

Methodology note · Official European Commission policy page hosting the Code of Practice for General-Purpose AI Models, developed by independent experts under the EU AI Act process and published in 2025. The Code covers Safety and Security, Transparency, and Copyright chapters, and is signed by major AI providers as a way to show compliance with the AI Act.

European Commission·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Introducing Comet: Browse at the Speed of Thought

Perplexity AI · 2025

Key finding

Perplexity launched Comet, an AI-native web browser built around Perplexity's answer engine. Comet replaces the search bar with a conversational assistant, summarises pages, answers questions about open tabs, and can execute agentic tasks across the web on behalf of the user. (agent inferred)

Methodology note · First-party product launch announcement from Perplexity. The post positions Comet as a Chromium-based browser with Perplexity's assistant available across every tab, supporting research workflows, product comparisons, and multi-step actions. Initial access was offered to Perplexity Max subscribers.

Perplexity Hub·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

SIGIR 2025 LiveRAG Challenge Report

arXiv · 2025

Key finding

SIGIR 2025 LiveRAG Challenge Report summarises a community competition where teams built end-to-end live RAG systems evaluated on real-time queries. Reports best-performing strategies, common failure modes, and lessons learned. Notable findings include the dominance of hybrid sparse-dense retrieval and the difficulty of evaluating live RAG without ground-truth answers.

Methodology note · arXiv preprint 2507.04942 (July 2025). Direct fetch returned the abstract page. Multi-team challenge report from SIGIR 2025; methodology, evaluation protocol, and team submissions are documented in the PDF.

arXiv·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

News Source Citing Patterns in AI Search Systems

arXiv (cs.IR) · Kai-Cheng Yang · 2025

Key finding

AI search systems concentrate news citations in a small set of outlets, and the cited mix leans politically liberal. Low-credibility sources are rarely cited. News makes up only 9% of all citations across more than 366,000 citations studied, so brands depending on press coverage for AI visibility face a narrow set of gatekeeper publishers, with limited influence from political leaning or quality on user satisfaction.

Methodology note · Academic analysis of the AI Search Arena platform, covering more than 24,000 conversations and 65,000 responses from search systems by OpenAI, Perplexity, and Google. The study extracted over 366,000 citations, isolated those referencing news, and correlated source-level attributes (political leaning, credibility ratings) with user preference data from head-to-head model comparisons.

arXiv·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Beyond SEO: A Transformer-Based Approach for Reinventing Web Content Optimisation

arXiv · 2025

Key finding

Rewriting web copy to add credible citations, statistical evidence, and cleaner phrasing measurably increases how much of that copy gets reproduced in AI answers. Optimised travel pages saw a 15.63% rise in absolute word count surfaced inside generative responses and a 30.96% rise on a position-weighted version of the same metric, with small computational cost.

Methodology note · The team fine-tuned a BART-base transformer on 1,905 paired travel-website passages, each pairing raw copy with a generative-engine-optimised rewrite. Quality was scored with ROUGE-L and BLEU against the optimised targets; visibility was tested by feeding both versions to Llama-3.3-70B and counting how much of each rewrite appeared in the model's responses.

arXiv·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Introducing Pay Per Crawl

Cloudflare · 2025

Key finding

Cloudflare launched Pay Per Crawl in private beta on July 1, 2025, letting publishers charge AI crawlers per request. The system uses HTTP status code 402 Payment Required: a crawler either includes a crawler-max-price header to pre-agree to pay, or gets a 402 with the price and can retry with crawler-exact-price. Cloudflare acts as the merchant of record, identifies crawlers via Web Bot Auth signed requests, and aggregates payments.

Methodology note · Official Cloudflare product announcement by Will Allen and Simon Newton, July 1, 2025. The post documents the headers, the publisher controls (allow, charge, block), and the integration with existing WAF and bot management rules. Crawler authentication uses Ed25519 key pairs and HTTP Message Signatures as defined by RFC 9421.

Cloudflare Blog·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

From Googlebot to GPTBot: Who's Crawling Your Site in 2025

Cloudflare · 2025

Key finding

Across Cloudflare's network, search and AI crawler traffic rose 18% from May 2024 to May 2025. Googlebot grew 96% in raw requests and now accounts for 50% of crawler traffic. GPTBot rose 305% in requests, with its share climbing from 2.2% to 7.7%. ChatGPT-User requests jumped 2,825%, and PerplexityBot grew 157,490% off a tiny base. About 14% of top domains now use robots.txt rules targeting AI bots specifically.

Methodology note · Cloudflare Radar analysis published July 2025, comparing crawler activity in May 2024 against May 2025 across a fixed cohort of customers to remove growth bias. The team matches user-agent tokens against an open-source list of AI crawlers and analyses robots.txt files on 3,816 of the top 10,000 domains. Methodology and limits are documented in the post.

Cloudflare Blog·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-Context QA

THUDM · Jiajie Zhang · 2025

Key finding

LongCite enables LLMs to generate fine-grained citations in long-context QA by training the model to attribute each statement in its answer to a specific span in the retrieved long document. The method substantially improves citation precision over post-hoc citation generation and outperforms baselines on the released LongCite benchmark.

Methodology note · arXiv preprint 2409.02897 (September 2024). Direct fetch returned the abstract page. The paper releases a training pipeline and benchmark dataset; empirical comparison against post-hoc citation baselines is reported in the PDF.

ACL 2025 Findings·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Search Engines in an AI Era: The False Promise of Factual and Verifiable Source-Cited Responses

Salesforce AI Research et al. · 2025

Key finding

Argues that current AI search engines' promise of factual, verifiable, source-cited responses is partly illusory. Empirical analysis across major systems shows frequent unsupported claims even when citations are present, weak ranking of citations by relevance, and citation patterns that systematically advantage well-resourced incumbents. Calls for new evaluation standards before mass deployment.

Methodology note · arXiv preprint 2410.22349 (October 2024). Direct fetch returned the abstract page. The paper provides both critique and empirical evaluation across several commercial AI search systems, with methodology and per-system results in the PDF.

FAccT 2025 (ACM)·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Chunk Twice, Embed Once: Systematic Study of Segmentation and Representation Trade-offs

arXiv · 2025

Key finding

How a page is split into chunks matters as much for retrieval as which model embeds it. Simple recursive token chunking around 100 tokens with no overlap (R100-0) consistently beat more elaborate strategies. Retrieval-tuned embedding models such as Nomic and Intfloat E5 outperformed domain-specialised ones like SciBERT, suggesting embedding choice and chunk size are the high-leverage levers.

Methodology note · Systematic evaluation in a chemistry retrieval setting: 25 chunking configurations across five method families combined with 48 embedding models, tested on three chemistry retrieval benchmarks including the authors' new QuestChemRetrieval dataset. Datasets, code, and benchmark results were released publicly.

arXiv·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

About Applebot (incl. Applebot-Extended)

Apple · 2025

Key finding

Apple uses one crawler, Applebot, to gather data that powers Spotlight, Siri, and Safari search. A separate user agent, Applebot-Extended, governs whether content is used to train Apple's foundation models, including Apple Intelligence. Sites can disallow Applebot-Extended in robots.txt to opt out of generative AI training while keeping content discoverable in Apple search. Applebot is identified via reverse DNS at applebot.apple.com or a published IP CIDR list.

Methodology note · Official Apple support documentation about Applebot, last updated April 25, 2025. The page details user-agent strings, robots.txt behavior, supported meta directives such as noindex, nosnippet, nofollow, and how Applebot-Extended works as a secondary control specifically for generative AI training.

Apple Support·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Enhancing Critical Thinking in Generative AI Search with Metacognitive Prompts

arXiv · 2025

Key finding

Metacognitive prompts that explicitly ask the AI to evaluate its own reasoning before answering reduce errors in generative search responses. Users supplied with such prompts also show improved critical evaluation of AI-generated answers compared with users supplied with standard prompts. The intervention is light-touch and does not require changes to the underlying model.

Methodology note · arXiv preprint 2505.24014 (May 2025). User study examining how prompting strategies affect both AI output quality and human evaluation behaviour in generative search. Direct fetch returned the abstract page; the empirical sample size and significance levels are reported in the underlying PDF.

arXiv·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Rethinking Chunk Size For Long-Document Retrieval: A Multi-Dataset Analysis

arXiv (cs.IR) · Sinchana Ramakanth Bhat et al. · 2025

Key finding

Chunk size in retrieval-augmented generation has a large effect on retrieval quality. Smaller chunks of 64 to 128 tokens are optimal when answers are short and fact-based. Larger chunks of 512 to 1024 tokens work better when broader context is needed. Embedding models react differently: Stella benefits from larger chunks for long-range retrieval, while Snowflake performs better with smaller chunks for entity-level matching.

Methodology note · Peer-style arXiv paper (2505.21700) by Bhat, Rudat, Spiekermann and Flores-Herr, submitted May 2025. The authors systematically test fixed-size chunking from 64 to 1024 tokens across multiple embedding models and both short-form and long-form datasets, measuring retrieval performance across configurations. Results highlight the interaction between chunk size, embedding model and dataset characteristics.

arXiv·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Scrapers Selectively Respect robots.txt Directives

arXiv · 2025

Key finding

Audit of major AI scrapers' compliance with robots.txt directives finds that compliance is selective rather than uniform: scrapers honour blocks on some user agents and not others, even within the same vendor, and compliance changes over time as crawlers are updated. The paper provides specific evidence of non-compliance events with named vendors and dates.

Methodology note · arXiv preprint 2505.21733 (May 2025). Direct fetch on arxiv.org returned the abstract page. The paper documents specific non-compliance events with named scrapers using server-log evidence and timestamped robots.txt snapshots.

arXiv·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments

arXiv (cs.CL) / EMNLP 2025 · Gili Lior et al. · 2025

Key finding

ReliableEval proposes a method-of-moments recipe for stochastic LLM evaluation that explicitly accounts for run-to-run variance in model outputs. Across standard benchmarks, the method produces tighter confidence intervals than naive averaging and reveals that some headline LLM performance comparisons are within noise margins. Released as an open evaluation toolkit.

Methodology note · arXiv preprint 2505.22169 (May 2025). Direct fetch returned the abstract page. The paper derives the method-of-moments estimator, tests it against several common evaluation tasks, and releases the toolkit for community use.

arXiv·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

An Analysis of AI Overview Brand Visibility Factors (75K Brands Studied)

Ahrefs · Louise Linehan, Xibeijia Guan · 2025

Key finding

Across 75,000 brands (DR>40, top kw vol >=800), Spearman correlation with AI Overview brand visibility: branded web mentions 0.664, branded anchors 0.527, branded search volume 0.392, Domain Rating 0.326, referring domains 0.295, branded traffic 0.274, backlinks 0.218. Web mentions correlate ~3x stronger than backlinks. Top quartile by web mentions averages 169 AIO mentions (median) vs 14 for next quartile (~10x gap); bottom 50% average 0-3 (effectively invisible). 26% of brands had zero AIO mentions.

Methodology note · 75,000 brands filtered by DR>40 and highest-volume keyword >=800 monthly searches; AI Overview mentions measured via Ahrefs Brand Radar across millions of AIO responses; Spearman rank correlation. Authors explicitly note correlation != causation and that all factors are moderate-to-weak on the Spearman scale. Single-vendor dataset using Ahrefs' own metrics (web mentions, DR), so absolute values are tool-defined; direction corroborated by Seer Interactive (backlinks 0.10, DR 0.25) and Kevin Indig (brand search vol 0.334).

Ahrefs Blog·Accessed 08.07.2026

Tier A — Strongest evidenceRead source

NLWeb — Bringing Conversational Interfaces Directly to the Web

Microsoft · R.V. Guha · 2025

Key finding

Microsoft launched NLWeb on May 19, 2025, an open project that lets any website expose its content as a natural-language interface using Schema.org, RSS, and other structured data the site already publishes. Every NLWeb instance is also a Model Context Protocol server, making the site discoverable to AI agents. Initial adopters include Shopify, Tripadvisor, Eventbrite, O'Reilly, Hearst, and Chicago Public Media.

Methodology note · Official Microsoft announcement, published on the Microsoft Source corporate blog. NLWeb was conceived by R.V. Guha, the creator of RSS, RDF, and Schema.org, who joined Microsoft as Corporate Vice President and Technical Fellow. The project is open source and technology agnostic, with code and documentation on GitHub.

Microsoft News·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

NExT-Search: Rebuilding User Feedback Ecosystem for Generative AI Search

NUS / Renmin University · Sunhao Dai et al. · 2025

Key finding

Argues that generative AI search has broken the feedback loop that traditionally improved ranking. Traditional web search collects fine-grained user feedback (clicks, dwell time) at the document level; generative AI search receives only coarse-grained feedback on the final answer, even though the pipeline spans query decomposition, retrieval, and generation. Proposes NExT-Search to reintroduce process-level feedback.

Methodology note · SIGIR 2025 perspective paper (arXiv:2505.14680) by Dai, Wang, Pang, Xu, Ng, Wen, and Chua. Direct fetch on arxiv.org returned the abstract page; the paper proposes a two-mode feedback architecture (User Debug Mode and Shadow User Mode) without claiming empirical validation. Forward-looking perspective rather than experimental study.

SIGIR 2025 (Perspective Paper)·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

C2PA Technical Specification v2.2 (ISO/IEC 22144)

Coalition for Content Provenance and Authenticity (C2PA) · 2025

Key finding

C2PA Technical Specification v2.2 defines a standard for cryptographically signed content credentials. The specification was published in 2024 as ISO/IEC 22144, enabling images, video, and other media to carry tamper-evident metadata about origin, edits, and AI involvement. Adopters include Adobe, Microsoft, the BBC, OpenAI, Sony, and Leica, with rollout in cameras, generative tools, and publisher workflows.

Methodology note · Official specification from the Coalition for Content Provenance and Authenticity, a Joint Development Foundation project. Version 2.2 is the latest at time of publication and corresponds to the formally adopted international standard ISO/IEC 22144. Steering committee members include Adobe, Microsoft, Intel, Google, the BBC, OpenAI, and Sony. The full text is openly published with conformance and test suites.

C2PA / ISO·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Copyright and AI Part 3: Generative AI Training (pre-publication report)

U.S. Copyright Office · 2025

Key finding

The US Copyright Office concludes that generative AI training raises copyright questions at several points: data collection, model training, retrieval-augmented generation, and outputs. Fair use is fact-specific and depends on transformativeness, commerciality, the amount used, and effects on the market for the original work, including market dilution and lost licensing. The Office recommends voluntary licensing markets rather than compulsory licensing schemes.

Methodology note · Pre-publication version of Part 3 of the Copyright Office's Report on Copyright and Artificial Intelligence, released May 2025 by the Register of Copyrights. The report draws on more than 10,000 comments submitted in response to a 2023 Notice of Inquiry, plus existing case law and international approaches. Sections cover technical background, prima facie infringement, fair use, and licensing options.

USCO·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Sufficient Context: A New Lens on Retrieval Augmented Generation Systems

Google Research · 2025

Key finding

Introduces the concept of sufficient context as a new evaluation lens for RAG systems. A retrieval is sufficient if it contains enough evidence to answer the query correctly; insufficient retrievals lead to hallucinated answers even when the model has the right reasoning ability. Provides metrics and empirical analysis across standard RAG benchmarks.

Methodology note · arXiv preprint 2411.06037 (November 2024). Direct fetch returned the abstract page. The paper introduces a sufficient-context metric, validates it against human annotations of RAG outputs, and reports correlations with downstream answer quality.

ICLR 2025·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Human Trust in AI Search: A Large-Scale Experiment

arXiv · 2025

Key finding

Across ~12,000 queries in seven countries and a preregistered randomised experiment on a US sample, participants trusted GenAI search less than traditional search on average, but adding reference links and citations to GenAI answers significantly increased trust, even when those citations were incorrect or hallucinated. Uncertainty highlighting reduced trust whether confidence was high or low.

Methodology note · arXiv preprint 2504.06435 (April 2025) by Haiwen Li and Sinan Aral (MIT Sloan). Preregistered randomised experiment on a US-representative panel, paired with a 12,000-query, 80,000-result global exposure measurement across seven countries. 23 pages, six figures. Direct fetch on arxiv.org confirmed authorship and methodology.

arXiv·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

NYT v. OpenAI Motion to Dismiss Opinion (April 4 2025)

U.S. District Court SDNY · 2025

Key finding

Judge Sidney Stein denied OpenAI's and Microsoft's motions to dismiss the core contributory copyright infringement and most DMCA claims brought by The New York Times, the Daily News, and the Center for Investigative Reporting. The court allowed the publishers' direct infringement and trademark dilution claims to proceed. It dismissed common law misappropriation and certain DMCA 1202 sub-claims without prejudice. The case moves to discovery.

Methodology note · Memorandum opinion and order from Judge Sidney H. Stein of the US District Court for the Southern District of New York, dated April 4, 2025, in the consolidated actions including 23-cv-11195 (Times v. OpenAI). The court ruled on Rule 12(b)(6) motions to dismiss, accepting the plaintiffs' allegations as true at this stage. The full opinion is published on the court website.

U.S. Courts·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

2025 Bad Bot Report

Imperva (Thales) · 2025

Key finding

Imperva's 2025 Bad Bot Report finds that automated traffic overtook human activity for the first time in a decade, reaching 51% of all internet traffic in 2024. Bad bots specifically account for 37% of internet traffic, up from 32% the year prior. 44% of advanced bot traffic targeted APIs. AI is supercharging bot sophistication, with simple high-volume attacks now 45% of all bot attacks.

Methodology note · Imperva (Thales) 2025 Bad Bot Report. Direct fetch of the resource library landing page confirmed accessibility but the full report PDF requires a form submission. Headline statistics cross-verified against Imperva's official blog post 'AI Bots Overtake the Web' and Thales Group press release dated April 2025.

Imperva·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

How Crawlers Impact the Operations of the Wikimedia Projects

Wikimedia Foundation · 2025

Key finding

Bandwidth used by automated crawlers on Wikimedia Commons grew 50% between January 2024 and early 2025, driven primarily by AI training scrapers fetching multimedia at scale. Wikimedia Foundation engineers found that 65% of the most expensive requests on the projects came from bots, even though bots accounted for only about 35% of pageviews, because crawlers hit pages that miss the site's caching layer.

Methodology note · Diff blog post from the Wikimedia Foundation, dated April 1, 2025, written by the Site Reliability Engineering team. The analysis draws on Wikimedia's own traffic logs across Commons and other projects. Reporting in Ars Technica and other outlets reproduced the 50% bandwidth figure and the 65% expensive-request statistic.

Wikimedia Diff·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Trapping Misbehaving Bots in an AI Labyrinth

Cloudflare · 2025

Key finding

AI Labyrinth is an opt-in Cloudflare feature that serves AI-generated decoy pages to bots that ignore no-crawl directives. A misbehaving crawler follows hidden links into a maze of plausibly written but irrelevant content, wasting compute and exposing itself as a bot. Pages are pre-generated with Workers AI, sanitised against XSS, and stored in R2. AI crawlers send more than 50 billion requests a day across Cloudflare.

Methodology note · Cloudflare announcement from March 19, 2025, by Reid Tatoris, Harsh Saxena, and Luis Miglietti. The feature is available to all customers including the free plan, enabled with a single dashboard toggle. Decoy pages carry noindex meta tags to protect SEO and remain invisible to human visitors and verified crawlers.

Cloudflare Blog·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Adobe Analytics: Traffic to U.S. Retail Websites from Generative AI Sources — Holiday 2024 / January 2025 Update

Adobe Digital Insights · 2025

Key finding

Visits to US retail websites originating from generative AI sources grew by roughly 1,200% between July 2024 and February 2025. AI-sourced visitors browsed 12% more pages per session and bounced 23% less than visitors from other channels. AI referrals still represented a small share of total retail traffic, but the per-visit engagement quality was meaningfully higher.

Methodology note · Aggregate analysis of Adobe Analytics data covering visits to US retail websites during the 2024 holiday season and January 2025. Adobe compared engagement metrics and visit counts for sessions originating from generative AI sources against sessions from other referral channels. Published March 2025.

Adobe Blog·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

AI Search Has a Citation Problem (Tow Center Report)

Tow Center for Digital Journalism, Columbia · 2025

Key finding

Across eight AI search engines tested, more than 60% of news-attribution queries received incorrect answers. Perplexity got 37% wrong; Grok 3 got 94% wrong. Premium paid models were no more accurate than free ones, and often produced confidently incorrect answers without flagging uncertainty. Several engines retrieved content from publishers that had explicitly blocked their crawlers.

Methodology note · 1,600 queries were run across ChatGPT Search, Perplexity, Perplexity Pro, DeepSeek Search, Microsoft Copilot, Grok-2, Grok-3, and Google Gemini. The researchers selected 10 articles from each of 20 publishers, used direct excerpts as queries, and asked each chatbot to identify the headline, publisher, publication date, and URL. Responses were manually graded against six categories.

Columbia Journalism Review·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Web search (OpenAI API documentation)

OpenAI · 2025

Key finding

OpenAI's web-search API documentation states that the web_search_call output item will usually (but not always) include the search queries that were searched, and that the sources field can reveal all URLs consulted during the search run. This is first-party proof that some query rewrites can be observed for requests under the caller's control — but the 'usually but not always' caveat means observed fan-out is partial rather than exhaustive.

Methodology note · Official OpenAI developer documentation for the web search tool exposed via the Responses API. Describes the schema of the web_search_call output item, including which fields are populated and the explicit caveat that searched queries are returned 'usually (but not always).' Content verified by fetch on 2026-05-27. No aggregate usage data is disclosed.

OpenAI Developer Platform·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Goodbye Clicks, Hello AI: Zero-Click Search Redefines Marketing

Bain & Company · 2025

Key finding

Roughly 80% of consumers now rely on zero-click results, AI summaries, or assistant answers for at least 40% of their search needs, and AI search use has reduced average organic click-through rates by 15% to 25%. The shift compresses the funnel: brands need to be present and credible in the answer itself, not on the click destination.

Methodology note · Bain survey of more than 1,000 US consumers combined with proprietary analysis of organic search traffic patterns. The report measures self-reported AI search use and click-through behaviour across categories. Published February 2025.

Bain & Company·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Google AI Overviews and Your Website | Google Search Central

Google · 2025

Key finding

Google states there are no extra technical requirements for appearing in AI Overviews or AI Mode beyond being indexed and eligible for a standard search snippet. SEO fundamentals apply: allow crawling in robots.txt, maintain internal linking, keep pages findable. Google describes the query fan-out technique, where the system issues multiple related searches across subtopics, and reports that clicks from AI Overview pages tend to be higher quality.

Methodology note · Official Google Search Central documentation describing how AI features such as AI Overviews and AI Mode interact with websites, and what site owners can and cannot do to influence inclusion. Direct fetch failed; content was verified against the live Google AI Features and AI Optimization Guide pages on developers.google.com plus corroborating secondary coverage.

Google Search Central·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Intellectual Property Issues in AI Trained on Scraped Data (AI Paper No. 33)

OECD · 2025

Key finding

OECD report 'Intellectual Property Issues in AI Trained on Scraped Data' (AI Paper No. 33, February 2025) examines copyright, trademark, trade secret, and database protection challenges raised by AI training data scraping. The report recommends voluntary codes of conduct based on transparency in the data chain, requiring AI developers to disclose data sources and preserve metadata enabling rightsholders to track unauthorised use.

Methodology note · OECD policy report, Paper No. 33, February 2025. Direct fetch returned HTTP 403; findings cross-verified against summaries on TheLegalWire, NortonRoseFulbright knowledge publications, and the OECD AI Policy Observatory at oecd.ai. The report is government/regulatory tier (OECD) and authoritative on policy direction.

OECD Publications·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

FTC Staff Report on AI Partnerships & Investments

U.S. Federal Trade Commission · 2025

Key finding

The FTC's January 2025 staff report on AI partnerships finds that the three largest US cloud providers (Alphabet, Amazon, Microsoft) have used their investments in Anthropic and OpenAI to lock in cloud spend, gain equity and revenue-sharing rights, and access sensitive technical and business information. The Commission flags risks to switching costs, input access for rivals, and competition for engineering talent and compute.

Methodology note · Press release from the Federal Trade Commission, January 17, 2025, summarizing a staff report based on Section 6(b) orders issued in January 2024 to five companies: Microsoft, OpenAI, Amazon, Alphabet, and Anthropic. Findings reflect information available to staff as of September 2024 plus publicly available information through January 2025. The Commission voted 5 to 0 to issue the report.

FTC·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

The Rise of the AI Crawler (Vercel + MERJ, 1B requests)

Vercel + MERJ · 2025

Key finding

On Vercel's network in late 2024, GPTBot generated 569 million monthly requests and Anthropic's Claude generated 370 million, together roughly 20% of Googlebot's 4.5 billion. None of the major AI crawlers (OpenAI, Anthropic, Meta, ByteDance, Perplexity) executed JavaScript. ChatGPT spent 34.82% of fetches on 404 pages, Claude 34.16%, versus 8.22% for Googlebot. Server-side rendered content is far more visible to AI crawlers.

Methodology note · Joint research from Vercel and MERJ, published December 17, 2024. Data comes from monitoring nextjs.org and the Vercel network, validated against two job-board sites (Resume Library on Next.js and CV Library on a custom monolith). Microsoft Copilot was excluded because it lacks a distinct user agent. Methods follow the same approach used in MERJ's earlier Googlebot analysis.

Vercel Blog·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Long Context vs. RAG for LLMs: An Evaluation and Revisits

arXiv · Xinze Li · 2025

Key finding

Empirical comparison of long-context LLMs (where retrieved content is dropped into the model's input window) against retrieval-augmented generation (where retrieval is iterative) finds that long-context approaches underperform RAG when the relevant evidence is buried in noise. RAG still wins for most production information-retrieval tasks despite advances in long-context models.

Methodology note · arXiv preprint 2501.01880 (January 2025). Direct fetch on arxiv.org returned the abstract page; the empirical comparison uses public QA benchmarks and tests several long-context LLMs against RAG baselines under matched compute budgets.

arXiv·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Understanding How the Google Trends Explore Page uses Gemini to help you find insights

Google · 2025

Key finding

Google Trends Explore uses Gemini to take an area of interest and expand it into up to eight related search terms, additional ideas, and top/rising queries. This is direct first-party proof that public search behaviour can be expanded from a seed topic into an adjacent-intent neighbourhood — but Google explicitly frames it as web-search demand, not chatbot-prompt demand.

Methodology note · Official Google Help Center documentation page describing how the Google Trends Explore experience uses Gemini models to suggest related search terms, follow-up ideas, and top/rising queries from a seed input. Content was fetched and verified directly from support.google.com on 2026-05-27; the page also discloses Gemini privacy practices, data retention, and feedback mechanisms.

Google Help·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

ChatGPT Search (OpenAI Help Center)

OpenAI · 2024

Key finding

OpenAI documents that ChatGPT Search rewrites a user query into one or more targeted queries and may send additional, more specific queries after reviewing initial results. This is first-party proof that query fan-out behaviour is real inside a production chatbot search system.

Methodology note · Official OpenAI Help Center article describing how ChatGPT Search functions, including the prompt-rewriting and follow-up query behaviour. Content verified by fetch on 2026-05-27 (HTTP 200 confirmed; full body accessible in a browser session). Article documents product behaviour without disclosing the rewriting algorithm, query-volume statistics or model-side reasoning.

OpenAI Help Center·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Google Common Crawlers Overview (incl. Google-Extended)

Google · 2024

Key finding

Google publishes a list of common crawlers covering Googlebot, Googlebot-Image, Googlebot-Video, Googlebot-News, AdsBot variants and the Google-Extended user-agent token used for Gemini and Vertex AI training opt-out. Each row in the documentation gives the user-agent string seen in HTTP requests, the matching robots.txt token, the IP ranges in common-crawlers.json and the products affected when a site changes its crawl preferences for that agent.

Methodology note · Official Google Search Central documentation listing the common Googlebot variants and their robots.txt tokens. The page was inaccessible to direct fetch; user-agent strings, IP-range publication mechanism and reverse-DNS verification process were confirmed through the live developers.google.com URL referenced in search results and supporting third-party crawler reference databases.

Google Search Central·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

GEO: Generative Engine Optimization

Princeton University / Georgia Tech / Allen Institute for AI / IIT Delhi · Pranjal Aggarwal et al. · 2024

Key finding

Adding citations, quotations, and statistics to content can increase its visibility in AI-generated answers by up to 41% on average. Pages ranked outside the top of traditional search saw the largest gains. The effect varies by content domain and by AI engine, but the lift from evidence-style content elements is consistent across the conditions tested.

Methodology note · 10,000 questions were run through generative search engines. The researchers compared answers before and after applying nine content optimisation strategies, including citations, quotations, statistics, and authoritative language. They measured visibility as the share of the AI answer attributable to the optimised page, using both word position and word count metrics. Peer-reviewed at KDD 2024.

arXiv / KDD 2024·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Detecting hallucinations in large language models using semantic entropy

University of Oxford (et al.) · 2024

Key finding

Sebastian Farquhar and colleagues at the University of Oxford propose using semantic entropy to detect hallucinations in LLMs. The method measures uncertainty in the meaning of model outputs across sampled generations rather than uncertainty in token probabilities. Outperforms prior hallucination-detection baselines on standard QA benchmarks and generalises across model families.

Methodology note · Nature article (volume 630), Farquhar, Kossen, Kuhn and Gal, published 2024. Peer-reviewed primary research. Direct fetch on nature.com confirmed authorship, journal, and the semantic-entropy method. Empirical evaluation across multiple QA datasets and LLMs is detailed in the full paper.

Nature·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Retrieval-Augmented Generation for Large Language Models: A Survey

arXiv (cs.CL) · Yunfan Gao et al. · 2024

Key finding

Comprehensive survey of retrieval-augmented generation (RAG) covering its history, core components (retrieval, augmentation, generation), evaluation methods, and open challenges. The survey organises RAG variants into a taxonomy and traces the field's evolution from naive retrieval to modular and agentic RAG architectures. Widely cited as the field's canonical reference.

Methodology note · arXiv preprint 2312.10997 (December 2023, updated 2024). Direct fetch on arxiv.org returned the abstract page; the full survey runs over 40 pages and includes a comprehensive bibliography. One of the most-cited RAG references in the academic literature.

arXiv·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Perplexity Crawlers Documentation (PerplexityBot, Perplexity-User)

Perplexity AI · 2024

Key finding

Perplexity documents two user agents controllable via robots.txt: PerplexityBot indexes content for retrieval in Perplexity answers, while Perplexity-User fetches pages on demand when a user submits a query. Pages disallowed for PerplexityBot are not indexed in full, though Perplexity may still display the domain, headline and a brief factual summary. IP ranges are published at perplexity.com/perplexitybot.json and perplexity.com/perplexity-user.json. (agent inferred)

Methodology note · Official Perplexity crawler documentation. Original URL docs.perplexity.ai/guides/bots returns HTTP 308 redirect; the canonical content now lives at docs.perplexity.ai/docs/resources/perplexity-crawlers. User-agent strings, robots.txt behaviour and IP-publication URLs (perplexity.com/perplexitybot.json, perplexity-user.json) cross-verified via 51Degrees, Known Agents and CrawlerCheck.

Perplexity Docs·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Google FAQ Structured Data Guidelines (FAQPage Schema)

Google · 2023

Key finding

Google's FAQPage structured data documentation announces that as of May 7, 2026, FAQ rich results no longer appear in Google Search. Support in the rich-result report and Rich Results Test ends in June 2026, and Search Console API support is removed in August 2026. While the feature is being deprecated, FAQ markup itself remains valid Schema.org and is still used by AI engines that read structured data.

Methodology note · Official Google Search Central documentation page for FAQPage structured data, last updated 8 May 2026. The page sets out the schema requirements (FAQPage, Question, Answer), eligibility rules (limited to authoritative health or government sites), content guidelines and the deprecation timetable for FAQ rich results.

Google Search Central·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

Google E-E-A-T Quality Rater Guidelines Update

Google · 2022

Key finding

Google announced an update to its Search Quality Rater Guidelines, adding a second E to E-A-T to create E-E-A-T: Experience, Expertise, Authoritativeness, and Trustworthiness. Experience asks whether content reflects first-hand or life experience with the subject. Trust is positioned as the most important of the four, and the others support it. The guidelines instruct human raters who evaluate search quality.

Methodology note · Official Google Search Central blog announcement from December 15, 2022, accompanying a revised version of the public Search Quality Rater Guidelines PDF. The guidelines describe how Google's external quality raters score sample results to train and evaluate ranking systems. Ratings do not directly change rankings but feed into system improvements.

Google Search Central Blog·Accessed 27.05.2026

Tier A — Strongest evidenceRead source

KELM: Integrating Knowledge Graphs with Language Model Pre-training Corpora

Google Research · 2021

Key finding

Google Research's KELM project verbalized the entire English Wikidata knowledge graph into natural-language sentences using its TEKGEN model, producing a corpus of about 18 million sentences spanning about 45 million triples and about 1,500 relations. This corpus was used to augment language-model pre-training, demonstrating a concrete pathway by which structured entity data reaches LLMs.

Methodology note · Research blog post from Google Research (May 2021) describing the KELM corpus and TEKGEN data-to-text model. A dated research artifact, not a statement about current production ranking or retrieval; the entity-to-LLM pathway it demonstrates is the relevant claim. Verified by direct fetch during this run.

Google Research Blog·Accessed 10.07.2026

Tier B — Citable with caveats

These sources are credible but narrower in scope or methodology than Tier A. They include trade press with editorial standards (Digiday, Search Engine Land, Marketing Brew, Press Gazette, Nieman Lab), case studies from named companies with disclosed methodology, and single-vendor studies with disclosed methodology (Ahrefs, Semrush, Surfer, Profound, Tinuiti), where sample size or vendor incentives are part of the picture.

We cite Tier B sources when they're the best available evidence. We mark them with a one-line caveat, usually about sample size, methodology, or possible bias. If a Tier A source exists for the same claim, we use that instead.

Tier B — Citable with caveatsRead source

When AI Describes Your Brand, It Cites Your Competitors (SurfacedBy)

SurfacedBy · Ali Khallad · 2026

Key finding

SurfacedBy analyzed almost 100,000 citations to outside sources behind AI answers about a sample of brands, across ChatGPT, Claude, Gemini, Perplexity, and Google AI Mode over about three months in spring 2026. Around 40% of the outside sources cited about a brand were that brand's direct competitors; the typical brand sat above a third, and nearly nine in ten had at least a quarter. Brand-owned pages are a smaller stream.

Methodology note · First-party study by SurfacedBy, an AI-visibility tracking vendor with commercial interest, published 29 June 2026. Nearly 100,000 outside-source citations resolving to 10,000-plus domains, tagged competitor or third party using each brand's own competitor set, aggregated brand by brand. Authors stress a small, mixed brand sample; decimals provisional. Verified by direct fetch.

SurfacedBy Blog·Accessed 08.07.2026

Tier B — Citable with caveatsRead source

We Analyzed 127,198 AI Citations. The Five Engines Barely Read the Same Web. (SurfacedBy)

SurfacedBy · Ali Khallad · 2026

Key finding

SurfacedBy analyzed 127,198 source citations from ChatGPT, Claude, Gemini, Perplexity, and Google AI Mode across roughly 16,400 commercial-intent answers between March and June 2026. Of 11,647 cited domains, 69.6% were cited by only one engine and just 2.7% by all five. Vendor, product, and long-tail pages drew 90.6% of citations; Reddit 1.8% and Wikipedia 0.6%. Gemini averaged 11.0 sources per answer, ChatGPT 3.7.

Methodology note · First-party experiment by SurfacedBy, an AI-visibility tracking vendor with commercial interest, published 27 June 2026 and updated 29 June. About 16,400 answers to real buyer and category questions across five engines; citations counted at the domain level. Authors disclose limits: commercial-query skew, citations are not clicks, engine behavior shifts. Verified by direct fetch.

SurfacedBy Blog·Accessed 08.07.2026

Tier B — Citable with caveatsRead source

What is Retrieval-Augmented Generation? (AWS Explainer)

Amazon Web Services · 2026

Key finding

AWS's explainer defines retrieval-augmented generation as a technique that supplements an LLM's training data with external sources at inference time, improving factual accuracy and reducing hallucinations. The page covers RAG benefits (cost-effective vs fine-tuning, current information, source attribution), and recommended architectural patterns on AWS infrastructure.

Methodology note · AWS vendor explainer page. Direct fetch returned the HTML article. Tier B because AWS is an authoritative vendor in the cloud and AI infrastructure space but the page is a marketing explainer rather than original research. Suitable as a definitional reference for RAG; not for empirical citation claims.

AWS·Accessed 27.05.2026

Tier B — Citable with caveatsRead source

Cited-Listicle Rank-Tier Exposure, Author Type, and LLM Brand Visibility: A Two-Part Model of Selection and Prominence in Generative Engine Responses (Peec AI Working Paper)

Peec AI · Jan Ehrlinspiel et al. · 2026

Key finding

Across three markets and up to seven engines (5.7M brand-chat observations, Sept 2025-March 2026), appearing in repeatedly-cited third-party listicles is positively associated with LLM brand mentions. Third-party rank-1 exposure raises mention probability by roughly 14-32 percentage points and is linked to earlier placement (about 1.1-1.5 positions). The authors stress these are within-brand associations, not causal rank effects.

Methodology note · Observational working paper by three Peec AI authors, posted to SSRN 12 May 2026. Chat-level panel data across B2B SaaS, MarTech and US finance, estimated as a Two-Part Model: a correlated random-effects logit for selection and OLS for prominence, with prompt-model-date fixed effects and Mundlak brand means. Data and code are proprietary and not released. Verified directly from the authors' PDF.

SSRN (Working Paper)·Accessed 01.06.2026

Tier B — Citable with caveatsRead source

We Tracked 1,885 Pages Adding Schema. AI Citations Barely Moved

Ahrefs · 2026

Key finding

Across 1,885 pages that added JSON-LD between August 2025 and March 2026, schema produced no meaningful uplift in AI citations. Matched difference-in-differences tests against 4,000 control pages showed +2.4% on Google AI Mode and +2.2% on ChatGPT (both statistically indistinguishable from zero) and a small 4.6% decline on Google AI Overviews. 53% of AI-cited pages already carry schema, but this reflects overall site quality.

Methodology note · Ahrefs identified 1,885 URLs that transitioned from no JSON-LD to having JSON-LD between August 2025 and March 2026, using its crawler database. Each treated page was matched to three control pages from different domains with similar pre-period citation levels. Citation changes were measured 30 days before and after the schema-add date across AI Overviews, AI Mode and ChatGPT using four statistical tests including matched difference-in-differences.

Ahrefs Blog·Accessed 27.05.2026

Tier B — Citable with caveatsRead source

AI-Referred Shoppers Convert Better and Spend More: What Shopify's Early Data Shows

Shopify · Kyle Risley · 2026

Key finding

On Shopify storefronts in Q1 2026, AI-referred product-page sessions converted nearly 50% higher than organic search, beating organic in 23 of 25 categories by an average of 56%, with 14% higher order values. AI-chatbot referral sessions grew over 8x year over year and AI-referred orders about 13x. More than half of AI-referred sessions start on a product page versus about 20% for organic.

Methodology note · First-party Shopify platform telemetry across storefronts on its platform, authored by Kyle Risley and published 11 May 2026. Metrics are disclosed (conversion, AOV, session and order growth) but the method is aggregate platform analytics, not a controlled study; Shopify has a commercial interest in AI commerce. Verified by direct page fetch 2026-07-08.

Shopify Enterprise Blog·Accessed 10.07.2026

Tier B — Citable with caveatsRead source

AI Citation Ranking Factors (Meta-Analysis of 54 Studies)

Zyppy · Cyrus Shepard · 2026

Key finding

Scores 23 AI-citation factors 0-10 on repeatability, evidence strength, and official platform/patent support, across ChatGPT, Gemini, Perplexity. Top five: URL accessibility 9.5, search rank 9.4, fan-out rank 9.3, preview control (nosnippet) 9.2, query-answer match 9.2. Topic-cluster ranking 8.9; AI-ready structure 8.6; self-contained passages 8.0; cites sources internally 8.0; freshness 7.0. Lowest: llms.txt 2.0 (no credible evidence of citation impact). Core thesis: 'win SEO, win AI citations, with extra steps.'

Methodology note · Meta-analysis: author gathered ~54 published experiments, patents, and case studies (2024-2026) and scored 23 recurring factors on three axes (repeatability across studies, strength of evidence, official support). Scores are author-assigned weights, not a single controlled experiment, so treat as a prioritised evidence map rather than measured effect sizes. Published on Zyppy Signal Substack 2026-05-07; widely re-reported (PPC Land, multiple SEO blogs). Cross-verified figures against secondary coverage; primary Substack post is the canonical source.

Zyppy Signal (Substack)·Accessed 08.07.2026

Tier B — Citable with caveatsRead source

Earned Media Still Drives 84% of AI Citations (Muck Rack, What Is AI Reading? May 2026)

Muck Rack (Generative Pulse) · Linda Zebian · 2026

Key finding

Muck Rack's May 2026 "What Is AI Reading?" study analyzed more than 25 million links cited by ChatGPT, Claude, and Gemini across 17 industries and found earned media accounts for 84% of all AI citations, journalism alone 27%, and paid/advertorial just 0.3%. The earned share held at 82-89% across three editions since July 2025. Single-vendor study; treat as directional.

Methodology note · Single-vendor study by Muck Rack's Generative Pulse team, published 7 May 2026, authored by Linda Zebian. Analyzed 25M+ citation links across ChatGPT, Claude, and Gemini over 17 industries; third consecutive edition since July 2025. Page fetched and figures, per-platform citation rates, and top-cited domains confirmed directly.

Muck Rack Blog·Accessed 23.06.2026

Tier B — Citable with caveatsRead source

OpenAI's Crawler Docs Now List OAI-AdsBot for ChatGPT Ads

Search Engine Journal · 2026

Key finding

OpenAI added a new crawler, OAI-AdsBot, to its public bots documentation. OAI-AdsBot supports ChatGPT's advertising features by fetching pages so advertised products and links can be checked and presented inside ChatGPT. Publishers can list OAI-AdsBot in robots.txt to allow or disallow its access, separately from GPTBot (training), OAI-SearchBot (search), and ChatGPT-User (user-initiated browsing).

Methodology note · Search Engine Journal news article reporting on OpenAI's published crawler documentation at platform.openai.com/docs/bots. SEJ is a long-running SEO trade publication that tracks vendor documentation updates. The piece references the live OpenAI page where OAI-AdsBot is listed alongside OpenAI's other declared user agents.

Search Engine Journal·Accessed 27.05.2026

Tier B — Citable with caveatsRead source

The Answer Economy: How AI Search Is Rewiring B2B Software Buying

G2 · 2026

Key finding

G2's survey of 1,076 B2B software buyers, fielded March 2026, found 51% now start research with an AI chatbot more often than Google, up from 29% a year earlier, and 71% use AI chatbots for vendor research. 69% chose a different vendor than planned on AI guidance, and a third bought from a vendor new to them. Single-vendor survey with commercial interest.

Methodology note · First-party survey by G2 of 1,076 B2B software buyers and decision-makers, fielded March 2026 and published 15 April 2026 as 'The Answer Economy'. G2 sells answer-engine-optimization products, so treat as directional, not independent. Verified against G2's own release and PR Newswire. An unverified '85% think more highly' figure circulating in aggregators does not trace to G2's release.

G2·Accessed 08.07.2026

Tier B — Citable with caveatsRead source

Where Google AI Overviews Cite From: A 100-Page Analysis

CXL · 2026

Key finding

In a mapping of 100 Google AI Overview citations, 55% of cited snippets sit in the top 30% of the source page. The middle third of the page produces 24% of citations. Everything past the 60% mark accounts for just 21%. Pages whose answer is buried below the fold are far less likely to be picked up.

Methodology note · CXL coded 100 individual AI Overview citations by where in the source page the cited passage appeared, splitting each page into vertical thirds. The data was used to assess how page position relates to citation probability. The original page was inaccessible at the time of writing; figures were confirmed via secondary coverage referencing the CXL study directly.

CXL Blog·Accessed 27.05.2026

Tier B — Citable with caveatsRead source

Exclusive: Small Publishers Hit Hardest by Search Traffic Declines (Chartbeat data)

Axios · 2026

Key finding

Smaller publishers are seeing AI chatbot referrals rise as a share of search-driven traffic, even as overall search referrals to news sites decline. Chartbeat data shows the gap between large and small publishers in AI traffic share is narrowing, with niche publishers picking up disproportionate AI visibility. (agent inferred)

Methodology note · Axios reporting on Chartbeat's analysis of referrer data across its publisher network, comparing AI chatbot referrals (ChatGPT, Perplexity, Copilot) against traditional search referrals over time, segmented by publisher size.

Axios·Accessed 27.05.2026

Tier B — Citable with caveatsRead source

We Analyzed 89K LinkedIn URLs Cited in AI Search: Here's What Drives Visibility

Semrush, in collaboration with LinkedIn · Margarita Loktionova · 2026

Key finding

Across 325,000 prompts run against ChatGPT Search, Google AI Mode and Perplexity in January–February 2026, LinkedIn appears in 11% of AI responses on average and ranks second among all cited domains. LinkedIn articles of 500–2,000 words and posts of 50–299 words attract the most citations. Perplexity cites Company Pages in 59% of LinkedIn citations; ChatGPT Search and Google AI Mode cite individual creators 59% of the time.

Methodology note · Vendor research study by Semrush in collaboration with LinkedIn, published March 10, 2026. 325,000 unique prompts across twelve industry categories were sent to ChatGPT Search, Google AI Mode, and Perplexity, yielding 89,000 cited LinkedIn URLs. Each URL was enriched with content type, author signals, engagement, and a semantic similarity score (0.57–0.60). Source verified by direct fetch.

Semrush Blog·Accessed 01.06.2026

Tier B — Citable with caveatsRead source

The YouTube Citation Study 2026 (OtterlyAI)

OtterlyAI · Rick Tousseyn · 2026

Key finding

Across 100M+ AI citations tracked over 30 days across six AI engines, 94 percent of YouTube AI citations went to long-form videos and 5.7 percent to Shorts. Views, likes and subscribers showed near-zero correlation with citation frequency (r approximately -0.03). Description length (r = 0.31) and timestamp presence drove repeat citation: 78 percent of timestamped videos were cited multiple times.

Methodology note · OtterlyAI YouTube Citation Study, published 2 March 2026 by Rick Tousseyn. Direct fetch on otterly.ai confirmed the methodology: 30-day citation tracking across ChatGPT, Google AI Overviews and AI Mode, Perplexity, Microsoft Copilot, and Gemini. Pearson correlation analysis on already-cited videos only — results explain repeat-citation behaviour rather than initial citation eligibility.

OtterlyAI Blog·Accessed 27.05.2026

Tier B — Citable with caveatsRead source

New Finding: ChatGPT Sources 83% of Its Carousel Products from Google Shopping via Shopping Query Fan-Outs (Peec AI)

Peec AI · Malte Landwehr, Tom Wells · 2026

Key finding

Peec AI analyzed more than 43,000 ChatGPT shopping-carousel products across 10 verticals against over 200,000 organic Google and Bing shopping results. About 83% of ChatGPT carousel products matched Google Shopping's top organic positions, with 60% from the top 10 and roughly 84% from the top 20. The Bing equivalent was 11%, and most Bing matches also appeared in Google, indicating ChatGPT Shopping largely re-ranks scraped Google Shopping organic results.

Methodology note · Trade-press article in Search Engine Land reporting a first-party Peec AI study by Malte Landwehr and Tom Wells, published March 2026. Method disclosed: 43,000 carousel products across 10 verticals matched by position against 200,000-plus organic results. Peec sells AI-visibility tracking. Page exceeded the fetch size limit; figures cross-verified against Seeders and other secondaries.

Search Engine Land·Accessed 10.07.2026

Tier B — Citable with caveatsRead source

AI Sources Like ChatGPT Account for Less Than 1% of Publishers' Pageviews

Nieman Lab (Chartbeat) · 2026

Key finding

AI sources such as ChatGPT account for less than 1% of publisher pageviews, according to Chartbeat data covering thousands of news and media sites. Direct visits and traditional search still dominate referrals to publishers, while AI referrals remain a small but growing channel. (agent inferred)

Methodology note · Nieman Lab summary of new Chartbeat analytics data covering pageview composition across its publisher network. Chartbeat tracks real-time referrer data on thousands of news websites and reports aggregate shares from search, social, direct, and AI assistants.

Nieman Lab·Accessed 27.05.2026

Tier B — Citable with caveatsRead source

Anthropic Updates Crawler Docs: ClaudeBot, Claude-User, Claude-SearchBot

Search Engine Roundtable · 2026

Key finding

Anthropic updated its public crawler documentation to clarify the role of each bot. ClaudeBot handles training data collection. Claude-User performs user-initiated retrievals inside Claude (similar to ChatGPT-User). Claude-SearchBot powers Claude's search feature and is meant to be allowed if publishers want their pages to appear in Claude answers. Each bot has its own user agent and robots.txt token, with separate published IP ranges.

Methodology note · Search Engine Roundtable news item by Barry Schwartz reporting on Anthropic's crawler documentation update. Search Engine Roundtable is a long-running SEO news site that tracks search engine and AI vendor documentation changes. The post quotes Anthropic's own published crawler reference page.

Search Engine Roundtable·Accessed 27.05.2026

Tier B — Citable with caveatsRead source

AI Bot Traffic Closing in on Human Web Visits (TollBit Q4 2025 / Q1 2026 data)

The Register · 2026

Key finding

TollBit's Q4 2025 State of the Bots report found roughly one AI bot visit for every 31 human visits, up from 1 in 200 in Q1. Training scrapes fell 15% Q2 to Q4, while RAG bot traffic rose 33% and AI search indexers rose 59%. Click-through referrals from AI apps dropped from 0.8% in Q2 to 0.27% in Q4. ChatGPT-User scrapes pages five times more often than the next scraper.

Methodology note · Reporting in The Register by Brandon Vigliarolo, February 4, 2026, summarising TollBit's quarterly State of the Bots study. TollBit tracks AI bot traffic on behalf of publishers. The article also cites supporting data from Eight Oh Two and Pew Research on AI search usage among US adults.

The Register·Accessed 27.05.2026

Tier B — Citable with caveatsRead source

Google AI Overviews CTR Shows Early Signs of Recovery (Seer Interactive 25M-impression study)

Search Engine Land · 2026

Key finding

Organic click-through rate on Google searches with AI Overviews rose from a low of 1.3% in December 2025 to 2.4% in February 2026, an 85% rebound in two months. Searches without AI Overviews achieve about 3.3% CTR; pages cited inside an AI Overview get about 2.1%; uncited pages get about 0.9%. CTR on AIO-free queries rose from 2.8% to 3.8% year on year.

Methodology note · Search Engine Land summary of a Seer Interactive study covering 53 brands, 5.47 million queries, and 2.43 billion impressions from January 2025 to February 2026. Seer compared organic and paid click-through rates across searches with and without AI Overviews, and segmented results by query intent (informational, transactional, comparison, question).

Search Engine Land·Accessed 27.05.2026

Tier B — Citable with caveatsRead source

Q1 2026 AI Citation Trends Report

Tinuiti · 2026

Key finding

In January 2026, social media accounts for around 9% of all AI citations across the platforms tracked. Google AI Overviews cites social media more than four times as often as Gemini. Amazon.com appears zero times in Gemini's citations during the period. Reddit dominates the social share, with YouTube next; patterns differ sharply between engines, including between Google's own Gemini and AI Mode.

Methodology note · Tinuiti's Q1 2026 report uses the Profound platform to track citations across nine categories (apparel, beauty, electronics, food and beverage, home and garden, manufacturing, OTC health, technology, transportation and logistics) and seven AI surfaces (ChatGPT, Perplexity, Google AI Mode, Google AI Overviews, Gemini, Microsoft Copilot, Meta AI). The full report is gated; sample charts and headline figures are public.

Tinuiti Research·Accessed 27.05.2026

Tier B — Citable with caveatsRead source

Global Publisher Google Traffic Dropped by a Third in 2025

Press Gazette · 2026

Key finding

Globally, Google search traffic to publishers fell by a third in the year to November 2025. Google Discover referrals to over 2,500 publisher sites dropped 21% year on year. In the US, Google search referrals fell 38% and Discover fell 29%. Surveyed media leaders expect publisher traffic to decline by 43% on average over the next three years. ChatGPT referrals reached 0.02% of total traffic.

Methodology note · Press Gazette reports on Chartbeat data published within the Reuters Institute's Journalism and Technology Trends and Predictions 2026 report. The Reuters report combines Chartbeat referral analytics across publisher sites with a survey of 280 media leaders (including 64 editors-in-chief, 64 CEOs, 51 heads of digital) from 51 countries.

Press Gazette·Accessed 27.05.2026

Tier B — Citable with caveatsRead source

AI Traffic Converts at 3× the Rate of Other Channels (Study)

Microsoft Clarity · 2026

Key finding

Visitors arriving from AI assistants convert at roughly 3 times the rate of visitors from other channels, and at up to 11 times the rate in certain publisher segments. AI traffic still represents a small share of total visits, but its per-visitor commercial value is materially higher than traditional search or social.

Methodology note · Analysis of Microsoft Clarity user-session data across a multi-publisher dataset. The study compared conversion rates of sessions originating from AI assistants against sessions from other referral channels. Published January 2026. Single-vendor study with disclosed methodology, downgraded to Tier B in v1.1.

Microsoft Clarity Blog·Accessed 27.05.2026

Tier B — Citable with caveatsRead source

Cloudflare Year in Review: AI Bots Crawl Aggressively Without Proportional Referrals

InfoQ / Cloudflare · 2025

Key finding

Cloudflare's 2025 Radar Year in Review reports global internet traffic up 19% year on year, with Googlebot still the largest single source of crawler traffic. Crawl-to-refer ratios widened sharply: Anthropic peaked at 500,000 to 1, OpenAI at 3,700 to 1, vast numbers of crawls per referral click. Half of human web traffic now uses post-quantum encryption, and Go's share of automated API requests jumped from 12% to 20%.

Methodology note · InfoQ summary by Renato Losio of Cloudflare's sixth annual Radar Year in Review, published December 31, 2025. Data is drawn from Cloudflare's edge network and the 1.1.1.1 public DNS resolver. The InfoQ piece highlights and aggregates the figures from Cloudflare's own published Radar microsite.

InfoQ·Accessed 27.05.2026

Tier B — Citable with caveatsRead source

Year in Search: AI Overview Study (Serpstat)

Serpstat · 2025

Key finding

Across millions of keywords tracked by Serpstat in 2025, AI Overviews expanded across query types and saw a year of rising prevalence, with informational queries the most affected and commercial queries gaining ground. Click-through rates on AIO-affected SERPs were materially lower than on traditional SERPs. (agent inferred)

Methodology note · Serpstat blog post by Kateryna Hordiienko (AI Marketer at Serpstat), 25 December 2025. Direct fetch returned the full article confirming the methodology: 1 billion keywords analysed, 35 million AI Overviews tracked. Tier B vendor study with disclosed methodology.

Serpstat Blog·Accessed 27.05.2026

Tier B — Citable with caveatsRead source

Google AI Overviews Surged in 2025, Then Pulled Back: Data

Search Engine Land · 2025

Key finding

Google AI Overviews appeared on 6.5% of queries in January 2025, peaked at just under 25% in July, then fell back to under 16% by November. Informational queries dominated early (91% in January) but fell to 57% by October, as commercial queries rose from 8% to 18% and transactional from 2% to 14%. Ads alongside AI Overviews rose from about 3% to 40%.

Methodology note · Search Engine Land summary of a Semrush analysis of more than 10 million keywords from January through November 2025. Semrush tracked AI Overview activation rates by month and query intent, paid ad placement frequency, zero-click rates on the same keywords before and after AIO appeared, and category-level penetration.

Search Engine Land·Accessed 27.05.2026

Tier B — Citable with caveatsRead source

OpenAI Revises ChatGPT Crawler Documentation with Significant Policy Changes

PPC Land · 2025

Key finding

On December 9, 2025, OpenAI updated its crawler documentation. ChatGPT-User, which handles user-initiated browsing inside ChatGPT, no longer commits to following robots.txt, on the basis that requests come from a user rather than an autonomous crawler. OAI-SearchBot is now described purely as a search crawler, with training data removed from its scope. GPTBot and OAI-SearchBot may also share crawl results to avoid duplicate fetches.

Methodology note · PPC Land news article by Luis Rijo summarising changes spotted by digital marketing consultant Pieter Serraris in OpenAI's public bots documentation. The article quotes OpenAI's previous and revised wording side by side and links to the live OpenAI bots reference page on platform.openai.com.

PPC Land·Accessed 27.05.2026

Tier B — Citable with caveatsRead source

Publishers Say No to AI Scrapers, Block Bots at Server Level

The Register · 2025

Key finding

BuiltWith counted about 5.6 million sites disallowing OpenAI's GPTBot in robots.txt, up from 3.3 million in July 2025, a 70% jump in five months. ClaudeBot is now blocked at about 5.8 million sites, AppleBot at 5.8 million, Googlebot at 18 million. TollBit reports a 336% year-on-year rise in sites blocking AI crawlers; 13.26% of AI bot requests in Q2 2025 ignored robots.txt, up from 3.3% in Q4 2024.

Methodology note · The Register reporting by Thomas Claburn, December 8, 2025, drawing on BuiltWith's public robots.txt trend dashboards and TollBit's Q2 2025 report. The article also cites Arc XP data showing about half of news sites block GPTBot, and quotes Cloudflare VP of product Will Allen.

The Register·Accessed 27.05.2026

Tier B — Citable with caveatsRead source

Salesforce Data: AI and Agents Propel Cyber Week to Record $336.6B in Global Spend

Salesforce · 2025

Key finding

Over Cyber Week 2025 (Nov 25 to Dec 1), Salesforce measured $336.6 billion in global sales, up 7% year over year, with US sales of $79.6 billion. AI and agents drove $67 billion in sales and influenced 20% of all orders through product recommendations and conversational service. Retailers using Agentforce 360 plus their own branded agents grew sales 32% faster than those without.

Methodology note · First-party Salesforce press release published 5 December 2025, drawing on its Shopping Index of more than 1.5 billion shoppers across 89 countries. Salesforce discloses that figures blend first- and third-party data with market assumptions, and it sells Agentforce, so commercial interest applies. Verified by direct fetch 2026-07-08.

Salesforce Newsroom·Accessed 10.07.2026

Tier B — Citable with caveatsRead source

AI Overview Fan-Out Rankings Boost Citation Odds by 161% (Surfer SEO study, 10K keywords)

Search Engine Land · 2025

Key finding

Pages ranking for both the main query and at least one fan-out sub-query collected 51% of AI Overview citations. Pages ranking only for the main query collected just under 20%. Ranking for fan-out queries makes citation 161% more likely than ranking only for the head term. Around 68% of cited pages did not rank in Google's top 10 for any related query.

Methodology note · Search Engine Land coverage, December 2025, of a Surfer SEO analysis of 10,000 keywords and 33,000 fan-out queries extracted with Gemini. Surfer measured the share of AI Overview citations going to pages ranking on the head query, on fan-outs, on both, or on neither, and reported a Spearman correlation of 0.77 between fan-out coverage and citation rate.

Search Engine Land·Accessed 27.05.2026

Tier B — Citable with caveatsRead source

Schema Markup and AI in 2025: What ChatGPT, Claude, Perplexity & Gemini Really See (searchVIU)

searchVIU · 2025

Key finding

A controlled experiment placed product prices only in hidden JSON-LD, Microdata, and RDFa. During live page retrieval, zero of five AI systems (ChatGPT, Claude, Perplexity, Gemini, Google AI Mode) extracted the hidden values; all read only visible HTML. Gemini and Google AI Mode executed JavaScript, the others did not.

Methodology note · Controlled retrieval experiment by searchVIU (Sebastian Erlhofer team), published 2025 and widely cited in 2026. The primary page resisted direct fetch; figures and the five-system result were cross-verified against Search Engine Journal (Matt Southern, May 2026) and Gianluca Fiorelli at I Love SEO, both reporting identical findings.

searchVIU Blog·Accessed 08.07.2026

Tier B — Citable with caveatsRead source

LLMs.txt Shows No Clear Effect on AI Citations (300K domains)

SE Ranking · 2025

Key finding

Across 300,000 domains, only 10.13% had an llms.txt file. Adoption is roughly flat across traffic tiers, with high-traffic sites slightly less likely (8.27%) to use it than mid-tier ones (10.54%). Statistical tests and an XGBoost model found no relationship between the presence of llms.txt and how often a domain is cited by AI engines. Removing the variable from the model actually improved its accuracy.

Methodology note · SE Ranking study of nearly 300,000 domains, published November 2025. The team checked each domain for an llms.txt file, segmented adoption by monthly traffic, and modelled citation frequency using Spearman correlation, XGBoost regression and SHAP analysis. The conclusion is based on whether llms.txt presence improved or degraded model predictions of LLM citations.

SE Ranking Blog·Accessed 27.05.2026

Tier B — Citable with caveatsRead source

ChatGPT Search Often Switches To English In Fan-Out Queries (Search Engine Journal)

Search Engine Journal · 2025

Key finding

Search Engine Journal reports on vendor-observed fan-out behaviour in ChatGPT Search, including the pattern of fan-out queries often switching to English regardless of the original prompt language. Useful as reporting on vendor-observed patterns; should not be treated as first-party platform proof or as a representative sample of all end-user prompts.

Methodology note · Search Engine Journal article by Matt G. Southern, 18 February 2026, reporting on a Peec AI analysis of 10M+ ChatGPT prompts and 20M fan-out queries. Trade-press coverage of vendor-observed patterns; the underlying dataset comes from Peec's own controlled prompt runs (UI scraping), not a representative sample of all consumer ChatGPT use.

Search Engine Journal·Accessed 27.05.2026

Tier B — Citable with caveatsRead source

The Most-Cited Domains in AI: A 3-Month Study

Semrush · 2025

Key finding

Across more than 100 million citations from 230,000 prompts tracked weekly between July 14 and October 12, 2025, ChatGPT's reliance on Reddit and Wikipedia collapsed in mid-September. Reddit fell from close to 60% of ChatGPT responses in early August to around 10% by mid-September. Wikipedia dropped from roughly 55% to under 20%. AI Mode and Perplexity stayed stable; LinkedIn and Forbes citations grew across all three engines.

Methodology note · Semrush ran weekly snapshots of citations for 230,000 prompts over 13 weeks across ChatGPT search, Google AI Mode and Perplexity. Each week the team tracked the 25 most-cited domains and totaled changes around Google's mid-September removal of the num=100 search parameter. The post includes per-platform domain trend lines and lists the biggest gainers and losers per engine.

Semrush Blog·Accessed 27.05.2026

Tier B — Citable with caveatsRead source

Semrush 230,000 Prompts Multi-Platform AI Visibility Study

Semrush · 2025

Key finding

AI assistants cite community-edited and forum content far more often than corporate marketing pages. Wikipedia appeared as the first or second most-cited source in four of five industries studied. Reddit was cited in 176% of ChatGPT finance queries, meaning Reddit was referenced more than once per answer on average. Official brand websites rarely appeared in top-source lists.

Methodology note · Semrush analysed search prompts across finance, digital technology, business services, consumer electronics, and fashion, comparing ChatGPT and Google AI Mode responses. The team measured source citation frequency, brand mention rates, and the overlap between mentioned brands and cited source domains. Methodology disclosed; ongoing dataset.

Semrush Blog·Accessed 27.05.2026

Tier B — Citable with caveatsRead source

Wikipedia Says Traffic Is Falling Due to AI Search Summaries and Social Video

TechCrunch (Wikimedia Foundation) · 2025

Key finding

Wikipedia's human pageviews fell 8% year on year over the months preceding October 2025, according to the Wikimedia Foundation. The decline became visible after improved bot-detection systems revealed that much of the May and June 2025 traffic spike came from bots designed to evade detection. The Foundation attributes the decline to generative AI summaries in search and to younger users seeking information on social video platforms.

Methodology note · TechCrunch reporting based on a Wikimedia Foundation blog post by Marshall Miller. The post draws on Wikipedia's server logs and updated bot-detection systems to separate human from automated traffic, then compares human pageviews year on year.

TechCrunch·Accessed 27.05.2026

Tier B — Citable with caveatsRead source

AI Platform Citation Patterns (680M citations across ChatGPT, AI Overviews, Perplexity)

Profound (TryProfound) · 2025

Key finding

Across 680 million tracked citations between August 2024 and June 2025, the three big AI engines source very differently. Wikipedia is ChatGPT's top source at 7.8% of citations and 47.9% of its top-10 sources. Reddit leads on Perplexity (6.6% of all citations, 46.7% of top-10) and Google AI Overviews (2.2%). Around 80% of cited URLs sit on .com domains.

Methodology note · Profound analysed citations collected by its monitoring platform across ChatGPT, Google AI Overviews and Perplexity from August 2024 to June 2025. The post reports two cuts of the same data: share of total citations (per platform) and share of each platform's top 10 most-cited sources. Top-level domain distribution is broken out separately. Source lists are published in the article.

Profound·Accessed 27.05.2026

Tier B — Citable with caveatsRead source

Google AI Overviews Linked to 25% Drop in Publisher Referral Traffic

Digiday · 2025

Key finding

Across 19 Digital Content Next member publishers (including The New York Times, Condé Nast, Vox), Google search referral traffic fell broadly in May and June 2025. The median year-on-year decline was 10% overall, 7% for news brands and 14% for non-news. Losses outpaced gains two to one. UK lifestyle and automotive publishers reported CTR falls of up to 25% on first-page rankings.

Methodology note · Digiday reports on a Digital Content Next survey of 19 of its approximately 40 member publishers, run between May and June 2025, plus parallel evidence submitted by the UK's Professional Publishers Association to the Competition and Markets Authority. Findings combine year-on-year referral traffic comparisons with publisher-reported CTR data on specific queries.

Digiday·Accessed 27.05.2026

Tier B — Citable with caveatsRead source

Perplexity Accused of Scraping Websites That Explicitly Blocked AI Scraping

TechCrunch · 2025

Key finding

TechCrunch reported on Cloudflare's August 4, 2025 findings that Perplexity continued to scrape sites after they blocked PerplexityBot in robots.txt and WAF rules. According to Cloudflare, Perplexity rotated user agents (impersonating Chrome on macOS), used undeclared IPs, and changed ASNs, with the behaviour observed across tens of thousands of domains and millions of requests per day. Perplexity disputed the findings.

Methodology note · TechCrunch reporting by Lorenzo Franceschi-Bicchierai, August 4, 2025, summarising Cloudflare's research post and including a direct response from Perplexity spokesperson Jesse Dwyer who described the post as a sales pitch and denied that the named bot belonged to the company.

TechCrunch·Accessed 27.05.2026

Tier B — Citable with caveatsRead source

Surfer SEO AI Citation Report 2025 (36M AIOs, 46M citations)

Surfer SEO · 2025

Key finding

Across 36 million Google AI Overviews and 46 million citations between March and August 2025, three domains dominate: YouTube at about 23.3%, Wikipedia at 18.4% and Google.com at 16.4%. Industry mix shifts the picture: NIH leads health at 39%, YouTube and Reddit together carry gaming with 93% and 78% appearance rates, and Shopify takes 17.7% of ecommerce citations.

Methodology note · Surfer's AI Tracker logged AI Overview responses and their citations from March to August 2025, covering 36M Overviews and 46M citations across 57,000-plus URLs. The team broke results into industry segments (finance, health, ecommerce, SEO, gaming, sports, travel) and reported the share of citations earned by the most frequent domains within each category.

Surfer Blog·Accessed 27.05.2026

Tier B — Citable with caveatsRead source

New Study: AI Assistants Prefer to Cite Fresher Content (17 Million Citations Analyzed)

Ahrefs · Ryan Law, Xibeijia Guan · 2025

Key finding

Analyzing 16.975 million cited URLs across ChatGPT, Perplexity, Gemini, Copilot and AI Overviews, Ahrefs found AI-cited content averages 1,064 days old versus 1,432 for organic Google results - 25.7% fresher. ChatGPT shows the strongest recency bias, citing pages 393-458 days newer than organic. Google's AI Overviews are the exception, citing slightly older content.

Methodology note · Single-vendor data study by Ahrefs (Ryan Law, Xibeijia Guan), published July 2025, using Ahrefs Brand Radar and Content Explorer to measure days-since-publication and days-since-update for cited URLs across six surfaces. Correlational, not causal; the author cautions freshness is one factor among many and average cited age is still about 2.9 years.

Ahrefs Blog·Accessed 10.07.2026

Tier B — Citable with caveatsRead source

AI Search Volatility: Citation Drift Across ChatGPT, Google AI Overviews, Microsoft Copilot, and Perplexity

Profound · Josh Blyskal, Sartaj Rajpal · 2025

Key finding

Across roughly 80,000 prompts per platform tested in June 2025 and again in July 2025, Profound measured citation drift — the share of domains appearing in the later window but not the earlier one. Google AI Overviews drifted 59.3%, ChatGPT 54.1%, Microsoft Copilot 53.4%, Perplexity 40.5%. Over a January-to-July comparison, drift rises to 70–90%, making single-snapshot AI visibility measurements unreliable.

Methodology note · Vendor research study by Profound, published 17 July 2025. Compared domain-level citations on identical open-ended prompts across two three-day windows: 11–13 June and 11–13 July 2025. Sample roughly 80,000 prompts per platform. Drift defined as the percentage of domains cited in the later window but absent in the earlier window. Source verified by direct fetch.

Profound Blog (Research)·Accessed 01.06.2026

Tier B — Citable with caveatsRead source

Cloudflare Will Now Block AI Bots by Default

MIT Technology Review · 2025

Key finding

Cloudflare made blocking AI bots the default for websites it hosts as of July 1, 2025. Customers can override per bot, allow verified crawlers, or charge for access via Pay Per Crawl. Media outlets including the Associated Press and Time, plus platforms like Quora and Stack Overflow, endorsed the move. CEO Matthew Prince argues current AI use of the web is breaking the publisher business model.

Methodology note · Reporting in MIT Technology Review by Peter Hall, published July 1, 2025, covering Cloudflare's announcement and including direct comment from Will Allen, Cloudflare's head of AI privacy, control, and media products. The piece also includes a contrasting view from MIT Media Lab PhD candidate Shayne Longpre on impacts to research and non-commercial use.

MIT Technology Review·Accessed 27.05.2026

Tier B — Citable with caveatsRead source

AI Traffic Has Increased 9.7× in the Past Year (81,947 Websites Study)

Ahrefs · 2025

Key finding

Across 81,947 websites, average AI traffic grew about 9.7 times in a year. The average site's search traffic dropped about 21% over the same period. AI traffic now represents 0.25% of a site's total traffic on average. ChatGPT grew 85% since January 2025 and now sends more traffic than Reddit or LinkedIn. Google still sends about 210 times more traffic than the big three AI platforms combined.

Methodology note · Ahrefs analysed referral traffic patterns across 81,947 websites between mid-2024 and mid-2025, comparing AI referrals (ChatGPT, Perplexity, Gemini, Copilot) against traditional search, social platforms, and direct traffic. The dataset more than doubled the size of the earlier March 2025 study.

Ahrefs Blog·Accessed 27.05.2026

Tier B — Citable with caveatsRead source

Study: AI Brand Visibility and Content Recency

Seer Interactive · Sonny Vasquez · 2025

Key finding

Analyzing 5,000+ cited URLs plus AI-bot server logs, Seer found strong recency bias: about 65% of AI bot hits targeted content from the past year and 79% from 2024-2025. Perplexity drew about 50% of citations from 2025 content. But the effect is category-dependent - energy and instructional 'decking' content was still hit as far back as 2004.

Methodology note · Single-agency data study by Seer Interactive (Sonny Vasquez), published June 2025, combining publish and update dates of 5,000+ cited URLs with ChatGPT-bot server log hits and Peec.ai citation data across ChatGPT, Perplexity and AI Overviews. Correlational; industry breakdowns are illustrative rather than statistically controlled.

Seer Interactive·Accessed 10.07.2026

Tier B — Citable with caveatsRead source

AI Visitors Visit Fewer Pages and Bounce More Often Than Search Visitors (Quality Study)

Ahrefs · 2025

Key finding

Visitors arriving from AI platforms (ChatGPT, Perplexity, Copilot, Gemini) view 4 pages on average, 1.2 fewer than search visitors and 1.5 fewer than the typical visitor. They spend about 8 seconds longer on site (86 versus 78 seconds) but bounce 4.1% more often than search visitors and 5.4% more than the average visitor. Sessions are longer in time but shallower in depth.

Methodology note · Ahrefs analysed user behaviour across roughly 82,000 websites between May and June 2025, comparing visitors arriving from AI platforms against those arriving from search engines and against the overall visitor average. Metrics included pages per visit, pages per session duration, time on site, and bounce rate.

Ahrefs Blog·Accessed 27.05.2026

Tier B — Citable with caveatsRead source

AI Makes Up 0.1% of Traffic, but Clicks Aren't Everything (~35K Websites Study)

Ahrefs · 2025

Key finding

Across roughly 35,000 websites, AI tools sent 0.1% of total referral traffic, just below email at 0.2%. Google sent 345 times more traffic than the three main AI platforms (ChatGPT, Perplexity, Gemini) combined. The three AI platforms together referred about as much traffic as Reddit. AI traffic was highest in the US (7.71% of AI referrals) and in business and industrial sectors (21%).

Methodology note · Ahrefs analysed referral traffic across approximately 35,000 websites in early 2025, breaking down sources by channel (search, direct, social, paid, email, AI) and by AI platform. The study also examined AI traffic distribution by country, industry, site size, and page type (using URL keyword frequency analysis).

Ahrefs Blog·Accessed 27.05.2026

Tier B — Citable with caveatsRead source

Microsoft Bing/Copilot use schema for its LLMs

Search Engine Land (reporting Microsoft/Fabrice Canel) · Barry Schwartz · 2025

Key finding

Search Engine Land reports that Microsoft's Fabrice Canel, speaking at SMX Munich, confirmed Bing and Copilot use schema markup to help their LLMs understand content - the first clear vendor statement that structured data is consumed on the LLM side. Google, OpenAI and Perplexity had made no equivalent confirmation at the time.

Methodology note · Trade-press reporting by Barry Schwartz for Search Engine Land (March 2025), relaying a conference statement by Microsoft's Fabrice Canel at SMX Munich, corroborated via David Mihm's LinkedIn post and Canel's comments. A reported statement, not a study; a primary Microsoft transcript would be stronger. Verified by direct fetch.

Search Engine Land·Accessed 10.07.2026

Tier B — Citable with caveatsRead source

Introducing Citations on the Anthropic API

Anthropic · 2025

Key finding

Anthropic's Citations feature lets Claude ground answers in source documents the developer provides, returning the specific sentences and passages each claim is drawn from. Anthropic reports that this built-in citation approach improved recall accuracy by up to 15% compared with custom prompt-based citation implementations. Thomson Reuters and Endex report reductions in hallucinated or misformatted source references.

Methodology note · Product announcement and developer documentation for the Citations API, generally available on the Anthropic API and Google Cloud Vertex AI. The feature processes user-provided source documents by chunking them into sentences, then passes them with the user query so the model can cite specific passages. Published January 2025; expanded to Amazon Bedrock June 2025.

Anthropic Blog·Accessed 27.05.2026

Tier B — Citable with caveatsRead source

How to view fanout queries generated by AI (Ahrefs Help)

Ahrefs · 2025

Key finding

Ahrefs provides a way for users to view fan-out queries that AI assistants generate from a seed prompt the user has chosen to track. This is evidence that third-party tools can observe AI-generated query rewrites for prompts under the user's control, but it does not prove access to all end-user prompts in the wild.

Methodology note · Ahrefs Help Center article by Constance Tan (updated weekly) describing the Brand Radar fan-out-queries feature for ChatGPT and Perplexity. Explains that Ahrefs typically returns two fan-out queries per tracked prompt (sometimes one, sometimes none) and compares fan-out to People Also Ask. Vendor-reported product behaviour; the fan-out queries observed are derived from user-defined seed prompts. Content verified by direct fetch.

Ahrefs Help Center·Accessed 27.05.2026

Tier B — Citable with caveatsRead source

What is RAG (Retrieval-Augmented Generation)? (IBM Think)

IBM · 2024

Key finding

IBM's explainer defines retrieval-augmented generation (RAG) as a process where an LLM first retrieves relevant external documents, then generates an answer grounded in those documents rather than only its parametric memory. The page describes RAG architecture, common use cases (enterprise search, customer support), and trade-offs compared with pure LLM inference or fine-tuning approaches.

Methodology note · Vendor explainer hosted on IBM's 'Think' marketing site. Direct fetch returned the HTML article. Tier B because IBM is an authoritative vendor in the AI/enterprise space but this is a marketing explainer rather than original research. Suitable as a definitional reference for RAG; not for citation-rate or methodology claims.

IBM Think·Accessed 27.05.2026

Tier B — Citable with caveatsRead source

How can I add prompts? (Otterly Help)

Otterly AI · 2024

Key finding

Otterly supports prompt construction from external proxies including SEO keywords, brand names, industry terms, and URLs. The page reinforces that prompts are customer-defined and proxy-derived, not drawn from a privileged platform-wide feed of real chatbot user prompts.

Methodology note · Otterly AI Help Center article (December 2025) describing the three ways customers can add prompts inside the Otterly platform: individual entry, CSV import, or the AI Prompt Research tool. Self-reported vendor documentation. Useful as evidence of the kinds of inputs Otterly accepts; not a controlled study or independent benchmark. Content verified by direct fetch.

Otterly AI Help Center·Accessed 27.05.2026

Tier B — Citable with caveatsRead source

How to find relevant prompts for your brand? (Otterly Help)

Otterly AI · 2024

Key finding

Otterly's own help documentation explicitly states there is 'no way to learn which prompts are most asked at ChatGPT or Perplexity' and 'no way to know what exactly people are searching for in the AI engines.' Otterly recommends constructing prompts from available external inputs such as brand terms, domains, industries, URLs, and SEO keywords. This is a vendor admission that aligns with the public-proxy thesis.

Methodology note · Otterly AI Help Center article (last updated April 2026) describing the vendor's own recommended methodology for building a brand's prompt list. Self-reported vendor documentation; the page explicitly states that AI search engines do not publish query data and lists three substitute methods Otterly supports (Prompt Research tool, Google Search Console import, AI-assisted brainstorming). Content verified by direct fetch.

Otterly AI Help Center·Accessed 27.05.2026

Tier B — Citable with caveatsRead source

Welcome to Peec AI (Peec AI Docs)

Peec AI · 2024

Key finding

Peec frames itself as a platform that runs customer-defined prompts across major AI assistants and tracks visibility, citation, and answer-inclusion outcomes. Useful as additional product-context evidence that the platform observes outputs from its own controlled runs.

Methodology note · Peec AI's product-introduction documentation page. Describes the platform's three core metrics (Visibility, Position, Sentiment), the prompt-running cadence, and Peec's UI-scraping data-collection approach. Confirms that data comes from prompts the customer defines, not from a privileged platform-wide feed. Content verified by direct fetch on 2026-05-27.

Peec AI Documentation·Accessed 27.05.2026

Tier B — Citable with caveatsRead source

Quickstart Guide (Peec AI Docs)

Peec AI · 2024

Key finding

Peec's documentation says the platform runs customer prompts daily across AI platforms. This supports the interpretation that vendors like Peec observe outcomes from prompts they execute rather than drawing from a secret platform-wide prompt firehose.

Methodology note · Peec AI's official Quickstart Guide, published on its Mintlify-hosted documentation site. Describes the four-step onboarding workflow (set up prompts, identify competitors, read the dashboard, analyse sources) and confirms that Peec runs customer-defined prompts daily across ChatGPT, Perplexity, Gemini and Copilot. Content verified by direct fetch on 2026-05-27.

Peec AI Documentation·Accessed 27.05.2026

Tier C — Tactical signals only

Tier C is vendor blogs, individual LinkedIn or Substack posts, and case studies with a single data point. They're useful sometimes, especially when a category is moving fast and Tier A or B research hasn't caught up.

When a Tier C source surfaces a finding that's genuinely novel, we cite it openly with the caveat that the evidence is provisional, and we treat it as a hypothesis worth testing rather than a fact to repeat.

Tier C — Tactical signals onlyRead source

Who Blocks OpenAI, Google AI and Common Crawl? (News Homepages tracker)

Palewire · Ben Welsh · 2026

Key finding

Palewire's continually-updated news-homepages tracker shows that 633 of 1,156 news publishers surveyed (54.8%) have instructed OpenAI, Google AI, or Common Crawl to stop crawling their sites via robots.txt. Per-crawler block rates: OpenAI 49.9%, Google AI 45.5%, Common Crawl 50.0%. The tracker collects each site's robots.txt file twice per day and reports the latest results.

Methodology note · Palewire (Ben Welsh) News Homepages project documentation, ongoing. Direct fetch returned the project page with live block-rate counts and a per-site breakdown. Tier C: a personal/research project with transparent methodology and live data, but single-author maintenance and no formal peer review.

palewi.re·Accessed 27.05.2026

Tier C — Tactical signals onlyRead source

AI Citation Patterns by Platform, Industry, and Intent

ALM Corp · 2026

Key finding

ALM Corp synthesises AI citation patterns from multiple 2026 datasets, including Yext's 6.8M-source analysis showing 86% of citations come from sources brands can directly influence, and a landmark study finding 44.2% of citations come from the first 30% of content. Concludes there is no universal top source; patterns are shaped by intent, platform, and category. Treat as provisional.

Methodology note · ALM Corp blog post by digital strategy team, 2026. Direct fetch returned the HTML article. Tier C: a marketing-agency blog synthesising third-party datasets rather than running original primary research. Suitable as a tactical reference; not for citation as a primary source. Cross-verifiable against Yext and Tinuiti underlying reports.

almcorp.com·Accessed 27.05.2026

Tier C — Tactical signals onlyRead source

Cloudflare and ETH Zurich Say AI Bots Are Breaking the Web's Cache Layer

PPC Land · 2026

Key finding

Trade press coverage of the joint Cloudflare and ETH Zurich research published April 2026: automated traffic now accounts for 32% of Cloudflare's network. AI crawl purposes break down as 45% training, 45% mixed-purpose, and 7.5% search. The research argues that standard CDN caching strategies are failing under AI crawler load. Companion to the peer-reviewed SOCC 2025 paper at R181.

Methodology note · PPC Land article by Luis Rijo, 6 April 2026. Direct fetch returned the full article. Tier C trade-press summary of primary research from Cloudflare and ETH Zurich. The underlying primary sources are the Cloudflare blog post 'Why we're rethinking cache for the AI era' and the peer-reviewed SOCC 2025 paper (R181).

ppc.land·Accessed 27.05.2026

Tier C — Tactical signals onlyRead source

The 2026 State of AI Search Report

AirOps · 2026

Key finding

Pages not updated for a quarter are over three times more likely to lose AI citations. About 70% of cited pages were updated in the last 12 months, and 83% of commercial citations come from pages refreshed within a year. Sequential heading hierarchies correlate with 2.8 times higher citation likelihood; 87% of cited pages use a single H1, and 48% of citations come from user-generated platforms.

Methodology note · Industry report from AirOps with Kevin Indig, drawing on millions of citation datapoints across ChatGPT, Google AI Overviews, AI Mode, Gemini, and Perplexity. Findings are organised around freshness, on-page structure, schema use, user-generated content, off-site mentions, and visibility stability, with specific percentage gaps tied to each signal.

airops.com·Accessed 27.05.2026

Tier C — Tactical signals onlyRead source

What 2025 Revealed About AI Search and the Future of Schema Markup

Schema App · Martha van Berkel · 2025

Key finding

In 2025, Google and Microsoft publicly confirmed they use Schema markup for generative AI features, and ChatGPT confirmed it uses structured data to decide which products appear in results. Schema App reported a 19.72% rise in AI Overview visibility on its own site after deploying Entity Linking, and customer InSinkErator a 69% rise in clicks on non-branded queries.

Methodology note · First-party essay by Schema App's CEO. The piece argues structured data should be treated as a knowledge graph rather than a rich-result trick, and uses examples from Schema App's own site and named customers (InSinkErator, Wells Fargo) plus public statements from Google, Microsoft, and ChatGPT to support the case.

schemaapp.com·Accessed 27.05.2026

Tier C — Tactical signals onlyRead source

AI Search Cites Press Releases Just 0.04% of the Time

ALM Corp · 2025

Key finding

Press releases syndicated through Yahoo Finance, MSN, and similar networks account for 0.04% of all AI citations, and newswire pages such as PRNewswire add 0.21%. Original editorial content carries 81% of news citations. ChatGPT is a partial exception: press releases hosted on a brand's own newsroom domain drive 18.15% of its citations, against around 3% for Google's AI platforms.

Methodology note · Industry commentary on a BuzzStream study run with the XOFU citation-monitoring tool, covering more than four million AI citations across ChatGPT, Google AI Overviews, Google AI Mode, and Gemini. Researchers ran 3,600 prompts across 10 industries over one week and split prompts into evaluative, informational, and brand-awareness categories.

almcorp.com·Accessed 27.05.2026

Tier C — Tactical signals onlyRead source

Generative AI Chapter, Web Almanac 2025

HTTP Archive · 2025

Key finding

The Web Almanac 2025 Generative AI chapter reports that only 0.015% of sites in the Majestic Million had an llms.txt file in early 2025 (just 15 sites total). The chapter also documents a 6,697% increase in research-paper usage of the word 'delves' as an AI fingerprint, and analyses adoption of built-in browser AI APIs. ChatGPT reached 700M weekly active users by the July 2025 crawl date.

Methodology note · HTTP Archive Web Almanac 2025, Generative AI chapter by Christian Liebel, Yash Vekaria, Jonathan Pagel and others. Direct fetch on almanac.httparchive.org returned the chapter content. Tier C because the Web Almanac is a community-volunteer publication rather than peer-reviewed research, but methodology and data sources are disclosed.

Web Almanac·Accessed 27.05.2026

Tier C — Tactical signals onlyRead source

Google's AI Mode Cites Google in 17% of Answers (SE Ranking, 1.3M citations)

SE Ranking via Search Engine Land · 2025

Key finding

SE Ranking analysed 68,313 keywords and over 1.3 million citations and found that Google.com accounts for 17% of all Google AI Mode citations, more than YouTube, Facebook, Reddit, Amazon, Indeed, and Zillow combined. Google was the top-cited domain in 19 of 20 industry niches studied. 59% of these Google citations point to organic search results, 36% to Google Business Profiles.

Methodology note · Search Engine Land article reporting on SE Ranking research, 2026. Direct fetch returned the HTML article. Tier C because the underlying analysis is a single-vendor study (SE Ranking) covered by trade press; methodology disclosed but vendor incentive to position its tool. Cross-verifiable against the SE Ranking blog post directly.

Search Engine Land·Accessed 27.05.2026

Tier C — Tactical signals onlyRead source

Wikipedia Analysis (LLM Optimizer)

Adobe LLM Optimizer · 2025

Key finding

Adobe's LLM Optimizer treats a company's Wikipedia page as a primary lever for being cited correctly by ChatGPT, Google AI Mode, Gemini, Perplexity, and Copilot. It scores articles on five dimensions: references, sections, content length, images, and infobox completeness. It then benchmarks each against industry competitors and surfaces prioritised fixes, including critical flags for press-release tone and reference gaps.

Methodology note · Product documentation for the Wikipedia Analysis opportunity inside Adobe LLM Optimizer. The system scrapes a brand's Wikipedia page, auto-selects up to six industry competitors based on the company's category, calculates gaps on the five dimensions, and ranks recommendations from Informational to Critical. Edits are made on Wikipedia; the tool does not push changes itself.

Adobe Experience League·Accessed 27.05.2026

Tier C — Tactical signals onlyRead source

Reddit's Rise in AI Citations

CMSWire · 2025

Key finding

Social media climbed to over 9% of AI citations between October 2025 and January 2026, with Reddit driving the dominant share of growth across nine tracked product categories. Reddit's karma-weighted upvote system functions as distributed editorial curation, which retrieval systems treat as a credibility signal. Answer Engine Optimisation now needs a community-content strategy, not only owned-domain SEO.

Methodology note · Trade publication article (CMSWire) citing Tinuiti's AI Citations Trends Report Q1 2026. The piece compares citation behaviour across ChatGPT, Perplexity, and Google's AI surfaces, and translates the data into recommendations for tracking and participating in community conversations as part of an Answer Engine Optimisation programme.

cmswire.com·Accessed 27.05.2026

Tier C — Tactical signals onlyRead source

Common Crawl: Setting the Record Straight (Transparency Response)

Common Crawl · 2025

Key finding

Common Crawl's transparency response (November 2025) addresses criticism around its commitment to fair use and public-good principles. The post documents Common Crawl's robots.txt and opt-out compliance, its crawl truncation thresholds (raised from 1 MiB to 5 MiB per page as of the March 2025 crawl), and clarifies that Common Crawl is a non-profit research dataset rather than an AI training entity itself.

Methodology note · Common Crawl blog post, November 4 2025. Direct fetch on commoncrawl.org returned the full article. Tier C: first-party communication from a research organisation responding to public criticism; useful as context on Common Crawl's stated policies but not as independent evidence of compliance.

commoncrawl.org·Accessed 27.05.2026

Tier C — Tactical signals onlyRead source

Top Cited Domains in AI: What 10M+ Citations Reveal About Visibility

Decoding · 2025

Key finding

AI citations concentrate in a small set of domains: the top 5 hold 38% of citations, the top 20 hold 66%. Wikipedia leads at 11.22% of Google AI Mode citations and 47.9% of ChatGPT's top-10 share. YouTube grew 34% in six months. Reddit citations surged 450% between March and June 2025, then collapsed in ChatGPT around September 2025 from roughly 60% to about 10% of responses.

Methodology note · Vendor blog (Decoding) consolidating citation data from third-party studies, including a Profound analysis of 680 million citations across ChatGPT, Google AI Overviews, and Perplexity from August 2024 to June 2025, plus citation-share counts from Ahrefs and a three-month Semrush time series capturing the September 2025 shift in ChatGPT source mix.

trydecoding.com·Accessed 27.05.2026

Tier C — Tactical signals onlyRead source

Perplexity vs ChatGPT: AI Citation Study Q3 2025

Qwairy · 2025

Key finding

Qwairy's analysis of 118,000+ AI-generated answers across Q3 2025 found that Perplexity averages 21.87 citations per question while ChatGPT averages 7.92, that OpenAI is the only major model citing Wikipedia significantly (4.8% of citations), and that only 11% of cited domains appear across multiple platforms. Each AI provider has distinct source preferences requiring platform-specific optimisation.

Methodology note · Qwairy blog post, Q3 2025. Direct fetch returned the HTML article. Tier C: single-vendor study with disclosed sample size (118K+ answers) but limited methodology disclosure on how the answer set was sampled. Vendor incentive to position its GEO platform; treat per-vendor citation counts as directional rather than definitive.

qwairy.co·Accessed 27.05.2026

Tier C — Tactical signals onlyRead source

First-Ever SEO Study on ChatGPT Search Queries (Query Length, Fan-Outs, N-Grams) — Tactical Signal

Marketing Power Ups / LinkedIn · Chris Long · 2025

Key finding

Chris Long published the first-ever SEO study on ChatGPT Search behaviour, analysing query length, fan-out patterns, and n-gram distributions in ChatGPT-cited content. A notable finding: roughly 28% of pages cited by ChatGPT had zero organic Google visibility, indicating that ChatGPT's source-selection criteria diverge meaningfully from Google ranking. Treat as a provisional Tier C tactical signal.

Methodology note · LinkedIn post by Chris Long (Nectiv / Go Fish Digital), October 2025. The original LinkedIn URL returns HTTP 404; the finding is cross-verified against Chris Long's X post (1985689925602460120) and the AirOps webinar 'Query Fan-Out: What 60,000+ Searches from ChatGPT & Google Show with Chris Long' which references the same analysis.

LinkedIn (Chris Long)·Accessed 27.05.2026

Tier C — Tactical signals onlyRead source

OpenAI Search Crawler Reaches 55% Web Coverage: Analysis of 66 Billion Bot Requests

ALM Corp · 2025

Key finding

ALM Corp summary of a Hostinger study (January 2026) analysing 66.7 billion bot requests across more than 5 million websites. Found OAI-SearchBot reached 55.67% average coverage of monitored websites between June and November 2025. TikTok's bot reached 25.67%, Applebot 24.33%, and Huawei's PetalSearch 18.33%. Demonstrates rapid expansion of assistant-facing crawlers as training-bot blocking grows.

Methodology note · ALM Corp blog post summarising Hostinger's 2026 AI crawler coverage study, January 2026. Direct fetch returned the HTML article. Tier C because it is an agency blog summarising third-party vendor research (Hostinger). Cross-verifiable against the original Hostinger blog post 'AI bot analysis' and Search Engine Journal coverage.

almcorp.com·Accessed 27.05.2026

Tier C — Tactical signals onlyRead source

AI Mode and AI Overviews Share Only 13.7% of Citations (Ahrefs 730K response pairs)

Ahrefs · 2025

Key finding

Across 730,000 query pairs analysed in September 2025, Ahrefs found that Google AI Mode and AI Overviews reach 86% semantic similarity in their answers but cite only 13.7% of the same URLs. The two surfaces converge on conclusions while diverging on sources, suggesting brands need to optimise for each surface separately rather than treating them as a single Google AI endpoint.

Methodology note · Ahrefs blog post by Brand Radar team, September 2025. Direct fetch returned the article HTML. Tier C: single-vendor study with disclosed methodology and clear sample size, but vendor incentive to position its tool. Sample is US-only and query-set composition is not externally validated.

Ahrefs Blog·Accessed 27.05.2026

Tier C — Tactical signals onlyRead source

Athena State of AI Search Report 2025

AthenaHQ · 2025

Key finding

Athena's State of AI Search Report 2025 reports that zero-click search on Google rose from 56% in 2024 to 69% in 2025, that the average brand appears in just 17.24% of relevant prompts while top players reach 56.71%, and that informational queries dominate AI search at 34.28% of prompts. Treat as provisional Tier C until original PDF is re-verifiable.

Methodology note · Athena State of AI Search Report 2025. The PDF URL returns HTTP 404; findings cross-verified against the live Athena State of AI Search 2026 report at athenahq.ai/athena-state-of-ai-full-report and against summaries on Bluehost.com. Single-vendor research with limited methodology disclosure.

athenahq.ai·Accessed 27.05.2026

Tier C — Tactical signals onlyRead source

Which News Sites Block AI Crawlers in 2025?

BuzzStream · 2025

Key finding

BuzzStream analysed robots.txt directives on the top 50 news sites in the UK and the top 50 in the US (combined 100 sites) for 11 AI-related crawlers. PerplexityBot (the indexing variant) is blocked by 67% of these sites; only 14% of publishers block all AI bots while 18% block none. US publishers are more restrictive against Google's AI bots than UK publishers.

Methodology note · BuzzStream blog post, 2025. Direct fetch returned the HTML article with methodology (top 50 UK + top 50 US news sites by Similarweb), the 11 AI crawlers examined, and per-bot block rates. Tier C: marketing-tool vendor blog with disclosed methodology and a clear small sample (n=100) rather than independent peer-reviewed research.

BuzzStream Blog·Accessed 27.05.2026

Tier C — Tactical signals onlyRead source

AI Search Engines Cite Reddit, YouTube, LinkedIn Most (150K citations)

Cybernews via Search Engine Land · 2025

Key finding

Reddit ranks as the most-cited domain across ChatGPT, Google AI Mode, Gemini, Perplexity, and AI Overviews combined, with YouTube, LinkedIn, Wikipedia, and Forbes filling out the top five. Yelp and G2 surface often on recommendation queries. ChatGPT leans on Wikipedia, Reddit, and editorial sites; Google leans on Facebook and Yelp; Perplexity emphasises Reddit, LinkedIn, and G2, especially for business-to-business questions.

Methodology note · Search Engine Land summary of an analysis by Peec AI, an AI search analytics tool, covering 30 million sources cited directly inside answers from ChatGPT, Google AI Mode, Gemini, Perplexity, and AI Overviews. Coverage focuses on per-domain citation share by platform and by query type, including recommendation queries.

Search Engine Land·Accessed 27.05.2026

Tier C — Tactical signals onlyRead source

AI Bots and Robots.txt (longitudinal analysis)

HTTP Archive · Paul Calvano · 2025

Key finding

Longitudinal analysis of robots.txt files across popular websites finds that as of July 2025, AI bot user-agents top the list of most-referenced agents. Nearly 21% of the top 1,000 websites have rules targeting GPTBot. The wildcard '*' appears in 97.4% of robots.txt files. AI bot blocking has grown rapidly and is more common on higher-traffic sites. Treat as a provisional individual analysis.

Methodology note · Personal blog post by Paul Calvano (web performance engineer and Web Almanac contributor), 21 August 2025. Direct fetch returned the HTML article with the methodology, data source (AI Robots.txt GitHub repository), and findings. Tier C: single-author analysis on a personal blog. Methodology is disclosed but not externally peer-reviewed.

paulcalvano.com·Accessed 27.05.2026

Tier C — Tactical signals onlyRead source

How Perplexity Ranks Content: Research Uncovers Core Ranking Factors and Systems (Search Engine Land)

Search Engine Land (reporting on independent research by Metehan Yesilyurt) · Danny Goodwin · 2025

Key finding

Independent researcher Metehan Yesilyurt analysed browser-level interactions with Perplexity's infrastructure and reported a three-layer (L3) machine-learning reranker for entity searches that discards the full result set when too few results clear its threshold, plus manually curated authoritative-domain lists (Amazon, GitHub, LinkedIn, Coursera) granting authority boosts. Search Engine Land flags the research as unverified, so treat all mechanisms as provisional.

Methodology note · Trade-press article by Danny Goodwin, Editorial Director at Search Engine Land, published 5 August 2025, reporting on independent solo research by Metehan Yesilyurt derived from browser-level traffic analysis of Perplexity, not a disclosed dataset. Page fetched and confirmed directly; Search Engine Land explicitly labels the underlying findings unverified.

Search Engine Land·Accessed 23.06.2026

Tier C — Tactical signals onlyRead source

I Audited 30 llms.txt Files in the Wild — 5 Anti-Patterns Already Forming

DEV Community · Kenimo · 2025

Key finding

An audit of 30 live llms.txt files found five recurring failures: overlong files with too many links; URLs contradicting robots.txt for the very AI crawlers expected to read them (about a third of files); no Markdown twin of pages (24 of 30); marketing prose instead of pointers; and files frozen since 2024 with dead links and renamed slugs.

Methodology note · Practitioner blog post on dev.to. The author manually audited 30 llms.txt files in the wild against the original Jeremy Howard proposal and against guidance from Mintlify and the llmoframework, then documented five anti-patterns with examples. Three of the audited files were the author's own, used as a control on bias.

dev.to·Accessed 27.05.2026