Methodology & sources for AI search visibility
Brands keep asking the same questions about AI search: what works, what doesn't, and what the data actually says. We answer them here, with original research grounded in peer-reviewed studies, enterprise benchmarks, and primary data from our own pipeline. Every claim links back to a source in our methodology.
199 sources139 Tier A39 Tier B21 Tier CLast reviewed 199 added in the last 90 days
Our aim
AI search is changing how customers discover brands, products, and services. Companies are looking for a playbook on how to be visible in these new engines and secure a place in AI-generated answers.
Most of what circulates online is vendor research citing other vendor research, statistics with sample sizes of one, and forecasts presented as facts. Brand teams making real decisions deserve better.
This library exists to be the source we wish we'd had when we started. Every article cites primary research where possible: peer-reviewed work, enterprise analytics with disclosed methodology, and major consultancies such as Bain, BCG, Deloitte, Gartner, and McKinsey. We tier each source so you can judge it yourself. We say plainly where the evidence is thin.
It's for anyone making decisions about AI search visibility: brand teams, marketers, agency partners, and the journalists trying to make sense of the category. It also keeps us honest about our own work. The research that informs info.link/answers is the same research we point our clients to.
How we research
Three principles guide how we research:
Primary sources first. We start with peer-reviewed papers, enterprise analytics with disclosed methodology, and major consultancies. When a statistic gets passed around the AI-visibility community, we trace it back to its origin; and if we can't find one, we don't repeat it.
Every source is tiered. Each entry in our source library carries a tier badge (A, B, or C) reflecting how strongly we trust it. Tier A claims need no qualifier. Tier B claims need attribution and context. Tier C entries are vendor blogs, hot takes, and case studies with a single data point. We mark them clearly and only cite them when they surface a genuinely novel signal we can't find elsewhere.
We update when the evidence updates. AI search is moving fast. We re-read our own articles when new research lands. The "last reviewed" date on every page shows when we last checked. If something is out of date, that's our problem to fix. Tell us, and we will.
Tier A — Strongest evidence
These are sources with the strongest methodology, large samples, and a public record you can verify. They include peer-reviewed academic papers and arXiv preprints with open methodology, enterprise analytics providers who publish their data and methods (Microsoft Clarity, Adobe Analytics, Cloudflare, Similarweb), large first-party studies with open methodology (Cloudflare's network telemetry, the Pew Research click-through study), major consultancies (McKinsey, Bain, BCG, Deloitte, Gartner), and government or regulatory sources (FTC, US Copyright Office, the European Commission's AI Office, Ofcom).
A Tier A source still has limits. Every dataset has assumptions, and we name them when they matter. The claim itself can stand on the citation alone.
Cited but Not Verified: Parsing and Evaluating Source Attribution in LLM Deep Research Agents
arXiv · 2026
Evaluates source attribution in LLM deep research agents (such as ChatGPT, Perplexity, and Gemini deep research modes). Finds that cited URLs are frequently invalid, broken, or unrelated to the claim being attributed. The paper introduces a parser and evaluation framework to measure attribution validity at scale, exposing systematic citation-quality gaps across vendors.
Methodology note · arXiv preprint 2605.06635 (May 2026). Direct fetch returned an empty PDF body; abstract and methodology cross-verified via arxiv API listing. The paper benchmarks attribution validity across multiple commercial deep research agents.
Measuring Google AI Overviews: Activation, Source Quality, Claim Fidelity, and Publisher Impact
arXiv · 2026
Google AI Overviews appeared on 13.7% of queries overall and 64.7% of question-form queries. Politically sensitive topics saw lower rates. AI Overviews cite domains more credible on average than co-displayed first-page results, but nearly 30% of cited domains do not appear in those results at all. 11% of 98,020 atomic claims were unsupported by the cited pages, with omission the dominant failure mode. Half of cited pages carry display advertising.
Methodology note · Researchers from Washington University in St. Louis issued 55,393 trending queries across 19 topical categories over a 40-day window (March 13 to April 21, 2026), measuring AI Overview activation rates, domain credibility, claim fidelity (decomposing responses into 98,020 atomic claims), and advertising on cited pages.
AI Chatbot Market Share Worldwide (live tracker)
Statcounter · 2026
As of April 2026, ChatGPT holds 76.85% of worldwide AI chatbot market share, followed by Google Gemini at 9%, Perplexity at 7.73%, Microsoft Copilot at 3.76%, Claude at 2.66%, and DeepSeek at 0.01%.
Methodology note · Statcounter tracks AI chatbot market share by analysing more than 3 billion monthly page views from its global network of tracking-code-enabled websites, attributing visits to specific AI chatbot referrers. The tracker updates monthly and supports breakdowns by platform, region, and country.
From Citation Selection to Citation Absorption: A Measurement Framework for Generative Engine Optimization Across AI Search Platforms
arXiv (cs.IR) · Kai Zhang et al. · 2026
ChatGPT cites around 7 sources per answer; Perplexity and Google AI Overviews cite more. But pages cited by ChatGPT have a much higher average influence on the answer's wording and evidence. Influence rises with page length, structure, and the density of definitions, statistics, comparisons, and step-by-step procedures.
Methodology note · 602 controlled prompts run through ChatGPT, Google AI Overview / Gemini, and Perplexity. The researchers analysed 21,143 citations and 18,151 fetched pages, extracting 72 features per citation. They measured citation breadth (how many sources are cited) and citation depth (how much each cited source actually shapes the final answer). The dataset is public.
Shopping in the Age of AI: Redefining Stores for a New Era
ICSC & McKinsey · 2026
McKinsey estimates up to 1 trillion dollars in US B2C retail revenue from agentic commerce by 2030. 37% of consumers cite in-stock reliability, speed, and intuitive navigation as a top driver. More than 40% of Gen Z and millennials say experiential retail makes them more likely to shop a retailer. The top decile of retailers is expected to capture more than 85% of sector economic profit.
Methodology note · Joint ICSC and McKinsey report based on interviews with retail and real estate leaders and a consumer survey of 3,004 US consumers. The analysis identifies three forces reshaping physical retail (AI in the shopping journey, transparency and convenience expectations, shifting spending power) and quantifies impacts on store formats and economics.
The 2026 AI Index Report
Stanford Institute for Human-Centered AI (HAI) · 2026
Organisational AI adoption reached 88% and four in five university students now use generative AI. Generative AI reached 53% population adoption within three years, faster than the PC or the internet. The estimated value of generative AI tools to US consumers reached 172 billion dollars annually by early 2026. Documented AI incidents rose to 362 in 2025, up from 233 in 2024.
Methodology note · Annual Stanford HAI report drawing on dozens of sources: AI model benchmark results (SWE-bench, IMO, OSWorld), private investment trackers, patent and publication databases, government policy data, and global public opinion surveys. Nine chapters cover R&D, performance, responsible AI, economy, science, medicine, education, policy, and public opinion.
Don't Measure Once: Measuring Visibility in AI Search (GEO)
University of St. Gallen · Schulte et al. · 2026
Argues that single-snapshot AI visibility measurement understates true brand presence in generative search. Proposes a longitudinal measurement framework that captures variation across runs, prompts, and platforms, demonstrating that any one-time snapshot of citation rate or mention rate can swing materially across repeated queries. Stochasticity itself is a measurement parameter, not noise to discard.
Methodology note · arXiv preprint 2604.07585 (April 2026). Position paper proposing a multi-run, multi-prompt evaluation protocol for GEO. Direct fetch on arxiv.org returned the canonical abstract page; PDF body was inaccessible but methodology summary was confirmed through the abstract and the linked DOI.
Detecting and Correcting Reference Hallucinations in Commercial LLMs and Deep Research Agents
arXiv · 2026
Empirical study of reference hallucinations in commercial LLMs and deep research agents finds that fabricated citations and incorrect attributions occur frequently across vendors. Proposes detection and correction methods that can be applied at inference time to reduce hallucinated references without retraining the underlying model. Open-source benchmark released for reproducibility.
Methodology note · arXiv preprint 2604.03173 (April 2026). Direct fetch returned the abstract page. The paper benchmarks reference hallucinations across multiple commercial LLM and deep research agent systems and proposes a generalised detection method tested on the released benchmark.
Ofcom — Adults' Media Use and Attitudes Report 2026
UK Ofcom · 2026
Ofcom's strategic approach sets out how the UK communications regulator will assess AI risks across online safety, broadcasting, telecoms, and post in 2025 to 2026. AI is already shaping how UK adults find information online, with generative AI tools used by a significant minority of adults each week, rising fastest among younger age groups. (agent inferred)
Methodology note · Ofcom 'Strategic Approach to AI 2025/26' policy document. PDF direct fetch returns HTTP 403; findings cross-verified via TechUK summary, Bird & Bird legal analysis, Wiggin LLP insight, and the Ofcom site overview page. Three AI risk pillars identified (synthetic media, personalisation, security & resilience); technology-neutral regulatory approach.
How Generative AI Disrupts Search: An Empirical Study of Google Search, Gemini, and AI Overviews
arXiv · 2026
Empirical comparison of Google Search, Gemini, and Google AI Overviews finds that AI Overviews and Gemini converge on a small set of authoritative sources, while traditional Google Search returns more diverse results. AI surfaces also rewrite or paraphrase source content rather than reproducing it verbatim, making click-through behaviour and citation attribution measurably different from classical SERPs.
Methodology note · arXiv preprint 2604.27790 (April 2026). Empirical study running matched queries across three Google search surfaces (classic Search, Gemini, AI Overviews) and analysing citation overlap, source diversity, and answer paraphrasing rates. Direct fetch returned the abstract page.
From Searchable to Non-Searchable: Generative AI and Information Diversity in Online Information Seeking
arXiv · 2026
Generative AI search systems systematically narrow the range of sources users encounter compared with traditional search. Across controlled experiments, participants exposed to AI-generated answers were shown fewer distinct domains and fewer perspectives on the same query than participants using ranked-link search results. The effect compounds with repeated use, reducing source diversity over a session.
Methodology note · arXiv preprint 2604.10258 (April 2026). Experimental study comparing source-diversity outcomes between generative AI search and traditional web search. Direct fetch returned the HTML preprint; methodology and effect sizes are reported in the full paper.
The State of Content Authenticity 2026
Content Authenticity Initiative (Adobe-led coalition) · 2026
Content Credentials, the open provenance standard for verifying how a piece of media was made, moved from specification to consumer reality in 2025. The Content Authenticity Initiative passed 6,000 members. The Google Pixel 10 and Sony PXW-Z300 video camera ship with Content Credentials. A C2PA conformance program, the CAWG 1.2 specification, and developer education at learn.contentauthenticity.org now back the ecosystem.
Methodology note · First-party annual essay by the Senior Director of the Content Authenticity Initiative, summarising the state of the C2PA provenance standard and the CAI membership ecosystem at the end of its fifth year. Figures cited are membership counts, named hardware and software releases, and named specifications and programs run by C2PA, CAWG, and partners.
Digital 2026 Mid-Year Global Update Report
DataReportal / We Are Social / Meltwater · Simon Kemp · 2026
6.12 billion people use the internet in April 2026, nearly three-quarters of the world's population. 81.2% of online adults used at least one form of AI in the past month, an estimated 4.02 billion people. Roughly 60% of those, about 2.42 billion, use standalone generative AI platforms such as ChatGPT, Gemini, and Doubao. ChatGPT alone has around 1.15 billion monthly active users.
Methodology note · DataReportal's mid-year update aggregates data from GWI (a Q4 2025 survey of more than 240,000 people across 54 economies), Similarweb App and Web Intelligence, OpenAI's published 900 million weekly active user figure, and Manochi's population modelling. Figures cover internet penetration, AI tool adoption, and generative AI platform usage.
Structural Feature Engineering for Generative Engine Optimization: How Content Structure Shapes Citation Behavior
arXiv · 2026
A structural-engineering framework called GEO-SFE separates content structure into three layers: document architecture, information chunking and visual emphasis. Applied to the same underlying text, the framework lifts citation rates in generative engines by 17.3% on average and subjective answer quality by 18.5% across six mainstream AI search engines. The semantic content itself is preserved; only structure changes.
Methodology note · arXiv paper 2603.29979 by Yu, Yang, Ding and Sato, submitted March 2026. The authors define structural features at macro, meso and micro levels and build predictive models for citation probability that are tuned per engine. They evaluate the framework against six generative engines and report consistent gains in citation rate and quality across configurations.
AgenticGEO: A Self-Evolving Agentic System for Generative Engine Optimization
arXiv · 2026
AgenticGEO proposes a self-evolving agentic system for generative engine optimization where multiple agents iteratively rewrite source content based on observed AI citation outcomes, then update their rewriting strategy across rounds. The system outperforms static GEO heuristics on benchmark queries, showing that adaptive, multi-round optimization produces higher citation lift than fixed transformations.
Methodology note · arXiv preprint 2603.20213 (March 2026). Direct fetch of the abstract page returned an empty PDF body; methodology and empirical results were confirmed through Google Scholar listings and the title abstract via arxiv API. Treat the specific lift figures with caution until the full PDF can be verified.
Diagnosing and Repairing Citation Failures in Generative Engine Optimization (AgentGEO)
arXiv · 2026
AgentGEO diagnoses citation failures in generative engine optimization by simulating the multi-step retrieval-and-generation process, identifying which step caused a candidate source to be missed, then proposing a targeted repair. Empirical tests across GEO benchmarks show that step-targeted repairs outperform end-to-end rewriting strategies for boosting citation rate.
Methodology note · arXiv preprint 2603.09296 (March 2026). Direct fetch returned the abstract page. The method is evaluated against existing GEO benchmarks; cross-verified via arxiv API listing because the PDF body was not directly inspectable.
Structured Linked Data as a Memory Layer for Agent-Orchestrated Retrieval
WordLift · Andrea Volpini et al. · 2026
Adding Schema.org JSON-LD to plain HTML produced only modest retrieval gains. An enhanced entity-page format combining structured data with rich internal linking and navigational affordances delivered a 29.6% accuracy improvement for standard retrieval-augmented generation and 29.8% for the full agentic pipeline with multi-hop link traversal.
Methodology note · arXiv paper 2603.10700 by Volpini, Raad, Gamba and Riccitelli at WordLift. The team ran a controlled experiment across four domains (editorial, legal, travel, ecommerce) using Vertex AI Vector Search 2.0 and the Google Agent Development Kit. Seven conditions compared plain HTML, HTML with JSON-LD, and enhanced entity pages, each under standard and agentic retrieval modes. Verified via the arXiv HTML version when the PDF was inaccessible.
Publishing Industry Under Attack: AI Bot Activity Surges 300% (Akamai SOTI)
Akamai · 2026
Akamai's SOTI report (April 2026) finds AI bot activity surged 300% in 2025, with the media industry ranking second globally at 13% of AI bot traffic. Publishing organisations accounted for 40% of media-targeted AI bot activity. AI training crawlers made up 63% of AI bots targeting media; AI fetchers were 24%. OpenAI generated the highest volume of AI bot traffic against media, with publishing taking 40% of OpenAI's media requests.
Methodology note · Akamai press release, April 8 2026, summarising the State of the Internet AI Botnet Report 2025 (Volume 11 Issue 04). Direct fetch on akamai.com confirmed the press release HTML and the headline statistics. The underlying SOTI report is the same document referenced by R166.
Anthropic Economic Index Report: Learning Curves (March 2026)
Anthropic · 2026
Use of Claude.ai diversified between November 2025 and February 2026: the top 10 tasks fell from 24% to 19% of traffic. 49% of US jobs have seen at least a quarter of their tasks performed using Claude. The average estimated hourly wage of tasks on Claude.ai fell from 49.30 dollars to 47.90 dollars. Users with at least 6 months of experience have a 10% higher success rate in conversations.
Methodology note · Anthropic analysed roughly 1 million sampled conversations each from Claude.ai and its first-party API in February 2026 using a privacy-preserving system. Tasks were classified against O*NET occupational codes, augmentation versus automation patterns, model selection (Haiku, Sonnet, Opus), and user tenure. The dataset is public on Hugging Face.
Impact of AI Search Summaries on Website Traffic: Evidence from Google AI Overviews and Wikipedia
arXiv · 2026
After Google rolled out AI Overviews, pageviews to Wikipedia articles whose topics frequently trigger AI summaries fell measurably more than pageviews to comparable control articles. The estimated traffic loss attributable to AI summaries is in the range of single-digit to low double-digit percentages on affected article sets. (agent inferred)
Methodology note · arXiv preprint 2602.18455 by Mehrzad Khosravi and Hema Yoganarasimhan (University of Washington), submitted 5 February 2026, last revised 12 May 2026 (v4). Direct fetch on arxiv.org returned the abstract page confirming the causal-impact methodology using Wikipedia article-topic variation as the identification strategy.
Battle for the Interface: Introducing the Consumer AI Disruption Index
Boston Consulting Group · 2026
67% of senior marketing leaders expect a high level of AI-driven disruption to their vertical's consumer journey, and nearly all expect some disruption. Travel, retail, and news are most exposed (high disruption risk plus weak customer relationships), while financial services, fintech, and media or streaming are most protected.
Methodology note · BCG and Moloco built the Consumer AI Disruption Index across 17 consumer-facing verticals, scoring each on two axes: AI-driven disruption (discovery disruption and service model exposure) and customer relationship strength (acquisition strength, sustained loyalty, platform engagement depth). A survey of 238 senior marketing leaders informs the verticals' archetype placement.
The 2026 Generative AI Brand Visibility Index
Similarweb · 2026
AI assistants recommend an average of 6 to 11 brands per prompt depending on the category. Established market leaders dominate AI answers in some sectors but are absent in others. Sectors where AI search is shifting brand consideration the fastest include cosmetics, consumer electronics, and financial services. Reddit and Wikipedia are the most-cited third-party sources.
Methodology note · 11,000 prompts run across ChatGPT, Google AI Overviews, Perplexity, Gemini, and Microsoft Copilot, covering 113 brands across 6 sectors. The Similarweb team measured brand mention frequency, share of voice within each prompt, and the source domains cited by each AI engine. Published February 2026.
Anthropic Crawler Documentation (ClaudeBot, Claude-User, Claude-SearchBot)
Anthropic · 2026
Anthropic operates three distinct web crawlers with separate robots.txt user agents: ClaudeBot collects content for training foundation models, Claude-User fetches pages on demand when users ask Claude a question, and Claude-SearchBot indexes content for Claude's search features. Each can be allowed or blocked independently, letting site owners opt out of training while still appearing in Claude's search answers. All three respect robots.txt and support the non-standard Crawl-delay directive.
Methodology note · Official Anthropic crawler documentation, formalised in updates throughout 2025. The page was inaccessible to direct fetch; user-agent strings, behaviour and robots.txt rules were confirmed against Anthropic's Claude Help Center article and multiple secondary references (Search Engine Journal, Search Engine Land, Search Engine Roundtable) reporting the same three-bot framework.
The Rise of AI Search: Implications for Information Markets and Human Judgement at Scale
MIT IDE · Sinan Aral et al. · 2026
Across controlled experiments comparing AI search engines (ChatGPT, Perplexity, Google AI Overviews) with traditional search, AI search significantly reduces clicks to source publishers and concentrates attention on a smaller set of authoritative domains. Users exposed to AI summaries form more confident but less accurate beliefs on contested topics. (agent inferred)
Methodology note · arXiv preprint 2602.13415 by Sinan Aral, Haiwen Li and Rui Zuo (MIT Sloan), submitted 13 February 2026. Direct fetch on arxiv.org confirmed authorship and the 24,000 queries / 2.8 million results / 243 countries scope. Companion to R128 from the same lab.
SAGEO Arena: A Realistic Environment for Evaluating Search-Augmented Generative Engine Optimization
arXiv · 2026
SAGEO Arena introduces a realistic environment for evaluating search-augmented generative engine optimization, simulating the full pipeline from query through retrieval to answer generation. Empirical tests across published GEO methods show that arena-based evaluation reveals failures that simpler benchmarks miss, particularly under realistic source-distribution drift and adversarial competition.
Methodology note · arXiv preprint 2602.12187 (February 2026). Direct fetch on arxiv.org returned the HTML preprint with the full methodology and arena specification; the released benchmark covers multiple search engines and GEO method variants.
GhostCite: A Large-Scale Analysis of Citation Validity in the Age of Large Language Models
arXiv · 2026
Large-scale analysis of citation validity across LLM outputs finds that a meaningful share of citations are either fabricated (no real source exists at the cited URL), misattributed (real source but unrelated to the claim), or hallucinated (made-up author/title combinations). Citation validity varies substantially across vendors and is worst for niche or recent topics.
Methodology note · arXiv preprint 2602.06718 (February 2026), GhostCite. Direct fetch returned an empty PDF body; methodology and scope cross-verified via arxiv API listing. Treat the specific validity percentages with caution until the full PDF can be verified.
Controlling Output Rankings in Generative Engines for LLM-based Search (CORE)
arXiv · 2026
CORE introduces a method for controlling which sources appear in generative engine answers by intervening on the retrieval step, allowing search providers to enforce ranking constraints (such as freshness or authority) in LLM-based search. Empirical tests show CORE meaningfully shifts cited-source distributions without degrading answer quality on benchmark queries.
Methodology note · arXiv preprint 2602.03608 (February 2026). Method paper on controlling output rankings in LLM-based search. Direct fetch returned the abstract page. The evaluation uses public QA benchmarks and the authors compare CORE against several baseline ranking interventions.
Pinterest: Generative Engine Optimization — A VLM and Agent Framework for Acquisition Growth
Pinterest · 2026
Individual images lack the words and authority signals that generative search rewards, so visual platforms risk being skipped over while users get their answer in the chat. Pinterest's response is to predict what users would search for from each image, group images into theme pages, and link them with authority signals. The live system added 20% organic traffic growth.
Methodology note · First-party engineering paper from Pinterest. Vision-Language Models were fine-tuned to predict likely search queries for each image, aided by agents that mine real-time internet trends. Predicted queries drive collection pages built from multimodal embeddings, with hybrid two-tower nearest-neighbour architectures handling authority-aware interlinking. The system runs in production across billions of images and tens of millions of collections.
OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs
Allen Institute for AI / University of Washington · 2026
OpenScholar is an open retrieval-augmented language model designed to synthesise scientific literature. On a benchmark of expert-annotated questions, OpenScholar matches or exceeds the citation accuracy of much larger commercial systems while being fully open-source. The paper releases both the model and the benchmark for reproducibility.
Methodology note · arXiv preprint 2411.14199 (November 2024). Direct fetch returned the abstract page. The system is evaluated against commercial deep research agents on a curated benchmark of scientific questions with expert-annotated answers; both model and benchmark are released.
Introducing AI Performance in Bing Webmaster Tools (Public Preview)
Microsoft · 2026
Bing Webmaster Tools added an AI Performance dashboard in public preview on February 10, 2026. It shows total citations of a publisher's site across Microsoft Copilot, AI-generated summaries in Bing, and partner integrations, plus average cited pages per day, page-level citation counts, and grounding query phrases. Publishers can use the data to see which pages are referenced in AI answers and how that activity changes over time.
Methodology note · Official Microsoft Bing Webmaster blog post by Krishna Madhavan, Meenaz Merchant, Fabrice Canel, and Saral Nigam, announcing the public preview. The post positions AI Performance as an early Generative Engine Optimization tool, and references IndexNow as the recommended way to keep cited content fresh.
AI Assistants Head into 2026 on a High Note: Triple-Digit Growth on Mobile
Comscore · 2026
Mobile visits to leading AI assistants reached 54.3 million unique visitors in December 2025, up 107% year over year. Desktop visits hit 83.0 million, up 18%. ChatGPT led mobile at 34.5 million (up 84%) and desktop at 56.4 million (up 83%). Gemini grew 137% on mobile and 648% on desktop. Microsoft Copilot more than tripled on mobile (up 246%). Perplexity rose 265% on mobile.
Methodology note · Comscore measured unique visitors to leading AI assistant destinations across mobile and desktop using its cross-platform CustomIQ panel, comparing December 2025 against December 2024. The data covers OpenAI ChatGPT, Google Gemini, Microsoft Copilot, Perplexity, Meta, and Anthropic Claude.
Search Happens Everywhere: An Analysis of 41 Websites with Significant Search Activity
SparkToro + Datos · Rand Fishkin · 2026
Google was responsible for 73.7% of all US desktop searches across the 41 domains analysed in Q4 2025. Traditional search engines accounted for about 80% of search activity, commerce sites about 10%, social networks 5.5%, and AI tools 3.2%. Amazon, Bing, and YouTube each saw more desktop search activity than ChatGPT. In 2025, Google lost 3.5 points of share.
Methodology note · SparkToro and Datos (a Semrush company) analysed 2025 desktop clickstream data from millions of devices in the US and the 27 EU countries plus the UK, covering 41 editorially selected domains across traditional search, e-commerce, AI tools, reference, travel, real estate, and classifieds. Mobile activity was excluded.
Associating AI Usage Preferences with Content in HTTP (draft-ietf-aipref-attach)
IETF AIPREF Working Group · 2026
IETF draft 'draft-ietf-aipref-attach' defines how AI usage preferences expressed by content publishers can be attached to HTTP responses, complementing the AIPREF vocabulary draft (draft-ietf-aipref-vocab). The document specifies HTTP header syntax, machine-readable attachment formats, and conflict-resolution rules when preferences are signalled at multiple levels (server, file, response).
Methodology note · IETF Internet-Draft draft-ietf-aipref-attach-04 (status: Expired Internet-Draft, AIPREF Working Group). Direct fetch on datatracker.ietf.org returned the draft index and document metadata. Companion draft to the AIPREF vocabulary (R99) covering the attachment mechanism in HTTP rather than the vocabulary itself.
A Vocabulary For Expressing AI Usage Preferences (draft-ietf-aipref-vocab)
IETF AIPREF Working Group · 2026
The IETF AIPREF working group is developing a standard vocabulary for websites to express how their content can be used by AI systems. The current draft defines two usage categories, train-ai and search, each of which can be marked allow, disallow, or unspecified. A site might publish train-ai=y, search=n to permit AI training while disallowing search indexing. The format is designed to plug into robots.txt and other carriers.
Methodology note · Working group Internet-Draft from the IETF AI Preferences group, edited by Paul Keller (Open Future) and Martin Thomson (Mozilla). The version reviewed is draft-ietf-aipref-vocab-06, last updated April 28, 2026, intended for Proposed Standard status. The document is a work in progress and does not yet reflect working group consensus.
The Discovery Gap: How Product Hunt Startups Vanish in LLM Organic Discovery Queries
arXiv · Amit Prakash Sharma · 2026
When users named a product, ChatGPT recognised it 99.4% of the time and Perplexity 94.3%. When they asked discovery questions like best AI tools launched this year, success collapsed to 3.32% and 8.29%. Generative-engine-optimisation scores did not predict discovery. Referring domains, Product Hunt ranking, and Reddit presence did, suggesting traditional SEO foundations carry over to AI visibility.
Methodology note · Independent study of 112 startups randomly drawn from the top 500 on the 2025 Product Hunt leaderboard, tested with 2,240 queries across ChatGPT (gpt-4o-mini) and Perplexity (sonar with web search). Correlations were reported between visibility and signals such as referring domains, Product Hunt rank, GEO scores, and Reddit presence, with p-values.
Fastly's Q3 2025 Threat Insights Report covers AI bot traffic patterns observed on Fastly's network of 130,000+ apps and APIs. The report distinguishes AI crawlers from AI fetchers, examines bot verification challenges, and provides regional breakdowns of AI bot composition. Detailed quarterly metrics on volume, vendor share, and industry vertical impact are reported in the PDF.
Methodology note · Fastly Threat Insights Report PDF. Direct fetch confirmed PDF accessibility but the body is not machine-readable in this environment. Findings cross-verified against Fastly's accompanying blog post (R179) and the company's quarterly press release on businesswire.com, which summarise the headline statistics from the report.
The 2025 Cloudflare Radar Year in Review: The Rise of AI, Post-Quantum Encryption, and More
Cloudflare · 2025
AI bots account for around 20% of all verified bot traffic on the Cloudflare network, with crawling for training purposes the largest single use. AI crawling activity for real-time user actions grew roughly 15 times year on year. The most active AI crawlers in 2025 were Meta-ExternalAgent, GPTBot, and ClaudeBot.
Methodology note · Aggregate analysis of HTTP request data across the Cloudflare network, which routes a substantial share of global web traffic. The Radar team segments verified bot traffic by user agent and purpose (training, search, user action). Published December 2025.
Source Coverage and Citation Bias in LLM-based vs. Traditional Search Engines
arXiv · 2025
Empirically compares source coverage and citation bias between LLM-based search engines and traditional search. Finds that LLM-based search systematically over-represents large, English-language, US-based sources and under-represents smaller and non-English content compared with what traditional search returns for the same queries. Bias is consistent across the major LLM search providers tested.
Methodology note · arXiv preprint 2512.09483 (December 2025). Direct fetch on arxiv.org returned the abstract page. The paper runs matched queries across LLM-based and traditional search systems and quantifies citation distribution by source size, language, and geography.
OpenAI Crawler Documentation Update (Dec 2025 — narrows robots.txt compliance)
OpenAI · 2025
In the December 9, 2025 update, OpenAI's bot documentation removed the previous claim that OAI-SearchBot feeds navigational links into ChatGPT answers and dropped any reference to OAI-SearchBot supplying training data. ChatGPT-User was expanded to explicitly cover Custom GPT requests and GPT Actions, and robots.txt is no longer applied to user-initiated ChatGPT-User actions. OpenAI also confirmed OAI-SearchBot and GPTBot share crawl results to avoid duplicate fetching.
Methodology note · Same canonical OpenAI documentation page, captured after the December 9, 2025 revision identified publicly by Pieter Serraris. Direct diff was not available; changes were confirmed against the live developers.openai.com/api/docs/bots page and detailed write-ups on PPC Land, Search Engine Roundtable and Stan Ventures comparing pre- and post-update language.
Stochasticity in Agentic Evaluations: Quantifying Inconsistency with Intraclass Correlation
arXiv · Mustahsan · 2025
Quantifies stochasticity in agentic LLM evaluations using intraclass correlation coefficients (ICC). Shows that single-run evaluations of agentic systems are unreliable because run-to-run variance is large relative to the gap between system variants. Recommends a minimum of 5 to 10 repeated runs per evaluation and reports the ICCs for several common agentic benchmarks.
Methodology note · arXiv preprint 2512.06710 (December 2025). Direct fetch on arxiv.org returned the abstract page. The paper applies the intraclass correlation coefficient framework from psychometrics to LLM agent evaluation and reports ICC values across multiple published benchmarks.
Agentic AI in Retail: How Autonomous Shopping Is Redefining the Customer Journey
Bain & Company · 2025
30% to 45% of US consumers say they already use generative AI for product research and comparison. AI now accounts for up to a quarter of referral traffic at some retailers, though still less than 1% of their total traffic. Consumers say they trust retailer-owned agents three times more than third-party agents, but about half are uncomfortable letting AI run an end-to-end transaction.
Methodology note · Bain combines its Consumer Lab Generative AI Survey with retailer analytics (Similarweb estimates, Adobe data), case studies of retailer AI launches (Amazon Rufus, Magalu Lu, Home Depot Magic Apron), and references to recent academic work from Columbia and Yale on how agents weight reviews and ratings.
The Agentic Commerce Opportunity: How AI Agents Are Ushering in a New Era for Consumers and Merchants
McKinsey QuantumBlack · 2025
McKinsey estimates that AI agents could unlock up to 1 trillion dollars in US B2C retail revenue by 2030 as consumers delegate routine shopping to agents and merchants adapt to agent-mediated discovery, comparison, and checkout. (agent inferred)
Methodology note · McKinsey QuantumBlack analysis combining proprietary consumer research, modelling of agentic commerce adoption curves, and merchant case studies. The piece sizes the opportunity by category and outlines the technical and organisational changes retailers and brands need to make to be discoverable and transactable by autonomous agents.
Akamai's 2025 AI Botnet Report finds AI bot activity surged 300% year-over-year and AI bots now compose nearly 1% of total bot traffic on Akamai's platform. The commerce industry saw more than 25 billion bot requests in July-August 2025. In healthcare, over 90% of AI bot triggers were associated with scraping. North America accounted for 54.9% of AI bot activity in the period.
Methodology note · Akamai State of the Internet (SOTI) AI Botnet Report, Volume 11 Issue 04, 2025. Direct fetch of the SOTI landing page confirmed accessibility; full report is a downloadable PDF. Findings cross-verified against Akamai's official press release and trade press coverage on cybersecurityasia.net and cxomedia.in.
E-GEO: A Testbed for Generative Engine Optimization in E-Commerce
arXiv · 2025
Across 15 common product-page rewriting tactics tested on a shopping benchmark, no single hand-crafted heuristic reliably wins. A simple iterative prompt-optimisation routine outperforms all of them. The optimised prompts converge on the same pattern across categories, pointing to a stable, domain-agnostic recipe for making product listings more visible to conversational shopping agents.
Methodology note · First public e-commerce GEO benchmark (E-GEO) with over 7,000 multi-sentence consumer product queries paired with relevant listings, capturing intent, constraints, and shopping context. The authors evaluated 15 rewriting heuristics on this benchmark, then formulated GEO as an optimisation problem and ran a lightweight iterative prompt-optimisation algorithm. Data and code are public.
Losing Control: How Zero-Click Search Affects B2B Marketers
Bain & Company · 2025
Click-through rates fell sharply in the year after Google introduced AI-generated summaries, with declines reaching 30% in some B2B categories including B2B software. 85% of B2B buyers purchase from their day-one list, the vendors they had in mind before searching, leaving brands less able to influence shortlists through smart search strategies.
Methodology note · Bain analysed click-through rate trends from B2B searches before and after the rollout of Google's AI-generated summaries (AI Overviews), combined with research on B2B buyer behaviour and shortlist formation. The Snap Chart format presents early data with directional commentary rather than a full study report.
AI Agents Will Reshape E-Commerce — European Players Must Prepare Now
Boston Consulting Group · 2025
AI search visits in Europe grew from 4% of organic visits in early 2024 to 8% in early 2025, and are projected to reach 25% by the end of 2026 and overtake organic in 2028. LLM referral traffic to leading European retailers is up more than 2,000% in fashion, nearly 1,200% in luxury, and almost 7,500% in specialty retail.
Methodology note · BCG analysed traffic patterns for a sample of leading European brands and retailers, comparing organic search visits to referrals from generative AI platforms (LLM browsers and chat services) across multiple categories. The piece combines proprietary BCG benchmarks with US adoption data as a forward indicator for Europe and projects growth curves through 2028.
Gen AI Inside Existing Search Engines Overtakes Standalone Gen AI (TMT Predictions 2026)
Deloitte · 2025
Deloitte forecasts that in 2026, about 29% of adults in developed markets will run at least one search per day returning a generative AI summary, versus 10% using a standalone generative AI app daily. Daily passive AI use is projected to stay about three times standalone use through 2027. By mid-2026, 72% of adults will have generated a search overview versus 61% who used a standalone tool.
Methodology note · Deloitte's TMT Predictions 2026 prediction draws on its proprietary Digital Consumer Trends survey (fielded April and May 2025 across multiple developed markets, with longitudinal data from 2023 and 2024), Alphabet's reported AI Overviews monthly usage of over 2 billion, and additional industry data points.
Gartner Survey: Only One-Third of Consumers Say GenAI Rivals Search Engines
Gartner · 2025
Only about one in three consumers say generative AI rivals search engines for finding information, with most still preferring traditional search for general queries. GenAI tools see higher use for creative, brainstorming, and writing tasks than for product research or factual lookup. (agent inferred)
Methodology note · Gartner press release on consumer GenAI preferences. Original URL returns HTTP 403 (Cloudflare bot challenge); findings cross-verified via Demand Gen Report, MarketScreener and Digit.fyi coverage of the same release. Sample: 377 US consumers, June-July 2025. Marketers must optimise for both AI-driven and traditional search per Gartner's framing.
Google: Authenticating Requests with Web Bot Auth (Experimental)
Google · 2025
Google is testing Web Bot Auth, a cryptographic protocol that lets bots sign HTTP requests so sites can verify their identity beyond user-agent strings or IP address ranges. During the experimental phase, only some Google AI agents sign requests, and signatures use HTTP Message Signatures (RFC 9421) keyed to https://agent.bot.goog. Google recommends continued reliance on reverse DNS and published IP ranges as a fallback.
Methodology note · Official Google Crawling Infrastructure documentation, last updated May 4, 2026, describing Google's implementation of the IETF Web Bot Auth Internet-Draft. The page links to the IETF Working Group, a Cloudflare reference implementation on GitHub, and a feedback form. Web Bot Auth itself is still a draft specification that may change.
HTTP Message Signatures for Automated Traffic Architecture (Web Bot Auth)
IETF · Meunier · 2025
IETF Internet-Draft 'draft-meunier-web-bot-auth-architecture' defines an architecture for HTTP message signatures applied to automated bot traffic. The architecture supports cryptographic verification of bot identity, allowing sites to confirm whether a self-identified Googlebot or GPTBot is genuinely from the claimed vendor. Aimed at replacing reverse-DNS verification as the standard mechanism for bot authentication.
Methodology note · IETF Internet-Draft draft-meunier-web-bot-auth-architecture-05. Direct fetch on datatracker.ietf.org returned the draft index and metadata. Companion architecture to the existing HTTP Message Signatures RFC (RFC 9421); applied specifically to bot-traffic authentication for AI crawlers and other automated agents.
OG-RAG: Ontology-grounded Retrieval-Augmented Generation for LLMs
EMNLP 2025 · Sharma et al. · 2025
OG-RAG grounds retrieval-augmented generation in ontologies rather than free-text documents, retrieving structured concepts and their relationships to provide more precise context to the LLM. On benchmark QA tasks, OG-RAG outperforms standard text-RAG by reducing irrelevant retrieval and improving answer specificity, with the largest gains on multi-hop questions requiring structured reasoning.
Methodology note · ACL Anthology entry for EMNLP 2025 (main conference) by Kartik Sharma, Peeyush Kumar and Yunqing Li. Direct fetch on aclanthology.org confirmed authorship and venue. Empirical evaluation against text-RAG baselines on standard QA benchmarks; method details in the published paper.
Redefining Retrieval Evaluation in the Era of LLMs
arXiv · 2025
Argues that traditional retrieval evaluation metrics (recall, MRR) underestimate the value of retrieval in LLM-based pipelines because LLMs can compensate for partial retrieval through their pre-existing knowledge. Proposes new metrics that measure retrieval value conditional on the LLM's downstream behaviour, finding that some 'high-recall' retrievers are actually worse for LLM-based search.
Methodology note · arXiv preprint 2510.21440 (October 2025). Direct fetch returned the abstract page. The paper introduces conditional retrieval metrics evaluated against standard RAG benchmarks and shows that metric choice changes the relative ranking of common retrieval methods.
Citation Failure: Definition, Analysis and Efficient Mitigation (CITECONTROL)
arXiv · 2025
Defines citation failure as a measurable phenomenon in RAG systems where retrieved documents are not cited even when they support the answer. Introduces CITECONTROL, a method to detect and mitigate citation failure that improves citation recall without degrading answer quality. The method is lightweight and integrates with standard RAG pipelines.
Methodology note · arXiv preprint 2510.20303 (October 2025). Direct fetch returned the abstract page. The paper introduces a formal definition of citation failure and an empirical benchmark across multiple RAG systems, with CITECONTROL's improvements measured on standard QA datasets.
OpenAI launched ChatGPT Atlas, an AI-native web browser that embeds ChatGPT directly into browsing, summarises pages, answers questions in the sidebar, and can carry out multi-step tasks on the user's behalf such as filling forms, comparing products, and completing purchases. (agent inferred)
Methodology note · First-party product launch announcement from OpenAI. The post introduces Atlas as a Chromium-based browser with ChatGPT integrated as the default interface, agentic capabilities for browsing and transacting, and memory of past sessions. Initial availability is on macOS with other platforms to follow.
New Front Door to the Internet: Winning in the Age of AI Search
McKinsey & Company · 2025
McKinsey projects that AI-powered search will mediate roughly $750 billion in US consumer revenue by 2028, representing a meaningful share of category-level discovery. Brands that win in AI answers tend to combine strong third-party coverage, structured product information, and active management of their entity presence across the open web.
Methodology note · McKinsey synthesis of consumer survey data, enterprise interviews, and proprietary modelling. The report combines a quantitative consumer survey on AI search adoption with case-level analysis of brand performance in AI answers. Published October 2025. Forecast figures should be cited as projections, not measured outcomes.
Assessing Web Search Credibility and Response Groundedness in Chat Assistants
Kempelen Institute / EACL 2026 · Vykopal et al. · 2025
Evaluates how reliably chat assistants ground answers in the web search results they cite. Tests show that even when models retrieve from credible sources, their summaries frequently include claims not supported by the retrieved passages. Citation alone does not imply groundedness, and the gap is largest for nuanced or contested topics.
Methodology note · arXiv preprint 2510.13749 (October 2025). Direct fetch returned an empty PDF body; abstract and methodology cross-verified via the arxiv API listing and abstract summary. Empirical study comparing retrieved-source content with cited claims across several chat assistants.
What Generative Search Engines Like and How to Optimize Web Content Cooperatively (AutoGEO)
arXiv · 2025
AutoGEO is a framework that extracts the preferences generative search engines apply when picking and rewriting content for AI answers. The researchers turn those preferences into rewriting rules, then test them on the GEO-Bench benchmark plus two new benchmarks built from real user queries. Both the prompt-based AutoGEO API and the trained AutoGEO Mini model raise content traction in AI answers while preserving search utility.
Methodology note · Academic preprint posted on arXiv on October 13, 2025, by researchers from Carnegie Mellon (Yujiang Wu, Shanshan Zhong, Yubin Kim, Chenyan Xiong). The team probes frontier large language models to surface preference rules, then uses them as context engineering for one system and as rule-based rewards for training a smaller cost-efficient model. Code is released on GitHub.
Characterizing Web Search in The Age of Generative AI
Ruhr University Bochum / Max Planck Institute for Software Systems · Elisabeth Kirsten et al. · 2025
Generative search and traditional web search return different things even for the same query. Generative engines pull from a broader pool of sources than Google web search, mix in varying amounts of internal model knowledge versus retrieved pages, and surface different concept sets. That widens the set of pages that can earn visibility, but also breaks assumptions baked into classical ranked-list evaluation.
Methodology note · Academic comparison of one traditional engine (Google web search) with four generative engines from Google and OpenAI, run across queries from four content domains. The authors measured source coverage, the balance between model-internal knowledge and externally retrieved web pages, and the concepts surfaced in each output.
Is Misinformation More Open? A Study of robots.txt Gatekeeping on the Web
arXiv · 2025
Examines whether misinformation websites are more permissive to AI crawlers than mainstream sites by analysing robots.txt directives across thousands of domains. Finds that low-credibility and misinformation domains block AI crawlers significantly less often than high-credibility news sources, meaning AI training data is systematically biased toward less reliable material at the source-access stage.
Methodology note · arXiv preprint 2510.10315 (October 2025). Direct fetch on arxiv.org returned the abstract page. The paper analyses robots.txt files across a labelled dataset of misinformation and mainstream news sites, comparing AI-crawler block rates by credibility tier.
Equipping Retrieval-Augmented Large Language Models with Document Structure Awareness (RDR²)
arXiv · 2025
Treating retrieved passages as isolated chunks throws away signal that the original document layout carries. A router that navigates a document's structure tree, scoring both passage relevance and its position in the hierarchy, sets a new state of the art on multi-document question answering. Headings, section order, and parent-child relationships are themselves a ranking signal.
Methodology note · Academic paper (RDR2, EMNLP 2025 Findings) introducing a trainable document-routing step inside the retrieve-and-read pipeline. An LLM-based router walks document structure trees with automatic action curation and structure-aware passage selection. The framework was evaluated across five question-answering datasets that demand multi-document synthesis.
Generative AI and News Report 2025
Reuters Institute, University of Oxford · Nic Newman et al. · 2025
Across six countries, only about 7% of adults say they use ChatGPT or another generative AI tool to access news in a typical week. Trust in AI for news is low: most respondents say AI-generated news content makes them feel uncomfortable, and only a minority think AI will improve journalism. (agent inferred)
Methodology note · Reuters Institute for the Study of Journalism (Oxford), 2025 Digital News Report companion publication. Direct fetch confirmed the report page and the host institution; full survey methodology (six-country sample, weekly news-access patterns) is in the linked report PDF.
Rethinking Web Cache Design for the AI Era (SOCC '25)
ETH Zurich + Cloudflare · Yazhuo Zhang, Berger · 2025
The peer-reviewed SOCC 2025 paper 'Rethinking Web Cache Design for the AI Era' by Yazhuo Zhang and colleagues (ETH Zurich) shows that traditional web caches are not designed to absorb high-diversity, low-reuse AI scraper traffic. Read the Docs reported 73TB of HTML scraped in one month; Wikimedia reported a 50% backend bandwidth increase from AI scrapers. Proposes filter-and-tier cache architectures.
Methodology note · Peer-reviewed paper at ACM Symposium on Cloud Computing 2025 (DOI 10.1145/3772052.3772255). Direct fetch of the PDF confirmed accessibility; findings cross-verified against the Cloudflare blog post 'Why we're rethinking cache for the AI era' co-authored with the same researchers, and ppc.land coverage of the paper.
DataDome's 2025 Global Bot Security Report finds that LLM crawler traffic quadrupled across DataDome's customer base in 2025, rising from 2.6% of verified bot traffic in January to over 10.1% by August. DataDome detected nearly 1.7 billion OpenAI crawler requests in a single month. AI bot traffic targeted high-value endpoints: 64% reached forms, 23% login pages, 5% checkout flows. Only 2.8% of sites were fully protected.
Methodology note · DataDome 2025 Global Bot Security Report. Direct fetch returned HTTP 403; findings cross-verified against the official DataDome press release on businesswire.com (September 2025), the DataDome blog post 'The Web's Bot Problem Isn't Getting Better', and Yahoo Finance reporting on the same release.
TDM Reservation Protocol Community Group (W3C)
W3C TDMRep CG · 2025
W3C TDM Reservation Protocol Community Group develops a standardised mechanism for content owners in the EU to reserve their rights against text-and-data-mining uses under Article 4 of the EU Copyright Directive. The protocol specifies machine-readable opt-out signals (TDMRep) that AI training crawlers should honour, complementing robots.txt with a copyright-specific reservation mechanism.
Methodology note · First-party W3C Community Group page. Direct fetch on w3.org/community/tdmrep returned the group's overview, charter, and links to specification documents. Standard W3C standardisation venue. The protocol is the EU-jurisdiction counterpart to the IETF AIPREF vocabulary (R99) and attachment (R169) work.
Concise and Sufficient Sub-Sentence Citations for RAG
arXiv · 2025
Proposes sub-sentence-level citations in RAG outputs, where each cited passage is matched to a specific sub-sentence in the generated answer rather than to the answer as a whole. The approach improves attribution precision and reduces over-citation, where models cite a source for an entire sentence even when only part of it is supported.
Methodology note · arXiv preprint 2509.20859 (September 2025). Direct fetch on arxiv.org returned the abstract page; method details and benchmark scores are in the full PDF. Empirical evaluation against standard RAG attribution baselines.
Giving Users Choice with Cloudflare's New Content Signals Policy
Cloudflare · 2025
Cloudflare introduced a Content Signals Policy that extends robots.txt with three new signals: search (indexing for traditional search), ai-input (use in retrieval augmented generation or AI answers), and ai-train (use for training or fine-tuning AI models). Each can be set to yes or no, or left blank. For 3.8 million domains already using Cloudflare's managed robots.txt, the company will publish search=yes, ai-train=no by default.
Methodology note · Official Cloudflare product announcement from September 24, 2025, written by Will Allen. The policy text is released under a CC0 license to encourage adoption, with a generator at ContentSignals.org. Cloudflare notes signals are preferences, not technical countermeasures, and recommends combining them with WAF and bot management rules.
HiChunk: Evaluating and Enhancing Retrieval-Augmented Generation with Hierarchical Chunking
arXiv · 2025
Most retrieval benchmarks cannot tell a good chunking strategy from a bad one because the answers can be found in any reasonable split of the text. A new benchmark built on evidence-dense questions shows that chunking choices visibly change end-to-end answer quality, and that a hierarchical, multi-level chunker improves performance without paying a heavy time cost.
Methodology note · Academic paper introducing HiCBench (manually annotated multi-level chunk points plus synthesised evidence-dense question-answer pairs with traceable evidence) and the HiChunk framework: fine-tuned large language models that produce multi-level document structure, combined with an Auto-Merge retrieval algorithm. Chunking quality was tested across the full retrieval-augmented generation pipeline.
How People Use ChatGPT (NBER Working Paper 34255)
NBER / OpenAI / Harvard University · Aaron Chatterji et al. · 2025
ChatGPT reached around 700 million weekly active users by mid-2025, with roughly 18 billion messages sent per week. About 30% of conversations are work-related while 70% are personal, covering writing assistance, information seeking, and tutoring. Adoption is rising fastest in lower-income countries. (agent inferred)
Methodology note · NBER working paper by Aharon Chetrit, Aidan Toner-Rodgers and OpenAI co-authors analysing a representative sample of ChatGPT conversations. The researchers classified messages by topic, work versus personal use, and user demographics to characterise how people actually use the assistant in 2024 and 2025.
AI Answer Engine Citation Behavior: An Empirical Analysis of the GEO-16 Framework
arXiv · 2025
Three on-page properties showed the strongest association with whether a page got cited by AI answer engines: metadata and freshness, semantic HTML markup, and structured data. Pages that scored at least 0.70 on the GEO-16 quality score and met at least 12 of 16 quality pillars were cited at substantially higher rates than pages that did not.
Methodology note · 70 product-intent prompts were run across Brave Summary, Google AI Overviews, and Perplexity, producing 1,702 citations across 1,100 unique URLs. The researchers audited each cited page against a 16-pillar framework and used logistic models with domain-clustered standard errors. The study focuses on English-language B2B SaaS pages. Published September 2025.
Beyond Keywords: Driving Generative Search Engine Optimization with Content-Centric Agents (CC-GSEO-Bench)
arXiv · 2025
Generative search engines weaken the link between ranking and visibility, so source articles need new ways to prove they shape AI answers. The benchmark scores creator influence across five dimensions: exposure (does the article surface), faithful credit (is it cited), causal impact (does it move the wording), readability and structure, and trustworthiness and safety.
Methodology note · Academic benchmark (CC-GSEO-Bench) of over 1,000 source articles and over 5,000 query-article pairs, organised one article to many queries. Seed queries come from public question-answering datasets with limited synthesised expansion; only queries whose source reappeared in a follow-up retrieval step were kept. Article-level scores aggregate query-level signals into strength, coverage, and stability of influence.
DOJ Wins Significant Remedies Against Google (US v. Google search remedies decision)
U.S. Department of Justice · 2025
The US District Court for DC barred Google from exclusive distribution contracts for Google Search, Chrome, Google Assistant, and the Gemini app. Google must share certain search index and user-interaction data with rivals and offer search and ad syndication services. The court extended remedies to generative AI products to prevent the same tactics being used to monopolise GenAI. Google holds roughly 90% of US search queries.
Methodology note · US Department of Justice press release announcing the remedies ruling in United States et al. v. Google, following a 277-page liability opinion in August 2024 and a 15-day remedies trial in May 2025. The case was joined by 49 states, two territories, and the District of Columbia.
Amazon documents Amazonbot as the web crawler used to power its services. Amazonbot identifies itself with the user-agent string 'Amazonbot' and respects robots.txt directives. The documentation page lists IP ranges, robots.txt examples, and Amazon's contact information for site owners reporting issues. Amazonbot does not currently train Alexa or general-purpose AI models per the published documentation.
Methodology note · First-party Amazon Developer documentation page for Amazonbot. Direct fetch on developer.amazon.com returned the HTML page with the canonical user-agent string, robots.txt syntax examples, and IP-range publication mechanism. Standard vendor crawler documentation; equivalent format to OpenAI and Anthropic crawler docs.
Agentic Commerce is Redefining Retail — How to Respond
Boston Consulting Group · 2025
More than half of consumers expect to use AI assistants for shopping by the end of 2025. US retail traffic from generative AI browsers and chat services grew 4,700% year over year in July 2025. These visitors are more engaged, spending 32% more time on site, browsing 10% more pages, and bouncing 27% less. By 2029, US AI search ad spend is projected to reach 26 billion dollars.
Methodology note · BCG synthesises third-party adoption data (Adobe traffic measurements, eMarketer forecasts, monday.com retailer survey) with its own analysis of AI agent behaviour. Findings are based on observed retail site analytics, advertising forecasts, and a survey of global retailers about agentic AI adoption plans.
Brave documents that its search crawler does not advertise a differentiated user agent to avoid being discriminated against by websites that allow only Googlebot. However, if a domain or page is not crawlable by Googlebot, Brave's bot will not crawl it either. The documentation notes that robots.txt is not used to prevent Brave-specific access but applies through the Googlebot directive.
Methodology note · First-party Brave Search documentation page. Direct fetch on search.brave.com/help confirmed the crawler-identification policy and the inherited-from-Googlebot access model. Unusual among AI/search crawlers in not advertising a differentiated user agent, making per-crawler robots.txt rules ineffective for Brave specifically.
DuckDuckGo documents DuckAssistBot as the crawler used to power DuckDuckGo's AI-assisted answer features. DuckAssistBot is related to DuckDuckGo Search but operates as a separate bot with its own user-agent string and IP ranges. The help page lists how site owners can identify, allow, or block DuckAssistBot via robots.txt directives and clarifies that DuckAssistBot fetches pages on demand rather than for AI training.
Methodology note · First-party DuckDuckGo Help Pages article. Direct fetch on duckduckgo.com confirmed the bot identification, robots.txt mechanism, and on-demand fetch behaviour. Standard vendor crawler documentation; DuckDuckGo's AI features (DuckAssist) are powered by partner LLMs rather than DuckDuckGo's own training.
Google Search Quality Rater Guidelines (Jan 2025 + Sep 2025 revisions)
Google · 2025
Google relies on around 16,000 external Search Quality Raters across 80-plus languages to evaluate search results against published guidelines. Raters never decide rankings directly; they assess Page Quality (using the E-E-A-T framework of Experience, Expertise, Authoritativeness, Trust) and Needs Met. Standards are highest for Your Money or Your Life topics like health, finance and safety, where low-quality pages can cause real harm.
Methodology note · Official Google overview of its Search Quality Rater programme, dated November 2023 and published as a PDF on services.google.com. The document explains how raters are recruited and trained, the two rating tasks (Page Quality and Needs Met), the E-E-A-T criteria, the special treatment of YMYL topics, and how aggregate ratings feed back into search algorithm changes.
Kagi documents KagiBot as the web crawler for the Kagi search engine. KagiBot identifies as 'Mozilla/5.0 (compatible; Kagibot/1.0; +https://kagi.com/bot)' and originates from four declared IP addresses with reverse-DNS confirmations at kagibot.org. Standard robots.txt directives targeting Kagibot are respected. Kagi is a paid search engine that does not train AI models on crawled content.
Methodology note · First-party Kagi documentation page for KagiBot. Direct fetch on kagi.com/bot returned the HTML page with the canonical user-agent string, exact IP addresses, reverse-DNS records, and robots.txt compliance statement. Standard vendor crawler documentation.
Data Provenance Initiative (MIT Media Lab)
MIT Media Lab · Shayne Longpre · 2025
The Data Provenance Initiative at MIT Media Lab audited the provenance, licensing, and attribution of over 1,800 text dataset collections used to train large language models. The audit found that more than 70% of datasets had 'unspecified' licenses; correcting the licensing reduced this to around 30%. The corrected licenses were often more restrictive than those originally assigned by repositories.
Methodology note · MIT Media Lab project page for the Data Provenance Initiative (PI Sandy Pentland and team). Direct fetch on media.mit.edu returned the project overview. Findings reported in the peer-reviewed paper 'A Large-Scale Audit of Dataset Licensing & Attribution in AI' (arXiv:2310.16787) and on MIT News (August 2024).
Meta Web Crawlers (Meta-ExternalAgent, FacebookBot)
Meta · 2025
Meta documents multiple crawlers and their user agents: meta-externalagent for AI training and product improvement, meta-externalfetcher for user-initiated content fetches by Meta AI features, and the older facebookexternalhit for generating preview cards when links are shared. Publishers can use robots.txt to control meta-externalagent and meta-externalfetcher; facebookexternalhit follows different rules because it acts on a user's direct request to share a link.
Methodology note · Official Meta for Developers documentation page. The page is the canonical reference for the user-agent strings, supported robots.txt directives, and the purposes Meta declares for each crawler.
Mistral AI documents its crawler identification and robots.txt compliance policy. The documentation lists user-agent strings used by Mistral's crawlers (for web indexing, real-time retrieval, and model training), specifies how site owners can target each via robots.txt, and states Mistral's commitment to honour disallow directives. Standard format matching other major AI vendor crawler documentation.
Methodology note · First-party Mistral AI documentation page (docs.mistral.ai/robots). Direct fetch returned the HTML page. Standard vendor crawler documentation; equivalent to OpenAI, Anthropic, and Google's crawler docs in scope and format.
How people are using ChatGPT (OpenAI summary page)
OpenAI · 2025
Publicly accessible summary of OpenAI's ChatGPT usage research. Describes the Asking/Doing/Expressing classification (49%/40%/11%) and the dominant consumer use cases: practical guidance, information seeking, and writing assistance. Useful as a citable first-party summary of the underlying research, but, like the underlying paper, it provides aggregate categories rather than a browsable feed of real user prompts.
Methodology note · Public-facing OpenAI summary page accompanying the 'How People Use ChatGPT' research paper. Provides accessible explanations of the Asking/Doing/Expressing taxonomy and the reported category shares (~49% / 40% / 11%). Content verified by fetch on 2026-05-27. No methodology beyond what is disclosed in the underlying paper (R192) and the NBER working paper (R52).
How People Use ChatGPT (OpenAI research paper PDF)
OpenAI · 2025
OpenAI's research on ChatGPT usage classifies consumer conversations into three categories: roughly 49% 'Asking' (information seeking), 40% 'Doing' (practical task assistance), and 11% 'Expressing' (writing and creative work). This provides a defensible intent taxonomy for classifying prompt hypotheses — but it is an aggregate breakdown, not a browsable dump of user prompts.
Methodology note · OpenAI research paper published as a downloadable PDF, with co-authorship by external economists (David Deming, Christopher T. Stanton and colleagues). The paper presents an aggregate analysis of anonymised ChatGPT consumer usage and classifies conversations into Asking, Doing and Expressing categories. No individual prompts are published; methodology and definitions of each category are disclosed in full inside the PDF.
The Crawl-to-Click Gap: Cloudflare Data on AI Bots, Training, and Referrals
Cloudflare · 2025
AI crawlers read content far more than they send referrals back. Anthropic's ClaudeBot crawled around 70,000 pages for every visitor it referred; OpenAI's GPTBot crawled around 1,700 for every visitor; Perplexity around 5 for every visitor. Mistral was the only major AI engine where referrals outweighed crawl volume.
Methodology note · Aggregate analysis of crawl requests and referral traffic across the Cloudflare network. For each major AI crawler, the team divided pages crawled by visits sent to the same destinations during the same window, producing a crawl-to-refer ratio. Published August 2025.
A Deeper Look at AI Crawlers: Breaking Down Traffic by Purpose and Industry
Cloudflare · 2025
AI crawler traffic is concentrated in news, technology, finance, and retail. The largest category is training crawling, but real-time user-action crawling (where an AI assistant fetches a page during a user conversation) is the fastest-growing segment. Different AI engines crawl with different mixes of purpose, which has direct implications for which crawlers a brand should allow.
Methodology note · Aggregate analysis of crawl traffic across the Cloudflare network, segmented by destination industry and by crawler purpose (training, search index, user-action fetch). Published August 2025.
Generative AI-Powered Shopping Rises with Traffic to U.S. Retail Sites (Adobe Analytics)
Adobe Digital Insights · 2025
Visitors arriving at US retail sites from generative AI sources show measurably higher engagement than visitors from other channels: 8% higher time on site, 12% more pages per visit, and 23% lower bounce rate. AI-driven retail traffic grew sharply through 2024 and 2025, though it remains a small share of total visits.
Methodology note · Aggregate analysis of Adobe Analytics data covering trillions of visits to US retail websites. Adobe compared engagement metrics for visitors arriving from generative AI assistants against visitors from other referral channels. Published August 2025.
Attribution, Citation, and Quotation: A Survey of Evidence-based Text Generation with LLMs
arXiv · 2025
Surveys the field of evidence-based text generation with LLMs, organising work into three categories: attribution (linking generated text to evidence), citation (formatting and presenting that evidence to users), and quotation (verbatim grounding). Identifies four open problems including attribution granularity, retrieval-attribution coupling, and evaluation of partial attribution.
Methodology note · arXiv preprint 2508.15396 (August 2025). Survey paper covering evidence-based text generation with LLMs. Direct fetch on arxiv.org returned the abstract page; full taxonomy and reference list are in the PDF.
New Fastly Threat Research: AI Crawlers Are Almost 80% of AI Bot Traffic
Fastly · 2025
Across April-July 2025, Fastly observed that AI crawlers made up almost 80% of AI bot traffic on its network (the other 20% being AI fetcher bots). Meta's AI crawlers alone generated 52% of crawler traffic, more than Google (23%) and OpenAI (20%) combined. Nearly 90% of North American AI bot traffic came from crawlers vs 41% in Europe. Fetcher bot bursts reached 39,000 requests per minute.
Methodology note · Fastly blog post by Threat Insights team, Q2 2025. Direct fetch returned the HTML article. Tier A: enterprise-scale analytics from Fastly's network covering 6.5 trillion monthly requests across 130,000+ apps. Methodology disclosed at network-flow level; vendor is the data owner.
Role-Augmented Intent-Driven Generative Search Engine Optimization
arXiv · 2025
Generative search engines reward content that anticipates the different roles a user might be playing when they ask a question. Rewriting a page through several informational personas, then refining it, produced larger gains in both subjective impression and measured presence inside generative answers than approaches that optimise on a single axis.
Methodology note · Academic paper introducing Role-Augmented Intent-Driven G-SEO, which models search intent through reflective refinement across multiple informational roles. The authors extended an existing GEO dataset with diversified query variations and introduced G-Eval 2.0, a six-level large-language-model-augmented rubric for finer-grained, human-aligned scoring of optimisation outputs.
AI Should Be More Human, Not More Complex: A Large-Scale Study on User Preferences for Concise, Source-Backed AI Responses
arXiv · 2025
Users prefer concise, source-attributed answers over verbose explanations from AI search. Longer, more lexically complex responses produced an uncanny-valley effect: systems sounded authoritative but lacked critical thinking, lowering trust and raising cognitive load. The pattern challenges the assumption that more elaborate AI output equals better output. (agent inferred)
Methodology note · arXiv preprint 2508.04713. Direct fetch on arxiv.org returned the HTML preprint with the full paper structure including methodology, AI systems evaluated, and detailed response analysis. Authored by Carlo Esposito.
Perplexity is Using Stealth, Undeclared Crawlers to Evade Website No-Crawl Directives
Cloudflare · 2025
Cloudflare observed Perplexity using undeclared crawlers to fetch content from sites that had blocked its known bots in robots.txt and WAF rules. When PerplexityBot was blocked, traffic rotated through a generic Chrome-on-macOS user agent, undisclosed IP ranges, and multiple ASNs, hitting tens of thousands of domains and millions of requests a day. Cloudflare de-listed Perplexity from its verified bots list and added detection rules.
Methodology note · Cloudflare blog post from August 4, 2025, by Gabriel Corral, Vaibhav Singhal, Brian Mitchell, and Reid Tatoris. The team set up brand new test domains that had never been indexed, applied robots.txt and WAF blocks, then queried Perplexity about content on those domains. Detection used machine learning plus network signals.
Bing identifies several active crawlers. Bingbot is the main web crawler for Bing search. AdIdxBot crawls pages for Bing Ads. BingPreview generates page snapshots. MicrosoftPreview supports preview cards. Copilot uses Bingbot data for grounding rather than a separate crawler. Publishers can issue user-agent specific rules in robots.txt, and Bing respects standard directives and meta robots tags for indexing control.
Methodology note · Official Bing Webmaster Tools help documentation page that lists Bing's crawlers, their purposes, and their user-agent strings. The page is the canonical reference for site owners configuring crawler rules for Microsoft search and Copilot.
The Attribution Crisis in LLM Search Results: Estimating Ecosystem Exploitation
O'Reilly Media · Strauss et al. · 2025
Argues that the citation behaviour of LLM-based search constitutes an attribution crisis: cited sources are systematically under-credited (fewer click-throughs than equivalent SERP positions), over-extracted (more content reproduced verbatim or near-verbatim), and concentrated on a small subset of high-authority publishers. Quantifies the ecosystem-level economic impact on publishers.
Methodology note · arXiv preprint 2508.00838 (August 2025). Direct fetch on arxiv.org returned the abstract page. The paper combines empirical citation analysis with economic modelling to estimate ecosystem-level effects on publisher revenue and proposes attribution reforms.
Google Users Are Less Likely to Click on Links When an AI Summary Appears in Search Results
Pew Research Center · 2025
When a Google search result page includes an AI summary, users click on a traditional link in roughly 8% of visits. On result pages without an AI summary, they click in roughly 15% of visits. Users rarely click on the citations inside the AI summary itself, doing so on about 1% of visits.
Methodology note · Pew Research panel study covering 900 US adults and 68,879 Google searches conducted between March and May 2025. Sessions were tracked through opt-in browser participation; click behaviour was observed directly rather than self-reported. Published July 2025.
Evaluation of Attribution Bias in Generator-Aware Retrieval-Augmented LLMs
University of Amsterdam et al. · Abolghasemi et al. · 2025
Identifies attribution bias in generator-aware RAG: when an LLM is told which documents it can cite, the model preferentially attributes claims to documents that align with its own pre-existing beliefs, ignoring contradicting sources even when those contradict the model's output. The bias is measurable and persists across model families.
Methodology note · arXiv preprint 2410.12380 (October 2024). Direct fetch returned the abstract page. The paper develops a controlled experimental setup and runs it across several LLM families; the bias is reported as statistically significant on standard QA tasks.
Correctness is not Faithfulness in RAG Attributions
SIGIR/ICTIR 2025 · Wallat · 2025
Shows that a RAG system's answers can be factually correct while its citations are unfaithful, meaning the cited passages do not actually support the generated claim. Across standard benchmarks, correctness and faithfulness diverge measurably, implying that citation-quality evaluation must be a separate metric from answer accuracy in any AI visibility tracking system.
Methodology note · arXiv preprint 2412.18004 (December 2024). Empirical study testing whether RAG answers and their cited evidence are mutually consistent. Direct fetch on arxiv.org confirmed the abstract; the full evaluation uses public attribution datasets and human annotation for faithfulness scoring.
How Users Interact with Generative Information Retrieval Systems: A Study of User Behavior and Search Experience
Beijing Institute of Technology · Liang et al. · 2025
Generative information-retrieval systems that return a written, cited answer instead of a ranked list of links can reduce a searcher's effort and improve their experience without lowering perceived credibility. The comparison covers conversational answer interfaces against traditional ranked-list search, suggesting brands should expect users to do less link-clicking and rely more on what the answer itself says.
Methodology note · SIGIR 2025 user study using Bing Chat as the generative system and Bing as the traditional baseline. Participants completed three task types on each system while the researchers logged behaviour such as clicks and query reformulation, alongside explicit ratings of satisfaction, credibility, and perceived success. The two conditions were compared head to head.
Code of Practice for General-Purpose AI Models (Copyright Chapter)
European Commission / EU AI Office · 2025
The EU General-Purpose AI Code of Practice provides a voluntary route for AI model providers to demonstrate compliance with the AI Act's obligations on copyright, transparency, and safety. The Copyright Chapter requires signatories to honour machine-readable opt-out signals such as robots.txt and TDM reservations, to publish a summary of training data, and to put a complaint mechanism in place for rightsholders.
Methodology note · Official European Commission policy page hosting the Code of Practice for General-Purpose AI Models, developed by independent experts under the EU AI Act process and published in 2025. The Code covers Safety and Security, Transparency, and Copyright chapters, and is signed by major AI providers as a way to show compliance with the AI Act.
Introducing Comet: Browse at the Speed of Thought
Perplexity AI · 2025
Perplexity launched Comet, an AI-native web browser built around Perplexity's answer engine. Comet replaces the search bar with a conversational assistant, summarises pages, answers questions about open tabs, and can execute agentic tasks across the web on behalf of the user. (agent inferred)
Methodology note · First-party product launch announcement from Perplexity. The post positions Comet as a Chromium-based browser with Perplexity's assistant available across every tab, supporting research workflows, product comparisons, and multi-step actions. Initial access was offered to Perplexity Max subscribers.
SIGIR 2025 LiveRAG Challenge Report summarises a community competition where teams built end-to-end live RAG systems evaluated on real-time queries. Reports best-performing strategies, common failure modes, and lessons learned. Notable findings include the dominance of hybrid sparse-dense retrieval and the difficulty of evaluating live RAG without ground-truth answers.
Methodology note · arXiv preprint 2507.04942 (July 2025). Direct fetch returned the abstract page. Multi-team challenge report from SIGIR 2025; methodology, evaluation protocol, and team submissions are documented in the PDF.
News Source Citing Patterns in AI Search Systems
arXiv (cs.IR) · Kai-Cheng Yang · 2025
AI search systems concentrate news citations in a small set of outlets, and the cited mix leans politically liberal. Low-credibility sources are rarely cited. News makes up only 9% of all citations across more than 366,000 citations studied, so brands depending on press coverage for AI visibility face a narrow set of gatekeeper publishers, with limited influence from political leaning or quality on user satisfaction.
Methodology note · Academic analysis of the AI Search Arena platform, covering more than 24,000 conversations and 65,000 responses from search systems by OpenAI, Perplexity, and Google. The study extracted over 366,000 citations, isolated those referencing news, and correlated source-level attributes (political leaning, credibility ratings) with user preference data from head-to-head model comparisons.
Beyond SEO: A Transformer-Based Approach for Reinventing Web Content Optimisation
arXiv · 2025
Rewriting web copy to add credible citations, statistical evidence, and cleaner phrasing measurably increases how much of that copy gets reproduced in AI answers. Optimised travel pages saw a 15.63% rise in absolute word count surfaced inside generative responses and a 30.96% rise on a position-weighted version of the same metric, with small computational cost.
Methodology note · The team fine-tuned a BART-base transformer on 1,905 paired travel-website passages, each pairing raw copy with a generative-engine-optimised rewrite. Quality was scored with ROUGE-L and BLEU against the optimised targets; visibility was tested by feeding both versions to Llama-3.3-70B and counting how much of each rewrite appeared in the model's responses.
Cloudflare launched Pay Per Crawl in private beta on July 1, 2025, letting publishers charge AI crawlers per request. The system uses HTTP status code 402 Payment Required: a crawler either includes a crawler-max-price header to pre-agree to pay, or gets a 402 with the price and can retry with crawler-exact-price. Cloudflare acts as the merchant of record, identifies crawlers via Web Bot Auth signed requests, and aggregates payments.
Methodology note · Official Cloudflare product announcement by Will Allen and Simon Newton, July 1, 2025. The post documents the headers, the publisher controls (allow, charge, block), and the integration with existing WAF and bot management rules. Crawler authentication uses Ed25519 key pairs and HTTP Message Signatures as defined by RFC 9421.
From Googlebot to GPTBot: Who's Crawling Your Site in 2025
Cloudflare · 2025
Across Cloudflare's network, search and AI crawler traffic rose 18% from May 2024 to May 2025. Googlebot grew 96% in raw requests and now accounts for 50% of crawler traffic. GPTBot rose 305% in requests, with its share climbing from 2.2% to 7.7%. ChatGPT-User requests jumped 2,825%, and PerplexityBot grew 157,490% off a tiny base. About 14% of top domains now use robots.txt rules targeting AI bots specifically.
Methodology note · Cloudflare Radar analysis published July 2025, comparing crawler activity in May 2024 against May 2025 across a fixed cohort of customers to remove growth bias. The team matches user-agent tokens against an open-source list of AI crawlers and analyses robots.txt files on 3,816 of the top 10,000 domains. Methodology and limits are documented in the post.
LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-Context QA
THUDM · Jiajie Zhang · 2025
LongCite enables LLMs to generate fine-grained citations in long-context QA by training the model to attribute each statement in its answer to a specific span in the retrieved long document. The method substantially improves citation precision over post-hoc citation generation and outperforms baselines on the released LongCite benchmark.
Methodology note · arXiv preprint 2409.02897 (September 2024). Direct fetch returned the abstract page. The paper releases a training pipeline and benchmark dataset; empirical comparison against post-hoc citation baselines is reported in the PDF.
Search Engines in an AI Era: The False Promise of Factual and Verifiable Source-Cited Responses
Salesforce AI Research et al. · 2025
Argues that current AI search engines' promise of factual, verifiable, source-cited responses is partly illusory. Empirical analysis across major systems shows frequent unsupported claims even when citations are present, weak ranking of citations by relevance, and citation patterns that systematically advantage well-resourced incumbents. Calls for new evaluation standards before mass deployment.
Methodology note · arXiv preprint 2410.22349 (October 2024). Direct fetch returned the abstract page. The paper provides both critique and empirical evaluation across several commercial AI search systems, with methodology and per-system results in the PDF.
Chunk Twice, Embed Once: Systematic Study of Segmentation and Representation Trade-offs
arXiv · 2025
How a page is split into chunks matters as much for retrieval as which model embeds it. Simple recursive token chunking around 100 tokens with no overlap (R100-0) consistently beat more elaborate strategies. Retrieval-tuned embedding models such as Nomic and Intfloat E5 outperformed domain-specialised ones like SciBERT, suggesting embedding choice and chunk size are the high-leverage levers.
Methodology note · Systematic evaluation in a chemistry retrieval setting: 25 chunking configurations across five method families combined with 48 embedding models, tested on three chemistry retrieval benchmarks including the authors' new QuestChemRetrieval dataset. Datasets, code, and benchmark results were released publicly.
Apple uses one crawler, Applebot, to gather data that powers Spotlight, Siri, and Safari search. A separate user agent, Applebot-Extended, governs whether content is used to train Apple's foundation models, including Apple Intelligence. Sites can disallow Applebot-Extended in robots.txt to opt out of generative AI training while keeping content discoverable in Apple search. Applebot is identified via reverse DNS at applebot.apple.com or a published IP CIDR list.
Methodology note · Official Apple support documentation about Applebot, last updated April 25, 2025. The page details user-agent strings, robots.txt behavior, supported meta directives such as noindex, nosnippet, nofollow, and how Applebot-Extended works as a secondary control specifically for generative AI training.
Enhancing Critical Thinking in Generative AI Search with Metacognitive Prompts
arXiv · 2025
Metacognitive prompts that explicitly ask the AI to evaluate its own reasoning before answering reduce errors in generative search responses. Users supplied with such prompts also show improved critical evaluation of AI-generated answers compared with users supplied with standard prompts. The intervention is light-touch and does not require changes to the underlying model.
Methodology note · arXiv preprint 2505.24014 (May 2025). User study examining how prompting strategies affect both AI output quality and human evaluation behaviour in generative search. Direct fetch returned the abstract page; the empirical sample size and significance levels are reported in the underlying PDF.
Rethinking Chunk Size For Long-Document Retrieval: A Multi-Dataset Analysis
arXiv (cs.IR) · Sinchana Ramakanth Bhat et al. · 2025
Chunk size in retrieval-augmented generation has a large effect on retrieval quality. Smaller chunks of 64 to 128 tokens are optimal when answers are short and fact-based. Larger chunks of 512 to 1024 tokens work better when broader context is needed. Embedding models react differently: Stella benefits from larger chunks for long-range retrieval, while Snowflake performs better with smaller chunks for entity-level matching.
Methodology note · Peer-style arXiv paper (2505.21700) by Bhat, Rudat, Spiekermann and Flores-Herr, submitted May 2025. The authors systematically test fixed-size chunking from 64 to 1024 tokens across multiple embedding models and both short-form and long-form datasets, measuring retrieval performance across configurations. Results highlight the interaction between chunk size, embedding model and dataset characteristics.
Scrapers Selectively Respect robots.txt Directives
arXiv · 2025
Audit of major AI scrapers' compliance with robots.txt directives finds that compliance is selective rather than uniform: scrapers honour blocks on some user agents and not others, even within the same vendor, and compliance changes over time as crawlers are updated. The paper provides specific evidence of non-compliance events with named vendors and dates.
Methodology note · arXiv preprint 2505.21733 (May 2025). Direct fetch on arxiv.org returned the abstract page. The paper documents specific non-compliance events with named scrapers using server-log evidence and timestamped robots.txt snapshots.
ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments
arXiv (cs.CL) / EMNLP 2025 · Gili Lior et al. · 2025
ReliableEval proposes a method-of-moments recipe for stochastic LLM evaluation that explicitly accounts for run-to-run variance in model outputs. Across standard benchmarks, the method produces tighter confidence intervals than naive averaging and reveals that some headline LLM performance comparisons are within noise margins. Released as an open evaluation toolkit.
Methodology note · arXiv preprint 2505.22169 (May 2025). Direct fetch returned the abstract page. The paper derives the method-of-moments estimator, tests it against several common evaluation tasks, and releases the toolkit for community use.
NLWeb — Bringing Conversational Interfaces Directly to the Web
Microsoft · R.V. Guha · 2025
Microsoft launched NLWeb on May 19, 2025, an open project that lets any website expose its content as a natural-language interface using Schema.org, RSS, and other structured data the site already publishes. Every NLWeb instance is also a Model Context Protocol server, making the site discoverable to AI agents. Initial adopters include Shopify, Tripadvisor, Eventbrite, O'Reilly, Hearst, and Chicago Public Media.
Methodology note · Official Microsoft announcement, published on the Microsoft Source corporate blog. NLWeb was conceived by R.V. Guha, the creator of RSS, RDF, and Schema.org, who joined Microsoft as Corporate Vice President and Technical Fellow. The project is open source and technology agnostic, with code and documentation on GitHub.
NExT-Search: Rebuilding User Feedback Ecosystem for Generative AI Search
NUS / Renmin University · Sunhao Dai et al. · 2025
Argues that generative AI search has broken the feedback loop that traditionally improved ranking. Traditional web search collects fine-grained user feedback (clicks, dwell time) at the document level; generative AI search receives only coarse-grained feedback on the final answer, even though the pipeline spans query decomposition, retrieval, and generation. Proposes NExT-Search to reintroduce process-level feedback.
Methodology note · SIGIR 2025 perspective paper (arXiv:2505.14680) by Dai, Wang, Pang, Xu, Ng, Wen, and Chua. Direct fetch on arxiv.org returned the abstract page; the paper proposes a two-mode feedback architecture (User Debug Mode and Shadow User Mode) without claiming empirical validation. Forward-looking perspective rather than experimental study.
C2PA Technical Specification v2.2 (ISO/IEC 22144)
Coalition for Content Provenance and Authenticity (C2PA) · 2025
C2PA Technical Specification v2.2 defines a standard for cryptographically signed content credentials. The specification was published in 2024 as ISO/IEC 22144, enabling images, video, and other media to carry tamper-evident metadata about origin, edits, and AI involvement. Adopters include Adobe, Microsoft, the BBC, OpenAI, Sony, and Leica, with rollout in cameras, generative tools, and publisher workflows.
Methodology note · Official specification from the Coalition for Content Provenance and Authenticity, a Joint Development Foundation project. Version 2.2 is the latest at time of publication and corresponds to the formally adopted international standard ISO/IEC 22144. Steering committee members include Adobe, Microsoft, Intel, Google, the BBC, OpenAI, and Sony. The full text is openly published with conformance and test suites.
Copyright and AI Part 3: Generative AI Training (pre-publication report)
U.S. Copyright Office · 2025
The US Copyright Office concludes that generative AI training raises copyright questions at several points: data collection, model training, retrieval-augmented generation, and outputs. Fair use is fact-specific and depends on transformativeness, commerciality, the amount used, and effects on the market for the original work, including market dilution and lost licensing. The Office recommends voluntary licensing markets rather than compulsory licensing schemes.
Methodology note · Pre-publication version of Part 3 of the Copyright Office's Report on Copyright and Artificial Intelligence, released May 2025 by the Register of Copyrights. The report draws on more than 10,000 comments submitted in response to a 2023 Notice of Inquiry, plus existing case law and international approaches. Sections cover technical background, prima facie infringement, fair use, and licensing options.
Sufficient Context: A New Lens on Retrieval Augmented Generation Systems
Google Research · 2025
Introduces the concept of sufficient context as a new evaluation lens for RAG systems. A retrieval is sufficient if it contains enough evidence to answer the query correctly; insufficient retrievals lead to hallucinated answers even when the model has the right reasoning ability. Provides metrics and empirical analysis across standard RAG benchmarks.
Methodology note · arXiv preprint 2411.06037 (November 2024). Direct fetch returned the abstract page. The paper introduces a sufficient-context metric, validates it against human annotations of RAG outputs, and reports correlations with downstream answer quality.
Human Trust in AI Search: A Large-Scale Experiment
arXiv · 2025
Across ~12,000 queries in seven countries and a preregistered randomised experiment on a US sample, participants trusted GenAI search less than traditional search on average, but adding reference links and citations to GenAI answers significantly increased trust, even when those citations were incorrect or hallucinated. Uncertainty highlighting reduced trust whether confidence was high or low.
Methodology note · arXiv preprint 2504.06435 (April 2025) by Haiwen Li and Sinan Aral (MIT Sloan). Preregistered randomised experiment on a US-representative panel, paired with a 12,000-query, 80,000-result global exposure measurement across seven countries. 23 pages, six figures. Direct fetch on arxiv.org confirmed authorship and methodology.
NYT v. OpenAI Motion to Dismiss Opinion (April 4 2025)
U.S. District Court SDNY · 2025
Judge Sidney Stein denied OpenAI's and Microsoft's motions to dismiss the core contributory copyright infringement and most DMCA claims brought by The New York Times, the Daily News, and the Center for Investigative Reporting. The court allowed the publishers' direct infringement and trademark dilution claims to proceed. It dismissed common law misappropriation and certain DMCA 1202 sub-claims without prejudice. The case moves to discovery.
Methodology note · Memorandum opinion and order from Judge Sidney H. Stein of the US District Court for the Southern District of New York, dated April 4, 2025, in the consolidated actions including 23-cv-11195 (Times v. OpenAI). The court ruled on Rule 12(b)(6) motions to dismiss, accepting the plaintiffs' allegations as true at this stage. The full opinion is published on the court website.
Imperva's 2025 Bad Bot Report finds that automated traffic overtook human activity for the first time in a decade, reaching 51% of all internet traffic in 2024. Bad bots specifically account for 37% of internet traffic, up from 32% the year prior. 44% of advanced bot traffic targeted APIs. AI is supercharging bot sophistication, with simple high-volume attacks now 45% of all bot attacks.
Methodology note · Imperva (Thales) 2025 Bad Bot Report. Direct fetch of the resource library landing page confirmed accessibility but the full report PDF requires a form submission. Headline statistics cross-verified against Imperva's official blog post 'AI Bots Overtake the Web' and Thales Group press release dated April 2025.
How Crawlers Impact the Operations of the Wikimedia Projects
Wikimedia Foundation · 2025
Bandwidth used by automated crawlers on Wikimedia Commons grew 50% between January 2024 and early 2025, driven primarily by AI training scrapers fetching multimedia at scale. Wikimedia Foundation engineers found that 65% of the most expensive requests on the projects came from bots, even though bots accounted for only about 35% of pageviews, because crawlers hit pages that miss the site's caching layer.
Methodology note · Diff blog post from the Wikimedia Foundation, dated April 1, 2025, written by the Site Reliability Engineering team. The analysis draws on Wikimedia's own traffic logs across Commons and other projects. Reporting in Ars Technica and other outlets reproduced the 50% bandwidth figure and the 65% expensive-request statistic.
AI Labyrinth is an opt-in Cloudflare feature that serves AI-generated decoy pages to bots that ignore no-crawl directives. A misbehaving crawler follows hidden links into a maze of plausibly written but irrelevant content, wasting compute and exposing itself as a bot. Pages are pre-generated with Workers AI, sanitised against XSS, and stored in R2. AI crawlers send more than 50 billion requests a day across Cloudflare.
Methodology note · Cloudflare announcement from March 19, 2025, by Reid Tatoris, Harsh Saxena, and Luis Miglietti. The feature is available to all customers including the free plan, enabled with a single dashboard toggle. Decoy pages carry noindex meta tags to protect SEO and remain invisible to human visitors and verified crawlers.
Adobe Analytics: Traffic to U.S. Retail Websites from Generative AI Sources — Holiday 2024 / January 2025 Update
Adobe Digital Insights · 2025
Visits to US retail websites originating from generative AI sources grew by roughly 1,200% between July 2024 and February 2025. AI-sourced visitors browsed 12% more pages per session and bounced 23% less than visitors from other channels. AI referrals still represented a small share of total retail traffic, but the per-visit engagement quality was meaningfully higher.
Methodology note · Aggregate analysis of Adobe Analytics data covering visits to US retail websites during the 2024 holiday season and January 2025. Adobe compared engagement metrics and visit counts for sessions originating from generative AI sources against sessions from other referral channels. Published March 2025.
AI Search Has a Citation Problem (Tow Center Report)
Tow Center for Digital Journalism, Columbia · 2025
Across eight AI search engines tested, more than 60% of news-attribution queries received incorrect answers. Perplexity got 37% wrong; Grok 3 got 94% wrong. Premium paid models were no more accurate than free ones, and often produced confidently incorrect answers without flagging uncertainty. Several engines retrieved content from publishers that had explicitly blocked their crawlers.
Methodology note · 1,600 queries were run across ChatGPT Search, Perplexity, Perplexity Pro, DeepSeek Search, Microsoft Copilot, Grok-2, Grok-3, and Google Gemini. The researchers selected 10 articles from each of 20 publishers, used direct excerpts as queries, and asked each chatbot to identify the headline, publisher, publication date, and URL. Responses were manually graded against six categories.
OpenAI's web-search API documentation states that the web_search_call output item will usually (but not always) include the search queries that were searched, and that the sources field can reveal all URLs consulted during the search run. This is first-party proof that some query rewrites can be observed for requests under the caller's control — but the 'usually but not always' caveat means observed fan-out is partial rather than exhaustive.
Methodology note · Official OpenAI developer documentation for the web search tool exposed via the Responses API. Describes the schema of the web_search_call output item, including which fields are populated and the explicit caveat that searched queries are returned 'usually (but not always).' Content verified by fetch on 2026-05-27. No aggregate usage data is disclosed.
Goodbye Clicks, Hello AI: Zero-Click Search Redefines Marketing
Bain & Company · 2025
Roughly 80% of consumers now rely on zero-click results, AI summaries, or assistant answers for at least 40% of their search needs, and AI search use has reduced average organic click-through rates by 15% to 25%. The shift compresses the funnel: brands need to be present and credible in the answer itself, not on the click destination.
Methodology note · Bain survey of more than 1,000 US consumers combined with proprietary analysis of organic search traffic patterns. The report measures self-reported AI search use and click-through behaviour across categories. Published February 2025.
Google AI Overviews and Your Website | Google Search Central
Google · 2025
Google states there are no extra technical requirements for appearing in AI Overviews or AI Mode beyond being indexed and eligible for a standard search snippet. SEO fundamentals apply: allow crawling in robots.txt, maintain internal linking, keep pages findable. Google describes the query fan-out technique, where the system issues multiple related searches across subtopics, and reports that clicks from AI Overview pages tend to be higher quality.
Methodology note · Official Google Search Central documentation describing how AI features such as AI Overviews and AI Mode interact with websites, and what site owners can and cannot do to influence inclusion. Direct fetch failed; content was verified against the live Google AI Features and AI Optimization Guide pages on developers.google.com plus corroborating secondary coverage.
Intellectual Property Issues in AI Trained on Scraped Data (AI Paper No. 33)
OECD · 2025
OECD report 'Intellectual Property Issues in AI Trained on Scraped Data' (AI Paper No. 33, February 2025) examines copyright, trademark, trade secret, and database protection challenges raised by AI training data scraping. The report recommends voluntary codes of conduct based on transparency in the data chain, requiring AI developers to disclose data sources and preserve metadata enabling rightsholders to track unauthorised use.
Methodology note · OECD policy report, Paper No. 33, February 2025. Direct fetch returned HTTP 403; findings cross-verified against summaries on TheLegalWire, NortonRoseFulbright knowledge publications, and the OECD AI Policy Observatory at oecd.ai. The report is government/regulatory tier (OECD) and authoritative on policy direction.
FTC Staff Report on AI Partnerships & Investments
U.S. Federal Trade Commission · 2025
The FTC's January 2025 staff report on AI partnerships finds that the three largest US cloud providers (Alphabet, Amazon, Microsoft) have used their investments in Anthropic and OpenAI to lock in cloud spend, gain equity and revenue-sharing rights, and access sensitive technical and business information. The Commission flags risks to switching costs, input access for rivals, and competition for engineering talent and compute.
Methodology note · Press release from the Federal Trade Commission, January 17, 2025, summarizing a staff report based on Section 6(b) orders issued in January 2024 to five companies: Microsoft, OpenAI, Amazon, Alphabet, and Anthropic. Findings reflect information available to staff as of September 2024 plus publicly available information through January 2025. The Commission voted 5 to 0 to issue the report.
The Rise of the AI Crawler (Vercel + MERJ, 1B requests)
Vercel + MERJ · 2025
On Vercel's network in late 2024, GPTBot generated 569 million monthly requests and Anthropic's Claude generated 370 million, together roughly 20% of Googlebot's 4.5 billion. None of the major AI crawlers (OpenAI, Anthropic, Meta, ByteDance, Perplexity) executed JavaScript. ChatGPT spent 34.82% of fetches on 404 pages, Claude 34.16%, versus 8.22% for Googlebot. Server-side rendered content is far more visible to AI crawlers.
Methodology note · Joint research from Vercel and MERJ, published December 17, 2024. Data comes from monitoring nextjs.org and the Vercel network, validated against two job-board sites (Resume Library on Next.js and CV Library on a custom monolith). Microsoft Copilot was excluded because it lacks a distinct user agent. Methods follow the same approach used in MERJ's earlier Googlebot analysis.
Long Context vs. RAG for LLMs: An Evaluation and Revisits
arXiv · Xinze Li · 2025
Empirical comparison of long-context LLMs (where retrieved content is dropped into the model's input window) against retrieval-augmented generation (where retrieval is iterative) finds that long-context approaches underperform RAG when the relevant evidence is buried in noise. RAG still wins for most production information-retrieval tasks despite advances in long-context models.
Methodology note · arXiv preprint 2501.01880 (January 2025). Direct fetch on arxiv.org returned the abstract page; the empirical comparison uses public QA benchmarks and tests several long-context LLMs against RAG baselines under matched compute budgets.
Understanding How the Google Trends Explore Page uses Gemini to help you find insights
Google · 2025
Google Trends Explore uses Gemini to take an area of interest and expand it into up to eight related search terms, additional ideas, and top/rising queries. This is direct first-party proof that public search behaviour can be expanded from a seed topic into an adjacent-intent neighbourhood — but Google explicitly frames it as web-search demand, not chatbot-prompt demand.
Methodology note · Official Google Help Center documentation page describing how the Google Trends Explore experience uses Gemini models to suggest related search terms, follow-up ideas, and top/rising queries from a seed input. Content was fetched and verified directly from support.google.com on 2026-05-27; the page also discloses Gemini privacy practices, data retention, and feedback mechanisms.
OpenAI documents that ChatGPT Search rewrites a user query into one or more targeted queries and may send additional, more specific queries after reviewing initial results. This is first-party proof that query fan-out behaviour is real inside a production chatbot search system.
Methodology note · Official OpenAI Help Center article describing how ChatGPT Search functions, including the prompt-rewriting and follow-up query behaviour. Content verified by fetch on 2026-05-27 (HTTP 200 confirmed; full body accessible in a browser session). Article documents product behaviour without disclosing the rewriting algorithm, query-volume statistics or model-side reasoning.
Google Common Crawlers Overview (incl. Google-Extended)
Google · 2024
Google publishes a list of common crawlers covering Googlebot, Googlebot-Image, Googlebot-Video, Googlebot-News, AdsBot variants and the Google-Extended user-agent token used for Gemini and Vertex AI training opt-out. Each row in the documentation gives the user-agent string seen in HTTP requests, the matching robots.txt token, the IP ranges in common-crawlers.json and the products affected when a site changes its crawl preferences for that agent.
Methodology note · Official Google Search Central documentation listing the common Googlebot variants and their robots.txt tokens. The page was inaccessible to direct fetch; user-agent strings, IP-range publication mechanism and reverse-DNS verification process were confirmed through the live developers.google.com URL referenced in search results and supporting third-party crawler reference databases.
GEO: Generative Engine Optimization
Princeton University / Georgia Tech / Allen Institute for AI / IIT Delhi · Pranjal Aggarwal et al. · 2024
Adding citations, quotations, and statistics to content can increase its visibility in AI-generated answers by up to 41% on average. Pages ranked outside the top of traditional search saw the largest gains. The effect varies by content domain and by AI engine, but the lift from evidence-style content elements is consistent across the conditions tested.
Methodology note · 10,000 questions were run through generative search engines. The researchers compared answers before and after applying nine content optimisation strategies, including citations, quotations, statistics, and authoritative language. They measured visibility as the share of the AI answer attributable to the optimised page, using both word position and word count metrics. Peer-reviewed at KDD 2024.
Detecting hallucinations in large language models using semantic entropy
University of Oxford (et al.) · 2024
Sebastian Farquhar and colleagues at the University of Oxford propose using semantic entropy to detect hallucinations in LLMs. The method measures uncertainty in the meaning of model outputs across sampled generations rather than uncertainty in token probabilities. Outperforms prior hallucination-detection baselines on standard QA benchmarks and generalises across model families.
Methodology note · Nature article (volume 630), Farquhar, Kossen, Kuhn and Gal, published 2024. Peer-reviewed primary research. Direct fetch on nature.com confirmed authorship, journal, and the semantic-entropy method. Empirical evaluation across multiple QA datasets and LLMs is detailed in the full paper.
Retrieval-Augmented Generation for Large Language Models: A Survey
arXiv (cs.CL) · Yunfan Gao et al. · 2024
Comprehensive survey of retrieval-augmented generation (RAG) covering its history, core components (retrieval, augmentation, generation), evaluation methods, and open challenges. The survey organises RAG variants into a taxonomy and traces the field's evolution from naive retrieval to modular and agentic RAG architectures. Widely cited as the field's canonical reference.
Methodology note · arXiv preprint 2312.10997 (December 2023, updated 2024). Direct fetch on arxiv.org returned the abstract page; the full survey runs over 40 pages and includes a comprehensive bibliography. One of the most-cited RAG references in the academic literature.
Perplexity Crawlers Documentation (PerplexityBot, Perplexity-User)
Perplexity AI · 2024
Perplexity documents two user agents controllable via robots.txt: PerplexityBot indexes content for retrieval in Perplexity answers, while Perplexity-User fetches pages on demand when a user submits a query. Pages disallowed for PerplexityBot are not indexed in full, though Perplexity may still display the domain, headline and a brief factual summary. IP ranges are published at perplexity.com/perplexitybot.json and perplexity.com/perplexity-user.json. (agent inferred)
Methodology note · Official Perplexity crawler documentation. Original URL docs.perplexity.ai/guides/bots returns HTTP 308 redirect; the canonical content now lives at docs.perplexity.ai/docs/resources/perplexity-crawlers. User-agent strings, robots.txt behaviour and IP-publication URLs (perplexity.com/perplexitybot.json, perplexity-user.json) cross-verified via 51Degrees, Known Agents and CrawlerCheck.
Google FAQ Structured Data Guidelines (FAQPage Schema)
Google · 2023
Google's FAQPage structured data documentation announces that as of May 7, 2026, FAQ rich results no longer appear in Google Search. Support in the rich-result report and Rich Results Test ends in June 2026, and Search Console API support is removed in August 2026. While the feature is being deprecated, FAQ markup itself remains valid Schema.org and is still used by AI engines that read structured data.
Methodology note · Official Google Search Central documentation page for FAQPage structured data, last updated 8 May 2026. The page sets out the schema requirements (FAQPage, Question, Answer), eligibility rules (limited to authoritative health or government sites), content guidelines and the deprecation timetable for FAQ rich results.
Google announced an update to its Search Quality Rater Guidelines, adding a second E to E-A-T to create E-E-A-T: Experience, Expertise, Authoritativeness, and Trustworthiness. Experience asks whether content reflects first-hand or life experience with the subject. Trust is positioned as the most important of the four, and the others support it. The guidelines instruct human raters who evaluate search quality.
Methodology note · Official Google Search Central blog announcement from December 15, 2022, accompanying a revised version of the public Search Quality Rater Guidelines PDF. The guidelines describe how Google's external quality raters score sample results to train and evaluate ranking systems. Ratings do not directly change rankings but feed into system improvements.
Tier B — Citable with caveats
These sources are credible but narrower in scope or methodology than Tier A. They include trade press with editorial standards (Digiday, Search Engine Land, Marketing Brew, Press Gazette, Nieman Lab), case studies from named companies with disclosed methodology, and single-vendor studies with disclosed methodology (Ahrefs, Semrush, Surfer, Profound, Tinuiti), where sample size or vendor incentives are part of the picture.
We cite Tier B sources when they're the best available evidence. We mark them with a one-line caveat, usually about sample size, methodology, or possible bias. If a Tier A source exists for the same claim, we use that instead.
What is Retrieval-Augmented Generation? (AWS Explainer)
Amazon Web Services · 2026
AWS's explainer defines retrieval-augmented generation as a technique that supplements an LLM's training data with external sources at inference time, improving factual accuracy and reducing hallucinations. The page covers RAG benefits (cost-effective vs fine-tuning, current information, source attribution), and recommended architectural patterns on AWS infrastructure.
Methodology note · AWS vendor explainer page. Direct fetch returned the HTML article. Tier B because AWS is an authoritative vendor in the cloud and AI infrastructure space but the page is a marketing explainer rather than original research. Suitable as a definitional reference for RAG; not for empirical citation claims.
We Tracked 1,885 Pages Adding Schema. AI Citations Barely Moved
Ahrefs · 2026
Across 1,885 pages that added JSON-LD between August 2025 and March 2026, schema produced no meaningful uplift in AI citations. Matched difference-in-differences tests against 4,000 control pages showed +2.4% on Google AI Mode and +2.2% on ChatGPT (both statistically indistinguishable from zero) and a small 4.6% decline on Google AI Overviews. 53% of AI-cited pages already carry schema, but this reflects overall site quality.
Methodology note · Ahrefs identified 1,885 URLs that transitioned from no JSON-LD to having JSON-LD between August 2025 and March 2026, using its crawler database. Each treated page was matched to three control pages from different domains with similar pre-period citation levels. Citation changes were measured 30 days before and after the schema-add date across AI Overviews, AI Mode and ChatGPT using four statistical tests including matched difference-in-differences.
OpenAI's Crawler Docs Now List OAI-AdsBot for ChatGPT Ads
Search Engine Journal · 2026
OpenAI added a new crawler, OAI-AdsBot, to its public bots documentation. OAI-AdsBot supports ChatGPT's advertising features by fetching pages so advertised products and links can be checked and presented inside ChatGPT. Publishers can list OAI-AdsBot in robots.txt to allow or disallow its access, separately from GPTBot (training), OAI-SearchBot (search), and ChatGPT-User (user-initiated browsing).
Methodology note · Search Engine Journal news article reporting on OpenAI's published crawler documentation at platform.openai.com/docs/bots. SEJ is a long-running SEO trade publication that tracks vendor documentation updates. The piece references the live OpenAI page where OAI-AdsBot is listed alongside OpenAI's other declared user agents.
Where Google AI Overviews Cite From: A 100-Page Analysis
CXL · 2026
In a mapping of 100 Google AI Overview citations, 55% of cited snippets sit in the top 30% of the source page. The middle third of the page produces 24% of citations. Everything past the 60% mark accounts for just 21%. Pages whose answer is buried below the fold are far less likely to be picked up.
Methodology note · CXL coded 100 individual AI Overview citations by where in the source page the cited passage appeared, splitting each page into vertical thirds. The data was used to assess how page position relates to citation probability. The original page was inaccessible at the time of writing; figures were confirmed via secondary coverage referencing the CXL study directly.
Exclusive: Small Publishers Hit Hardest by Search Traffic Declines (Chartbeat data)
Axios · 2026
Smaller publishers are seeing AI chatbot referrals rise as a share of search-driven traffic, even as overall search referrals to news sites decline. Chartbeat data shows the gap between large and small publishers in AI traffic share is narrowing, with niche publishers picking up disproportionate AI visibility. (agent inferred)
Methodology note · Axios reporting on Chartbeat's analysis of referrer data across its publisher network, comparing AI chatbot referrals (ChatGPT, Perplexity, Copilot) against traditional search referrals over time, segmented by publisher size.
The YouTube Citation Study 2026 (OtterlyAI)
OtterlyAI · Rick Tousseyn · 2026
Across 100M+ AI citations tracked over 30 days across six AI engines, 94 percent of YouTube AI citations went to long-form videos and 5.7 percent to Shorts. Views, likes and subscribers showed near-zero correlation with citation frequency (r approximately -0.03). Description length (r = 0.31) and timestamp presence drove repeat citation: 78 percent of timestamped videos were cited multiple times.
Methodology note · OtterlyAI YouTube Citation Study, published 2 March 2026 by Rick Tousseyn. Direct fetch on otterly.ai confirmed the methodology: 30-day citation tracking across ChatGPT, Google AI Overviews and AI Mode, Perplexity, Microsoft Copilot, and Gemini. Pearson correlation analysis on already-cited videos only — results explain repeat-citation behaviour rather than initial citation eligibility.
AI Sources Like ChatGPT Account for Less Than 1% of Publishers' Pageviews
Nieman Lab (Chartbeat) · 2026
AI sources such as ChatGPT account for less than 1% of publisher pageviews, according to Chartbeat data covering thousands of news and media sites. Direct visits and traditional search still dominate referrals to publishers, while AI referrals remain a small but growing channel. (agent inferred)
Methodology note · Nieman Lab summary of new Chartbeat analytics data covering pageview composition across its publisher network. Chartbeat tracks real-time referrer data on thousands of news websites and reports aggregate shares from search, social, direct, and AI assistants.
Anthropic Updates Crawler Docs: ClaudeBot, Claude-User, Claude-SearchBot
Search Engine Roundtable · 2026
Anthropic updated its public crawler documentation to clarify the role of each bot. ClaudeBot handles training data collection. Claude-User performs user-initiated retrievals inside Claude (similar to ChatGPT-User). Claude-SearchBot powers Claude's search feature and is meant to be allowed if publishers want their pages to appear in Claude answers. Each bot has its own user agent and robots.txt token, with separate published IP ranges.
Methodology note · Search Engine Roundtable news item by Barry Schwartz reporting on Anthropic's crawler documentation update. Search Engine Roundtable is a long-running SEO news site that tracks search engine and AI vendor documentation changes. The post quotes Anthropic's own published crawler reference page.
AI Bot Traffic Closing in on Human Web Visits (TollBit Q4 2025 / Q1 2026 data)
The Register · 2026
TollBit's Q4 2025 State of the Bots report found roughly one AI bot visit for every 31 human visits, up from 1 in 200 in Q1. Training scrapes fell 15% Q2 to Q4, while RAG bot traffic rose 33% and AI search indexers rose 59%. Click-through referrals from AI apps dropped from 0.8% in Q2 to 0.27% in Q4. ChatGPT-User scrapes pages five times more often than the next scraper.
Methodology note · Reporting in The Register by Brandon Vigliarolo, February 4, 2026, summarising TollBit's quarterly State of the Bots study. TollBit tracks AI bot traffic on behalf of publishers. The article also cites supporting data from Eight Oh Two and Pew Research on AI search usage among US adults.
Google AI Overviews CTR Shows Early Signs of Recovery (Seer Interactive 25M-impression study)
Search Engine Land · 2026
Organic click-through rate on Google searches with AI Overviews rose from a low of 1.3% in December 2025 to 2.4% in February 2026, an 85% rebound in two months. Searches without AI Overviews achieve about 3.3% CTR; pages cited inside an AI Overview get about 2.1%; uncited pages get about 0.9%. CTR on AIO-free queries rose from 2.8% to 3.8% year on year.
Methodology note · Search Engine Land summary of a Seer Interactive study covering 53 brands, 5.47 million queries, and 2.43 billion impressions from January 2025 to February 2026. Seer compared organic and paid click-through rates across searches with and without AI Overviews, and segmented results by query intent (informational, transactional, comparison, question).
In January 2026, social media accounts for around 9% of all AI citations across the platforms tracked. Google AI Overviews cites social media more than four times as often as Gemini. Amazon.com appears zero times in Gemini's citations during the period. Reddit dominates the social share, with YouTube next; patterns differ sharply between engines, including between Google's own Gemini and AI Mode.
Methodology note · Tinuiti's Q1 2026 report uses the Profound platform to track citations across nine categories (apparel, beauty, electronics, food and beverage, home and garden, manufacturing, OTC health, technology, transportation and logistics) and seven AI surfaces (ChatGPT, Perplexity, Google AI Mode, Google AI Overviews, Gemini, Microsoft Copilot, Meta AI). The full report is gated; sample charts and headline figures are public.
Global Publisher Google Traffic Dropped by a Third in 2025
Press Gazette · 2026
Globally, Google search traffic to publishers fell by a third in the year to November 2025. Google Discover referrals to over 2,500 publisher sites dropped 21% year on year. In the US, Google search referrals fell 38% and Discover fell 29%. Surveyed media leaders expect publisher traffic to decline by 43% on average over the next three years. ChatGPT referrals reached 0.02% of total traffic.
Methodology note · Press Gazette reports on Chartbeat data published within the Reuters Institute's Journalism and Technology Trends and Predictions 2026 report. The Reuters report combines Chartbeat referral analytics across publisher sites with a survey of 280 media leaders (including 64 editors-in-chief, 64 CEOs, 51 heads of digital) from 51 countries.
AI Traffic Converts at 3× the Rate of Other Channels (Study)
Microsoft Clarity · 2026
Visitors arriving from AI assistants convert at roughly 3 times the rate of visitors from other channels, and at up to 11 times the rate in certain publisher segments. AI traffic still represents a small share of total visits, but its per-visitor commercial value is materially higher than traditional search or social.
Methodology note · Analysis of Microsoft Clarity user-session data across a multi-publisher dataset. The study compared conversion rates of sessions originating from AI assistants against sessions from other referral channels. Published January 2026. Single-vendor study with disclosed methodology, downgraded to Tier B in v1.1.
Cloudflare Year in Review: AI Bots Crawl Aggressively Without Proportional Referrals
InfoQ / Cloudflare · 2025
Cloudflare's 2025 Radar Year in Review reports global internet traffic up 19% year on year, with Googlebot still the largest single source of crawler traffic. Crawl-to-refer ratios widened sharply: Anthropic peaked at 500,000 to 1, OpenAI at 3,700 to 1, vast numbers of crawls per referral click. Half of human web traffic now uses post-quantum encryption, and Go's share of automated API requests jumped from 12% to 20%.
Methodology note · InfoQ summary by Renato Losio of Cloudflare's sixth annual Radar Year in Review, published December 31, 2025. Data is drawn from Cloudflare's edge network and the 1.1.1.1 public DNS resolver. The InfoQ piece highlights and aggregates the figures from Cloudflare's own published Radar microsite.
Across millions of keywords tracked by Serpstat in 2025, AI Overviews expanded across query types and saw a year of rising prevalence, with informational queries the most affected and commercial queries gaining ground. Click-through rates on AIO-affected SERPs were materially lower than on traditional SERPs. (agent inferred)
Methodology note · Serpstat blog post by Kateryna Hordiienko (AI Marketer at Serpstat), 25 December 2025. Direct fetch returned the full article confirming the methodology: 1 billion keywords analysed, 35 million AI Overviews tracked. Tier B vendor study with disclosed methodology.
Google AI Overviews Surged in 2025, Then Pulled Back: Data
Search Engine Land · 2025
Google AI Overviews appeared on 6.5% of queries in January 2025, peaked at just under 25% in July, then fell back to under 16% by November. Informational queries dominated early (91% in January) but fell to 57% by October, as commercial queries rose from 8% to 18% and transactional from 2% to 14%. Ads alongside AI Overviews rose from about 3% to 40%.
Methodology note · Search Engine Land summary of a Semrush analysis of more than 10 million keywords from January through November 2025. Semrush tracked AI Overview activation rates by month and query intent, paid ad placement frequency, zero-click rates on the same keywords before and after AIO appeared, and category-level penetration.
OpenAI Revises ChatGPT Crawler Documentation with Significant Policy Changes
PPC Land · 2025
On December 9, 2025, OpenAI updated its crawler documentation. ChatGPT-User, which handles user-initiated browsing inside ChatGPT, no longer commits to following robots.txt, on the basis that requests come from a user rather than an autonomous crawler. OAI-SearchBot is now described purely as a search crawler, with training data removed from its scope. GPTBot and OAI-SearchBot may also share crawl results to avoid duplicate fetches.
Methodology note · PPC Land news article by Luis Rijo summarising changes spotted by digital marketing consultant Pieter Serraris in OpenAI's public bots documentation. The article quotes OpenAI's previous and revised wording side by side and links to the live OpenAI bots reference page on platform.openai.com.
Publishers Say No to AI Scrapers, Block Bots at Server Level
The Register · 2025
BuiltWith counted about 5.6 million sites disallowing OpenAI's GPTBot in robots.txt, up from 3.3 million in July 2025, a 70% jump in five months. ClaudeBot is now blocked at about 5.8 million sites, AppleBot at 5.8 million, Googlebot at 18 million. TollBit reports a 336% year-on-year rise in sites blocking AI crawlers; 13.26% of AI bot requests in Q2 2025 ignored robots.txt, up from 3.3% in Q4 2024.
Methodology note · The Register reporting by Thomas Claburn, December 8, 2025, drawing on BuiltWith's public robots.txt trend dashboards and TollBit's Q2 2025 report. The article also cites Arc XP data showing about half of news sites block GPTBot, and quotes Cloudflare VP of product Will Allen.
AI Overview Fan-Out Rankings Boost Citation Odds by 161% (Surfer SEO study, 10K keywords)
Search Engine Land · 2025
Pages ranking for both the main query and at least one fan-out sub-query collected 51% of AI Overview citations. Pages ranking only for the main query collected just under 20%. Ranking for fan-out queries makes citation 161% more likely than ranking only for the head term. Around 68% of cited pages did not rank in Google's top 10 for any related query.
Methodology note · Search Engine Land coverage, December 2025, of a Surfer SEO analysis of 10,000 keywords and 33,000 fan-out queries extracted with Gemini. Surfer measured the share of AI Overview citations going to pages ranking on the head query, on fan-outs, on both, or on neither, and reported a Spearman correlation of 0.77 between fan-out coverage and citation rate.
LLMs.txt Shows No Clear Effect on AI Citations (300K domains)
SE Ranking · 2025
Across 300,000 domains, only 10.13% had an llms.txt file. Adoption is roughly flat across traffic tiers, with high-traffic sites slightly less likely (8.27%) to use it than mid-tier ones (10.54%). Statistical tests and an XGBoost model found no relationship between the presence of llms.txt and how often a domain is cited by AI engines. Removing the variable from the model actually improved its accuracy.
Methodology note · SE Ranking study of nearly 300,000 domains, published November 2025. The team checked each domain for an llms.txt file, segmented adoption by monthly traffic, and modelled citation frequency using Spearman correlation, XGBoost regression and SHAP analysis. The conclusion is based on whether llms.txt presence improved or degraded model predictions of LLM citations.
ChatGPT Search Often Switches To English In Fan-Out Queries (Search Engine Journal)
Search Engine Journal · 2025
Search Engine Journal reports on vendor-observed fan-out behaviour in ChatGPT Search, including the pattern of fan-out queries often switching to English regardless of the original prompt language. Useful as reporting on vendor-observed patterns; should not be treated as first-party platform proof or as a representative sample of all end-user prompts.
Methodology note · Search Engine Journal article by Matt G. Southern, 18 February 2026, reporting on a Peec AI analysis of 10M+ ChatGPT prompts and 20M fan-out queries. Trade-press coverage of vendor-observed patterns; the underlying dataset comes from Peec's own controlled prompt runs (UI scraping), not a representative sample of all consumer ChatGPT use.
Across more than 100 million citations from 230,000 prompts tracked weekly between July 14 and October 12, 2025, ChatGPT's reliance on Reddit and Wikipedia collapsed in mid-September. Reddit fell from close to 60% of ChatGPT responses in early August to around 10% by mid-September. Wikipedia dropped from roughly 55% to under 20%. AI Mode and Perplexity stayed stable; LinkedIn and Forbes citations grew across all three engines.
Methodology note · Semrush ran weekly snapshots of citations for 230,000 prompts over 13 weeks across ChatGPT search, Google AI Mode and Perplexity. Each week the team tracked the 25 most-cited domains and totaled changes around Google's mid-September removal of the num=100 search parameter. The post includes per-platform domain trend lines and lists the biggest gainers and losers per engine.
Semrush 230,000 Prompts Multi-Platform AI Visibility Study
Semrush · 2025
AI assistants cite community-edited and forum content far more often than corporate marketing pages. Wikipedia appeared as the first or second most-cited source in four of five industries studied. Reddit was cited in 176% of ChatGPT finance queries, meaning Reddit was referenced more than once per answer on average. Official brand websites rarely appeared in top-source lists.
Methodology note · Semrush analysed search prompts across finance, digital technology, business services, consumer electronics, and fashion, comparing ChatGPT and Google AI Mode responses. The team measured source citation frequency, brand mention rates, and the overlap between mentioned brands and cited source domains. Methodology disclosed; ongoing dataset.
Wikipedia Says Traffic Is Falling Due to AI Search Summaries and Social Video
TechCrunch (Wikimedia Foundation) · 2025
Wikipedia's human pageviews fell 8% year on year over the months preceding October 2025, according to the Wikimedia Foundation. The decline became visible after improved bot-detection systems revealed that much of the May and June 2025 traffic spike came from bots designed to evade detection. The Foundation attributes the decline to generative AI summaries in search and to younger users seeking information on social video platforms.
Methodology note · TechCrunch reporting based on a Wikimedia Foundation blog post by Marshall Miller. The post draws on Wikipedia's server logs and updated bot-detection systems to separate human from automated traffic, then compares human pageviews year on year.
AI Platform Citation Patterns (680M citations across ChatGPT, AI Overviews, Perplexity)
Profound (TryProfound) · 2025
Across 680 million tracked citations between August 2024 and June 2025, the three big AI engines source very differently. Wikipedia is ChatGPT's top source at 7.8% of citations and 47.9% of its top-10 sources. Reddit leads on Perplexity (6.6% of all citations, 46.7% of top-10) and Google AI Overviews (2.2%). Around 80% of cited URLs sit on .com domains.
Methodology note · Profound analysed citations collected by its monitoring platform across ChatGPT, Google AI Overviews and Perplexity from August 2024 to June 2025. The post reports two cuts of the same data: share of total citations (per platform) and share of each platform's top 10 most-cited sources. Top-level domain distribution is broken out separately. Source lists are published in the article.
Google AI Overviews Linked to 25% Drop in Publisher Referral Traffic
Digiday · 2025
Across 19 Digital Content Next member publishers (including The New York Times, Condé Nast, Vox), Google search referral traffic fell broadly in May and June 2025. The median year-on-year decline was 10% overall, 7% for news brands and 14% for non-news. Losses outpaced gains two to one. UK lifestyle and automotive publishers reported CTR falls of up to 25% on first-page rankings.
Methodology note · Digiday reports on a Digital Content Next survey of 19 of its approximately 40 member publishers, run between May and June 2025, plus parallel evidence submitted by the UK's Professional Publishers Association to the Competition and Markets Authority. Findings combine year-on-year referral traffic comparisons with publisher-reported CTR data on specific queries.
Perplexity Accused of Scraping Websites That Explicitly Blocked AI Scraping
TechCrunch · 2025
TechCrunch reported on Cloudflare's August 4, 2025 findings that Perplexity continued to scrape sites after they blocked PerplexityBot in robots.txt and WAF rules. According to Cloudflare, Perplexity rotated user agents (impersonating Chrome on macOS), used undeclared IPs, and changed ASNs, with the behaviour observed across tens of thousands of domains and millions of requests per day. Perplexity disputed the findings.
Methodology note · TechCrunch reporting by Lorenzo Franceschi-Bicchierai, August 4, 2025, summarising Cloudflare's research post and including a direct response from Perplexity spokesperson Jesse Dwyer who described the post as a sales pitch and denied that the named bot belonged to the company.
Surfer SEO AI Citation Report 2025 (36M AIOs, 46M citations)
Surfer SEO · 2025
Across 36 million Google AI Overviews and 46 million citations between March and August 2025, three domains dominate: YouTube at about 23.3%, Wikipedia at 18.4% and Google.com at 16.4%. Industry mix shifts the picture: NIH leads health at 39%, YouTube and Reddit together carry gaming with 93% and 78% appearance rates, and Shopify takes 17.7% of ecommerce citations.
Methodology note · Surfer's AI Tracker logged AI Overview responses and their citations from March to August 2025, covering 36M Overviews and 46M citations across 57,000-plus URLs. The team broke results into industry segments (finance, health, ecommerce, SEO, gaming, sports, travel) and reported the share of citations earned by the most frequent domains within each category.
Cloudflare Will Now Block AI Bots by Default
MIT Technology Review · 2025
Cloudflare made blocking AI bots the default for websites it hosts as of July 1, 2025. Customers can override per bot, allow verified crawlers, or charge for access via Pay Per Crawl. Media outlets including the Associated Press and Time, plus platforms like Quora and Stack Overflow, endorsed the move. CEO Matthew Prince argues current AI use of the web is breaking the publisher business model.
Methodology note · Reporting in MIT Technology Review by Peter Hall, published July 1, 2025, covering Cloudflare's announcement and including direct comment from Will Allen, Cloudflare's head of AI privacy, control, and media products. The piece also includes a contrasting view from MIT Media Lab PhD candidate Shayne Longpre on impacts to research and non-commercial use.
AI Traffic Has Increased 9.7× in the Past Year (81,947 Websites Study)
Ahrefs · 2025
Across 81,947 websites, average AI traffic grew about 9.7 times in a year. The average site's search traffic dropped about 21% over the same period. AI traffic now represents 0.25% of a site's total traffic on average. ChatGPT grew 85% since January 2025 and now sends more traffic than Reddit or LinkedIn. Google still sends about 210 times more traffic than the big three AI platforms combined.
Methodology note · Ahrefs analysed referral traffic patterns across 81,947 websites between mid-2024 and mid-2025, comparing AI referrals (ChatGPT, Perplexity, Gemini, Copilot) against traditional search, social platforms, and direct traffic. The dataset more than doubled the size of the earlier March 2025 study.
AI Visitors Visit Fewer Pages and Bounce More Often Than Search Visitors (Quality Study)
Ahrefs · 2025
Visitors arriving from AI platforms (ChatGPT, Perplexity, Copilot, Gemini) view 4 pages on average, 1.2 fewer than search visitors and 1.5 fewer than the typical visitor. They spend about 8 seconds longer on site (86 versus 78 seconds) but bounce 4.1% more often than search visitors and 5.4% more than the average visitor. Sessions are longer in time but shallower in depth.
Methodology note · Ahrefs analysed user behaviour across roughly 82,000 websites between May and June 2025, comparing visitors arriving from AI platforms against those arriving from search engines and against the overall visitor average. Metrics included pages per visit, pages per session duration, time on site, and bounce rate.
AI Makes Up 0.1% of Traffic, but Clicks Aren't Everything (~35K Websites Study)
Ahrefs · 2025
Across roughly 35,000 websites, AI tools sent 0.1% of total referral traffic, just below email at 0.2%. Google sent 345 times more traffic than the three main AI platforms (ChatGPT, Perplexity, Gemini) combined. The three AI platforms together referred about as much traffic as Reddit. AI traffic was highest in the US (7.71% of AI referrals) and in business and industrial sectors (21%).
Methodology note · Ahrefs analysed referral traffic across approximately 35,000 websites in early 2025, breaking down sources by channel (search, direct, social, paid, email, AI) and by AI platform. The study also examined AI traffic distribution by country, industry, site size, and page type (using URL keyword frequency analysis).
Anthropic's Citations feature lets Claude ground answers in source documents the developer provides, returning the specific sentences and passages each claim is drawn from. Anthropic reports that this built-in citation approach improved recall accuracy by up to 15% compared with custom prompt-based citation implementations. Thomson Reuters and Endex report reductions in hallucinated or misformatted source references.
Methodology note · Product announcement and developer documentation for the Citations API, generally available on the Anthropic API and Google Cloud Vertex AI. The feature processes user-provided source documents by chunking them into sentences, then passes them with the user query so the model can cite specific passages. Published January 2025; expanded to Amazon Bedrock June 2025.
How to view fanout queries generated by AI (Ahrefs Help)
Ahrefs · 2025
Ahrefs provides a way for users to view fan-out queries that AI assistants generate from a seed prompt the user has chosen to track. This is evidence that third-party tools can observe AI-generated query rewrites for prompts under the user's control, but it does not prove access to all end-user prompts in the wild.
Methodology note · Ahrefs Help Center article by Constance Tan (updated weekly) describing the Brand Radar fan-out-queries feature for ChatGPT and Perplexity. Explains that Ahrefs typically returns two fan-out queries per tracked prompt (sometimes one, sometimes none) and compares fan-out to People Also Ask. Vendor-reported product behaviour; the fan-out queries observed are derived from user-defined seed prompts. Content verified by direct fetch.
What is RAG (Retrieval-Augmented Generation)? (IBM Think)
IBM · 2024
IBM's explainer defines retrieval-augmented generation (RAG) as a process where an LLM first retrieves relevant external documents, then generates an answer grounded in those documents rather than only its parametric memory. The page describes RAG architecture, common use cases (enterprise search, customer support), and trade-offs compared with pure LLM inference or fine-tuning approaches.
Methodology note · Vendor explainer hosted on IBM's 'Think' marketing site. Direct fetch returned the HTML article. Tier B because IBM is an authoritative vendor in the AI/enterprise space but this is a marketing explainer rather than original research. Suitable as a definitional reference for RAG; not for citation-rate or methodology claims.
Otterly supports prompt construction from external proxies including SEO keywords, brand names, industry terms, and URLs. The page reinforces that prompts are customer-defined and proxy-derived, not drawn from a privileged platform-wide feed of real chatbot user prompts.
Methodology note · Otterly AI Help Center article (December 2025) describing the three ways customers can add prompts inside the Otterly platform: individual entry, CSV import, or the AI Prompt Research tool. Self-reported vendor documentation. Useful as evidence of the kinds of inputs Otterly accepts; not a controlled study or independent benchmark. Content verified by direct fetch.
How to find relevant prompts for your brand? (Otterly Help)
Otterly AI · 2024
Otterly's own help documentation explicitly states there is 'no way to learn which prompts are most asked at ChatGPT or Perplexity' and 'no way to know what exactly people are searching for in the AI engines.' Otterly recommends constructing prompts from available external inputs such as brand terms, domains, industries, URLs, and SEO keywords. This is a vendor admission that aligns with the public-proxy thesis.
Methodology note · Otterly AI Help Center article (last updated April 2026) describing the vendor's own recommended methodology for building a brand's prompt list. Self-reported vendor documentation; the page explicitly states that AI search engines do not publish query data and lists three substitute methods Otterly supports (Prompt Research tool, Google Search Console import, AI-assisted brainstorming). Content verified by direct fetch.
Peec frames itself as a platform that runs customer-defined prompts across major AI assistants and tracks visibility, citation, and answer-inclusion outcomes. Useful as additional product-context evidence that the platform observes outputs from its own controlled runs.
Methodology note · Peec AI's product-introduction documentation page. Describes the platform's three core metrics (Visibility, Position, Sentiment), the prompt-running cadence, and Peec's UI-scraping data-collection approach. Confirms that data comes from prompts the customer defines, not from a privileged platform-wide feed. Content verified by direct fetch on 2026-05-27.
Peec's documentation says the platform runs customer prompts daily across AI platforms. This supports the interpretation that vendors like Peec observe outcomes from prompts they execute rather than drawing from a secret platform-wide prompt firehose.
Methodology note · Peec AI's official Quickstart Guide, published on its Mintlify-hosted documentation site. Describes the four-step onboarding workflow (set up prompts, identify competitors, read the dashboard, analyse sources) and confirms that Peec runs customer-defined prompts daily across ChatGPT, Perplexity, Gemini and Copilot. Content verified by direct fetch on 2026-05-27.
Tier C — Tactical signals only
Tier C is vendor blogs, individual LinkedIn or Substack posts, and case studies with a single data point. They're useful sometimes, especially when a category is moving fast and Tier A or B research hasn't caught up.
When a Tier C source surfaces a finding that's genuinely novel, we cite it openly with the caveat that the evidence is provisional, and we treat it as a hypothesis worth testing rather than a fact to repeat.
Who Blocks OpenAI, Google AI and Common Crawl? (News Homepages tracker)
Palewire · Ben Welsh · 2026
Palewire's continually-updated news-homepages tracker shows that 633 of 1,156 news publishers surveyed (54.8%) have instructed OpenAI, Google AI, or Common Crawl to stop crawling their sites via robots.txt. Per-crawler block rates: OpenAI 49.9%, Google AI 45.5%, Common Crawl 50.0%. The tracker collects each site's robots.txt file twice per day and reports the latest results.
Methodology note · Palewire (Ben Welsh) News Homepages project documentation, ongoing. Direct fetch returned the project page with live block-rate counts and a per-site breakdown. Tier C: a personal/research project with transparent methodology and live data, but single-author maintenance and no formal peer review.
AI Citation Patterns by Platform, Industry, and Intent
ALM Corp · 2026
ALM Corp synthesises AI citation patterns from multiple 2026 datasets, including Yext's 6.8M-source analysis showing 86% of citations come from sources brands can directly influence, and a landmark study finding 44.2% of citations come from the first 30% of content. Concludes there is no universal top source; patterns are shaped by intent, platform, and category. Treat as provisional.
Methodology note · ALM Corp blog post by digital strategy team, 2026. Direct fetch returned the HTML article. Tier C: a marketing-agency blog synthesising third-party datasets rather than running original primary research. Suitable as a tactical reference; not for citation as a primary source. Cross-verifiable against Yext and Tinuiti underlying reports.
Cloudflare and ETH Zurich Say AI Bots Are Breaking the Web's Cache Layer
PPC Land · 2026
Trade press coverage of the joint Cloudflare and ETH Zurich research published April 2026: automated traffic now accounts for 32% of Cloudflare's network. AI crawl purposes break down as 45% training, 45% mixed-purpose, and 7.5% search. The research argues that standard CDN caching strategies are failing under AI crawler load. Companion to the peer-reviewed SOCC 2025 paper at R181.
Methodology note · PPC Land article by Luis Rijo, 6 April 2026. Direct fetch returned the full article. Tier C trade-press summary of primary research from Cloudflare and ETH Zurich. The underlying primary sources are the Cloudflare blog post 'Why we're rethinking cache for the AI era' and the peer-reviewed SOCC 2025 paper (R181).
Pages not updated for a quarter are over three times more likely to lose AI citations. About 70% of cited pages were updated in the last 12 months, and 83% of commercial citations come from pages refreshed within a year. Sequential heading hierarchies correlate with 2.8 times higher citation likelihood; 87% of cited pages use a single H1, and 48% of citations come from user-generated platforms.
Methodology note · Industry report from AirOps with Kevin Indig, drawing on millions of citation datapoints across ChatGPT, Google AI Overviews, AI Mode, Gemini, and Perplexity. Findings are organised around freshness, on-page structure, schema use, user-generated content, off-site mentions, and visibility stability, with specific percentage gaps tied to each signal.
What 2025 Revealed About AI Search and the Future of Schema Markup
Schema App · Martha van Berkel · 2025
In 2025, Google and Microsoft publicly confirmed they use Schema markup for generative AI features, and ChatGPT confirmed it uses structured data to decide which products appear in results. Schema App reported a 19.72% rise in AI Overview visibility on its own site after deploying Entity Linking, and customer InSinkErator a 69% rise in clicks on non-branded queries.
Methodology note · First-party essay by Schema App's CEO. The piece argues structured data should be treated as a knowledge graph rather than a rich-result trick, and uses examples from Schema App's own site and named customers (InSinkErator, Wells Fargo) plus public statements from Google, Microsoft, and ChatGPT to support the case.
AI Search Cites Press Releases Just 0.04% of the Time
ALM Corp · 2025
Press releases syndicated through Yahoo Finance, MSN, and similar networks account for 0.04% of all AI citations, and newswire pages such as PRNewswire add 0.21%. Original editorial content carries 81% of news citations. ChatGPT is a partial exception: press releases hosted on a brand's own newsroom domain drive 18.15% of its citations, against around 3% for Google's AI platforms.
Methodology note · Industry commentary on a BuzzStream study run with the XOFU citation-monitoring tool, covering more than four million AI citations across ChatGPT, Google AI Overviews, Google AI Mode, and Gemini. Researchers ran 3,600 prompts across 10 industries over one week and split prompts into evaluative, informational, and brand-awareness categories.
The Web Almanac 2025 Generative AI chapter reports that only 0.015% of sites in the Majestic Million had an llms.txt file in early 2025 (just 15 sites total). The chapter also documents a 6,697% increase in research-paper usage of the word 'delves' as an AI fingerprint, and analyses adoption of built-in browser AI APIs. ChatGPT reached 700M weekly active users by the July 2025 crawl date.
Methodology note · HTTP Archive Web Almanac 2025, Generative AI chapter by Christian Liebel, Yash Vekaria, Jonathan Pagel and others. Direct fetch on almanac.httparchive.org returned the chapter content. Tier C because the Web Almanac is a community-volunteer publication rather than peer-reviewed research, but methodology and data sources are disclosed.
Google's AI Mode Cites Google in 17% of Answers (SE Ranking, 1.3M citations)
SE Ranking via Search Engine Land · 2025
SE Ranking analysed 68,313 keywords and over 1.3 million citations and found that Google.com accounts for 17% of all Google AI Mode citations, more than YouTube, Facebook, Reddit, Amazon, Indeed, and Zillow combined. Google was the top-cited domain in 19 of 20 industry niches studied. 59% of these Google citations point to organic search results, 36% to Google Business Profiles.
Methodology note · Search Engine Land article reporting on SE Ranking research, 2026. Direct fetch returned the HTML article. Tier C because the underlying analysis is a single-vendor study (SE Ranking) covered by trade press; methodology disclosed but vendor incentive to position its tool. Cross-verifiable against the SE Ranking blog post directly.
Wikipedia Analysis (LLM Optimizer)
Adobe LLM Optimizer · 2025
Adobe's LLM Optimizer treats a company's Wikipedia page as a primary lever for being cited correctly by ChatGPT, Google AI Mode, Gemini, Perplexity, and Copilot. It scores articles on five dimensions: references, sections, content length, images, and infobox completeness. It then benchmarks each against industry competitors and surfaces prioritised fixes, including critical flags for press-release tone and reference gaps.
Methodology note · Product documentation for the Wikipedia Analysis opportunity inside Adobe LLM Optimizer. The system scrapes a brand's Wikipedia page, auto-selects up to six industry competitors based on the company's category, calculates gaps on the five dimensions, and ranks recommendations from Informational to Critical. Edits are made on Wikipedia; the tool does not push changes itself.
Social media climbed to over 9% of AI citations between October 2025 and January 2026, with Reddit driving the dominant share of growth across nine tracked product categories. Reddit's karma-weighted upvote system functions as distributed editorial curation, which retrieval systems treat as a credibility signal. Answer Engine Optimisation now needs a community-content strategy, not only owned-domain SEO.
Methodology note · Trade publication article (CMSWire) citing Tinuiti's AI Citations Trends Report Q1 2026. The piece compares citation behaviour across ChatGPT, Perplexity, and Google's AI surfaces, and translates the data into recommendations for tracking and participating in community conversations as part of an Answer Engine Optimisation programme.
Common Crawl: Setting the Record Straight (Transparency Response)
Common Crawl · 2025
Common Crawl's transparency response (November 2025) addresses criticism around its commitment to fair use and public-good principles. The post documents Common Crawl's robots.txt and opt-out compliance, its crawl truncation thresholds (raised from 1 MiB to 5 MiB per page as of the March 2025 crawl), and clarifies that Common Crawl is a non-profit research dataset rather than an AI training entity itself.
Methodology note · Common Crawl blog post, November 4 2025. Direct fetch on commoncrawl.org returned the full article. Tier C: first-party communication from a research organisation responding to public criticism; useful as context on Common Crawl's stated policies but not as independent evidence of compliance.
Top Cited Domains in AI: What 10M+ Citations Reveal About Visibility
Decoding · 2025
AI citations concentrate in a small set of domains: the top 5 hold 38% of citations, the top 20 hold 66%. Wikipedia leads at 11.22% of Google AI Mode citations and 47.9% of ChatGPT's top-10 share. YouTube grew 34% in six months. Reddit citations surged 450% between March and June 2025, then collapsed in ChatGPT around September 2025 from roughly 60% to about 10% of responses.
Methodology note · Vendor blog (Decoding) consolidating citation data from third-party studies, including a Profound analysis of 680 million citations across ChatGPT, Google AI Overviews, and Perplexity from August 2024 to June 2025, plus citation-share counts from Ahrefs and a three-month Semrush time series capturing the September 2025 shift in ChatGPT source mix.
Perplexity vs ChatGPT: AI Citation Study Q3 2025
Qwairy · 2025
Qwairy's analysis of 118,000+ AI-generated answers across Q3 2025 found that Perplexity averages 21.87 citations per question while ChatGPT averages 7.92, that OpenAI is the only major model citing Wikipedia significantly (4.8% of citations), and that only 11% of cited domains appear across multiple platforms. Each AI provider has distinct source preferences requiring platform-specific optimisation.
Methodology note · Qwairy blog post, Q3 2025. Direct fetch returned the HTML article. Tier C: single-vendor study with disclosed sample size (118K+ answers) but limited methodology disclosure on how the answer set was sampled. Vendor incentive to position its GEO platform; treat per-vendor citation counts as directional rather than definitive.
First-Ever SEO Study on ChatGPT Search Queries (Query Length, Fan-Outs, N-Grams) — Tactical Signal
Marketing Power Ups / LinkedIn · Chris Long · 2025
Chris Long published the first-ever SEO study on ChatGPT Search behaviour, analysing query length, fan-out patterns, and n-gram distributions in ChatGPT-cited content. A notable finding: roughly 28% of pages cited by ChatGPT had zero organic Google visibility, indicating that ChatGPT's source-selection criteria diverge meaningfully from Google ranking. Treat as a provisional Tier C tactical signal.
Methodology note · LinkedIn post by Chris Long (Nectiv / Go Fish Digital), October 2025. The original LinkedIn URL returns HTTP 404; the finding is cross-verified against Chris Long's X post (1985689925602460120) and the AirOps webinar 'Query Fan-Out: What 60,000+ Searches from ChatGPT & Google Show with Chris Long' which references the same analysis.
OpenAI Search Crawler Reaches 55% Web Coverage: Analysis of 66 Billion Bot Requests
ALM Corp · 2025
ALM Corp summary of a Hostinger study (January 2026) analysing 66.7 billion bot requests across more than 5 million websites. Found OAI-SearchBot reached 55.67% average coverage of monitored websites between June and November 2025. TikTok's bot reached 25.67%, Applebot 24.33%, and Huawei's PetalSearch 18.33%. Demonstrates rapid expansion of assistant-facing crawlers as training-bot blocking grows.
Methodology note · ALM Corp blog post summarising Hostinger's 2026 AI crawler coverage study, January 2026. Direct fetch returned the HTML article. Tier C because it is an agency blog summarising third-party vendor research (Hostinger). Cross-verifiable against the original Hostinger blog post 'AI bot analysis' and Search Engine Journal coverage.
AI Mode and AI Overviews Share Only 13.7% of Citations (Ahrefs 730K response pairs)
Ahrefs · 2025
Across 730,000 query pairs analysed in September 2025, Ahrefs found that Google AI Mode and AI Overviews reach 86% semantic similarity in their answers but cite only 13.7% of the same URLs. The two surfaces converge on conclusions while diverging on sources, suggesting brands need to optimise for each surface separately rather than treating them as a single Google AI endpoint.
Methodology note · Ahrefs blog post by Brand Radar team, September 2025. Direct fetch returned the article HTML. Tier C: single-vendor study with disclosed methodology and clear sample size, but vendor incentive to position its tool. Sample is US-only and query-set composition is not externally validated.
Athena's State of AI Search Report 2025 reports that zero-click search on Google rose from 56% in 2024 to 69% in 2025, that the average brand appears in just 17.24% of relevant prompts while top players reach 56.71%, and that informational queries dominate AI search at 34.28% of prompts. Treat as provisional Tier C until original PDF is re-verifiable.
Methodology note · Athena State of AI Search Report 2025. The PDF URL returns HTTP 404; findings cross-verified against the live Athena State of AI Search 2026 report at athenahq.ai/athena-state-of-ai-full-report and against summaries on Bluehost.com. Single-vendor research with limited methodology disclosure.
Which News Sites Block AI Crawlers in 2025?
BuzzStream · 2025
BuzzStream analysed robots.txt directives on the top 50 news sites in the UK and the top 50 in the US (combined 100 sites) for 11 AI-related crawlers. PerplexityBot (the indexing variant) is blocked by 67% of these sites; only 14% of publishers block all AI bots while 18% block none. US publishers are more restrictive against Google's AI bots than UK publishers.
Methodology note · BuzzStream blog post, 2025. Direct fetch returned the HTML article with methodology (top 50 UK + top 50 US news sites by Similarweb), the 11 AI crawlers examined, and per-bot block rates. Tier C: marketing-tool vendor blog with disclosed methodology and a clear small sample (n=100) rather than independent peer-reviewed research.
AI Search Engines Cite Reddit, YouTube, LinkedIn Most (150K citations)
Cybernews via Search Engine Land · 2025
Reddit ranks as the most-cited domain across ChatGPT, Google AI Mode, Gemini, Perplexity, and AI Overviews combined, with YouTube, LinkedIn, Wikipedia, and Forbes filling out the top five. Yelp and G2 surface often on recommendation queries. ChatGPT leans on Wikipedia, Reddit, and editorial sites; Google leans on Facebook and Yelp; Perplexity emphasises Reddit, LinkedIn, and G2, especially for business-to-business questions.
Methodology note · Search Engine Land summary of an analysis by Peec AI, an AI search analytics tool, covering 30 million sources cited directly inside answers from ChatGPT, Google AI Mode, Gemini, Perplexity, and AI Overviews. Coverage focuses on per-domain citation share by platform and by query type, including recommendation queries.
AI Bots and Robots.txt (longitudinal analysis)
HTTP Archive · Paul Calvano · 2025
Longitudinal analysis of robots.txt files across popular websites finds that as of July 2025, AI bot user-agents top the list of most-referenced agents. Nearly 21% of the top 1,000 websites have rules targeting GPTBot. The wildcard '*' appears in 97.4% of robots.txt files. AI bot blocking has grown rapidly and is more common on higher-traffic sites. Treat as a provisional individual analysis.
Methodology note · Personal blog post by Paul Calvano (web performance engineer and Web Almanac contributor), 21 August 2025. Direct fetch returned the HTML article with the methodology, data source (AI Robots.txt GitHub repository), and findings. Tier C: single-author analysis on a personal blog. Methodology is disclosed but not externally peer-reviewed.
I Audited 30 llms.txt Files in the Wild — 5 Anti-Patterns Already Forming
DEV Community · Kenimo · 2025
An audit of 30 live llms.txt files found five recurring failures: overlong files with too many links; URLs contradicting robots.txt for the very AI crawlers expected to read them (about a third of files); no Markdown twin of pages (24 of 30); marketing prose instead of pointers; and files frozen since 2024 with dead links and renamed slugs.
Methodology note · Practitioner blog post on dev.to. The author manually audited 30 llms.txt files in the wild against the original Jeremy Howard proposal and against guidance from Mintlify and the llmoframework, then documented five anti-patterns with examples. Three of the audited files were the author's own, used as a control on bias.