---
title: "LLM-Driven SEO: The Shift to Topical Coverage and Information Retrieval Cost"
canonical: "https://www.rankinghacks.com/llm-driven-seo/"
pubDate: "2025-11-16T02:12:17.000Z"
updatedDate: "2026-03-27T13:25:26.000Z"
author: Andreas De Rosi
description: "Michal Suski mined 100B+ OpenAI tokens to find what LLMs actually rank. Why 22% of customers now discover via AI, and what that means for content scoring."
tags: [cmseo-2025]
categories: [ai-search]
---

Michał Suski, co-founder and Head of Innovation at Surfer, presented a deep dive into advanced SEO strategies, focusing on adapting content for both Google and Large Language Models (LLMs). His company has processed over **100 billion tokens** through OpenAI, providing a large dataset for correlation studies.

---

## **Key Metrics & Business Rationale**

The drive to integrate LLMs into content strategy is financially motivated:

- **Customer Acquisition:** **22%** of customers discovered their product through **ChatGPT or other AI models**.
- **Conversion:** While Google provides **30 times more traffic** and **25% of conversions**, the conversion rates for Google traffic and AI-driven traffic are **comparable**. This justifies incorporating LLM signals.
- **Data Scale:** Correlation studies utilized a large keyword sample, although Suski noted that similar results could be achieved with **10,000** or **100,000** keywords.

---

## **SEO Ranking Factor Correlation Findings**

The analysis of a million-keyword sample (de-duplicated via clustering) revealed significant changes in ranking factors:

| **Factor** | **Key Finding** | **Correlation Strength** |
| --- | --- | --- |
| **Topical Coverage** | The **strongest factor** for success; covering comprehensive facts/entities. | **Highest** |
| **Domain Traffic/Authority** | Remains a **strong correlating factor**, though slightly declining. | Strong |
| **Keyword Density (Exact Match)** | Correlates around **zero**; move away from obsession with exact matches. | Zero |
| **Keyword Variations** | **Outperform exact keyword density** in paragraphs. | Stronger than Exact |
| **Headings (H2/H3/H4)** | **Variations** outperform exact matches; avoid forcing keywords in every subheading. | Stronger than Exact |
| **Content Length** | **Non-causal correlation**; an *outcome* of covering enough entities/facts, not a goal. | Contextual |
| **Bolding** | Bolding keywords and variations shows a modest positive correlation (**~5%**). | Modest Positive |
| **URL/Domain** | **Exact Match Domains (EMD)** and **Partial Match Domains (PMD)** show strong correlation. | Strong |
| **Page Performance** | **HTML size** and **server responsiveness** are more critical than overall page size. | Strong |
| **Schema Markup** | **Negative correlation** with the number of different schema types (avoid excessive, abusive schema). | Negative |
| **AI Content Detection** | **No correlation** between AI-detected content and rankings. | None |

---

## **Optimizing for LLMs: The Core Source & IRC**

The second part of the analysis involved reverse-engineering LLMs by examining **50 million AI Overview answers** and **400,000 model prompts**.

### **The “Core Source” Concept**

- **Definition:** URLs that are **consistently cited** in AI responses to the same question. Only **10-30%** of sources remain stable.
- **Key Differentiators for Core Sources:**
  - **Topical Coverage:** Core sources cover **42%** of facts vs. **34%** for volatile sources and **23%** for non-cited pages. This is the **most significant factor**.
  - **Google Rank:** Core sources typically rank higher (average position **8**) than volatile/non-core sources (average position **20**).
  - **Semantic Proximity:** Close alignment between **URL slug/title** and the prompt intent significantly increases the chance of inclusion.

### **Information Retrieval Cost (IRC)**

**IRC** is defined as the **token cost for an LLM to extract the necessary information** from content. Suski introduced this as “**the new crawl budget**” for AI optimization.

- **Goal:** Provide **maximum information with minimal tokens**.
- **Model-Specific Priorities:** Different LLMs prioritize content structure differently:
  - **ChatGPT:** Values **early query confirmation** but largely **ignores author credibility** (E-E-A-T).
  - **Google’s AI Mode:** Prioritizes **early answers**, **progression from general to specific**, and **author credibility** (E-E-A-T).
  - **Perplexity & AI Overview:** Identified as most susceptible to content manipulation due to broad sensitivity to content-driven signals.

---

## **IV. Action Items**

Based on the findings, Michał Suski recommends the following actionable steps:

- **Prioritize Topical Coverage:** Focus on covering contextual facts and entities to hit **~74%** coverage on top pages.
- **Optimize for LLMs (IRC):** Create content with a low **information retrieval cost** by being **direct** and **structured**, minimizing engaging filler (jokes, sidetracks). Provide maximum information with minimal tokens.
- **Evolve Keyword Strategy:** Move away from exact keyword density toward using **variations** in content, especially in **H3/H4 subheadings**.
- **Implement Bolding:** **Bold keywords and variations** as a low-effort optimization with positive correlation.
- **Domain Strategy:** Use **Exact Match Domains (EMD)** or **Partial Match Domains (PMD)** for new SEO projects.
- **Audit Schema:** Avoid going “crazy with schema” by limiting the use of different, irrelevant schema types.
- **Core Source Strategy:** Identify stable **core sources** before pursuing mentions or backlinks, as these are the most valuable citation targets.

---

## My Take: What This Means for Solo Publishers

I attended Michał Suski’s talk at Chiang Mai SEO 2025, and this was one of the presentations that actually changed how I think about content strategy. Not because it introduced completely new concepts—topical authority has been discussed for years—but because Surfer’s data finally quantifies what many of us suspected.

### Who Should Care About This

**Solo publishers and small teams:** This is directly applicable. You don’t need enterprise tools or a team of 20 to implement topical coverage strategies. In fact, smaller sites can pivot faster than large organizations.

**Affiliate marketers:** The shift from exact-match keywords to entity coverage explains why thin “best X” posts are dying. Your roundups need to comprehensively cover the product category, not just list items with affiliate links.

**NOT primarily for agencies:** While agencies can use this data, the real advantage goes to operators who can execute quickly without client approval cycles.

### What I’m Actually Implementing

**1. Topical Coverage Audits**

The 74% entity coverage benchmark is now my target for pillar content. You can measure this with NLP entity extraction tools, competitor content analysis, and “People Also Ask” clustering.

For my photography sites, this means every camera roundup needs to cover: sensor specs, autofocus systems, video capabilities, ergonomics, price positioning, and competitor comparisons—not just “top 5 cameras.”

**2. Information Retrieval Cost (IRC) Optimization**

This is the actionable insight I’m most excited about. Michał’s concept of IRC as “the new crawl budget” for AI is brilliant. Here’s how I’m applying it:

- **Front-load answers:** The first 100 words should directly address the query. No “In this article, we’ll explore…” preambles.
- **Structure for extraction:** Use clear H2/H3 hierarchy that mirrors how someone would ask follow-up questions.
- **Kill the filler:** Those “engaging” introductions and jokes? They increase token cost for LLMs without adding information value.

For a deeper technical framework on structuring content for LLM extraction, see [The Technical Framework for LLM Content Optimization](/llm-content-optimization/).

**3. The EMD/PMD Signal**

The strong correlation for Exact Match and Partial Match Domains is interesting but requires context. This works for *new projects*—I wouldn’t recommend rebranding an established site just for this signal. But for future microsites or programmatic plays, domain naming matters more than current wisdom suggests.

### What I’m Ignoring

**Schema proliferation:** Michał’s finding that excessive schema types correlate *negatively* with rankings validates my minimalist approach. I use Article, Product, and FAQ schema where genuinely applicable—not every schema type just because a plugin suggests it.

**AI content detection concerns:** The zero correlation between AI detection and rankings is reassuring but not surprising. Google has said they care about quality, not provenance. I still blend AI drafts with human expertise, but I’m not paranoid about detection tools.

### The Bigger Picture

What Suski is really describing is the transition from “ranking documents” to “ranking information.” Google’s AI systems (and ChatGPT, Perplexity, etc.) don’t care about your beautiful website design or clever copywriting. They’re extracting facts and entities to synthesize answers.

This is good news for publishers who focus on genuine expertise. It’s bad news for those who’ve built businesses on keyword-stuffed content with minimal actual information.

To audit and measure E-E-A-T signals systematically, [Tom Winter’s methodology](/tom-winter-auditing-measuring-eeat-with-llms/) provides a practical LLM-based approach. And for building the topical architecture that supports comprehensive coverage, see [Koray Tugberk Gubur’s topical map strategies](/koray-tugberk-guburs-topical-maps-in-seo/).
