Training data vs RAG: why AI tools behave differently

What is training data?

Training data is the corpus of text used to build an AI model. When you ask a default ChatGPT or Claude question and it answers without citing a URL, it is drawing on patterns learned from that corpus. The model has read a snapshot of the public web (plus books, papers and licensed sources) up to a cut-off date - then the weights are frozen until the next training run.

Two consequences matter for GEO. First, the model only "knows" what was true at the cut-off, so a new pricing page won't be reflected. Second, the model has no concept of a URL; it remembers claims, not sources. Citations from training-data paths are usually impressionistic - "Stripe is a leader in payments" - rather than linked.

What is retrieval (RAG)?

Retrieval-Augmented Generation, or RAG, is the architecture where the AI tool first runs a search, fetches a handful of live web pages, then asks the language model to synthesise an answer using those pages as context. Perplexity, ChatGPT search mode, Google AI Overviews and Gemini all use this pattern.

RAG answers come with citations because the model is literally reading documents at query time and quoting them. If your page isn't in the top 5-10 retrieved documents, you won't be cited. This makes classical SEO ranking a near-perfect predictor of RAG visibility.

Path A: training data

Prompt → model weights (frozen) → answer.

The model already "knows" about you because it read pages mentioning your product during training. Updates to your site today are invisible to this path until the next training cut-off.

Path B: retrieval (RAG)

Prompt → live search → top documents → model reads them → answer with citations.

The model fetches fresh content at query time. Your live site is fair game. SEO ranking, crawlability and schema directly determine whether you make the document set.

Every AI answer is built on one of these two paths - or a hybrid of both.

Why do AI tools respond to GEO tactics differently?

Because each tool sits in a different place on the training-vs-retrieval spectrum. A tactic aimed at one path can be invisible to the other.

FAQ schema mostly helps retrieval-based tools (Perplexity, AI Overviews) because schema is parsed at fetch time. A training-data path doesn't see it.
A new comparison page can rank in Perplexity within days, but won't appear in default ChatGPT until the next training cut-off includes it.
Earning a Reddit mention is high-value for training-data tools - those models were trained on Reddit - and modest value for retrieval tools, which only see whatever Reddit thread ranks today.
Updating an old high-traffic page helps retrieval immediately and trains the next model generation. It's a two-for-one.

Tool

Where answers come from

GEO move that works

ChatGPT (default)

Mostly training data, with web tool calls on demand

Earn mentions on sites that were in the training cut-off. Update old high-authority pages.

ChatGPT (search mode)

Real-time retrieval, then synthesis

Win the SERP for the underlying query. Crawlable, citable pages.

Perplexity

Real-time retrieval first, training data second

Get into the top-10 results for the prompt. Fast, schema-rich pages.

Claude (default)

Training data, no live web by default

Long-form, well-cited content on authoritative domains.

Google AI Overviews

Live index + LLM synthesis

Classic SEO ranking is the prerequisite for being summarised.

Gemini

Real-time Google retrieval + training data

Same as AI Overviews, plus YouTube and Google ecosystem signals.

How should this change your GEO strategy?

Split your roadmap into two streams. The retrieval stream is the fast-feedback half - crawlability, schema, SEO ranking, FAQ pages, fresh content. The training stream is the slow-burn half - building durable authority on the sites and platforms that the next round of training will read.

For most B2B SaaS teams the right ratio is 70% retrieval, 30% training in the first year. The retrieval work moves measurable AI citations within weeks; the training work moves how the default-mode chatbots talk about you twelve months from now.

Our overview on off-site authority for GEO and the guide to getting listed on the sites AI trusts cover the training-data side. The technical foundation and on-site optimisation pillars cover the retrieval side.

How do you tell which path a tool used?

Look at the answer. If there are inline citations to live URLs, it's retrieval. If the answer confidently asserts something with no sources, it's training data (or training-data inference on top of weak retrieval). When you run prompts as part of a free GEO audit, label each answer with its path. The path tells you which lever to pull next.

What mistakes does this concept fix?

Expecting an overnight ChatGPT lift from a new page (training data doesn't refresh that fast).
Ignoring Reddit and G2 because they "aren't your site" (they're prime training data).
Writing off Perplexity tactics as irrelevant to ChatGPT (ChatGPT search mode is identical architecturally).
Treating all AI traffic as one channel in reporting. Split by path; the diagnoses are different.