Information Retrieval Part 2: How to Get Into Model Training Data
This is the complete guide to training data. How you should think about it, how it works and how to become a known entity in a model's 'memory.'
There has never been a more important time in your career to spend time learning and understanding. Not because AI search differs drastically to traditional search. But because everyone else thinks it does.
Every C-suite in the country is desperate to get this right. Decision makers need to feel confident that you and I are the right people to lead us into the new frontier.
We need to learn the fundamentals of information retrieval. Even if your business shouldn’t be doing anything differently.
Here, that starts with understanding the basics of model training data. What is it, how does it work and - crucially - how do I get in it.
TL;DR
AI is the product of its training data. The quality (and quantity) the model trains on is key to its success.
The web-sourced AI data commons is rapidly becoming more restricted. This will skew data representativity, freshness, and scaling laws.
The more consistent, accurate brand mentions you have that appear in training data, the less ambiguous you are.
Quality SEO, with better product and traditional marketing, will improve your appearance in the training data and eventually with real-time RAG/retrieval.
What is training data?
Training data is the foundational dataset used in training LLMs to predict the most appropriate next word, sentence and answer. The data can be labelled, where models are taught the right answer, or unlabelled, where they have to figure it out for themselves.
Without high-quality training data, models are completely fucking useless.
From semi-libellous tweets, to videos of cats and great works of art and literature that stand the test of time, nothing is off limits. Nothing. It’s not just words either. Speech-to-text models need to be trained to respond to different speech patterns and accents. Emotions even.
How does it work?
The models don’t memorise, they compress. LLMs process billions of datapoints, adjusting internal weights through a mechanism known as backpropagation.
If the next word predicted in a string of training examples is correct, it moves on. If not, it gets the machine equivalent of Pavlovian conditioning.
Bopped on the head with a stick or a ‘good boy.’
The model is then able to vectorise. Creating a map of associations by term, phrase and sentence.
Converting text into numerical vectors AKA Bag of Words
Capturing semantic meaning of words and sentences, preserving wider context and meaning (word and sentence embeddings)
Rules and nuances are encoded as a set of semantic relationships - this is known as parametric memory. ‘Knowledge’ baked directly into the architecture. The more refined a model’s knowledge on a topic, the less it has to use a form of grounding to verify its twaddle.
Worth noting that models with a high parametric memory are faster at retrieving accurate information (if available), but have a static knowledge base and literally forget things.
RAG and live web search is an example of a model using non-parametric memory. Infinite scale, but slower. Much better for news and when results require grounding.
Crafting better quality algorithms
When it comes to the training data, drafting better quality algorithms relies on three elements;
Quality
Quantity
Removal of bias
Quality of data matters for obvious reasons. If you train a model on poorly labelled, solely synthetic data, the model performance cannot be expected to exactly mirror real problems or complexities.
Quantity of data is a problem too. Mainly because these companies have eaten everything in sight and done a runner on the bill.
Leveraging synthetic data to solve issues of scale isn’t necessarily a problem. The days of accessing high-quality, free-to-air content on the internet for these guys are, largely gone. For two main reasons;
Unless you want diabolical racism, mean comments, conspiracy theories and plagiarised bullshit, I’m not sure the internet is your guy anymore.
If they respect company’s robots.txt directives at least. Eight in ten of world’s biggest news websites now block AI training bots. I don’t know how effective their CDN-level blocking is, but this makes quality training data harder to come by.
Bias and diversity (or lack of it) is a huge problem too. People have their own inherent biases. Even the ones building these models.
Shocking I know…
If models are fed data unfairly weighted towards certain characteristics or brands, it can reinforce societal issues. It can further discrimination.
Remember, LLMs are neither intelligent nor databases of facts. They analyse patterns from ingested data. Billions or trillions of numerical weights that determine the next word (token) following another in any given context.
How is training data collected?
Like every good SEO, it depends.
If you’re an idiot and you built an AI model explicitly to identify pictures of dogs, you need pictures of dogs in every conceivable position. Every type of dog. Every emotion the pooch shows. You need to create or procure a dataset of millions, maybe billions, of canine images.
Then it must be cleaned. Think of it as structuring data into a consistent format. In said dog scenario, maybe a feline friend nefariously added pictures of cats dressed up as dogs to fuck you around. Those must be identified.
Then labelled (for supervised learning). Data labelling (with some human annotation) ensures we have a sentient being somewhere in the loop. Hopefully an expert to add relevant labels to a tiny portion data so that a model can learn. For example, A daschund sitting on a box looking melancholic.
Pre-processing. Responding to issues like cats masquerading as dogs. Ensuring you minimise potential biases in the dataset like specific dog breeds being mentioned far more frequently than others.
Partitioned. A portion of the data is kept back so the model can’t memorise the outputs. This is the final validation stage. Kind of like a placebo.
This is, obviously, expensive and time consuming. It’s not feasible to take up hundreds of thousands of hours of expertise from real people in fields that matter.
Think of this. You’ve just broken your arm and you’re waiting in A&E for six hours. You finally get seen, only to be told you had to wait because all the doctors have been processing images for OpenAI’s new model.
“Yes sir, I know you’re in excruciating pain, but I’ve got a hell of a lot of sad looking dogs to label.”
Data labeling is a time-consuming and tedious process. To combat this, many businesses hire large teams of human data annotators (AKA humans in the loop, you know, actual experts), assisted by automated weak labelling models. In supervised learning, they sort the initial labelling.
For perspective, one hour of video data can take humans up to 800 hours to annotate.
Micro models
So companies build micro-models. Models that don’t require as much training or data to run. The humans in the loop (I’m sure they have names) can start training micro-models after annotating a few examples.
The models learn. They train themselves.
So over time, human input decreases and we’re only needed to validate the outputs. And to make sure the models aren’t trying to undress children, celebrities and your coworkers on the internet.
But who cares about that in the face of ‘progress.’
Types of training data
Training data is usually categorised by how much guidance is provided or required (supervision) and the role it plays in the model’s lifecycle (function).
Ideally a model is largely trained on real data.
Once a model is ready, it can be trained and fine-tuned on synthetic data. But synthetic data alone is unlikely to create high-quality models.
Supervised (or labelled): where every input is annotated with the ‘right’ answer.
Un-supervised (or unlabelled): work it out yourself robots, I’m off for a beer.
Semi-supervised: where a small amount of the data is properly labelled and model ‘understands’ the rules. More I’ll have a beer in the office.
RLHF (Reinforcement Learning from Human Feedback): humans are shown two options and asked to pick the ‘right’ one (preference data). Or a person demonstrates the task at hand for the mode to imitate (demonstration data).
Pre-training and fine-tuning data: Massive datasets allow for broad information acquisition and fine-tuning is used to turn the model into a category expert.
Multi-modal: images, videos, text etc.
Then some what’s known as edge case data. Data designed to ‘trick’ the model to make it more robust.
In light of the let’s call it ‘burgeoning’ market for AI training data, there are obvious issues of ‘fair use’ surrounding it.
“We find that 23% of supervised training datasets are published under research or non-commercial licenses.”
So fucking pay people.
The spectrum of supervision
In supervised learning, the AI algorithm is given labelled data. These labels define the outputs and are fundamental to the algorithm being able to improve over time on its own.
Let’s say you’re training a model to identify colours. There are dozens of shades of each colour. Hundreds even. So whilst this is an easy example, it requires accurate labelling. The problem with accurate labelling is its time-consuming and potentially costly.
In unsupervised learning, the AI model is given unlabeled data. You chuck millions of rows, images or videos at a machine, sit down for a coffee and then kick it when it hasn’t worked out what to do.
It allows for more exploratory ‘pattern recognition.’ Not learning.
While this approach has obvious drawbacks, it’s incredibly useful at identifying patterns a human might miss. The model can essential define its own labels and pathway.
Models can and do train themselves and they will find things a human never could. They’ll also miss things. It’s like a driverless car. Driverless cars may have less accidents than when a human is in the loop. But when they do, we find it far more unpalatable.
It’s the technology that scares us. And rightly so.
Combatting bias
Bias in training data is very real and potentially very damaging. There are three phases;
Origin bias
Development bias
Deployment bias
Origin bias references the validity and fairness of the dataset. Is the data all-encompassing? Is there any obvious systemic, implicit or confirmation bias present ?
Development bias includes the features or tenets of the data the model is being trained on. Does algorithmic bias occur because of the training data?
Then we have deployment bias. Where the evaluation and processing of the data leads to flawed outputs and automated/feedback loop bias.
You can really see why we need a human in the loop. And why AI models training on synthetic or inappropriately chosen data would be a fucking disaster.
In healthcare, data collection activities influenced by human bias can lead to the training of algorithms that replicate historical inequalities. Yikes.
Leading to a pretty bleak cycle of reinforcement.
The most frequently used training data sources
Training data sources are wide ranging in both quality and structure. You’ve got the open web, which is obviously a bit mental. X if you want to train something to be racist. Reddit if you’re looking for the Incel Bot 5000.
Or highly structured academic and literary repositories if you want to build something, you know, good… Obviously then you have to pay something.
Common Crawl
Common Crawl is a public web repository - a free, open-source storehouse of historical and current web crawl data available to pretty much anyone on the internet.
The full Common Crawl Web Graph currently contains around 607 million domain records across all datasets, with each monthly release covering 94 to 163 million domains.
In the Mozilla Foundation's 2024 report Training Data for the Price of a Sandwich, 64% of the 47 LLMs analysed used at least one filtered version of Common Crawl data.
If you aren’t in the training data, you’re very unlikely to be cited and referenced. The Common Crawl Index Server lets you search any URL pattern against their crawl archives and Metehan’s Web Graph helps you see how ‘centred you are.’.
Wikipedia (and Wikidata)
The default English Wikipedia dataset contains 19.88 GB of complete articles that help with language modeling tasks. And Wikidata is an enormous, incredibly comprehensive knowledge graph. Immensely structured data.
While representing only a small percentage of the total tokens, Wikipedia is perhaps the most influential source for entity resolution and factual consensus. it is one of the most factually accurate, up-to-date and well structured repositories of content in existence.
Some of the biggest, shittiest guys have just signed deals with Wikipedia.
Publishers
OpenAI, Gemini etc have multi million dollar licensing deals with a number of publishers.
News Corp (WSJ, New York Post) signed a $250M+ deal in 2024
The list goes on, but only for a bit… And not recently. I’ve heard things have clammed shut. Which given the state of their finances may not be surprising.
Media & libraries
This is mainly for multi-modal content training. Shutterstock (images/video), Getty Images have one with Perplexity, and Disney (a 2026 partner for the Sora video platform) provide the visual grounding for multi-modal models.
As part of this three-year licensing agreement with Disney, Sora will be able to generate short, user-prompted social videos based on Disney characters.
As part of the agreement, Disney will make a $1 billion equity investment in OpenAI, and receive warrants to purchase additional equity.
Books
BookCorpus turned scraped data of 11,000 unpublished books into a 985 million-word dataset.
We cannot write books fast enough for models to continually learn on. It’s part of the soon to happen model collapse.
Code repositories
Coding has become one of the most influential and valuable features of LLMs. Specific LLMs like Cursor or Claude Code are incredible. GitHub and Stack Overflow data have built these models.
They’ve built the vibe-engineering revolution.
Public web data
Diverse (but relevant) web data results in faster convergence during training, which in turn reduces computational requirements. It’s dynamic. Ever-changing. But, unfortunately, a bit fucking nuts and messy.
But if you need vast swathes of data, maybe in real-time, then public web data is the way forward. Ditto for real opinions and reviews of products and services. Public web data, review platforms, UGC and social media sites are great.
Why models aren’t getting (much) better
While there’s no shortage of data in the world, most of it is unlabelled and thus can’t actually be used in supervised machine learning models. Every incorrect label has a negative impact on a model’s performance.
According to most, we’re only a few years away from running out of quality data. Inevitably, this will lead to a time when those genAI tools start consuming their own shit.
This is a known problem that will cause model collapse. If I was eating my own shit, I’d probably want out too.
They are being blocked by companies who do not want their data used pro bono to train the models.
Robots.txt protocols (a directive, not something directly enforceable), CDN level blocking and terms of service pages have been updated to tell these guys to get to fuck.
They consume data quicker than we can produce it
Frankly, as more publishers and websites are forced into paywalling (a smart business decision), the quality of these models only gets worse.
So, how do you get in the training data?
There are two obvious approaches I think of.
To identify the seed data sets of models that matter and find ways into them
To forgo the specifics and just do great SEO and wider marketing. Make a tangible impact in your industry
I can see pros and cons to both. Finding ways into specific models is probably highly unnecessary for most brands. To me this smells more like grey hat SEO. Most brands will be better off just doing some really fucking good marketing and getting shared, cited and you know, talked about.
These models are not trained on directly up-to-date data. This is important because you cannot retroactively get into a specific model’s training data. You have to plan ahead.
If you’re an individual, you should be;
Creating and sharing content
Going on podcasts
Attending industry events
Sharing other people’s content
Doing webinars
Getting yourself in front of relevant publishers, publications and people
There are some pretty obvious sources of highly structured data that models have paid for in recent times. I know, they’ve actually paid for it. I don’t know what the guys at Reddit and Wikipedia had to do to get money from these guys and maybe I don’t want to.
How can I tell what datasets models use?
Everyone has become a lot more closed off with what they do and don’t use for training data. I suspect this is is both legally and financially motivated. So you’ll need to do some digging.
And there are some massive ‘open source’ datasets I suspect they all use:
Common Crawl
Wikipedia
Wikidata
Coding repositories
Fortunately most deals are public and it’s safe to assume that models use data from these platforms.
Google has a partnership with Reddit and access to an insane amount of transcripts from YouTube. They almost certainly have more valuable, well structured data at their fingertips than any other company.
Grok trained almost exclusively on real-time data from X. Hence why it acts like a pre-pubescent school shooter and undresses everyone.
Worth noting that AI companies use third party vendors. Factories where data is scraped, cleaned and structured to create supervised datasets. Scale AI is the data engine that the big players use. Bright Data specialise in web data collection.
A checklist
OK, so we’re trying to feature in parametric memory. To appear in the LLMs training data so the model recognises you and you’re more likely to be used for RAG/retrieval. That means we need to:
Manage the multi bot ecosystem of training, indexing and browsing
Entity optimisation. Well structured, well connected content, consistent NAPs, sameAs schema properties and Knowledge Graph presence. In Google and Wikidata.
Make sure your content is rendered on the server side. Google has become very adept at rendering content on the client side. Bots like GPT-bot only see the HTML response. JS is still clunky.
Well structured, machine readable content in relevant formats. Tables, lists, properly structured semantic HTML.
Get. Yourself. Out. There. Share your stuff. Make noise.
Be ultra, ultra clear on your website about who you are. Answer the relevant questions. Own your entities.
You have to balance direct associations (what you say) with semantic association (what others say about you). Make your brand the obvious next word.
Modern SEO, with better marketing.








Fantastic breakdown on training data scarcity. The point about models consuming data faster than we produce it is probly the most underrated crisis in AI right now. What strikes me is how much the publishers blocking crawlers actually accelerates model collapse rather than protects them. Like, models will just train on more synthetic data and reddit threads instead of quality journalism. Kinda feels like everyone loses in that scenario.