WHAT TRAINING AN AI ON 5,000 IMAGES ACTUALLY TEACHES YOU

 

WHAT TRAINING AN AI ON 5,000 IMAGES ACTUALLY TEACHES YOU

And Why the Data Was Never Really About the Data

A meditation on teaching a machine to see — and what it reveals about how we see ourselves


Right now, somewhere in your dataset, there is an image you labelled wrong — and the model has already learned it.

It doesn't matter how carefully you curated your collection, how many hours you spent in a labelling interface clicking boxes and assigning classes. At some point, your attention flickered. Your definition of a category quietly shifted between image 412 and image 1,847. The lighting fooled you, or the angle, or the simple accumulated fatigue of looking at five thousand of anything.

This is not a confession of failure. It is the central, inescapable lesson of training an image classifier from scratch — not fine-tuning a foundation model, not running someone else's benchmark, but actually building the dataset yourself, image by image, decision by decision. What the process teaches you has almost nothing to do with gradient descent or learning rates or network architecture. It teaches you something far stranger: that the act of labelling the world is the act of deciding what the world means.

And a machine, with perfect fidelity, will learn exactly what you decided.

— ✦ —

I. The Question That Precedes Every Model

Both Answers Are Impossible. Both Must Be True.

Before you write a single line of training code, you must answer a question that sounds trivially simple: what are the categories? How many classes does your problem have, and where exactly do the boundaries between them fall?

This question is a trap. It seems like a preliminary matter — something to resolve in an afternoon before the real work begins. In practice, it is the real work. Everything else is arithmetic.

My first taxonomy had nine classes. By image 800, I had collapsed it to four — not because I had grown lazy or impatient, but because I had discovered, the hard way, that my original categories reflected the way I had been taught to think about these objects, not any pattern that existed in the visual data itself. Two of my categories, which I had confidently distinguished in theory, turned out to be visually indistinguishable in practice. The model, with no access to theory, would have learned only the confusion.

"A taxonomy is a theory of the world written in the language of boxes and labels. When you train a model on your taxonomy, you are not teaching it to see. You are teaching it to reproduce your theory. Those are very different things."

The paradox is this: you cannot begin labelling without a schema, but you cannot build a correct schema without labelling. The categories you choose at the start will be wrong in ways you cannot predict until you have already made hundreds of decisions based on them. The only solution is to treat the schema as a living document — which means accepting that some portion of your early work will always need to be redone.

And yet this is not idle methodology. It is the central open problem of every supervised learning project that touches the real world. The model optimises for whatever signal is in your labels. If your labels are wrong, the optimisation is precise and the result is precisely wrong.

II. What the Data Actually Contains

The Observable Training Set Is a Cage, Not a Map

When practitioners talk about dataset size, they usually mean the total number of examples. What they rarely discuss is the effective number — the count of genuinely informative, correctly labelled, distribution-representative examples that the model can actually learn from.

My 5,000 images contained, by my own reckoning, approximately 3,400 examples I was fully confident about. The remaining 1,600 occupied a territory I can only describe as epistemically uncomfortable: images where the correct label depended on assumptions I had not made explicit, on context the image did not contain, on a definition I had not yet stabilised. I labelled them anyway. I had to. But I labelled them with the quiet knowledge that I was making decisions rather than discovering truths.

This is the gap between the dataset and the map. The dataset is the collection you have. The map is the underlying distribution of the phenomenon you are trying to model. The two are never identical, and the distance between them — in coverage, in balance, in representative difficulty — determines the ceiling of your model's real-world performance.

Category

What It Measures

Why It Is Insufficient

Total examples

Raw dataset size

Includes ambiguous, mislabelled, redundant images

Class balance

Distribution across categories

Ignores within-class difficulty variation

Annotation confidence

Labeller certainty per example

Rarely collected; enormously informative

Source diversity

Variety of image origins

Critical for generalisation; rarely tracked

Edge case density

Proportion of hard examples

The factor most predictive of failure modes

What dataset quality metrics typically capture — and what they miss

The uncomfortable truth is that a smaller, better-curated dataset will almost always outperform a larger, carelessly assembled one for any task that involves real-world variation. Five thousand thoughtfully selected and consistently labelled images will teach a model more than fifty thousand gathered from a single source with a single lighting condition and a single angle of view.

"We have counted the images. We have not counted what is in them. The difference between those two things may be the entire project."

III. The Number That Should Humble You

The Arithmetic of Inconsistency

Partway through the project — around image 2,300 — I discovered something that forced a reckoning. I had been applying one of my labels inconsistently. My intuitive definition of a borderline category had quietly shifted between week one and week three. The newer labels were, arguably, more defensible. They reflected a more considered understanding of what the category should mean.

But that is not how machine learning works. The model does not care which of your definitions was more considered. It sees the input, it sees the label, and it attempts to learn a function that maps one to the other. If identical or near-identical inputs have received different labels at different points in time, the model learns that identical inputs can produce different outputs. Which is to say: it learns noise.

I relabelled approximately 600 images. It took two days. And in those two days, I understood something about consistency that no textbook had made visceral for me: it is often better to be systematically wrong than randomly right. A model trained on a consistently-applied incorrect definition will at least learn a coherent theory. A model trained on inconsistent labels learns that the world is arbitrary — which is the one thing you most need it not to believe.

Labelling regime

Model outcome

Real-world consequence

Consistent and correct

Learns true signal

Generalises well to new data

Consistent but wrong

Learns your error

Fails predictably; correctable

Inconsistent, mostly right

Learns signal + noise

Inconsistent performance at edges

Inconsistent, often wrong

Learns noise

Unpredictable failures everywhere

Deliberately ambiguous

Cannot converge

High loss, unstable training

The arithmetic of label quality and its consequences for model behaviour

"The model did not learn to see. It learned to replicate my taxonomy — including every place where my taxonomy was incoherent. That is a completely different thing, and it took two days of relabelling to feel the difference in my hands."

IV. The Finite Terror of Collection Bias

If Your Data Has Walls, Your Model Lives Inside Them

I collected images from three sources: my own camera, a stock photography site, and web scraping. This felt like diversity. It was, in fact, three varieties of the same problem.

When I eventually separated my validation set by source and ran the confusion matrix independently for each, the results were startling. The model achieved 94% accuracy on stock photographs and 61% accuracy on scraped web images. The same model. The same weights. The same architecture. The difference was not in the model. It was in the distribution.

Stock photographs have consistent lighting, professional composition, high resolution, and central subject placement. The real world does not. The web images had motion blur, unusual angles, cluttered backgrounds, and objects partially occluded by other objects. The model had learned to recognise a certain kind of image of a thing — not the thing itself.

"My model had not learned to identify the object. It had learned to identify a well-lit, professionally composed, centre-framed photograph of the object. When the world declined to be a stock photograph, the model declined to understand it."

This is the collection bias problem made viscerally concrete. You cannot gather data without choosing where to look. Where you look determines what you see. What you see determines what the model learns. The model's understanding of the world is bounded, precisely and invisibly, by the boundaries of your collection strategy.

A finite dataset does not merely limit your model's knowledge. It shapes the topology of its blindness — determining not just what it does not know, but in what directions and under what conditions it will confidently be wrong.

— ✦ —

V. The Multiverse of Edge Cases

Your Easy Examples Trained the Model. Your Hard Examples Defined It.

Of my 5,000 images, approximately 70% were straightforward. Clean lighting, unambiguous subject, clear category membership, label assignable in under two seconds. These images were the comfortable majority. They were also, I now believe, the least important part of the dataset.

The remaining 30% — badly lit, occluded, unusual angles, motion-blurred, compositionally ambiguous — consumed roughly 80% of my total labelling time. They were the images I returned to, reconsidered, occasionally relabelled. And they were, without exception, exactly where the model failed in production.

The intuition runs counter to how we tend to think about datasets. We focus on class balance, on total count, on overall distribution. We rarely focus on difficulty balance — on whether the hard cases, the edge cases, the failure-adjacent cases, are present in sufficient quantity and variety to actually teach the model what to do when the world gets difficult.

The easy cases train the model. The hard cases are the model. The difference between a classifier that works in a lab and one that works in the world is almost entirely located in those 30% of images that nobody wanted to label.

Image type

Share of dataset

Share of labelling time

Share of production failures

Clean, unambiguous

~70%

~20%

Minimal

Borderline / ambiguous

~20%

~45%

Moderate

Edge cases / degraded

~10%

~35%

Dominant

The asymmetry between dataset composition and real-world failure distribution

VI. Identity at the Edge of the Training Loop

What Does "The Model" Even Mean If the Data Was Always You?

This is where the project stops being technical and starts being personal.

If the model learns your categories, your consistency, your collection biases, your implicit decisions about what counts as an edge case — then in what sense is the model separate from you? It has no theory of its own. It has inherited yours, compressed and generalised into weights and activations, but fundamentally derived from every judgment call you made over six weeks of labelling.

The philosopher of mind Derek Parfit spent decades on questions of personal identity — on what it means for a self to persist through time, what survives change, what constitutes the continuity that makes you you. His answer, broadly, was that there is no deep metaphysical fact about personal identity beyond the physical and psychological continuity that constitutes your history.

A trained model has a kind of continuity with the labeller who built its dataset. Not identity, exactly, but something like inheritance. The model is not you. But it could not be what it is without you having been what you were, at 11pm on a Tuesday in week four, deciding that that image goes in that class.

"You are not your atoms. The model is not its weights. Both of you are patterns — patterns that persist, generalise, and occasionally fail in ways that reveal the decisions that built you."

Perspective

What the model is

Engineering view

A function approximator optimised on labelled data

Information view

A compressed representation of the training distribution

Philosophical view

An inheritance of every labelling decision made upstream

Practical view

Exactly as good as the data — no better, no worse, never different

Humbling view

A mirror that reflects your categories back at the world

Frameworks for understanding what a trained model actually contains

VII. Living Inside the Training Loop

The Answer You Are Already Enacting

Here is what the project taught me. The actual training — the gradient descent, the loss curves, the hyperparameter tuning, the architecture comparisons — took approximately four hours across all experiments. The data took six weeks.

This ratio is not unusual. It is, in fact, the norm in any machine learning project that touches the real world, and it is almost never honestly represented in tutorials, papers, or public benchmarks. Benchmarks are created once and used indefinitely. They conceal the labour that made them. They present the finished dataset as the natural state of the world, rather than as the product of thousands of hours of human judgment.

What building a dataset from scratch teaches you — what no amount of model architecture reading can replicate — is that supervised learning is not about algorithms. It is about epistemology. It is about the question of how you know what you know, and whether you can articulate that knowledge precisely enough to be consistent about it, at scale, across thousands of examples, even as your understanding evolves and your attention fluctuates and the images keep coming.

The machine will learn whatever signal is in your labels. It will learn it with perfect patience and zero fatigue. It will generalise it to cases you have never seen and apply it in contexts you did not imagine. The question is never whether the model will learn. The question is always: what, exactly, are you teaching it?

"Train a model on 5,000 images and you will improve your validation accuracy. But you will also develop a permanently changed intuition for what it means to teach something to see — which is to say, a permanently changed intuition for how precarious and constructed your own seeing has always been."

The infinite dataset does not await you. There is only the data you collected, from the places you thought to look, labelled according to the categories you believed were real. In that sense, every model is a self-portrait — not of what you look like, but of how you have decided to see.

Perhaps that is the gift the process conceals. Not a better model, though you may get that too. But a clearer understanding of the decisions that were always already present in the act of perception — the categorisation that precedes every label, every image, every moment of recognition. The model does not create that categorisation. It simply makes it visible, by taking it seriously enough to inherit it.

"Open your training pipeline to a world that refuses to be cleanly labelled. Everything it contains — including your own uncertainty — was always going to be there. The model did not introduce the ambiguity. It simply refused to pretend the ambiguity was not there."

 

— END —

 

Mystic Quill  |  Research & Writing by Selva  |  2026

— ✦ —

Read more at

mysticquill.blogspot.com

Comments

Popular posts from this blog

THE WRONG KIND OF GOOD

TIRAMISU: THE GLOBAL“Pick-Me-Up” PHENOMENON

I wanted to build a mech. Not design one for a game or draw one for fun. Actually research what it would take to build one. So I did.