WHAT TRAINING AN AI ON 5,000 IMAGES ACTUALLY TEACHES YOU
WHAT TRAINING AN AI ON 5,000 IMAGES ACTUALLY TEACHES
YOU
And Why the Data Was Never Really About the Data
A meditation on teaching a machine to
see — and what it reveals about how we see ourselves
It doesn't
matter how carefully you curated your collection, how many hours you spent in a
labelling interface clicking boxes and assigning classes. At some point, your
attention flickered. Your definition of a category quietly shifted between
image 412 and image 1,847. The lighting fooled you, or the angle, or the simple
accumulated fatigue of looking at five thousand of anything.
This is not a
confession of failure. It is the central, inescapable lesson of training an
image classifier from scratch — not fine-tuning a foundation model, not running
someone else's benchmark, but actually building the dataset yourself, image by
image, decision by decision. What the process teaches you has almost nothing to
do with gradient descent or learning rates or network architecture. It teaches
you something far stranger: that the act of labelling the world is the act of
deciding what the world means.
And a machine,
with perfect fidelity, will learn exactly what you decided.
— ✦ —
I. The Question That Precedes Every Model
Both Answers Are Impossible. Both Must Be True.
Before you
write a single line of training code, you must answer a question that sounds
trivially simple: what are the categories? How many classes does your problem
have, and where exactly do the boundaries between them fall?
This
question is a trap. It seems like a preliminary matter — something to resolve
in an afternoon before the real work begins. In practice, it is the real work.
Everything else is arithmetic.
My first
taxonomy had nine classes. By image 800, I had collapsed it to four — not
because I had grown lazy or impatient, but because I had discovered, the hard
way, that my original categories reflected the way I had been taught to think
about these objects, not any pattern that existed in the visual data itself.
Two of my categories, which I had confidently distinguished in theory, turned
out to be visually indistinguishable in practice. The model, with no access to
theory, would have learned only the confusion.
"A taxonomy is a theory of the world written in the
language of boxes and labels. When you train a model on your taxonomy, you are
not teaching it to see. You are teaching it to reproduce your theory. Those are
very different things."
The paradox
is this: you cannot begin labelling without a schema, but you cannot build a
correct schema without labelling. The categories you choose at the start will
be wrong in ways you cannot predict until you have already made hundreds of
decisions based on them. The only solution is to treat the schema as a living
document — which means accepting that some portion of your early work will
always need to be redone.
And yet this
is not idle methodology. It is the central open problem of every supervised
learning project that touches the real world. The model optimises for whatever
signal is in your labels. If your labels are wrong, the optimisation is precise
and the result is precisely wrong.
II. What the Data Actually Contains
The Observable Training Set Is a Cage, Not a Map
When
practitioners talk about dataset size, they usually mean the total number of
examples. What they rarely discuss is the effective number — the count of
genuinely informative, correctly labelled, distribution-representative examples
that the model can actually learn from.
My 5,000
images contained, by my own reckoning, approximately 3,400 examples I was fully
confident about. The remaining 1,600 occupied a territory I can only describe
as epistemically uncomfortable: images where the correct label depended on
assumptions I had not made explicit, on context the image did not contain, on a
definition I had not yet stabilised. I labelled them anyway. I had to. But I
labelled them with the quiet knowledge that I was making decisions rather than
discovering truths.
This is the
gap between the dataset and the map. The dataset is the collection you have.
The map is the underlying distribution of the phenomenon you are trying to
model. The two are never identical, and the distance between them — in
coverage, in balance, in representative difficulty — determines the ceiling of
your model's real-world performance.
|
Category |
What It Measures |
Why It Is Insufficient |
|
Total examples |
Raw dataset size |
Includes ambiguous,
mislabelled, redundant images |
|
Class balance |
Distribution across
categories |
Ignores within-class
difficulty variation |
|
Annotation confidence |
Labeller certainty per
example |
Rarely collected;
enormously informative |
|
Source diversity |
Variety of image origins |
Critical for
generalisation; rarely tracked |
|
Edge case density |
Proportion of hard examples |
The factor most predictive
of failure modes |
What dataset quality metrics typically
capture — and what they miss
The
uncomfortable truth is that a smaller, better-curated dataset will almost
always outperform a larger, carelessly assembled one for any task that involves
real-world variation. Five thousand thoughtfully selected and consistently
labelled images will teach a model more than fifty thousand gathered from a
single source with a single lighting condition and a single angle of view.
"We have counted the images. We have not counted what is in
them. The difference between those two things may be the entire project."
III. The Number That Should Humble You
The Arithmetic of Inconsistency
Partway
through the project — around image 2,300 — I discovered something that forced a
reckoning. I had been applying one of my labels inconsistently. My intuitive
definition of a borderline category had quietly shifted between week one and
week three. The newer labels were, arguably, more defensible. They reflected a
more considered understanding of what the category should mean.
But that is
not how machine learning works. The model does not care which of your
definitions was more considered. It sees the input, it sees the label, and it
attempts to learn a function that maps one to the other. If identical or
near-identical inputs have received different labels at different points in
time, the model learns that identical inputs can produce different outputs.
Which is to say: it learns noise.
I relabelled
approximately 600 images. It took two days. And in those two days, I understood
something about consistency that no textbook had made visceral for me: it is
often better to be systematically wrong than randomly right. A model trained on
a consistently-applied incorrect definition will at least learn a
coherent theory. A model trained on inconsistent labels learns that the world
is arbitrary — which is the one thing you most need it not to believe.
|
Labelling regime |
Model outcome |
Real-world consequence |
|
Consistent and correct |
Learns true signal |
Generalises well to new
data |
|
Consistent but wrong |
Learns your error |
Fails predictably;
correctable |
|
Inconsistent, mostly right |
Learns signal + noise |
Inconsistent performance at
edges |
|
Inconsistent, often wrong |
Learns noise |
Unpredictable failures
everywhere |
|
Deliberately ambiguous |
Cannot converge |
High loss, unstable
training |
The arithmetic of label quality and its
consequences for model behaviour
"The model did not learn to see. It learned to replicate my
taxonomy — including every place where my taxonomy was incoherent. That is a
completely different thing, and it took two days of relabelling to feel the
difference in my hands."
IV. The Finite Terror of Collection Bias
If Your Data Has Walls, Your Model Lives Inside Them
I collected
images from three sources: my own camera, a stock photography site, and web
scraping. This felt like diversity. It was, in fact, three varieties of the
same problem.
When I
eventually separated my validation set by source and ran the confusion matrix
independently for each, the results were startling. The model achieved 94%
accuracy on stock photographs and 61% accuracy on scraped web images. The same
model. The same weights. The same architecture. The difference was not in the
model. It was in the distribution.
Stock
photographs have consistent lighting, professional composition, high
resolution, and central subject placement. The real world does not. The web
images had motion blur, unusual angles, cluttered backgrounds, and objects
partially occluded by other objects. The model had learned to recognise a
certain kind of image of a thing — not the thing itself.
"My model had not learned to identify the object. It had
learned to identify a well-lit, professionally composed, centre-framed
photograph of the object. When the world declined to be a stock photograph, the
model declined to understand it."
This is the
collection bias problem made viscerally concrete. You cannot gather data
without choosing where to look. Where you look determines what you see. What
you see determines what the model learns. The model's understanding of the
world is bounded, precisely and invisibly, by the boundaries of your collection
strategy.
A finite
dataset does not merely limit your model's knowledge. It shapes the topology of
its blindness — determining not just what it does not know, but in what
directions and under what conditions it will confidently be wrong.
— ✦ —
V. The Multiverse of Edge Cases
Your Easy Examples Trained the Model. Your Hard Examples Defined It.
Of my 5,000
images, approximately 70% were straightforward. Clean lighting, unambiguous
subject, clear category membership, label assignable in under two seconds.
These images were the comfortable majority. They were also, I now believe, the
least important part of the dataset.
The
remaining 30% — badly lit, occluded, unusual angles, motion-blurred,
compositionally ambiguous — consumed roughly 80% of my total labelling time.
They were the images I returned to, reconsidered, occasionally relabelled. And
they were, without exception, exactly where the model failed in production.
The
intuition runs counter to how we tend to think about datasets. We focus on class
balance, on total count, on overall distribution. We rarely focus on difficulty
balance — on whether the hard cases, the edge cases, the failure-adjacent
cases, are present in sufficient quantity and variety to actually teach the
model what to do when the world gets difficult.
The easy
cases train the model. The hard cases are the model. The difference between a
classifier that works in a lab and one that works in the world is almost
entirely located in those 30% of images that nobody wanted to label.
|
Image type |
Share of dataset |
Share of labelling time |
Share of production
failures |
|
Clean, unambiguous |
~70% |
~20% |
Minimal |
|
Borderline / ambiguous |
~20% |
~45% |
Moderate |
|
Edge cases / degraded |
~10% |
~35% |
Dominant |
The asymmetry between dataset
composition and real-world failure distribution
VI. Identity at the Edge of the Training Loop
What Does "The Model" Even Mean If the Data Was Always You?
This is
where the project stops being technical and starts being personal.
If the model
learns your categories, your consistency, your collection biases, your implicit
decisions about what counts as an edge case — then in what sense is the model
separate from you? It has no theory of its own. It has inherited yours,
compressed and generalised into weights and activations, but fundamentally
derived from every judgment call you made over six weeks of labelling.
The
philosopher of mind Derek Parfit spent decades on questions of personal
identity — on what it means for a self to persist through time, what survives
change, what constitutes the continuity that makes you you. His answer,
broadly, was that there is no deep metaphysical fact about personal identity
beyond the physical and psychological continuity that constitutes your history.
A trained
model has a kind of continuity with the labeller who built its dataset. Not
identity, exactly, but something like inheritance. The model is not you. But it
could not be what it is without you having been what you were, at 11pm on a
Tuesday in week four, deciding that that image goes in that
class.
"You are not your atoms. The model is not its weights. Both
of you are patterns — patterns that persist, generalise, and occasionally fail
in ways that reveal the decisions that built you."
|
Perspective |
What the model is |
|
Engineering view |
A function approximator
optimised on labelled data |
|
Information view |
A compressed representation
of the training distribution |
|
Philosophical view |
An inheritance of every
labelling decision made upstream |
|
Practical view |
Exactly as good as the data
— no better, no worse, never different |
|
Humbling view |
A mirror that reflects your
categories back at the world |
Frameworks for understanding what a
trained model actually contains
VII. Living Inside the Training Loop
The Answer You Are Already Enacting
Here is what
the project taught me. The actual training — the gradient descent, the loss
curves, the hyperparameter tuning, the architecture comparisons — took
approximately four hours across all experiments. The data took six weeks.
This ratio
is not unusual. It is, in fact, the norm in any machine learning project that
touches the real world, and it is almost never honestly represented in
tutorials, papers, or public benchmarks. Benchmarks are created once and used
indefinitely. They conceal the labour that made them. They present the finished
dataset as the natural state of the world, rather than as the product of
thousands of hours of human judgment.
What
building a dataset from scratch teaches you — what no amount of model
architecture reading can replicate — is that supervised learning is not about
algorithms. It is about epistemology. It is about the question of how you know
what you know, and whether you can articulate that knowledge precisely enough
to be consistent about it, at scale, across thousands of examples, even as your
understanding evolves and your attention fluctuates and the images keep coming.
The machine
will learn whatever signal is in your labels. It will learn it with perfect
patience and zero fatigue. It will generalise it to cases you have never seen
and apply it in contexts you did not imagine. The question is never whether the
model will learn. The question is always: what, exactly, are you teaching it?
"Train a model on 5,000 images and you will improve your
validation accuracy. But you will also develop a permanently changed intuition
for what it means to teach something to see — which is to say, a permanently
changed intuition for how precarious and constructed your own seeing has always
been."
The infinite
dataset does not await you. There is only the data you collected, from the
places you thought to look, labelled according to the categories you believed
were real. In that sense, every model is a self-portrait — not of what you look
like, but of how you have decided to see.
Perhaps that
is the gift the process conceals. Not a better model, though you may get that
too. But a clearer understanding of the decisions that were always already
present in the act of perception — the categorisation that precedes every
label, every image, every moment of recognition. The model does not create that
categorisation. It simply makes it visible, by taking it seriously enough to
inherit it.
"Open your training pipeline to a world that refuses to be
cleanly labelled. Everything it contains — including your own uncertainty — was
always going to be there. The model did not introduce the ambiguity. It simply
refused to pretend the ambiguity was not there."
— END —
Mystic Quill |
Research & Writing by Selva
| 2026
— ✦ —
Read more at
mysticquill.blogspot.com
Comments
Post a Comment