One might think computer vision models are supposed to be easy to put into production. There are whole companies built on that promise: label a few images, click train, click deploy, done. In practice, it’s messier. Most of us working with these models aren’t ML experts, and moving fast to keep up with the industry has a real cost: the shortcuts you take to ship quickly turn into assumptions that are expensive to undo once the model is live.
Here are some key lessons I’ve learned from running classifiers in production. First, I’ll start with the basics to make sure you and I are on the same page.
What is a Computer Vision Model?
Computer vision (CV) models are a type of machine learning model, but they’re not LLMs. Where an LLM ingests tokens, a CV model ingests pixels and learns to recognize patterns like textures, shapes, and objects.
At a high level, a CV model can do a few things:
- Classification — “What is this?”
- Detection — “What’s in this image, and where?”
- Segmentation — “Which pixels belong to which object?”
For my most recent project, I deployed classification models, so I’ll focus on those.
What is a Classifier?
A classifier takes an image and assigns it to one of a fixed set of categories, called classes. You decide the classes up front, label example images, and train the model to map new images to those same classes.
Two common flavors:
- Binary classifier — picks between two classes (e.g. “yes” / “no”)
- Multi-class classifier — picks one class from three or more options (e.g. “blue” / “green” / “red”)
There are others (multi-label, hierarchical, etc.), but these two cover most practical cases.
Lesson 1: Designing the Classifier Is Harder Than Training It
Before you touch a training script, you have to answer two questions:
- What should this classifier detect?
- What classes should it use?
If you’ve ever played Catan, you know your first settlement placements largely determine how you’ll rank at the end of the game. Designing a classifier is similar: your early choices constrain everything that comes after.
Start by writing down a clear question for what the classifier is going to look for. Then test it with humans. Show several people the kind of images you plan to send the model and ask them to label them using your draft classes. If they disagree a lot, your model will too.
In ML this is called inter-annotator agreement, and it effectively sets an upper bound on your model quality. Low agreement is a design problem, not a training problem. The model’s answers will only ever be as good as its training labels.
Bad questions and classes
Question: Does this product look acceptable?
Classes: Yes, No
Question: How does this product make a customer feel?
Classes: Pretty, Cool, Awesome
Both are subjective. Ten labelers will give you ten different answers, and the model learns noise instead of a pattern. In production, that shows up as customers seeing results they don’t understand or trust.
Better questions and classes
Question: Is this a dog?
Classes: Yes, No
Question: What type of damage does this shirt have?
Classes: Hole, Tear
These look much better, but the second one hides a trap:
- What if the shirt has both a hole and a tear?
- What if it has neither — no damage at all?
Your classes should be mutually exclusive and cover every case you’ll actually see. If two classes can apply at once, you need a multi-label setup. If “none of the above” is possible, you need a None or Other class.
Once the question is crisp and the classes are complete and disjoint, picking binary vs. multi-class is usually straightforward.
Lesson 2: Labeling Is the Work
In practice, you’ll usually go through two training phases:
- Initial fine-tuning before you ship
- Continuous fine-tuning after you ship
Both require human-labeled data.
As a rough starting point, around 100 images per class is often enough for the initial fine-tune, especially if your classes are visually distinct (e.g. distinguishing a hole from a small tear). That initial fine-tune gets you from “random answer” to “roughly reasonable answer.”
Once the model is live, labeling doesn’t stop. To keep performance from drifting and to improve it as new kinds of images show up, you need a steady stream of fresh labels and periodic retraining.
In production we ran into one big problem — inconsistent results — and traced it back to two main causes:
- Inconsistent human labeling.
Ambiguous questions and fuzzy class boundaries meant our labelers disagreed with each other and with their past selves. The model dutifully learned the noise. - Not enough labeling capacity.
At peak we had 11 models in production and two people squeezing labeling while doing their actual jobs.
I’ll admit, labeling is tedious, and few people want it as their full-time job. But in production it’s the single biggest lever you have on model quality. Under-invest in it and “inconsistent results” becomes a permanent feature of your product.
One more thing worth naming: class imbalance. If 95% of your shirts are undamaged, a model that always predicts “undamaged” is 95% accurate and completely useless. Watching per-class metrics, not just overall accuracy, is just as important, or else you’ll miss that failure mode entirely.
Lesson 3: The Biggest Model Isn’t Always the Best
When results are disappointing, it’s tempting to assume the model just isn’t smart enough and to reach for a bigger one.
Sometimes that’s right. Often it isn’t. And in production, upsizing comes with costs people underestimate:
- More memory
- Higher latency on every prediction
- Less GPU headroom to run other models
- Higher inference cost at scale
Deployment on a local device will force these constraints on you, primarily because of the hardware. Cloud deployment hides them behind a credit card, which is not the same as making them disappear. At scale, a 10× larger model is roughly a 10× larger bill, plus real latency that customers will feel.
Before jumping to a bigger model, I now force myself to ask:
- Do we actually need all of these models, or can one do the job of three?
- Are we making the product more complex by adding another model?
- What matters most right now: accuracy, latency, cost, or consistency?
- What’s the smallest model that clears our accuracy bar for customers?
That last one is the key flip: start from “what’s the smallest model that works?” instead of “what’s the biggest model we can afford?” You end up with a leaner system that’s cheaper to run and easier to debug.
Conclusion
AI has pushed the industry toward “ship fast, fix later.” Being first to market matters, and I don’t think the answer is to slow down.
But shipping models without clear questions, without enough labeling capacity, and without thinking about size and cost is how you build a product that’s fragile in ways customers feel and you can’t easily unwind.
If you need to go fast, go fast. Just think twice about these three things while you do:
- The question and classes — design them like you’ll have to defend every label.
- Labeling — treat it as ongoing work, not a one-time setup.
- Model size — aim for the smallest model that clears the bar, not the biggest you can run.