Active learning: how to let your model choose what to label next
Labeling everything is expensive. Active learning lets your model identify which unlabeled examples would be most useful to annotate — often cutting labeled data requirements by 30–60% without sacrificing performance.
The Caudals Team
Dataset operations
Why labeling everything is a trap
When a team builds a labeled dataset from scratch, the instinct is to label as much as possible. More data means better models, right?
The relationship is more complicated. Not all examples in an unlabeled pool are equally informative. A dataset with 10,000 examples might have 2,000 that are genuinely novel and 8,000 that are minor variations of things the model already handles well. Labeling all 10,000 teaches the model roughly the same thing as labeling the 2,000 — at five times the cost.
The problem is that you cannot know in advance which 2,000 those are. That is the problem active learning solves.
How the active learning loop works
Active learning is a data collection strategy built around a cycle: you train a model, use it to identify the unlabeled examples it handles least confidently, label those specifically, and then retrain. Repeat until performance is good enough.
The key insight is that a model's uncertainty is a map of where more labeled data would help most. An uncertain prediction means the model has not seen enough examples of that type to form a reliable decision. Labeling that example and retraining directly addresses the gap.
Step through the loop below to see how each phase works in practice.
Interactive · Active Learning Loop
Train on labeled data
Score the unlabeled pool
Select the most uncertain
Label the query batch
Add to training set and repeat
Step 1 of 5
Train on labeled data
The model trains on your current labeled dataset. In the first round, this is typically a small seed set — a few hundred examples chosen for coverage rather than volume. The model is functional but uncertain across much of the input space.
This cycle typically runs for 5–20 rounds before performance plateaus. In practice, teams often see that performance after 5 rounds of active learning matches what random collection would have achieved in 15 or 20 rounds, at a fraction of the annotation cost.
The three main query strategies
How you select which examples to label next is the core design decision in any active learning system. There are three dominant approaches.
Uncertainty sampling is the simplest and most commonly used. The model ranks all unlabeled examples by confidence and flags the ones it is least sure about. For classification tasks, this is usually the examples where the predicted probability is closest to 0.5 (for binary) or most evenly spread across classes (for multiclass). Uncertainty sampling is cheap to compute and works well when the model's confidence estimates are reliable.
Query by committee runs multiple models (or multiple versions of the same model with different training subsets) over the unlabeled pool and selects examples where the models disagree most. Disagreement is a more robust signal than any single model's uncertainty — it captures cases where the decision boundary is genuinely contested rather than just where one model happens to be poorly calibrated. The tradeoff is that you need to maintain and run multiple models, which increases compute cost.
Expected model change selects the examples that, if labeled and added to the training set, would cause the largest update to the model's parameters. This is theoretically the most principled approach — you are directly maximizing the learning value of each annotation budget — but it is also the most expensive to compute. It is typically used in research settings or for small unlabeled pools where the computation is tractable.
For most production dataset programs, uncertainty sampling is the right starting point. It is fast, interpretable, and requires no infrastructure beyond the model you are already training.
When active learning pays off
Active learning works best under a specific set of conditions.
Labeling is expensive relative to inference. If annotating each example takes meaningful time from a specialist — a radiologist reviewing scans, a linguist tagging code-switching in speech, a legal expert categorizing documents — the math shifts heavily in favor of selective labeling. The compute cost of scoring an unlabeled pool is negligible compared to the human cost of unnecessary annotation.
The unlabeled pool is large. Active learning requires a pool to query from. If you only have a few hundred unlabeled examples, the benefit of selection over random sampling is modest. The larger the pool, the more variation it contains, and the more value the model's uncertainty signal can extract from it.
You can iterate. Active learning is a loop, not a one-shot decision. If your annotation pipeline requires months of lead time to spin up and cannot accept new batches midway, active learning's incremental structure will not fit well. Programs that can label a batch of a few hundred examples every week and retrain on a rolling basis extract the most from the approach.
You have a seed dataset. You cannot start from zero labels — the model needs enough labeled data to produce meaningful uncertainty scores. A useful seed set is typically 200–500 examples with decent class coverage. The quality of the seed set matters more than its size.
When it does not pay off
Active learning is not a universal win.
If your labeling is cheap (crowdsourced binary tasks at a few cents per annotation, or automated post-processing pipelines), the overhead of running an active learning loop — scoring the pool, batching, retraining — may outweigh the annotation savings. Simple volume collection at low cost is often the right call.
If your model's confidence calibration is poor, uncertainty scores will be noisy and the query batches will not reliably target the most informative examples. This is more common than expected: models trained on small seed sets often over- or under-express confidence in ways that are task-specific. Running a calibration check before committing to active learning is worth the time.
The cold-start problem is also real. Before you have any labeled data, you cannot score an unlabeled pool. The first batch always needs to be collected without guidance — usually via stratified random sampling to ensure class coverage. Active learning becomes useful only after that seed set is in place.
What this means for a dataset collection program
For teams running structured collection programs, active learning changes the shape of how you think about batch planning.
Instead of a single large collection sprint, active learning programs tend to run as a series of smaller targeted batches. The first batch is seed collection: broad, stratified, designed to give the model enough signal to calibrate. Subsequent batches are query-driven: smaller, targeted at the pool segments the model finds hardest.
This has practical implications for how you scope and price a collection program. The gross collection volume stays the same (you still have a large unlabeled pool to score), but the annotated volume can be substantially smaller than a naive random-labeling approach would require. Reviewer time concentrates on the hardest examples, which also tends to align naturally with the examples that most benefit from expert review.
The flip side is that the annotation task itself gets slightly harder in each round. Because the model is selecting for uncertainty, later batches tend to contain more edge cases, ambiguous examples, and underrepresented inputs — exactly the kinds of examples that require more careful annotation. Rubrics that were clear for the seed set sometimes need refinement as the query batches expose boundary cases the original guidelines did not anticipate.
The operational posture it requires
Active learning is a strategy that rewards operational discipline. It runs best when you have a clean separation between your unlabeled pool, your labeled training set, and your validation set — and when your annotation pipeline can ingest targeted batches rather than processing an undifferentiated stream.
The validation set deserves special attention. Because active learning deliberately biases the labeled training set toward uncertain, hard examples, held-out evaluation must be drawn separately from the pool before any query cycles begin. A validation set contaminated by query-driven selection will not accurately measure generalization.
Teams that treat active learning as a drop-in replacement for passive collection — running query cycles without updating their annotation rubrics, without maintaining proper validation splits, without budgeting for harder examples in later rounds — often see disappointing results. The strategy works when the collection infrastructure around it is designed to support it.
Where this fits in a broader data strategy
Active learning is one tool in a data strategy, not a complete answer. It works best alongside a clear task definition, a well-designed annotation rubric, and a realistic model of what quality level you are trying to reach.
For teams building out dataset programs for fine-tuning or evaluation, the question worth asking is not "should we use active learning" but "at what stage does active learning become the right call." For most programs, the answer is after the seed set is solid, after the annotation rubric has been stress-tested on a first review pass, and after you have enough unlabeled pool to make selection meaningful.
That is the point where selecting smartly beats collecting broadly.