Optimus AI Labs home

How to Spot Hallucinations Before You Ship a Fine-Tuned Model

by Optimus AI Labs7 min read

 Hallucinations

A legal tech startup fine-tunes a model on case law. It performs beautifully in testing: cites real statutes, references actual rulings, structures arguments correctly.

They ship it, within a week, a lawyer catches the model citing a case that doesn't exist. The statute numbers are plausible, the case name sounds real, but the ruling it describes never happened.

That's the particular problem with LLM hallucinations in fine-tuned models: they don't look like errors. They look like answers because the model extrapolated from patterns in your training data with full syntactic confidence, producing fabrications that pass a casual read.

Catching this before production requires a structured evaluation process, not a quick sanity check.

Here's what that process actually looks like.

Build the Golden Evaluation Dataset First

The fastest way to evaluate a fine-tuned model for factual drift is to have a prepared blind test before training even starts. Most teams don't do this. They fine-tune, then scramble to build evaluation data afterwards, often using examples that overlap with or resemble their training set. That produces inflated scores and false confidence.

A golden evaluation dataset is a set of questions and expected answers the model never sees during fine-tuning.

The questions should cover three categories: standard queries that represent the typical use case, edge cases that sit at the boundary of the model's domain, and adversarial questions designed to tempt the model into guessing.

The adversarial category is where most teams underinvest. These are questions just outside the model's knowledge boundary, phrased confidently enough that a poorly calibrated model will answer rather than decline.

A legal model trained on contract law might handle corporate litigation questions well but start fabricating when asked about specialized maritime disputes it wasn't trained on. If your golden dataset doesn't probe those edges, you won't find the drift until a user does.

Beyond just measuring accuracy on known-good examples, the goal is to create a consistent, repeatable scorecard that you can run against every version of the model, so regressions are visible before deployment, not after.

LLM-as-a-Judge vs. Human Review: How to Decide

Once you have a golden dataset, you need a grading mechanism. The choice between automated evaluation and human review isn't either/or, but knowing when each earns its place changes what you catch.

LLM-as-a-judge means using a frontier model to evaluate your fine-tuned model's outputs against the expected answers.

You give the judge a rubric: check factual consistency (does the answer match what we know to be true?), groundedness (does it introduce facts not present in the source material?), and completeness (does it answer what was actually asked?). The judge returns a score per example, and you aggregate across your full golden dataset.

You can run it overnight across thousands of examples and get structured feedback on exactly which question categories are drifting. For LLMOps testing strategies at any reasonable volume, automated grading is the baseline layer you need.

Human review earns its place for subtle fabrications that automated evaluators miss. A judge model evaluates factual consistency based on its own training data.

If your model hallucinates something plausible but wrong in a domain where the judge also has incomplete knowledge, the fabrication gets a passing score.

The practical approach is to use LLM-as-a-judge for broad coverage and human review for a curated sample of high-stakes outputs, particularly anything in regulated domains like medicine, law, or finance.

One signal worth tracking separately in automated evaluation: refusal calibration. A well-calibrated model knows when it doesn't know something and says so. A model that answers every question confidently, including the ones it shouldn't know, is more likely to hallucinate than one that correctly declines out-of-scope queries.

Your judge should score appropriate refusals as correct, not as failures.

Telltale Signs Your Training Data Caused the Hallucination

When a fine-tuned model starts hallucinating, the training data is almost always the origin. The hallucination patterns tell you specifically where to look.

If the model hallucinates consistently in one narrow topic area, the training data for that area was thin, inconsistent, or contained errors. The model learned from a noisy signal and extrapolated badly. Check the examples covering that topic for contradictions, formatting inconsistencies, or over-reliance on a single source phrasing.

If the model fabricates plausible-sounding specifics, like fake statistics, invented names, or fictional dates, it learned that specificity reads as authority in your training data.

This happens when the training set is heavy on confident, declarative examples and light on examples that express uncertainty or appropriate caveats. The model learns that hedging is stylistically wrong for your domain and overcorrects toward false confidence.

If the model's hallucinations resemble distorted versions of real training examples, near-duplicate contamination is likely.

When a model sees the same concept in 40 slightly different phrasings, it sometimes recombines them into outputs that blend elements from multiple examples into something that never existed.

Deduplication before training catches most of this, but near-duplicates that differ only in named entities or numbers are particularly prone to this failure mode.

The diagnostic question to ask when you find a hallucination: does this look like something that could have been in the training data, or does it look like something the model invented from scratch? The former points to a data cleaning problem. The latter often points to domain boundary issues, where the model is trying to answer questions it was never actually trained to handle.

The Rephrasing Stress Test

Accuracy on a held-out test set is a necessary condition for shipping, not a sufficient one. A model that scores 96% on your golden dataset using the exact phrasing of your evaluation questions might score 70% on the same questions asked differently. That gap is the hallucination risk that standard evaluation frameworks for fine-tuned LLMs routinely miss.

The rephrasing stress test works like this: take 50 core questions from your golden dataset and write three or four alternative phrasings for each. Keep the meaning identical; vary the vocabulary, sentence structure, and framing. Run both the original and rephrased versions through your model and compare outputs.

A generalized model produces consistent core answers regardless of surface form. An overfitted model that's memorizing patterns from training data will handle the familiar phrasing correctly and drift on the unfamiliar one. When you see this, the model hasn't learned the concept behind the question. It's pattern-matching the words.

This test catches a category of hallucination that's particularly dangerous in production: the model that performs well in demos, where questions are often carefully phrased, and fails under real user inputs, which are messier, more varied, and often phrased in ways your training data didn't anticipate. Users don't ask questions the way your training examples were written.

Setting the Ship/No-Ship Threshold in Your Pipeline

Every fine-tuned model needs a hallucination threshold before it gets anywhere near a production environment. Without a defined line, deployment decisions rely on subjective team judgment, which varies by deadline pressure and how tired everyone is.

The threshold is an accuracy floor on your golden evaluation dataset, expressed as a minimum factual accuracy percentage across all question categories. A reasonable starting point for most applications is 95% factual accuracy overall, with stricter per-category floors for high-stakes domains.

A customer service model might tolerate 94% overall, but a medical information model probably needs 98% in clinical categories before you consider shipping.

In a CI/CD pipeline, this threshold acts as a hard gate. The evaluation suite runs automatically against every new model version. If the model falls below the threshold on factual accuracy, groundedness, or consistency scores, the deployment fails. The team gets a report showing which categories drifted and by how much. No manual override, no shipping with a note to fix it later.

Treating hallucination rates as a hard compliance metric rather than a quality guideline changes how teams respond to failures.

When a model fails the gate, the question becomes "where in the training data did this confusion originate" rather than "can we rationalize shipping anyway." That's the right question to be asking.

The threshold also needs a review cadence. As your golden dataset grows and your user base expands, the distribution of real queries will drift from what you anticipated.

Audit your evaluation dataset every few months, add new adversarial examples based on failures you've seen in production, and recalibrate the threshold if the baseline model capability has shifted. A static threshold on a growing system eventually becomes meaningless.

What the Pre-Ship Process Actually Costs You

Teams that skip structured hallucination detection often justify it as a shortcut for speed. They argue that building golden datasets, setting up LLM-as-a-judge evaluations, and establishing deployment gates takes precious engineering time they simply don't have.

Consider the legal tech startup that had to urgently pull a hallucinating model from production: they spent weeks manually auditing flawed outputs, scrambling to rebuild user trust, and re-evaluating software they had already launched.

That reactive crisis management consumes vastly more time, money, and reputational capital than a proactive pre-ship evaluation pipeline ever would.

At OptimusAI Labs, we eliminate this false tradeoff. Through our expert Model Fine-tuning services, we ensure that safety, accuracy, and reliability are baked into the deployment process from day one.

We know that the fastest way to production is doing the work upfront. Let our Model Fine-tuning service provide the discipline and safety

infrastructure your enterprise needs to ship with absolute confidence.