Engineering 99.9% Accuracy in Healthcare Voice AI

Why a Frontier Model Isn’t Enough:

Engineering 99.9% Accuracy in Healthcare Voice AI

How a 30+ model constellation, two-level verifiers, and millions of real production calls let us cross the accuracy threshold that healthcare actually requires.

A common assumption in 2026 is that the next frontier model release will solve healthcare. Unfortunately, it’s not that simple. Even the most capable LLMs still hover well below the accuracy required for clinical patient conversations. In healthcare voice, the accuracy required is somewhere north of 99% on every reasoning subtask, every turn, every call. A single hallucinated medication dose or missed escalation marker is not an evaluation curiosity; it is a safety event.

Our recently published paper, Perfecting Human–AI Interaction at Clinical Scale: Turning Production Signals into Safer, More Human Conversations, lays out the framework we use to get there. It is grounded in real signals from 200M+ live patient-AI interactions and clinician-led testing (7K+ licensed clinicians, 500K+ test calls). The headline number — a 99.9% clinical safety score — does not come from one model. It comes from a system. This post walks through a few of the things that system does that we think are genuinely unique.

1. A constellation, not a single model

Our production system is a constellation of 30+ specialized models orchestrated around a core conversation model. Each satellite has a narrow, verifiable job: medication identification, overdose detection, escalation decisions, labs and vitals reasoning, policy and benefits lookup, scheduling, IVR navigation, and a long tail of online and offline verifiers. The conversational core stays flexible and human; the satellites pin down the parts that have to be exactly right.

The reason we do this rather than scale a single monolith is fundamental to how LLM error rates compound. If a single model is 98% accurate on each of ten reasoning steps in a call, you’ve already dropped below 82% end-to-end. The only way to climb the accuracy curve is to decompose the problem into pieces where each piece can be made verifiably correct.

2. A worked example: the blood sugar model

To make the constellation concrete, consider how we handle a single domain: blood sugar. When a patient mentions a glucose reading, a dedicated satellite first identifies whether the conversation is actually about blood sugar at all — distinguishing it from blood pressure, lab panels, or unrelated context. Once confirmed, the model drives the conversation through the questions that determine what the number actually means: when the sample was taken, and fasting status. It then supplies the appropriate clinical guidelines for that specific context — fasting versus post-prandial ranges are not interchangeable, and a value that is normal in one context is concerning in the other.

If the value falls into a concerning range, a separate set of models takes over to evaluate whether this is an emergency condition requiring immediate escalation, or a non-urgent clinical signal to route through standard channels. That hand-off — from “identify and contextualize” to “triage severity” — is exactly the kind of decomposition the constellation is designed for. No single model is doing all of this; each satellite owns a narrow, verifiable slice, and the orchestration glues them into a safe answer.

3. Tool calls get two-level verifiers — but only the verifiable kind

Scheduling looks, on paper, like a solved problem: book an appointment, confirm a time. In voice, it isn’t. Patients change their minds mid-sentence. They layer restrictions (“not Tuesdays, except the first Tuesday of the month”), express preferences for specific providers, juggle multiple appointments, and quietly assume the agent retained context from three turns ago. Every one of these is an opportunity for a hallucinated tool call.

There is also a more consequential failure mode hiding inside scheduling: the reason a patient is calling to book an appointment is often the most clinically important signal in the call. We had a patient request a neurology appointment because they had just been struck by lightning. The surface task is scheduling; the actual situation is a potential emergency that needs to be escalated immediately. A scheduling agent that competently books the appointment and ends the call has failed in the worst possible way. Detecting these signals — and routing them out of the scheduling flow and into urgent escalation — is one of the jobs we explicitly assign to dedicated satellite models in the constellation. Never missing one of these is non-negotiable.

So for every tool call our system makes, we also build a two-level verifier:

Online (real-time): An LLM-as-judge runs in-line and catches verifiable issues fast enough to address them conversationally — before the patient is told something wrong.
Offline: A deep-thinking model does multiple passes over the call. Slower, but it surfaces the long-tail issues a real-time judge can’t afford to catch.

The critical design constraint here is one we learned the hard way: a generic “did we make a mistake?” judge does not work. It has the same problem space as the original query, so it has the same error rate. A successful judge requires either (a) a reduction in problem space — turning open-ended reasoning into a yes/no on a specific verifiable fact — or (b) a smarter, usually slower model that can afford deeper computation. Without one of those two properties, you are just adding latency and false confidence.

4. IVR accuracy is experiential, not theoretical

The other major source of accuracy gains is, frankly, experience. You cannot anticipate what goes wrong in production from a whiteboard. Voice agents talking to other automated systems — pharmacy IVRs, lab phone trees, insurance verification lines — produce a class of failures that does not exist in human-to-human training data:

Not knowing what to do with hold music.
Failing to identify confirmation beeps as system signals rather than noise.
Treating IVR pacing like human pacing — pressing keys before the menu has finished, or interrupting prompts.
Speech recognition errors that look benign in text but are catastrophic when the next step is a tool call.

Our paper digs into the role of these in-the-wild cues — paralinguistics, turn-taking dynamics, clarification triggers, escalation markers — and shows that they expose failure modes curated datasets simply do not contain. Every one of these failures, once observed, feeds back into the constellation design: a new satellite, a new verifier, or a training signal for the supervisor models. Notably, this work has driven a ~50% reduction in ASR errors over off-the-shelf enterprise ASR in our production system.

Vishal Parikh Hippocratic AI Hippocratic AI

“It’s worth being explicit about why we treat IVR navigation as a first-class problem. The IVR layer is not itself clinical — there is no patient on the other end of a pharmacy phone tree. But navigating it accurately is what lets us deliver clinical information back to the patient and, more critically, ensure delivery of safety-critical medications. A failed IVR navigation can mean a prescription that doesn’t get refilled, a lab result that doesn’t get communicated, or an authorization that doesn't get confirmed. That puts IVR squarely in the critical path for safety, even though the IVR itself never says anything clinical.”

Vishal Parikh, Co-founder & Chief Product Officer

Why experiential learning beats benchmark optimization. The thesis of the paper is that production-grade clinical intelligence is achieved by learning from real-world interaction signals and embedding them into system-level design — not by optimizing isolated model accuracy alone. Benchmarks are necessary; they are not sufficient.

5. Fine-tuning is where the last percent lives

The constellation gets us most of the way, but fine-tuning is what takes us across the line. Our pipeline:

Start with our healthcare base models. These are built on open-source foundations — deliberately, so that we own the training pipeline — and we continuously train them on our proprietary healthcare dataset. Every new wave of production signals and clinician-labeled cases flows back into these base models. We zero-shot and few-shot prototype on top of them, and on many tasks this alone pushes accuracy above 99%. That is still not good enough.
Use our human evaluation platform to label failure cases. Clinicians review production and simulated calls, flag accuracy failures, and the labeled cases become high-signal training data.
Apply post-training techniques to incorporate that data. Concretely, this means a mix of supervised fine-tuning (SFT) on the labeled failure cases, direct preference optimization (DPO) and RLHF to align the model with clinician preferences on nuanced calls, rejection-sampling fine-tuning to amplify high-quality trajectories, and distillation from larger reasoning models into the smaller, faster satellites. We apply these techniques not just to the conversational model but to each of the specific support models in the constellation — the medication checker, the scheduler verifier, the escalation classifier, and so on.

The result of this loop, run over 30+ specialized models with their own failure modes and their own training data, is the production system described in the paper.

6. What’s next: doing this affordably

Running 30+ LLM queries per conversation against frontier models would cost over $100 per hour of patient conversation. That is not a viable architecture for the scale healthcare needs. The next post in this series will get into how we actually run this system in production cost-effectively — model distillation, routing, caching strategies, and where the constellation lets us substitute small, specialized models for large generalists without losing accuracy. Stay tuned.