
May 28, 2026 | 6 min read
By Vishal Parikh,
Chief Product Officer, Co-founder, Hippocratic AI
How a 30+ model constellation, two-level verifiers, and millions of real production calls let us cross the accuracy threshold that healthcare actually requires.
A common assumption in 2026 is that the next frontier model release will solve healthcare. Unfortunately, it’s not that simple. Even the most capable LLMs still hover well below the accuracy required for clinical patient conversations. In healthcare voice, the accuracy required is somewhere north of 99% on every reasoning subtask, every turn, every call. A single hallucinated medication dose or missed escalation marker is not an evaluation curiosity; it is a safety event.
Our recently published paper, Perfecting Human–AI Interaction at Clinical Scale: Turning Production Signals into Safer, More Human Conversations, lays out the framework we use to get there. It is grounded in real signals from 180M+ live patient-AI interactions and clinician-led testing (7K+ licensed clinicians, 500K+ test calls). The headline number — a 99.9% clinical safety score — does not come from one model. It comes from a system. This post walks through a few of the things that system does that we think are genuinely unique.
Our production system is a constellation of 30+ specialized models orchestrated around a core conversation model. Each satellite has a narrow, verifiable job: medication identification, overdose detection, escalation decisions, labs and vitals reasoning, policy and benefits lookup, scheduling, IVR navigation, and a long tail of online and offline verifiers. The conversational core stays flexible and human; the satellites pin down the parts that have to be exactly right.
The reason we do this rather than scale a single monolith is fundamental to how LLM error rates compound. If a single model is 98% accurate on each of ten reasoning steps in a call, you’ve already dropped below 82% end-to-end. The only way to climb the accuracy curve is to decompose the problem into pieces where each piece can be made verifiably correct.
To make the constellation concrete, consider how we handle a single domain: blood sugar. When a patient mentions a glucose reading, a dedicated satellite first identifies whether the conversation is actually about blood sugar at all — distinguishing it from blood pressure, lab panels, or unrelated context. Once confirmed, the model drives the conversation through the questions that determine what the number actually means: when the sample was taken, and fasting status. It then supplies the appropriate clinical guidelines for that specific context — fasting versus post-prandial ranges are not interchangeable, and a value that is normal in one context is concerning in the other.
If the value falls into a concerning range, a separate set of models takes over to evaluate whether this is an emergency condition requiring immediate escalation, or a non-urgent clinical signal to route through standard channels. That hand-off — from “identify and contextualize” to “triage severity” — is exactly the kind of decomposition the constellation is designed for. No single model is doing all of this; each satellite owns a narrow, verifiable slice, and the orchestration glues them into a safe answer.
Scheduling looks, on paper, like a solved problem: book an appointment, confirm a time. In voice, it isn’t. Patients change their minds mid-sentence. They layer restrictions (“not Tuesdays, except the first Tuesday of the month”), express preferences for specific providers, juggle multiple appointments, and quietly assume the agent retained context from three turns ago. Every one of these is an opportunity for a hallucinated tool call.
There is also a more consequential failure mode hiding inside scheduling: the reason a patient is calling to book an appointment is often the most clinically important signal in the call. We had a patient request a neurology appointment because they had just been struck by lightning. The surface task is scheduling; the actual situation is a potential emergency that needs to be escalated immediately. A scheduling agent that competently books the appointment and ends the call has failed in the worst possible way. Detecting these signals — and routing them out of the scheduling flow and into urgent escalation — is one of the jobs we explicitly assign to dedicated satellite models in the constellation. Never missing one of these is non-negotiable.
So for every tool call our system makes, we also build a two-level verifier:
The critical design constraint here is one we learned the hard way: a generic “did we make a mistake?” judge does not work. It has the same problem space as the original query, so it has the same error rate. A successful judge requires either (a) a reduction in problem space — turning open-ended reasoning into a yes/no on a specific verifiable fact — or (b) a smarter, usually slower model that can afford deeper computation. Without one of those two properties, you are just adding latency and false confidence.
The other major source of accuracy gains is, frankly, experience. You cannot anticipate what goes wrong in production from a whiteboard. Voice agents talking to other automated systems — pharmacy IVRs, lab phone trees, insurance verification lines — produce a class of failures that does not exist in human-to-human training data:
Our paper digs into the role of these in-the-wild cues — paralinguistics, turn-taking dynamics, clarification triggers, escalation markers — and shows that they expose failure modes curated datasets simply do not contain. Every one of these failures, once observed, feeds back into the constellation design: a new satellite, a new verifier, or a training signal for the supervisor models. Notably, this work has driven a ~50% reduction in ASR errors over off-the-shelf enterprise ASR in our production system.

“It’s worth being explicit about why we treat IVR navigation as a first-class problem. The IVR layer is not itself clinical — there is no patient on the other end of a pharmacy phone tree. But navigating it accurately is what lets us deliver clinical information back to the patient and, more critically, ensure delivery of safety-critical medications. A failed IVR navigation can mean a prescription that doesn’t get refilled, a lab result that doesn’t get communicated, or an authorization that doesn't get confirmed. That puts IVR squarely in the critical path for safety, even though the IVR itself never says anything clinical.”
Vishal Parikh, Co-founder & Chief Product Officer
Why experiential learning beats benchmark optimization. The thesis of the paper is that production-grade clinical intelligence is achieved by learning from real-world interaction signals and embedding them into system-level design — not by optimizing isolated model accuracy alone. Benchmarks are necessary; they are not sufficient.
The constellation gets us most of the way, but fine-tuning is what takes us across the line. Our pipeline:
The result of this loop, run over 30+ specialized models with their own failure modes and their own training data, is the production system described in the paper.
Running 30+ LLM queries per conversation against frontier models would cost over $100 per hour of patient conversation. That is not a viable architecture for the scale healthcare needs. The next post in this series will get into how we actually run this system in production cost-effectively — model distillation, routing, caching strategies, and where the constellation lets us substitute small, specialized models for large generalists without losing accuracy. Stay tuned.