Bias, Drift, and the Failure Modes Nobody Is Measuring

RADIOLOGY IN THE AGE OF AI & VLMS | ARTICLE 8 OF 14

Some radiology practices right now are looking at whether vision-language models belong in their workflow at all. That is a reasonable place to be. The technology is moving fast, the evidence is uneven, and the organizational lift is real. Articles 1 through 7 in this series have tried to give that decision an honest foundation.

This article is for the practices that have already deployed AI, or are close to it, and are starting to wonder about a different question: what happens after go-live? Because the honest answer is that most practices have very little infrastructure in place to answer it, and the consequences of that gap are more concrete than the field tends to acknowledge.

The Model You Validated Is Not the Model You Are Running

This can be surprising, particularly if you validated your AI tool carefully. You reviewed the sensitivity and specificity data. You ran a pilot. You went live. Eighteen months later, something has shifted, and nobody noticed because nobody was looking.

Performance degradation in deployed AI is not random. It follows recognizable patterns. The patient population you are scanning today may not match the population the model trained on: demographics shift, referral patterns change, payer mix moves. The model does not know any of this has happened. It keeps producing output, and that output gets gradually less calibrated to the patients in front of it.

Disease patterns also change over time, independent of the imaging data itself. A model trained on pre-treatment-era imaging learns features that may not exist in the same form in a treated population. And when a vendor updates or retrains a model, the version you validated before go-live is no longer the version running in your PACS. In current US practice, there is no systematic requirement that you be notified when that happens, let alone that you re-validate.

Quinn and Lee captured this plainly in a 2025 review in the Journal of the American College of Radiology: drift is not a question of whether, it is a question of when, and whether anyone is watching when it arrives.¹

What Scanner Hardware Is Doing to Your AI Output

One of the clearest illustrations of how quietly this can happen came from ECR 2026, where Merel Huisman, MD, PhD of Radboud University Medical Center presented data on what scanner vendor differences do to AI output at the patient level.²

The finding was straightforward and worth sitting with: batch effects from scanner hardware differences alone can shift roughly 7 percent of patients into a different risk classification category, before any pathology detection is even applied. The model is doing exactly what it was trained to do. The problem is that training happened on a different hardware environment than the one your institution runs. Neither the model nor the radiologist reading its output has any signal that this is occurring.

Seven percent sounds small. In a high-volume practice or a screening program, it represents a systematic misclassification event running continuously in the background.

At the same session, Akshay Chaudhari, PhD of Stanford University presented an encouraging counterpoint: a foundation model trained on 20,000 abdominal CT scans held its performance at an external site where the scanners had zero overlap with the training data.² That finding is real and matters for how we think about architectural choices in next-generation AI. But it does not change the monitoring obligation.

Robustness has limits, and those limits tend to appear at the edges of the training distribution, quietly, without any alarm.

The Bias Gap the Literature Has Now Documented

A scoping review published in early 2026 from Mayo Clinic examined seven published studies deploying fine-tuned and retrieval-augmented AI architectures in radiology settings.

The finding was straightforward: not one of the seven studies included any evaluation for bias.³

Retrieval-augmented systems are particularly susceptible to amplifying existing biases in the documents they retrieve from. Whether that amplification is occurring in deployed systems today is, based on the published literature, entirely uncharacterized. This is not a theoretical gap. It is a documented gap in systems that are running on real patients.

What ECR 2026 Surfaced About the Monitoring Problem

In March 2026, ECR hosted two sessions that together painted a clear picture of where the field stands on postmarket surveillance. Hugh Harvey of Hardian Health, Kicky van Leeuwen, PhD of the Netherlands Cancer Institute, and Susan Shelmerdine, PhD of Great Ormond Street Hospital were among those presenting. The AuntMinnie Europe coverage of those sessions is worth reading in full.⁴

Several themes emerged that translate directly to US practice, even though the regulatory framing at ECR was primarily European.

The first is that postmarket surveillance is not optional infrastructure. Van Leeuwen made the case that most AI evaluations still measure performance at the deployment gate, while the environment keeps shifting after go-live: imaging equipment evolves, software updates, patient populations change.

Her framing of continuous monitoring encompasses uptime, output drift, reporting times, workflow fit, and user adoption together, not any single metric in isolation.

The level of scrutiny a system requires, she argued, should scale with its clinical risk level.

The second theme concerns what happens to discrepancy data. Harvey noted that in practices using AI, radiologists encounter disagreements with AI output at a meaningful rate, somewhere in the range of 5 to 10 percent of cases by some estimates. The problem is that almost none of those disagreements make it back to vendors in any structured way. Published postmarket data consequently makes deployed AI look far safer than the working experience of radiologists suggests, not because the tools are performing well, but because the feedback infrastructure to capture when they are not was never built. The trend across a department over time, Harvey observed, tells you something different than any individual disagreement: consistent disagreement across radiologists points to a model problem, while a single outlier disagreeing persistently may point to something else entirely. That distinction requires data. Data requires a system to collect it.

The third theme is the one that tends to get left out of technical monitoring conversations: what the people inside the system actually experience. Shelmerdine presented a case study from a southwest London district hospital where an AI triage tool for chest X-rays produced measurable improvements: CT completion within the national target climbed from roughly 20 percent to nearly 50 percent, and average time from an abnormal X-ray to CT fell from approximately six days to around three and a half. By any governance metric, the tool was working.

Then the funding ran out. The tool was withdrawn, not because it failed clinically, but because of a budget decision. What surfaced afterward was instructive: some radiologists found they could no longer report as confidently without the triage output. Dependency had developed faster than anyone had tracked. Skills had quietly atrophied while the performance metrics looked fine. Shelmerdine applied a concept from human factors research to describe what had happened: the gap between work as imagined and work as done. The governance documents described one workflow. What was actually happening in the department was something different, and nobody had a way to see it.

Harvey was quoted at that session and it has stayed with me:

sometimes a radiologist says no. An AI system never does.⁴

A Failure Mode No Current Tool Can Detect

A preprint published in March 2026 from Stanford introduced what the authors call the mirage failure mode, and it deserves attention even in its pre-peer-review state.⁵

The researchers tested frontier vision-language models by presenting clinical queries about medical images and, in some conditions, providing no images at all. Across multiple models, the systems produced confident, clinically detailed descriptions of images that were never provided, at rates exceeding 60 percent on average. Medical imaging benchmarks were among the most susceptible categories tested.

The monitoring implication is the part that matters most: mirage-mode accuracy retained 70 to 80 percent of fully image-enabled benchmark accuracy.

A model whose performance does not meaningfully decline when images are removed is not detectable as failing by any current accuracy-based monitoring tool. The numbers look fine. The reasoning traces look fine. The visual evidence was never used.

This finding, if it holds through peer review, means that an observability layer needs to operate independently of the model’s own reported output. You cannot ask the model whether it looked at the image. In mirage mode, it believes it did.

(Note: this paper is a preprint as of this writing and has not yet completed peer review.)

Some Lessons the UK Learned Early

The Royal College of Radiologists published guidance in March 2026 on post-deployment monitoring and safety reporting of AI medical imaging devices.⁶ The regulatory framework it draws on is UK-specific, so some of the statutory detail does not translate directly to US practice. But two case studies in the guidance carry clinical weight for any jurisdiction and are worth knowing.

In one, a routine mammography software upgrade tripled AI recall rates, flagging roughly half of all studies until the model was recalibrated.

This failure did not surface through any standard validation protocol because it only occurred after the update was live in production.

In the second, a lung nodule volumetry software update produced roughly 80 percent disagreement on nodule count between software versions, with management decisions affected in approximately 20 percent of patients. Both events were caught because someone was watching the output in real time after go-live.

The US does not have an equivalent mandatory requirement for postmarket monitoring yet. The RCR guidance is useful less as a compliance document for US practices and more as a concrete illustration of what the failure modes look like in production, and what it takes to catch them before patients are affected.

What This Means in Practice

None of this requires a large enterprise infrastructure to start addressing. The foundation is simpler than the technical literature sometimes makes it seem: someone in the practice needs to be responsible for monitoring deployed AI after go-live, there needs to be a defined performance threshold that triggers a formal review, and there needs to be a documented process for what happens when that threshold is crossed.

Most vendor contracts are silent on all three. That is worth knowing before you sign one.

Continuous monitoring of AI in clinical deployment is not a luxury feature or a future-state aspiration. It is the minimum viable operating environment for a practice that wants to know whether its deployed tools are performing the way they did when you validated them. The gap between that standard and where most US practices currently sit is the honest starting point for this conversation.

If your practice is evaluating VLM deployment and you want to talk through what safe implementation looks like in practice, please reach out at ty@orainformatics.com

Most clinical AI systems are evaluated before deployment and assumed to perform the same in production. In reality, performance shifts across sites, scanners, populations, and workflows, and those shifts are rarely measured systematically.

If you are building or deploying AI in radiology, including VLM-based reporting or multi-model orchestration systems, you need a way to monitor real-world behavior continuously. This includes tracking disagreement with clinicians, identifying drift, and understanding failure modes over time.

Veriloop.health provides a vendor-agnostic observability layer for clinical AI. We sit downstream of your model and workflow, measuring performance where it matters: in production, across real cases, with real users.

This is not model evaluation. This is system monitoring. Happy to help: ty@orainformatics.com

References

1. Quinn E, Lee CI. Postdeployment Monitoring of Artificial Intelligence in Radiology: Stop the Drift. Journal of the American College of Radiology. 2025. https://www.jacr.org/article/S1546-1440(25)00451-X/abstract

2. Tschabuschnig C. Making Data Speak the Same Language: Harmonization and Health Data. AuntMinnie Europe. March 19, 2026. https://www.auntminnieeurope.com/imaging-informatics/article/15819751/making-data-speak-the-same-language-harmonization-and-health-data

3. Collaco JM, Erickson BJ, et al. Bias in Fine-Tuned and Retrieval-Augmented-Generation Healthcare AI: A Scoping Review. Bioengineering / Mayo Clinic. February 2026. https://pmc.ncbi.nlm.nih.gov/articles/PMC12938813/

4. Tschabuschnig C. ECR: Ethical AI in Radiology: Why Safety Begins After Deployment. AuntMinnie Europe. March 7, 2026. https://www.auntminnieeurope.com/resources/conferences/ecr/2026/article/15818943/ecr-ethical-ai-in-radiology-why-safety-begins-after-deployment; Carey L. ECR: What’s the Follow-Up When Radiologists Reject AI Findings? AuntMinnie Europe. March 6, 2026. https://www.auntminnieeurope.com/resources/conferences/ecr/2026/article/15818895/ecr-whats-the-followup-when-radiologists-reject-ai-findings

5. Asadi N, O’Sullivan S, et al. Mirage: The Illusion of Visual Understanding. Stanford / arXiv. March 26, 2026 [preprint; peer review pending]. https://arxiv.org/abs/2603.21687

6. Royal College of Radiologists. Post-deployment monitoring and safety reporting of AI medical imaging devices in clinical practice. March 2026. https://www.rcr.ac.uk/our-services/all-our-publications/clinical-radiology-publications/post-deployment-monitoring-and-safety-reporting-of-ai-medical-imaging-devices-in-clinical-practice

Post Views: 428