Deployment Is Where Validation Ends and Observability Begins.

RADIOLOGY IN THE AGE OF AI & VLMS | ARTICLE 6 OF 14

The last two articles in this series made a practical case. Generative AI is already producing radiology reports at scale, with measured efficiency gains and no accuracy loss in prospective deployment. And the radiologist workforce is not growing fast enough to absorb the volume that is coming. Those two facts together do not create a problem, they create a mission. The question Article 6 addresses is what it takes to be ready for it.

Readiness Is Not the Same as Caution

Early in my career I was a Navy flight surgeon assigned to a Marine Corps squadron. The flight surgeon’s job is not primarily about grounding pilots or cataloging what can go wrong. The job is readiness: ensuring the human systems, the physiological and psychological conditions that enable performance, are squared away before the mission launches. The squadron achieves its mission because the conditions for safe, effective operations were built before anyone climbed into the cockpit, not assembled in response to a mishap after the fact.

This is the posture radiology needs toward AI deployment right now, and it is almost entirely absent from how the field is currently approaching the problem. The conversation has been dominated by procurement checklists, FDA clearance thresholds, and accuracy benchmarks measured on validation datasets that bear no particular relationship to any given institution’s scanner fleet, patient population, or clinical workflow. Those are useful inputs but they are not readiness.

Readiness, in the context of AI-assisted radiology at scale, is the structured, continuous capacity to know whether your deployed AI is performing as expected: across patient subgroups, across imaging protocols, across time. It is the observability layer. And building it before the volume arrives is not excessive caution. It is mission preparation.

What Degrades After Deployment

The reasons a validated AI system can drift after go-live are mundane and constant. A scanner software upgrade shifts pixel intensity distributions. A new referring population changes the prevalence of findings the model was trained to detect. A protocol change alters how a finding appears at the slice thickness the system expects. None of these events represent vendor failure. They represent clinical operations. A 2025 JACR review documented what most practicing radiologists already sense: the governance infrastructure for detecting these shifts in commercial clinical AI is largely absent.¹ A companion analysis found that 64 percent of commercially available radiology AI products carried no peer-reviewed evidence supporting their performance claims, and most had no active monitoring protocol once deployed.²

The field has formal terms for what accumulates in that absence. Data drift is the divergence between the input distribution a model trained on and the distribution it encounters in production. Concept drift is the shift in the relationship between imaging features and clinical diagnoses over time, independent of the imaging data itself. Collision risk is the compounding failure mode that emerges when multiple AI tools operate on the same study, producing outputs that interact in ways no single radiologist is positioned to reconcile. When agents enter the picture, the exposure deepens further: a system that can order a follow-up CT, page an oncologist, or update an EHR without explicit physician authorization at each step compounds errors across handoffs rather than containing them within a single study.

None of these failure modes are detectable through a one-time pre-deployment validation. All of them are detectable through continuous observability infrastructure.

That distinction matters because the alternative is the documented state of the field today. The RAISE end-to-end radiology AI safety framework described this gap in 2023 and the underlying conditions have not materially changed.³ The radiologist signing reports generated with AI assistance is, without that infrastructure, flying without instruments.

No Model Is Static

There is a persistent assumption in vendor conversations and institutional procurement committees that a cleared, validated AI tool is a stable instrument. It is not. Validation establishes performance at a point in time, on a specific data distribution, under specific conditions. None of those conditions are permanent.

A concrete illustration: GPT-4’s performance on the ACR in-training examination was tracked over several months after its initial release. Performance degraded measurably over time, with no model update and no change in the test itself. The model that scored at one level in month one scored differently by month six. This was a closed benchmark with fixed questions, not a dynamic clinical environment with scanner variability, protocol changes, and shifting patient populations.

If performance degrades on a static test, the reasonable expectation for a deployed clinical AI operating in a genuinely variable environment is that the problem is at least as large, and more consequential.

This point has been made authoritatively at the highest level of the radiology literature. Paschali, Langlotz, and colleagues at the Stanford AIDE Lab published a comprehensive peer-reviewed review of foundation models in Radiology in February 2025 that dedicates an entire section to what they call the ‘Why Not’ dimension of responsible deployment.⁴ The review maps four categories of responsible deployment failures: evaluation gaps where benchmarks fail to predict clinical performance, generalizability failures of the kind Article 3 in this series documented, bias and fairness failures, and accountability gaps between what vendors claim and what post-deployment data shows. Langlotz senior authorship in a top-tier Radiological Society of North America (RSNA) journal means this is the field’s own authoritative framing, not an external critique. When a vendor pushes back on monitoring requirements, this is the citation that closes the argument.

The Human Factors Dimension

Mishap investigation in naval aviation has a human factors component for a reason: the system that failed was not purely mechanical. The same is true in AI-assisted radiology. A 2026 Nature Health study by Bernstein, Sheppard, Bruno, Baird, and colleagues tested what happens to legal liability when radiologists interact with AI in different workflow configurations.⁵ In the single-read condition, where the radiologist interpreted a CT only after seeing the AI flag it as abnormal, 74.7 percent of mock jurors found the radiologist liable when a finding was missed. In the double-read condition, where the radiologist completed an independent interpretation before receiving the AI feedback, plaintiff-siding dropped to 52.9 percent. A workflow design change alone, with no change in the AI and no change in the clinical outcome, moved the liability needle by 22 percentage points.

This is the human factors finding for AI-assisted radiology.

The problem is not the algorithm in isolation. It is the interaction between the algorithm and the workflow it is embedded in.

Automation bias, the tendency to defer to AI output even when independent clinical judgment would point elsewhere, is not a character flaw in individual radiologists. It is a predictable property of any decision-support system that is usually right. The observability layer needs to account for it, which means the workflow design needs to be part of what gets monitored, not just the model output.

At ECR 2026, Prof. Annemiek Snoeckx of Antwerp articulated the educational dimension of the same problem with precision. Trainees learning in an AI-saturated environment risk developing the habit of confirming the algorithm rather than building independent diagnostic pattern recognition. She was direct about the implication: which future arrives is not a property of the AI. It is a property of how we implement it and teach with it. That is a readiness question, not a technology question.⁶

What the Observability Layer Actually Measures

Building observability infrastructure for deployed AI requires confronting an uncomfortable finding about the metrics most institutions would reach for first. A 2025 study from Oxford, Glasgow, and HOPPR evaluated every major radiology report evaluation metric against board-certified radiologist judgment across 208 studies and more than 450 labeled clinical errors.⁷ Chexbert, the metric most practices would default to, is statistically misaligned: higher chexbert scores correlate with more clinician-identified errors, not fewer. GREEN is the most reliable single metric for clinically significant errors, but even GREEN fails to track every error category reliably. Omissions, severity errors, grammatical failures, and temporal changes each require distinct signal sources. A monitoring infrastructure built on a single automated score may be watching the wrong instrument entirely.

The CRIMSON framework from the Rajpurkar Lab at Harvard addresses the clinical stakes dimension of the same problem.⁸ Standard metrics treat all errors as equivalent in weight. Missing a pulmonary embolism on a scan ordered for dyspnea is scored identically to missing an incidental finding on a trauma chest X-ray. CRIMSON incorporates patient age, clinical indication, and guideline-based decision rules before assigning weight to any finding error. Its alignment with radiologist judgment, at Kendall’s tau of 0.84, substantially exceeds any existing single metric.

An observability layer calibrated only to aggregate accuracy is not sufficient for clinical governance. The metric has to reflect clinical stakes.

The Safety-Aware ROC framework from Kim and colleagues at MGH and Harvard offers a complementary structural instrument.⁹ SA-ROC partitions AI predictions into three zones: high-confidence positives, high-confidence negatives, and a Gray Zone where the model cannot classify with sufficient certainty. The Gray Zone is a structural mandate for radiologist review, built into the operating policy rather than left to post-hoc discretion. When the Gray Zone Area grows over time, that is a leading indicator of drift, visible in the observability layer before it surfaces as a clinical error.

Interpretability is a related component that is sometimes framed as a luxury. The evidence does not support that framing. NV-Reason-CXR-3B, a chain-of-thought reasoning VLM developed through a collaboration of NVIDIA, NIH, and Yale, produces explicit reasoning traces alongside its outputs: differential diagnoses, uncertainty estimates, step-by-step justifications for each conclusion. A reader study showed that access to these reasoning traces increased radiologist confidence and reduced report finalization time. The argument that safety and efficiency are fundamentally in tension is not sustained by the data. When an AI system shows its work, radiologists can review it faster, calibrate their agreement or disagreement more precisely, and generate an auditable record of the clinical reasoning behind each signed report.

RAG architecture, the retrieval-augmented generation approach covered in Article 2, also contributes to the safety layer. Models that retrieve from verified institutional knowledge before generating output are more stable and more auditable than models generating purely from their training weights. They are not immune to drift, but the retrieval layer creates a traceable chain of evidence that pure parametric generation does not.¹⁰

The Four Questions That Define Governance Readiness

Before any AI tool goes live in a clinical workflow, someone in the practice needs to be able to answer four questions, and the answers need to be documented rather than assumed.

Who is responsible for monitoring this system after go-live? Not the vendor, not the contract, but a named human being with a defined review schedule. What specific performance threshold triggers a formal review, and is that threshold written into the deployment agreement before any cases are read? Who receives the notification when performance falls below threshold, and what authority does that person have to act? And what are the retirement criteria, meaning the conditions under which the tool is suspended, not merely flagged for additional observation?

Most vendor contracts are silent on all four. The Flamingo-CXR study established a useful reference point for what the error landscape looks like in a well-studied AI deployment: 22.8 percent of AI-generated reports contained clinically significant errors, compared to 14.0 percent for human-written reports in the same dataset.¹¹

The goal of an observability layer is not to eliminate that gap overnight. It is to ensure the gap is known, stable, and proportionate to the clinical context the system is operating in.

A flight surgeon does not expect a zero-mishap deployment. The job is to make sure the squadron knows its actual readiness state before the mission launches, and has the infrastructure to detect when that state changes.

Up Next in Article 7:

The observability requirements described in this article apply to single-model AI tools operating within defined tasks. When AI moves into agentic workflows, where a system can order a follow-up study, page a referring clinician, or update an EHR entry without direct physician authorization at each handoff, the stakes for observability infrastructure scale proportionally. Errors in agentic systems do not persist in isolation: they compound across handoffs in ways no downstream reviewer is positioned to reconstruct after the fact. Article 7 examines that architecture and what governance over it actually requires.

Most clinical AI systems are evaluated before deployment and assumed to perform the same in production. In reality, performance shifts across sites, scanners, populations, and workflows, and those shifts are rarely measured systematically.

If you are building or deploying AI in radiology, including VLM-based reporting or multi-model orchestration systems, you need a way to monitor real-world behavior continuously. This includes tracking disagreement with clinicians, identifying drift, and understanding failure modes over time.

Veriloop provides a vendor-agnostic observability layer for clinical AI. We sit downstream of your model and workflow, measuring performance where it matters: in production, across real cases, with real users.

This is not model evaluation. This is system monitoring.

Feel free to send me an email at: ty@orainformatics.com

References

1. Erina Quinn, DO, and Christoph I. Lee, MD, MS, MBA. “Postdeployment Monitoring of Artificial Intelligence in Radiology: Stop the Drift.” JACR, 2025. Journal of the American College of Radiology https://www.jacr.org/article/S1546-1440(25)00451-X/abstract

2. “Real-World Monitoring of AI in Radiology: Challenges and Best Practices.” PMC, 2025. https://pmc.ncbi.nlm.nih.gov/articles/PMC12568762/

3. “RAISE: Radiology AI Safety, an End-to-End Approach.” arXiv preprint, 2023. https://arxiv.org/pdf/2311.14570

4. Paschali, Chen, Blankemeier, Varma, Youssef, Bluethgen, Langlotz, Gatidis, Chaudhari (Stanford AIDE Lab). “Foundation Models in Radiology: What, How, Why, and Why Not.” Radiology, February 2025. https://pubs.rsna.org/doi/10.1148/radiol.240597

5. Bernstein, Sheppard, Bruno, Baird et al. “The Radiologist-AI Workflow and the Risk of Medical Malpractice Claims.” Nature Health, March 10, 2026. https://www.nature.com/articles/s44360-026-00085-2

6. Rylands-Monk. “ECR: Does AI Toll the Beginning of the End for Chest X-Ray Reporting?” AuntMinnie Europe, March 6, 2026 (ECR 2026 session coverage: Snoeckx on deskilling/automation bias). AuntMinnie.com https://www.auntminnieeurope.com/resources/conferences/ecr/2026/article/15819005/ecr-does-ai-toll-the-beginning-of-the-end-for-chest-xray-reporting

7. Xu, Zhang, Abderezaei et al. “RadEval: A Framework for Radiology Text Evaluation.” EMNLP 2025 System Demonstrations. https://aclanthology.org/2025.emnlp-demos.40.pdf

8. Baharoon, Heintz, Raissi et al. “CRIMSON: Clinically-Relevant and Interpretable Medical Score for Radiology Report Evaluation.” arXiv, March 2026 [preprint]. https://arxiv.org/abs/2603.06183

9. Kim et al. “Defining Operational Safety in Clinical Artificial Intelligence Systems (SA-ROC Framework).” npj Digital Medicine, 2026. https://doi.org/10.1038/s41746-026-02450-7

10. Myronenko et al. (NVIDIA/NIH/Yale). “Reasoning Visual Language Model for Chest X-Ray Analysis (NV-Reason-CXR-3B).” arXiv, 2025. https://arxiv.org/abs/2510.23968

11. Tanno, Barrett, Karthikesalingam et al. (Google DeepMind). “Collaboration Between Clinicians and Vision-Language Models in Radiology Report Generation.” Nature Medicine, 2024. https://www.nature.com/articles/s41591-024-03302-1

Post Views: 145

Useful Links

Recent Posts