AI Vendors Know the AUC. Does Anyone Know If the Outcomes Are Right?

RADIOLOGY IN THE AGE OF AI & VLMS  |  ARTICLE 12 OF 14

A vendor using VLMs to generate reports sends our group a validation study. AUC of 0.93, external cohort and peer reviewed. It is genuinely good work.

In six months after go-live, if the model quietly degrades, shifts its behavior, or starts producing reports that are internally consistent but clinically wrong, how would we know?

This is not a procurement problem but a measurement problem. And it is the distinction that doctors working inside the VLM evaluation space are beginning to draw.

Four Archetypes

The market has settled into four recognizable archetypes. At one end is the full-stack institutional build: Cognita CXR, developed by Mosaic Clinical Technologies in partnership with Radiology Partners, became the first generative radiology AI to receive FDA Breakthrough Device Designation in March 2026.¹ That is the ceiling of what an enterprise build effort looks like, and it is resource intensive.

At the other end is acquisition as AI strategy: RadNet has deployed more than $340 million in acquisitions targeting Gleamer, iCAD, See-Mode, and CIMAR, building toward a stated goal of approximately $140 million in AI annual recurring revenue.² That is a network play with a fundamentally different logic. Between those poles, most practices are buying from vendors or running locally hosted models, a space where the open-source gap with proprietary systems is narrowing faster than most radiologists realize.³

Each archetype carries a distinct risk profile and a distinct observability requirement. The archetype decision is the first question. The monitoring infrastructure question is the second, and it does not go away regardless of which option you choose.

Measurement Challenge

Vendors know their model’s AUC. Press them and they can usually provide sensitivity, specificity, and, for the more sophisticated vendors, a Gray Zone area using the Safety-Aware ROC framework that characterizes cases where AI performance is genuinely uncertain.⁴

What they cannot tell you is whether the reports their model produced led to correct clinical decisions, and whether those decisions produced correct patient outcomes.

That gap is not a vendor failure. It is a structural absence in the field, and physicians need to understand it before they sign anything.

Four Stages

The distinction that is beginning to emerge among doctors evaluating AI at the system level is what I call four-stages of ‘truth’.

Stage 1 is Agreement: AI output compared against the radiologist’s edited version. Fast, scalable, directly tied to workflow. The limitation is fundamental: you are measuring consistency, not accuracy. A confident but wrong report can score well if the radiologist does not catch the error. This is where most vendor contracts define success.

Stage 2 is Correctness: structured comparison against pathology or imaging follow-up, failure mode taxonomy, multi-reader disagreement analysis. This moves from text similarity toward clinical structure alignment. It is still a proxy, but a better one.

Stage 3 is Decision-level validation: did this report lead to the correct clinical management? Was the biopsy triggered? Was the surgery ordered? Was the follow-up scheduled at the right interval? This is where report quality and decision quality begin to diverge, and that divergence is clinically meaningful.

Stage 4 is Outcome-level truth: biopsy-proven diagnoses, surgical confirmation, longitudinal follow-up imaging. The only real ground truth. Essentially absent from every vendor contract in the field today. Most contracts define success at Stage 1. Radiologists should be demanding Stage 2 at minimum, asking vendors about Stage 3, and understanding that Stage 4 is where the field is heading.

There are tools that already exist to help with the earlier stages. The SA-ROC framework from Kim and colleagues at MGH and Harvard allows any practice to ask vendors a simple, decisive question: what is your model’s Gray Zone Area at alpha equals one hundred percent? If they cannot produce it, that is an answer.⁴

The predeployment portfolio evaluation methodology from Larson, Poff, and colleagues at the Stanford AIDE Lab provides an institutional due diligence standard with real-world teeth: across approximately 89,000 exams and 12 clinical tasks, predeployment performance predictions matched real-world perceived value in 10 of the 12 tasks.⁵

And any vendor evaluation should include a direct question about the MIRAGE failure mode: what fraction of the model’s benchmark accuracy is attributable to non-visual inference, meaning textual patterns, positioning artifacts, or demographic signals rather than image interpretation?⁶

A vendor that cannot answer that question has not characterized a failure mode that is invisible to standard accuracy reporting. The Royal College of Radiologists’ post-deployment monitoring guidance provides the professional-body benchmark for what responsible postmarket surveillance looks like.⁷

There is also a failure mode that standard accuracy metrics do not capture at all. The NOHARM benchmark, cited in the 2026 Stanford-Harvard AI Index, found that leading large language models produced between 11.8 and 14.6 severely harmful recommendations per 100 clinical cases.

The striking detail: 76.6 percent of those errors were errors of omission rather than commission.⁸ AI misses findings more than it fabricates them.

This is clinically worse than hallucination: a fabricated finding is visible in the report, disputable on review, and correctable. An omitted finding is not there and you cannot dispute what was never written. Standard metrics, AUC, sensitivity, specificity, all measure what the model produces. They do not measure what it does not produce.

Human-AI disagreement monitoring, and specifically the pattern of radiologist addenda after AI-assisted reads, is the only workflow-embedded signal that can track omission-rate drift over time. The practice that is not tracking addenda patterns is not monitoring the failure mode most likely to produce a missed diagnosis.

Which brings us to what observability infrastructure can eventually become. Post-deployment monitoring, implemented consistently, can evolve into something more valuable than a compliance record. By linking imaging interpretations to downstream clinical events, including pathology reports, surgical findings, follow-up imaging, and clinical course, a practice builds the only framework capable of answering Stage 4 questions at scale.

The practical starting points are high-signal, controlled workflows: spine MRI correlated with subsequent surgical or interventional management; lung nodule reports correlated with biopsy confirmation or progression on follow-up CT. An important caveat applies: downstream clinical events are a signal, not a gold standard. Surgeons operate on ambiguous findings. Outcomes require contextual interpretation. But over time, this infrastructure creates a longitudinal clinical truth engine that no vendor product and no static monitoring tool currently provides.

The contract checklist is table stakes: data ownership and training use rights, performance guarantees and failure criteria, postmarket monitoring obligations, indemnification terms. But the deeper question, which most contracts do not yet address, is what happens when the model is right by agreement metrics and wrong by clinical outcome. That question is coming. The practice that has already built the measurement infrastructure to answer it is in a structurally different position than the practice that has not.

The doctor who can walk into a vendor meeting and ask the Stage 3 and Stage 4 questions that no one else is asking is doing something that goes beyond procurement due diligence. That physician is beginning to define what clinical accountability for AI actually means in practice, and in doing so, is reasserting something the specialty has been quietly ceding for thirty years: the radiologist as a genuine clinical partner, not a report factory.

For three decades, workflow pressure pushed radiology toward higher throughput and less consultation. AI is reversing that pressure, but only for radiologists who choose to meet it.

What Is Coming Next – Article 13

Article 13 takes that argument directly: what it looks like to become the Doctor’s Doctor again, how AI changes the comparative advantage equation when the routine is handled and the complex remains, and what the specialty needs to build, in training, in workflow, and in clinical relationships, to occupy that space before someone else does.


Most clinical AI systems are evaluated before deployment and assumed to perform the same in production. In reality, performance shifts across sites, scanners, populations, and workflows, and those shifts are rarely measured systematically.

If you are building or deploying AI in radiology, including VLM-based reporting or multi-model orchestration systems, you need a way to monitor real-world behavior continuously. This includes tracking disagreement with clinicians, identifying drift, and understanding failure modes over time.

Veriloop provides a vendor-agnostic observability layer for clinical AI. We sit downstream of your model and workflow, measuring performance where it matters: in production, across real cases, with real users.

This is not model evaluation. This is system monitoring.

Contact:

ty@orainformatics.com

References

1. Cognita CXR FDA Breakthrough Device Designation. Mosaic Clinical Technologies / Radiology Partners. Business Wire, March 5, 2026. https://www.businesswire.com/news/home/20260304633206/en/Mosaic-Clinical-Technologies-Announces-FDA-Breakthrough-Device-Designation-for-Cognitas-Generative-AI-Model-for-Radiology

2. RadNet M&A Strategy: $340M+ in AI Acquisitions. Radiology Business, March 4, 2026. https://radiologybusiness.com/topics/healthcare-management/mergers-and-acquisitions/radnet-has-allocated-over-340m-acquisitions-already-2026-leaders-discuss-why-and-whats-next

3. Kim SH, Schramm S, Adams LC, et al. Benchmarking the diagnostic performance of open source LLMs in 1,933 Eurorad case reports. npj Digital Medicine, February 12, 2025. https://www.nature.com/articles/s41746-025-01488-3

4. Kim Y-T, Kim H, Bahl M, et al. Defining operational safety in clinical artificial intelligence systems [SA-ROC framework]. npj Digital Medicine, February 2026. https://www.nature.com/articles/s41746-026-02450-7

5. Larson DB, Poff JA, Krishnan S, et al. Predicting the Value of Radiology Artificial Intelligence Applications: Large-Scale Predeployment Evaluation of a Portfolio of Models. AJR, 2026. https://www.ajronline.org/doi/10.2214/AJR.25.34340

6. Asadi M, O’Sullivan JW, Cao F, et al. MIRAGE: The Illusion of Visual Understanding in Medical AI. arXiv:2603.21687v2, March 26, 2026. [Preprint; not yet peer reviewed.] https://arxiv.org/abs/2603.21687

7. Post-deployment monitoring and safety reporting of AI medical imaging devices in clinical practice. Royal College of Radiologists, March 2026. [UK NHS context; applicable in principle internationally.] https://www.rcr.ac.uk/our-services/all-our-publications/clinical-radiology-publications/post-deployment-monitoring-and-safety-reporting-of-ai-medical-imaging-devices-in-clinical-practice

8. NOHARM benchmark. AI Index 2026, Chapter 6: Medicine. Stanford HAI / Harvard. https://aiindex.stanford.edu/report/

Menu