94% Sensitivity: At What Threshold? On Whose Patients? Reading an AI Validation Study
RADIOLOGY IN THE AGE OF AI & VLMS | ARTICLE 3 OF 14
A colleague forwarded you a paper this morning. Subject line: “thoughts?” The AUC is 0.94. The methods section says “multicenter validation.” The conclusion says the tool performs at or above experienced radiologists.
You have three minutes before the tech grabs you for a procedure. This article is for that moment: for the radiologist who wants to read AI literature independently and know whether a tool actually holds up before anyone asks for a vote.
Start With the Data, Not the Number
The first question is not what the model scored. It is where the data came from and whether it looks anything like your practice. A 2025 systematic review found that 76% of FDA-cleared radiology AI devices had no prospective testing, cleared via the 510(k) pathway by demonstrating equivalence to a predicate rather than being tested on real patients moving through a real workflow.¹
Retrospective validation on curated cases inflates performance numbers. The AUC in the abstract is often the ceiling, not the floor. Ask whether the test set was held out prospectively, whether training data came from one academic center or multiple sites, and whether the case mix matches your population.
What AUC Tells You and What It Does Not
AUC is a threshold-independent summary, useful but insufficient. A model with 95% sensitivity sounds reassuring until you run the PPV at your local prevalence. At 5% prevalence, PPV can fall below 20%, meaning most of what the model flags will be negative. This rarely gets applied to AI papers the way it could.
External validation AUCs consistently fall 0.03 to 0.10 or more below internal numbers, and specificity can drop 24 or more percentage points when the model moves to a new institution.² Red flags in any study: single-site data, no reader study, no radiologist baseline, case-control design, and thresholds selected after the fact to optimize the reported metric.
What is SA-ROC and Gray Zone?
Kim et al. (MGH/Harvard, npj Digital Medicine, 2026) built a framework that addresses what AUC cannot.⁴ They evaluated two FDA-cleared mammography AI tools. The higher-AUC model (0.928) was operationally inferior to the lower-AUC model (0.882) for high-volume screening; the lower-AUC tool safely automated more cases at the strictest safety threshold. AUC alone would have led to the wrong purchasing decision.
SA-ROC partitions predictions into three zones: Rule-In Safe Zone, Rule-Out Safe Zone, and Gray Zone. The Gray Zone holds every case the model cannot confidently classify and must hand to a radiologist. The Gray Zone Area (Gamma-Area) quantifies that operational cost as a fraction of total volume. It also allows clinical teams to set safety thresholds (alpha+ and alpha-) before deployment, shifting the conversation from “what did this model score” to “what can it safely automate at our standards.”
The Gray Zone Area is the number vendors don’t put in their sales materials. It tells you how many cases you will still be reading yourself, and that is what determines whether a tool relieves your workload or just adds a second reader to disagree with.
Predicting Value Before You Buy: A New Framework
A highly accurate tool for a task that is neither time-consuming nor error-prone delivers limited value regardless of its AUC. Larson, Poff et al. (Stanford AIDE Lab / Radiology Partners / Aidoc, AJR, 2026) addressed this directly by prospectively evaluating 13 AI models across 12 clinical tasks and approximately 89,000 exams.⁵
Before deployment, a workgroup rated each tool on three attributes: the tediousness of the task, the likelihood a radiologist would miss the finding, and the clinical impact if missed. Those predeployment predictions matched real-world radiologist-reported value in 10 of 12 tasks. The study also provides clean empirical grounding for the complementarity argument: across all models, AI achieved higher sensitivity while radiologists maintained higher PPV. Neither party is redundant. The radiologist’s comparative advantage is precision; AI’s is recall.
Three questions before any AI purchase: Is this task genuinely tedious? Is the miss rate non-trivial under normal clinical conditions? Is a miss consequential? All three yes and the tool has a reasonable prior probability of earning its place.
Five Questions to Ask Any Vendor
1. Where was the model trained and tested? Specific institutions, scanner manufacturers, patient demographics. If training data came entirely from one academic center abroad, transferability is a legitimate question.
2. What is performance at the threshold you will use? Not the best-case AUC. Sensitivity and specificity at the clinical operating point, then calculate PPV at your local prevalence.
3. What is the Gray Zone Area at your safety threshold? And how does this tool score on tediousness, miss likelihood, and clinical impact? If the vendor cannot engage with either question, they are not thinking in operational terms.
4. Was there a prospective external validation? Retrospective validation on curated data is a starting point, not a substitute for real patients in a real workflow at a site other than the development institution.
5. What monitoring is built in? How will you know if performance degrades after deployment? This separates tools designed for long-term clinical use from those designed to get past a purchasing committee.
The Skill Worth Developing Now
Two peer-reviewed frameworks now give you something more actionable than AUC. SA-ROC tells you what the model can safely automate at your clinical standards, and Gamma-Area quantifies the cost of its uncertainty. The Larson framework tells you, before a pilot, whether the problem the tool solves is one where AI is actually likely to matter. Together, they move the conversation from benchmark performance to operational reality. That is the only conversation worth having.
Up Next in Article 4:
What Happens If We Double Productivity? The Northwestern Medicine JAMA Network Open data, the Swedish MASAI trial and the scenario planning every practice leader should consider now.
AI can increase output per radiologist if it behaves like a well-trained fellow. If it behaves like a first-year, supervision friction is too high.
If you want to deploy AI in a way that expands effective capacity, protects revenue, and surfaces risk early, I can help. I am identifying a small number of forward-leaning partner sites to build and pilot independent AI performance evaluation software in real clinical workflows.
More information:
https://orainformatics.com/aiconsulting/
References
1. Sivakumar et al. FDA Approval of AI and Machine Learning Devices in Radiology: A Systematic Review. JAMA Network Open, 2025. https://pmc.ncbi.nlm.nih.gov/articles/PMC12595527/
2. Assessing the Generalizability of AI in Radiology: A Systematic Review. PMC, 2025. https://pmc.ncbi.nlm.nih.gov/articles/PMC12689012/
3. Allen, Dreyer et al. ACR/ESR/RSNA Multi-Society Statement on Developing, Purchasing, Implementing and Monitoring AI Tools in Radiology. Radiology: AI, 2024. https://pubs.rsna.org/doi/full/10.1148/ryai.230513
4. Kim et al. (MGH/Harvard). Defining Operational Safety in Clinical AI Systems (SA-ROC Framework). npj Digital Medicine, 2026. https://doi.org/10.1038/s41746-026-02450-7
5. Larson, Poff et al. (Stanford AIDE Lab / Radiology Partners / Aidoc). Predicting the Value of Radiology AI Applications: Large-Scale Predeployment Evaluation. AJR, 2026. https://www.ajronline.org/doi/10.2214/AJR.25.34340

