The Findings Checklist Workflow: A Practical Protocol for Human-AI Collaboration

RADIOLOGY IN THE AGE OF AI & VLMS

Every aviation safety advance in the last fifty years traces back to the checklist. Radiology has the tool. It does not yet have the protocol.

Over the past fourteen articles in the series “Radiology in the Age of AI & VLMs,” I have covered what these systems can do, where they fail, who bears liability when they do, how to evaluate them before deployment, how to monitor them after, and what a radiologist’s judgment is worth in the era when AI can generate the first draft. None of that is theoretical anymore. This piece builds on that foundation and addresses a question the series raised but did not fully resolve: when the AI hands you something to sign, what is the right cognitive sequence for a radiologist who takes that responsibility seriously?

The answer is a checklist. Not a new form. Not a compliance module. A structured cognitive protocol, embedded in the reading workflow, that preserves independent judgment while capturing the real value AI offers. Radiology has resisted this longer than it should have.

The Checklist Problem in Medicine

Atul Gawande made the case for surgical checklists in 2009 and the evidence was overwhelming.¹ The WHO Surgical Safety Checklist reduced complications and mortality in rigorous multicenter trials across high-income and low-income settings. The concept spread into intensive care, pharmacy, emergency medicine. It did not spread as cleanly into diagnostic radiology, in part because the radiology workflow was already structured around a different cognitive model: the radiologist as autonomous expert, reviewing images and dictating findings, with minimal procedural scaffolding around the act of interpretation itself.

That model made sense when the radiologist was the only reader. It becomes fragile when the radiologist is reviewing and certifying the output of a system that has already made pattern-recognition decisions before the study opens on the workstation.

The Editing Problem

Radiologists who have used AI report generation tools in clinical practice know this frustration well:

Editing an AI-generated report often takes longer than dictating from scratch.

This is not a criticism of the technology in isolation. It reflects a mismatch between the cognitive task and the interface designed to support it. Editing is a different cognitive mode than reading. When a radiologist dictates from a cold study, the sequence is: scan, perceive, formulate, articulate. When the radiologist edits an AI report, the sequence becomes: read the AI output, compare it to the images, identify discrepancies, rewrite. The second sequence is not faster for most readers. It is slower, and it introduces automation bias at the step where discrepancy identification should be most active.

The problem is not the draft. The problem is the cognitive ordering.

What the Discrepancy Data Shows

At ECR 2026, Dr. Hugh Harvey presented discrepancy rate data that should recalibrate how radiologists think about AI-assisted review. The rate of meaningful discrepancies between AI outputs and final radiologist interpretations runs at approximately 5 to 10 percent.² In a high-volume practice reading hundreds of studies per day, that figure translates into a substantial number of cases per shift where the AI and the radiologist do not agree, and where the radiologist’s independent judgment is the safeguard that prevents a clinically consequential error from reaching the patient record.

Mildenberger, also at ECR 2026, argued for PACS-embedded protocol as the institutional standard for AI integration: not a separate review interface, not a post-report audit layer, but a structured interaction embedded in the primary reading tool itself.³ That framing points toward the right architecture. The question is what that structure should look like from the radiologist’s side.

Bernstein and colleagues provided the other data point. Their mock-juror experiment, published in Nature Health in March 2026, tested whether the sequence of AI feedback relative to radiologist interpretation affects liability exposure.⁴ The answer was clear.

A double-read structure, where the radiologist completes an independent read before receiving AI feedback, reduced plaintiff-siding from 74.7 percent to 52.9 percent, an odds ratio of 2.6 with a p-value of 0.0002.

The liability literature now has a workflow prescription with a number attached. Independent read first. AI feedback second. The checklist enforces that sequence.

The Findings Checklist Workflow

The Findings Checklist Workflow is a structured protocol for integrating AI flags into the radiologist’s review sequence without triggering the automation bias that unstructured review invites. The core design principle is sequence preservation: the radiologist’s independent scan of the study comes before engagement with the AI output, not after.

The workflow operates in four steps. First, the radiologist reviews the study independently, forming an initial impression before the AI overlay is active. This step takes no additional time in a well-designed interface; it is a deliberate hold on the AI panel until the radiologist has scrolled the key sequences. Second, the AI flags are surfaced. The radiologist reviews the flags against the independent impression and marks each as concordant, discordant, or requiring further evaluation. Third, discordant flags trigger a structured reconsideration step, not a simple override. The radiologist is not being asked to defer to the AI; the AI flag is a prompt to look again with a specific question active. Fourth, the interaction is logged. The AI suggested this finding; the radiologist agreed or disagreed and documented why. That log is the audit trail.

The trust band concept developed in the companion series maps directly onto step three. A system the practice has validated and monitored, one whose behavior is well characterized within the current patient population, operating in a modality and task category where its performance record is established, warrants a different default confidence level than a recently deployed system working at the edge of its training distribution. The checklist does not make that calibration for the radiologist. It creates the cognitive space where the radiologist can make it deliberately.

Why This Is Not Optional

The liability data from Bernstein et al. is the most useful framing for practice leaders who need to justify building this infrastructure.⁴ The workflow prescription is now peer-reviewed and quantified. A practice that trains its radiologists on structured independent review before AI engagement can document that training, demonstrate it in audit logs, and present it as evidence of a systematic governance approach if a case reaches litigation. A practice that does not have that documentation cannot.

The post-deployment monitoring literature adds the second reason. AI systems drift. A system that performed within acceptable parameters at go-live may not be performing the same way twelve months later, and the discrepancy log that the Findings Checklist Workflow generates is precisely the data source that enables drift detection at the case level. The 5 to 10 percent discrepancy rate Harvey described at ECR 2026 is not a static number.² It changes as the model changes, as the patient population shifts, and as the radiologist population reading with AI accumulates habits, including the habit of overriding AI flags quickly and without documentation. The checklist interrupts that habit before it becomes invisible.

The Series in One Paragraph

The fourteen-article series that precedes this piece covered a single connected argument. The next version of AI in radiology is arriving fast, is deployed inconsistently, is evaluated against metrics that often do not measure what clinicians need to know, and is governed by contracts that leave liability exposure unaddressed. The radiologist who has worked through that argument is not in a defensive crouch about their career.

We are in the best position of anyone in healthcare to capture what this technology can actually deliver, because we are the only people in the system with the clinical domain knowledge, the interpretive authority, and, with the right infrastructure, the documented performance record to price and defend it. The checklist is where the argument becomes a practice.

References

1. Gawande A. The Checklist Manifesto: How to Get Things Right. Metropolitan Books, 2009.

2. Harvey H, Frauenfelder T, Mildenberger P. ECR 2026 Rejected Studies Session. AuntMinnie Europe, March 11, 2026. https://www.auntminnieeurope.com/imaging-informatics/artificial-intelligence/article/15819191/when-ai-and-radiologists-miss-the-same-thing

3. Mildenberger P. PACS-embedded AI integration protocol. ECR 2026 Rejected Studies Session. AuntMinnie Europe, March 11, 2026 (same source as ref 2).

4. Bernstein MA, Sheppard JP, Bruno MA, Lay PS, Baird GL. Brown / Penn State / Seton Hall. Nature Health, March 10, 2026. DOI: 10.1038/s44360-026-00085-2. https://doi.org/10.1038/s44360-026-00085-2

Image credit: The Day the Earth Smiled https://science.nasa.gov/photojournal/the-day-the-earth-smiled/ Easter egg for readers of my first AI articles

This piece is a companion to the 14-article LinkedIn series “Radiology in the Age of AI & VLMs.” The series, this article, and the piece that follows are being expanded into a book titled: A Radiologist’s Introduction to Foundation Models: The Best Time in History to Be a Radiologist available now on Amazon: https://a.co/d/0fjYa3Nz

Most clinical AI systems are evaluated before deployment and assumed to perform the same in production. In reality, performance shifts across sites, scanners, populations, and workflows, and those shifts are rarely measured systematically.

If you are building or deploying AI in radiology, including VLM-based reporting or multi-model orchestration systems, you need a way to monitor real-world behavior continuously. This includes tracking disagreement with clinicians, identifying drift, and understanding failure modes over time.

Veriloop provides a vendor-agnostic observability layer for clinical AI. We sit downstream of your model and workflow, measuring performance where it matters: in production, across real cases, with real users.

This is not model evaluation. This is system monitoring.

Contact: ty@orainformatics.com

Post Views: 442

Useful Links

Recent Posts