From Algorithms to Vision: How VLMs Are Generating Preliminary Reports (And What That Means)

RADIOLOGY IN THE AGE OF AI & VLMS | ARTICLE 2 OF 14

A Vision-Language Model just drafted a chest X-ray report. A radiologist reviewed it, edited it, and signed it. That already happened at scale, across an 11-hospital health system, on roughly 24,000 real studies. This isn’t a pilot or a press release. This is the practice of radiology today.

How does it work?

What Makes a VLM Different

Last week I introduced the shift from narrow algorithms to foundation models. Today I want to go one level deeper, because understanding the architecture changes how we think about the tool.

A traditional CNN (convolutional neural network), the backbone of the last decade of radiology AI, sees pixels and outputs a probability. It’s a pattern-matcher. Very good at one thing and blind to everything else.

You already know what a large language model can do with text; you’ve used Claude or ChatGPT. A Vision-Language Model is similar, but it ingests images. A vision encoder reads the pixels, a language decoder generates the report, and the two are trained together on millions of image-report pairs, so the model learns not just what’s in the scan, but how radiologists describe what’s in the scan.

The practical implication is this: you can now ask the model a question in plain language, and it will answer using its understanding of both the image and clinical radiology vocabulary. That’s categorically different from previous AI expectations.

What VLMs Can Do Right Now

A multidisciplinary team of radiologists, clinicians, and AI researchers at Microsoft Research published a framework in 2024 that is worth knowing. Working through an iterative co-design process with 13 radiologists and clinicians, they identified four VLM use cases that practitioners assessed as genuinely valuable. [Jain et al., CHI Conference, 2024]

Draft report generation. The model reads the images and produces a structured preliminary report. You review it, verify it, correct it, and sign it. This is the use case getting the most real-world deployment, supported by both the Google DeepMind Flamingo-CXR study and the Northwestern Medicine deployment at clinical scale. [Tanno et al., Nature Medicine, 2024; Huang et al., JAMA Network Open, 2025]
Augmented report review. Rather than generating a report from scratch, the VLM surfaces findings as you work, flagging what it sees against what you’re describing or highlighting areas of the image you haven’t yet addressed. The Jain et al. radiologist cohort specifically identified this as a workflow model they found more acceptable than full draft handoff, because it preserves the interpretive sequence they already use. [Jain et al., CHI Conference, 2024]
Visual search and querying. Ask the model to locate all cases in a dataset with a specific imaging pattern, or to answer a direct visual question about a study (“Where is the consolidation on this chest X-ray?”). This remains largely research-stage in radiology, but the direction is clear and the foundational capability is already demonstrated in the literature. [Jain et al., CHI Conference, 2024; Survey on Multimodal LLMs in Radiology, Information MDPI, 2025]
Patient imaging history synthesis. Rather than pulling up five prior studies yourself, the model synthesizes the relevant longitudinal comparison across your patient’s imaging record. The architectural mechanism that makes this possible is RAG, Retrieval-Augmented Generation. Think of it as the difference between a resident who reads the chart before walking in versus one who freestyles from the current image alone. RAG grounds the model in real data: your patient’s prior studies, your institution’s protocols, relevant clinical context. According to a 2025 Radiology: Artificial Intelligence review, it is also the primary architectural defense against hallucination currently available for radiology AI deployments. [Tejani et al., Radiology: Artificial Intelligence, RSNA; npj Digital Medicine, 2025]

Where VLMs Currently Fail – Balancing Marketing and Medicine

VLMs hallucinate. They generate findings that aren’t there and they omit findings that are. They use uncertain language, “possible,” “cannot exclude,” in ways that can be clinically ambiguous. They struggle significantly with 3D volumes, with CT and MRI remaining far harder than chest X-ray. Multi-modality integration, combining findings across a PET, a CT, and a prior MRI in a single reasoning step, is still largely unsolved at clinical scale.

The Flamingo-CXR study from Google DeepMind, published in Nature Medicine in 2024, is probably the cleanest proof-of-concept of success in the literature. 77.7% of reports generated by the VLM were rated preferable or equivalent to radiologist-written reports, rising to 94% for normal cases. Those are genuinely impressive numbers.

But the same study reported that 22.8% of AI reports contained clinically significant errors, compared to 14.0% in radiologist-written reports. A Harvard/Brigham and Women’s 2025 analysis reinforced this point with a taxonomy of 12 radiology-specific error types, noting that current evaluation methods consistently underestimate clinical risk because they rely on language metrics rather than clinical consequence. [Guan et al., Harvard/MedRxiv, 2025]

AI helps the majority of cases. It fails more often than humans. Both of those things are true at the same time. I’ll come back to what that means for safety and liability in Articles 6 and 9.

The trainee analogy that keeps appearing in the literature is apt: treat the VLM output like a resident’s preliminary report. Read it, verify it, and do not skip your own interpretation. The model is not doing the job, we are.

The Northwestern Medicine Data

I mentioned a real-world deployment at the top. The Northwestern Medicine study, published in JAMA Network Open in 2025, is worth noting here as a preview. I’ll go deep on the productivity data in Article 4.

In summary, a custom-built generative AI, trained on institutional clinical data with no third-party LLM, deployed live across 11 hospitals, on approximately 24,000 reports, over five months. Average efficiency gain in report completion was 15.5% and the highest performing individual saw a 40% reduction in report completion time.

This is among the first generative AI radiology tools integrated into a live multi-site clinical workflow at this scale, globally. It matters not just because of the numbers, but because of what it demonstrates about how a VLM can be built and deployed responsibly. We’ll discuss more in Article 4.

The Interface Problem Nobody Is Talking About

There’s a candid objection that keeps coming up among radiologists who have actually used these tools, and it deserves an honest answer: editing an AI-generated report often takes longer than just dictating from scratch.

Any radiologist who has trained residents knows what this feels like. Reviewing a first-year’s preliminary report is a net time cost. You’re correcting structure, correcting language, correcting reasoning. It’s only when the trainee reaches a senior level that the prelim becomes a genuine time-saver, because the baseline quality is high enough that your job shifts from rewriting to confirming.

Current VLMs are somewhere in that continuum, and depending on the study type and the model, many are still closer to the first year (or less) end than the fellow end. The Northwestern Medicine efficiency gains are real, but they represent a deployment that was purpose-built on institutional data. Off-the-shelf tools at most sites are not there yet.

The honest answer is that as models improve, the editing burden should decrease. The trajectory is clearly toward higher baseline quality, and the studies support that direction. But there’s a second, under appreciated question here: even when the model is good enough, is the current report-editing interface actually the right design?

Dictation workflows, PACS interfaces, and report editors were all built for a world where the radiologist generates the content. None of them were designed for a world where the AI generates a draft and the radiologist’s job is rapid, high-confidence verification. That’s a different cognitive task, and it probably deserves a different interface.

This is an open design problem, and it’s one where PACS vendors, RIS developers, and frankly anyone who understands both the clinical workflow and the underlying model architecture could build something genuinely better than what exists today.

I’ll come back to what that interface might look like at the end of this series, Article 15.

Where We Are

VLMs are generating preliminary radiology reports in live clinical practice right now. They are more capable than any prior generation of radiology AI. They also fail in ways that demand radiologist oversight, not just radiologist sign-off.

The trainees are creating prelims and we’re still the attendings.

Next Tuesday, Article 3: 94% Sensitivity – At What Threshold? On Whose Patients? Tips on Evaluating AI Models

References

Jain et al. Multimodal Healthcare AI: Identifying and Designing Clinically Relevant VLM Applications for Radiology. CHI Conference, 2024. https://dl.acm.org/doi/10.1145/3613904.3642013

Tanno, Barrett, Karthikesalingam et al. Collaboration Between Clinicians and Vision-Language Models in Radiology Report Generation (Flamingo-CXR). Nature Medicine, 2024. https://www.nature.com/articles/s41591-024-03302-1

Huang, Etemadi et al. Generative AI Boosts Radiology Productivity Up to 40% in Large Multi-Site Clinical Deployment. JAMA Network Open, 2025. https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2834943

Survey on Multimodal Large Language Models in Radiology. Information, MDPI, 2025. https://www.mdpi.com/2078-2489/16/2/136

Tejani et al. Retrieval-Augmented Generation in Radiology AI Deployments. Radiology: Artificial Intelligence, RSNA. https://pubs.rsna.org/doi/10.1148/ryai.240790

RAG Reduces Hallucinations in Radiology from 8% to 0%. npj Digital Medicine, 2025. https://www.nature.com/articles/s41746-025-01802-z

Guan et al. A Clinically-Informed Framework for Evaluating Vision-Language Models in Radiology Report Generation. Harvard/Brigham and Women’s, MedRxiv, 2025. https://www.medrxiv.org/content/10.1101/2025.07.13.25331222v1.full.pdf

Post Views: 7

Useful Links

Recent Posts