Insight · human oversight of AI

Human Oversight of Clinical AI: When Judgment Overrides the Model

When clinicians should override AI, how clinical trust is earned, and why human judgment stays the final authority on care — from a Signal Room conversation.

Featuring Dr. Mark Gendreau on The Signal Room

A key distinction from Dr. Mark Gendreau, which articulates something that should always be at the forefront of consideration in healthcare AI, is that "we have lived, and AI has not." Dr. Gendreau is an emergency medicine physician and senior health system executive. On The Signal Room, he mentioned that clinicians should always be the final decision-makers because, in the process of providing care, a clinician's judgment is exposed to experiences that AI is not. The images reviewed by a radiologist, or a physician's experience from the bedside, are things that an AI model simply cannot do, and oversight is there to ensure that AI does not make these calls.

This is important to consider because while there is a significant need for clinical AI, there is still a workforce problem of an aging physician and nursing workforce, along with a dwindling workforce in general, combined with an increasing demand. Dr Gendreau's solution is not to eliminate clinicians, but to provide tools to the clinicians currently working to improve access to care and to increase the quality and safety of care. The following most difficult question then becomes, if AI is going to be in healthcare, who is going to have the power to say no, and on what grounds?

Amplifying the Clinician, Not Automating the Decision

Gendreau has named three reasons (of many other possible reasons) where AI has the potential to improve the industry's quality, safety and efficiency, and he argues that none of the three reasons would involve making a decision.

AI tools in digital radiology take on the responsibility of reading and analyzing images then communicating findings to the radiologist. For example, when analyzing image 63, the AI suggests the radiologist pay attention and look more closely at a probable finding of a pulmonary embolus in a chest scan. The AI does not make diagnostic decisions. Its main purpose is to speed the reading of the scans. The focus of the AI in this case is on reducing workload for radiologists. The author notes the primary benefit from this AI application during off hours, when diagnostic reading is performed by an increasingly smaller group of radiologists and reading multiple studies diminishes diagnostic accuracy. The author also presents a case where a trauma center exceeded its trauma designation guidelines for read time. In this case, the AI application was used to expedite a head CT and bring it to the surgeon. It is an excellent example of AI use in a time-critical situation and a fully practical solution that requires human input to complete.

The second example is ambient documentation. AI used during a patient visit records and transcribes a summary of the visit at an impressive 97% accuracy. Physicians are freed from the burden of documenting and are able to give more attention to the patient. The author notes a positive correlation between ambient documentation and a reduction in 'pajama time.' Pajama time refers to the hours clinicians spend documenting in the electronic record from the late hours of the night. This documentation burden is unsustainable, and ambient documentation is the solution the author is most excited about.

The third example is an emergency department that uses an ambient note-taking system. After the note is completed, it then coaches the clinician by saying, 'you forgot to introduce yourself,' 'the patient looked tense when you said that,' and 'here's another way you could have answered that.' This technology is good at coaching clinicians and making them better at empathy during the patient encounter, as it augments clinicians rather than removing them from the encounter.

When to be Skeptical of the Model

The override case is not hypothetical. Gendreau stated that there are still instances where imaging algorithms incorrectly flag an issue and the radiologist must then evaluate it to find it is just a benign calcification that needs no report and no worry. Human oversight exists for instances like that. In healthcare and other high-reliability domains, you always keep a human in the loop, because the judgment that recognizes a false alarm comes from experience that the model does not have.

Chris Hutchins presented a similar example from the patient side. Someone with well-controlled high blood pressure can have any elevation flagged by a system that does not know the reading is normal for them and already treated. Without that context, the alert is technically correct and clinically useless, and therefore, the output must be ignored by the clinician. The authority to override is not a courtesy extended to clinicians. It is the safeguard that keeps a context-blind model from driving the decision.

What Trust Means Today

In the context of asking what allows clinicians to trust the insight of AI enough to act, Gendreau referred to Stephen M. R. Covey's version of the trust equation, which is essentially credibility, judgment, and psychological safety, balanced against one's self-interest, and what the AI tool is being used for, and if the purpose is worth it. Since healthcare is all about relationships, and quality and safety, an AI tool must also meet those standards. Its capabilities must be good. Its reliability must be strong. Trust is not bestowed upon a system that produces dubious output half of the time.

What he accurately pointed out next was the part that has no shortcuts. From the beginnings of ambient documentation, clinicians were skeptical. Trust was not immediate, and came only after the tool learned individual preferences and clinicians were willing and able to correct the tool. You wrote this—the patient actually said that. Afterward, clinicians found the tool to be trustworthy, and word of mouth did the rest. The early adapters were used to motivate the laggards, and the laggards were the early adopters' loudest support. Trust was built after the rollout. The use and correction of a tool are the reasons you should have respect for a clinician's doubts during the initial introduction of a tool, and should not be seen as an obstacle to the tool's use.

Keep Empathy in Human Hands

There is a boundary Gendreau draws firmly. He embraces the idea that AI can enhance care through pattern recognition and the automation of repetitive tasks. But when it comes to emotional intelligence, relationship building, and shared decision-making, these are not the model's territory — it belongs to the human. He paraphrased Jeff Woods, "You are the leader; do not give up your leadership to AI." For this context, it can be modified to, "You are the human; do not give up your humanness to AI."

He pointed out the opposite risk. As systems develop to provide sufficient capabilities such that you can no longer tell whether you are conversing with a human or a machine, we have the risk of humanizing AI, through naming and treating it as a colleague. He argued we keep the two workforces, human and AI, carefully separated as we damage the human characteristics that provide care. Hutchins focused on the same issue from the design perspective of how much empathy do we really want technology to have when it is likely to be trusted by many too easily and too early? Neither Hutchins nor Gendreau had the answer; both viewed the question as an ongoing dilemma.

An Alert Needs To Justify Its Disruption

Oversight is not only concerned with individual overrides. It is also about not overwhelming the people who do the overriding. Gendreau attributed alert fatigue specifically to the problem of hospital alerts and alarms — they sound until they are viewed by the staff as background noise to the point where no one is able to hear them anymore. His principle states that an alert must justify the disruption to the clinician. To avoid the system accumulating unnecessary noise, his principle also states that if you add something, you must remove something. He supported the creation of alert fatigue committees that would analyze what triggers specific alerts, the value of those alerts, the cost of those alerts, and if an alert adds burden to the system and does not improve patient care. This is a form of human oversight, not only on the tooling, but also on what the tooling is designed to accomplish.

Scaling and Preserving Discretion

Moving beyond pilot phases creates new unique challenges. Gendreau highlighted three major challenges in this area, first the need for interoperability, and then governance, and the discipline associated with the training of models. Currently, it is common practice for organizations to place a large language model in an internal system and train it on the internal goals and activities of the organization. Interoperability carries the greatest weight in Gendreau's order of these three challenges. Without suitable APIs and adequate system integration, the underlying data remains largely useless, and more often than not, wasted. Gendreau noted that the 2022 report on artificial intelligence in healthcare released by the National Academy of Medicine outlines a number of the same challenges and is worth reading. The cultural signal Gendreau selects to classify an organization as truly ready is the presence of a leadership team who own this change and implement it themselves, as opposed to pushing it down to the IT department. Gendreau describes these individuals as creatively driven with empathy and unyielding trust. When asked what metrics would demonstrate, to him, that clinicians and AI are improving the healthcare system, he listed a range of positive outcomes including improved healthcare outcomes, reduced healthcare inequities and overall more time.

How Hutchins Approaches Human Oversight of AI

At Hutchins Data Strategy Consultants, we treat the override question as a design requirement, not an afterthought. That means defining where a clinician's judgment is the final authority, building the human-in-the-loop checkpoints that high-reliability care demands, and giving teams the literacy to know when an AI output deserves a second look and when it deserves to be discounted. It is the same discipline behind explainable clinical AI and behind designing clinical AI for the conditions it will actually meet — because oversight only works when clinicians can see why a model said what it said and trust that the system was built for their reality. These themes run throughout The Signal Room podcast, where clinical leaders describe what it takes to keep human judgment at the center of AI-assisted care.

Authoritative sources

Have a data or AI challenge like this?

A 30-minute call is enough to tell whether we're the right fit.

FAQ

Frequently asked questions

When should a clinician override an AI recommendation?

Whenever the clinician's experience contradicts the output. AI imaging tools flag findings that turn out to be benign, and only a radiologist who has read thousands of studies can discern that. The model surfaces something worth a second look; the human decides what it means.

What earns clinician trust in AI?

Demonstrated reliability over time, not a launch announcement. Trust grows when the tool's capabilities are sound, its output is consistent, and clinicians can correct it — telling it what the patient actually said until it learns their preferences. Early skeptics often become the strongest advocates once that happens.

What does keeping a human in the loop mean in practice?

In high-reliability settings like healthcare, a person reviews the AI's output every time before it affects care. The model can read an image instantly and point to where attention is needed, but interpretation and judgment stay with the clinician.

How should leaders prevent alert fatigue from automation?

Treat an interruption as something an alert has to earn. Add a new alert only when you retire one, and stand up review committees that measure each alert's value and impact, retiring those that add noise without improving care.