Why Healthcare AI Fails at the Data Layer
Healthcare AI stalls when data leaves the EHR boundary and its protections fall away. Why privacy, governance, and trust at the data layer decide what ships.
Featuring Sid Dutta on The Signal Room
Inquiring why a certain healthcare AI initiative remains perpetually piloted, the response invariably centers around the model, citing lack of accuracy, over-promising by the vendor, and clinicians' distrust of the output. Recently, during a Signal Room discussion, Sid Dutta offered a different perspective. He suggested the bottleneck is fundamentally earlier and lower, at the data layer, and that fundamentally it is not the model. Dutta is a veteran cybersecurity professional with extensive experience leading data protection efforts across large enterprises and now runs a startup focused on the protection of highly sensitive data in an AI context. He was particularly clear in his comments: most organizations are not currently facing AI development challenges, but rather, challenges stemming from the safe and responsible use of organizational data.
This shift in perspective holds a great deal of importance due to its reallocation of the impediments. Improvements in modeling can address issues of sub-par model performance. If the issue at hand is impediments to safe data mobility, then no improvement in modeling can address these impediments, and the focus moves to the layer beneath.
Why Healthcare Data Is Uniquely Hard to Use
Dutta gives three reasons that show how healthcare data is more challenging to handle than data in most other sectors. One of these is sensitivity. Data is much more valuable to an organization if the loss of that data has the potential to injure the trust and safety of a patient. Because of this, data is treated with care. Dutta says healthcare data is fragmented. Unlike data in other industries, healthcare data spreads across a collection of electronic platforms, imaging systems, and billing systems, and often cannot talk to any of these systems. The controls that safeguard healthcare data were designed to protect data in a different era.
The most interesting part of this segment was the point about how controls don't move with the data. Epic and Cerner do a fine job of protecting data, but only as long as data remains in the platforms. The instant data exits the platform (say, to a model, to a copilot that reads the data, to an agent that accesses the EHR as a tool, or to a third-party platform that provides analytics) protections are no longer in place. Dutta says that, at this point in time, this data is in the wild, and it is most vulnerable. This is exactly where security and compliance teams lose their oversight, and control of the data.
The Intent Problem With AI Identities
The second layer of the erosion of control boundaries is associated more directly with the functioning of AI. Traditional security asks who is accessing data and what they are permitted to touch. Dutta indicated that with agents and copilots accessing data on behalf of users, a system breaks down. You may know which system is making the request and what it is permitted to access, but you cannot know, or even easily determine, what the agent is going to do with the data and why — its intent.
This is the arena of data access in relation to data protection. Traditional security is based on static systems that were deterministic. An application accepted a defined input, processed it, and stored the result at a known location. In contrast, an agent does not function that way. It is not pre-wired to a fixed scope; rather, it makes decisions in an open-ended manner, with the potential to recombine and/or reinterpret the data in ways that were not defined or even documented. Traditional guardrails do not apply to a system where the scope is continually extended in an open-ended manner.
Privacy-Preserving AI, Without the Marketing Gloss
Dutta refers to privacy-preserving AI as a means of maintaining the value of the data without revealing it, which is a much more simplistic idea than the terminology suggests. He was frank about many of the most well-known methods remaining in a research-completion limbo. Homomorphic encryption, differential privacy, federated learning, and trusted execution environments, which are secure enclaves that process data that is still encrypted, are all methods which will allow you to compute on data you will never show. From his perspective, many of these approaches have really never been practical, and in many cases enterprise-class, at scale, and even in niche areas, they fit poorly. This manifests in having a negative impact on their feasibility regarding performance and usability.
De-identification and data desensitization have proven more reliable, where processing takes place, although they still travel and/or reside on the data, in the form of tokenization, format-preserving encryption, and deterministic encryption. These are important due to referential integrity. Of course, poor masking, like replacing a value with a row of x's, or deleting the value, will impair the ability to join tables and search a structured dataset. More advanced methods will allow you to replace a real value with a token or surrogate, and will behave like the original value, making data records, and thus, the sensitive value, joinable and searchable, and never having to be exposed.
Dutta has developed a balancing act in this regard. If you protect too little, then you leak. If you protect too much, or the wrong things, then your model gets nonsense which it can't reason over, resulting in bad or unfavorable output. In most cases, he elaborated, the model does not need the PHI itself, it needs the surrounding structures. The hard work is determining the least needed to be protected, then masking or tokenizing what needs to be transported, and what needs to be revealed, and then protecting what needs to be revealed to work on the task at hand. Given the fact that most hospitals do not have their own models, this is all done when their data is sent to a hyperscaler's foundational model which exists outside the hospital's boundaries, and so the cost and risk to the data gets higher.
Infrastructure versus Models
Dutta spent time noting that too much emphasis is placed on models, and not enough on the layering or building blocks of the models placed on data. You can have the best model, but if the data layer is not robust, secure, managed, and observable, then you are not going to be able to run that model at the industrial scale of healthcare. It is the constraint of a garbage in garbage out scenario. It's not a model at industrial scale if you are dumping garbage data into the model.
The more difficult point to grasp is that the current infrastructure was not designed for dynamic systems like these. Static applications with known parameters and destinations were the only things considered. With the advent of AI, the static application of data gave way to data that is perpetually altered and recombined, interpreted in dynamic ways, and moved along the continuum while making autonomous decisions. The infrastructure within which this data moves did not evolve at the same pace. In Dutta's words, organizations have the ability to design extremely powerful models and build the capability to process enormous datasets in seconds. However, the static infrastructure means that organizations have no means of designing these large data models and processing systems in a way that will provide the organization with the ability to retain control. What is lacking is neither creativity nor innovation. It is confidence and control at the data layer.
This gap also helps explain shadow AI. When privacy and security feel like a cost that the innovator must implement in addition to the real work, people will circumvent the system, and unsanctioned tools will spread. Dutta's point is that protection must be built into the system by default and enforced at runtime so that it simply works, and the builder does not have to bear the cost. If it works and creates no friction, people will stop bypassing it, and the enterprise will have less hidden risk.
Reading the Signals of an Unready Organization
Dutta described several signals for leaders to identify the lack of preparedness for organizations to partner safely on sensitive data. The first signal is an over-reliance on static controls. This occurs when an organization claims data is "encrypted at rest" and "encrypted in transit." Static controls such as encryption at rest fail to counter threats to functions that could be performed using data once it is in use. The second signal is the absence of data flow visibility. Tools that review the contents of data stores provide a snapshot of data at rest. However, they do not provide information on where data moves to after it leaves the store. If the team does not know where data moves and what the sensitive data is doing in the organization's AI workflows, then the status is not ready. The third signal is the absence of risk ownership. This is commonly seen in large and complex organizations. This occurs in shared data stores that multiple teams use. The data is accessed through the approval of a person who manages the server as opposed to a person who owns the data and the consequences of the data breach. The final signal is a block-first reflex. If a governance model is to only restrict everything because it is done with an ease of implementation, then the model is not ready.
How Hutchins Approaches the Data Layer
Our work tends to begin where the model conversation skips ahead — at whether your data can move, be shared, and be used without losing control of it. That means looking at what happens past the EHR boundary, how protection travels with data into AI and third-party workflows, who actually owns each data asset and its risk, and whether access decisions can be made on context and intent rather than a one-time role grant. We treat data access and data protection as one problem rather than two teams pulling against each other, because the friction between them is where shadow AI and avoidable exposure grow. This connects directly to data governance and the security practices that decide whether a model is safe to deploy, and it is the same foundation that data readiness depends on. These themes run throughout The Signal Room podcast, where practitioners working under the covers of healthcare AI describe what protecting data in motion actually takes.
Authoritative sources
Have a data or AI challenge like this?
A 30-minute call is enough to tell whether we're the right fit.
Frequently asked questions
Why does healthcare AI fail at the data layer rather than the model?
Most stalled initiatives are not blocked by model accuracy. They are blocked because the data cannot be used safely once it leaves its source system, where the native protections that guarded it no longer apply.
What happens when patient data leaves the EHR?
Systems like Epic and Cerner protect data well while it sits inside them, but those controls do not travel with the data. The moment it moves to an AI model, a copilot, or a third party, the native safeguards are gone and security teams lose visibility and control.
What is privacy-preserving AI in practice?
It means getting value from data without exposing the data itself — through techniques like de-identification, tokenization, and format-preserving encryption that keep records usable for joins and searches, plus runtime decisions about what to reveal based on context and intent.
What signals suggest an organization is not ready to collaborate on data safely?
Over-reliance on static controls like encryption at rest and in transit, no visibility into where sensitive data flows once it leaves a system, unclear ownership of data and its risk, and a block-first default that restricts everything because that is easier than enabling it safely.