What the Model Inherits

Healthcare leaders often evaluate clinical models based on their performance after being deployed. However, a more critical question is: what is the foundation of the model? Each model is dependent on data, and that data would have been elicited for billing or for documentation long before the data was meant to be used to instruct the algorithm. What is the cost of overlooking the past? In a 2019 edition of Science, a group of researchers described how one of the first population health algorithms used across millions of target patients for a given year actually underestimated the health care needs of Black patients. The reason for this was that the algorithm was trained based on spending data from a certain population which, instead of being data on health care needs, was data on health care access. The model accomplished the task it was designed to do, but the data provided to the model was a highly misleading representation of the health care needs, and at any point of the entire workflow, no one questioned the assumptions that the model would be making based on the data.

Models as a Final Output

Models are not the end. Decisions were made on the data long before modeling began. A decision was made about how a diagnosis will be coded. A decision was made regarding the frequency with which the data will be refreshed. The model trains and learns from those decisions and makes them its truth. If those decisions were made with sound judgment and are based on logic, the model stands a fair chance. But if those decisions were made randomly, the model learns those inconsistencies, and brings those inconsistencies, in the form of a clinical signal, to the clinician.

The data used in healthcare often contain a multitude of inconsistencies; many records are spread across many systems, and one single patient could have multiple records and multiple identifiers across various systems. The Office of the National Coordinator for Health Information Technology (ONC) has established the United States Core Data for Interoperability, among other things, to try and eliminate the gaps that are created by siloed records. Models, even after the most sophisticated attempts at data interoperability, may fail when exposed to the unorganized data from disparate systems.

What the Evidence Says About the Data Under AI

The FDA has included some of these concepts in their Good Machine Learning Practices. Here, principles of data lineage and data integrity ensure models are developed safely. Developers need to understand the source of their training data, and what the data will be subject to after the model has been deployed. A health system that cannot answer these questions about their data is developing models on unknown territory.

The research by the Agency for Healthcare Research and Quality (AHRQ) finds that conclusions made by using data that is incomplete or biased are authoritative, and in fact, exhibit a lack of authority. The same conclusions can be made for any model that is trained on data that was primarily collected for billing, instead of for the purpose of data capture for forecasting. For the Coalition for Health AI (CHAI), which includes health care providers and health care technology organizations, ensuring data quality is essential for the development of trustworthy AI.

In its work on a Learning Health System (LHS), the National Academy of Medicine articulates that data is a shared institutional resource. Therefore, an LHS employs data that is reliably trusted in numerous contexts, data that for one model is collected and then left to age, cannot be utilized. The shared, yet simultaneously unshared, ownership of data produces excessive blind spots, undermining any model that is built on the data.

Demographic data makes clear the discrepancies. If race and ethnicity data are recorded differently in the Emergency Department (ED) and Clinic (C) data partitions, respectively, any model using these data partitions to analyze equitable performance will be badly flawed. In response to this issue, the ONC has developed data standards. The development of data standards will not eliminate the inequities caused by differential data collection.

Someone Must Own the Data

The problem cannot be solved by just the introduction of new technology. The leading health systems do not submit their data for the clinical models. They have begun to assign data stewards. A data steward is an individual who, when the laboratory information system (LIS) undergoes an upgrade, is given the authority to determine if the clinical model associated with the LIS will be retained.

Data stewardship has an explanation. It refers to the act of assigning ownership and answerability of data. The majority of health systems document the policy that data stewardship is to be practiced. Very few have offered a designated data steward the authority to manage the data that underpins a model.

Data stewardship must accompany data ownership. A data steward routinely inspects data associated with a clinical model, and records feedback from clinicians. ECRI, formerly known as the Emergency Care Research Institute, has described the integrity of data as a prevalent concern when a system becomes operational, and the data is affected post the deployment of the system.

More advanced systems can help identify unexplored gaps. HIMSS (Healthcare Information and Management Systems Society) ran a study on analytics maturity, which explained a stepwise approach of increasing standardization of isolated departmental data. The study also focused on the trust of consolidated data at the enterprise level for the organization to use in the business decision-making process. The majority of systems lie at a lower level in the maturity model than where they think they are with AI; that gap is where systems actually do fail. This model will, and should, show the organization the exact level of each step that they are attempting to skip.

What Health Systems Can Expect

In the next couple of years, health systems will obtain many more clinical models. Each of those models will carry the same implicit assumption: the data that they will use is stable and is understood. The more valuable question is not what model to buy; rather, if the organization is aware of the health of their data, is it good enough to enable any model. Most leaders are able to identify the model they are going to acquire. Very few have the ability to understand the underlying data.

Meaningful, non-glamourous work has to be done. Those health systems that do this well assign a data owner to each clinical model, and the data owner reserves the right to stop the model if data changes meaningfully. The quality of data is reviewed and improved proactively to prevent harm to patients. A health system with this type of strategy will be able to utilize the large number of basic models more effectively than their competitors that purchased the most advanced models and let their data quality deteriorate.

Context

This edition refers to the 2019 article in the journal Science about racial bias in a population health algorithm; the Good Machine Learning Practice principles from the Food and Drug Administration; the interoperability standards from the Office of the National Coordinator for Health Information Technology; the data quality work of the Agency for Healthcare Research and Quality; the data quality assurance position of the Coalition for Health AI; the learning health system work of the National Academy of Medicine; the data integrity hazards listed by ECRI; and the analytics maturity study by the Healthcare Information and Management Systems Society. Related editions are The Hidden Infrastructure of Trust and Still Chasing Integration.

Christopher Hutchins Founder and CEO, Hutchins Data Strategy Consultants

Tags: AI Health Pulse newsletter · healthcare AI · AI in healthcare · training data quality · data stewardship · model bias · AI governance

Models as a Final Output

What the Evidence Says About the Data Under AI

Someone Must Own the Data

What Health Systems Can Expect

Context

One signal a week. No noise.

Facing a challenge like this in your own system?

What the Model Inherits

Models as a Final Output

What the Evidence Says About the Data Under AI

Someone Must Own the Data

What Health Systems Can Expect

Context

One signal a week. No noise.

Facing a challenge like this in your own system?

Continue exploring

Read more on Insights

On the Signal Room podcast

More from Chris Hutchins