To read the medical literature, you might think AI is taking over medicine. It can detect cancers on images earlier, find heart issues invisible to cardiologists, and predict organ dysfunction hours before it becomes dangerous to hospitalized patients.
But most of the AI models described in journals — and lionized in press releases — never make it into clinical use. And the rare exceptions have fallen well short of their revolutionary expectations.
On Wednesday, a group of academic hospitals, government agencies, and private companies unveiled a plan to change that. The group, billing itself as the Coalition for Health AI, called for the creation of independent testing bodies and a national registry of clinical algorithms to allow physicians and patients to assess their suitability and performance, and root out bias that so often skews their results.
“We don’t have the tools today to understand whether machine learning algorithms and these new technologies being deployed are good or bad for patients,” said John Halamka, president of Mayo Clinic Platform. The only way to change that, he said, is to more rigorously study their impacts and make the results transparent, so that users can understand the benefits and risks.
Like the many documents of its kind, the coalition’s blueprint is merely a proclamation — a set of principles and recommendations that are eloquently articulated but easily ignored. The group is hoping that its broad membership will help stir a national conversation and concrete steps to start governing the use of AI in medicine. Its blueprint was built with input from Microsoft and Google, MITRE Corp, universities such as Stanford, Duke and Johns Hopkins, and government agencies including the Office of the National Coordinator for Health Information Technology, the Food and Drug Administration, National Institutes of Health, and the Centers for Medicare & Medicaid Services.
Even with some level of buy-in from those organizations, the hardest part of the work remains to be done. The coalition must build consensus around ways to measure an AI tool’s usability, reliability, safety, and fairness. It will also need to establish the testing laboratories and registry, figure out which parties will host and maintain them, and convince AI developers to cooperate with new oversight and added transparency that may conflict with their business interests.
As it stands today, there are few guideposts hospitals can use to help test algorithms or understand how well they will work on their patients. Health systems have largely been left on their own to sort through the complicated legal and ethical questions AI systems pose and determine how to implement and monitor them.
“Ultimately, every device should ideally be calibrated and tested locally at every new site,” said Suchi Saria, a professor of machine learning and health care at Johns Hopkins University who helped create the blueprint. “And there should be a way to monitor and tune performance over time. This is essential for truly assessing safety and quality.”
The ability of hospitals to carry out those tasks should not be determined by the size of their budgets or access to data science teams typically only found at the largest academic centers, experts said. The coalition is calling for the creation of multiple laboratories around the country to allow developers to test their algorithms on more diverse sets of data and audit them for bias. That would ensure an algorithm built on data from California could be tested on patients from Ohio, New York, and Louisiana, for example. Currently, many algorithm developers — especially those situated in academic institutions — are building AI tools on their own data, which limits their applicability to other regions and populations of patients.
“It’s only in creating these communities that you can do the kind of training and tuning needed to get where we need to be, which is AI that serves all of us,” said Brian Anderson, chief digital health physician at MITRE. “If all we have are researchers training their AI on Bay Area patients or upper Midwest patients, and not doing the cross-training, I think that would be a very sorry state.”
The coalition is also discussing the idea of creating of an accrediting organization that would certify an algorithm’s suitability for use on a given task or set of tasks. That would help to provide some level of quality assurance, so the proper uses and potential side effects of an algorithm could be understood and disclosed.
“We have to establish that AI-guided decision making is useful,” said Nigam Shah, a professor of biomedical informatics at Stanford. That requires going beyond assessments of an algorithm’s mathematical performance to studying whether it is actually improving outcomes for patients and clinical users.
“We need a mind shift from admiring the algorithm’s output and its beauty to saying ‘All right, let’s put in the elbow grease to get this into our work system and see what happens,” Shah said. “We have to quantify usefulness as opposed to just performance.”
This story is part of a series examining the use of artificial intelligence in health care and practices for exchanging and analyzing patient data. It is supported with funding from the Gordon and Betty Moore Foundation.
To submit a correction request, please visit our Contact Us page.
STAT encourages you to share your voice. We welcome your commentary, criticism, and expertise on our subscriber-only platform, STAT+ Connect