Millions of Americans rely on the internet to answer questions about their own health. The public release of powerful artificial intelligence models like ChatGPT has only accelerated these trends.
In a large survey, more than half of American adults reported putting their own health information into a large language model (LLM). And there’s reason to believe these models might bring real value to these people, such as the case of a mother who, after seeing 17 physicians and receiving no diagnosis for her son with chronic pain, put MRI reports and additional history into ChatGPT. It returned a diagnosis of tethered cord syndrome, which was later confirmed — and operated on — by a neurosurgeon.
This story is not unique. Missed or delayed diagnoses harm patients every day. Each year, an estimated 795,000 Americans die or become permanently disabled from misdiagnoses. And these misdiagnoses are not exclusively rare “zebras” like tethered cord syndrome. Just 15 or so diseases, many of them common, like heart disease and breast cancer, account for half of serious harms. The sicker an individual, the higher the stakes — and the more common these errors become. In a recent study of people admitted to the hospital who were then transferred to an intensive care unit because their conditions got worse, 23% had a diagnostic error affecting their case; 17% of those errors caused severe harm or death.
While numerous factors — many of them outside the control of physicians — are at play in diagnostic errors, human cognition plays a major role. These problems have long been realized by the medical community — the Institute of Medicine released its landmark report “To Err is Human,” in 1999, with comprehensive recommendations to tackle diagnostic errors. But 25 years later, diagnostic errors remain stubbornly persistent.
While many people might imagine that a physician approaches a diagnosis much like Sherlock Holmes — or Dr. House — diligently collecting facts to compare against her or his encyclopedic knowledge of the disease, the reality is far more prosaic. Decades of psychological study, influenced by the pioneering work of Danny Kahneman and Amos Tversky, have shown that diagnosis is subject to the same predictable biases and heuristics as other domains of knowledge. For example, emergency room doctors were less likely to test for a pulmonary embolism (a blood clot in the lungs) when the triage information mentioned heart failure, even when objective data and documented symptoms suggested a pulmonary embolism. This suggested that the physicians got stuck on the first information given to them, a problem called anchoring bias.
Doctors do a poor job of estimating the likelihood that patients have diseases and how testing changes those probabilities — and are readily outperformed by general-purpose language models. Decades of research have similarly shown the widespread involvement of other cognitive biases such as availability bias, confirmation bias, and premature closure in the diagnostic process.
Since ChatGPT was released to the public in late 2022, there have been hundreds of demonstrations of the diagnostic reasoning capabilities of general-purpose large language models and other AI models on a broad array of general diagnostic tasks, some of which we performed with various collaborators. We believe there is compelling evidence that AI, safely integrated into the clinical workflow, could be helpful today to address some of the limitations of human cognition for medical diagnosis. In particular, AI could be made available as a “second opinion” service in the hospital to assist physicians and other medical professionals with challenging medical cases and also to check for blind spots in diagnostic reasoning. Second opinion services with human physicians — admittedly on a much smaller scale — have already shown that they can provide real value to patients.
What would this look like in practice?
Building a second opinion system powered by a large language model is no longer in the realm of science fiction. As a physician treating patients (A.R.) and a medical AI researcher (A.M.), we envision a system that allows a treating physician, using the electronic medical record, to place an “order.” But instead of selecting a diagnostic test, the physician would summarize the clinical question about a patient the same way they would talk to a colleague. After submitting the order, the question, along with the entire chart, would go to a secure computing environment where an LLM would process and provide a recommendation of possible diagnoses, blind spots, and therapeutic options.
Just as in the opening case, where the diagnosis of tethered cord syndrome was confirmed by a neurosurgeon, recommendations emerging from the model would be first reviewed by a physician who serves as a human in the loop to prevent obvious errors and hallucinations (where an AI model often confidently states factual inaccuracies). After this review, the second opinion would be sent back to the requesting doctor to be placed in the medical record and considered by the ordering physician.
Similar to human second opinions, it is not essential that the requesting physician follow the recommendations emerging from the LLM. But the mere process of considering other options can help reduce diagnostic errors. And unlike human second opinion services, the costs of running the model can be measured in cents, and the model can serve scores of clinicians and their patients in parallel.
To be sure, there are obvious risks that would need to be mitigated in early studies with close human involvement. LLMs contain the ethnic, racial, and gender biases of the data they were trained on, which could influence second opinions in unpredictable and harmful ways. LLMs are also capable of hallucinating; while humans also make errors, AI hallucinations may be more egregious and may be more likely to cause harm. Having a human expert in the loop would be absolutely essential, especially in early studies.
However, the stakes of continuing the current rate of diagnostic errors are so high, and other attempts to reduce errors have failed to make any meaningful dent, that we feel now is the time to start studying these technologies. To riff off the old saying, to err is human, so AI must opine.
Adam Rodman is a practicing internist at the Beth Israel Deaconess Medical Center and an assistant professor of medicine at Harvard Medical School. Arjun K. Manrai is an assistant professor of biomedical informatics at Harvard Medical School and a founding deputy editor of NEJM AI.
LETTER TO THE EDITOR
Have an opinion on this essay? Submit a letter to the editor here.
To submit a correction request, please visit our Contact Us page.
STAT encourages you to share your voice. We welcome your commentary, criticism, and expertise on our subscriber-only platform, STAT+ Connect