A study by researchers at the Icahn School of Medicine at Mount Sinai has identified strategies for using large language models (LLMs), a type of artificial intelligence (AI), in health systems while maintaining cost efficiency and performance.
The findings, published in the November 18 online issue of npj Digital Medicine, provide insights into how health systems can leverage advanced AI tools to automate tasks efficiently, saving time and reducing operational costs while ensuring these models remain reliable even under high task loads.
"Our findings provide a road map for health care systems to integrate advanced AI tools to automate tasks efficiently, potentially cutting costs for application programming interface (API) calls for LLMs up to 17-fold and ensuring stable performance under heavy workloads," says co-senior author Girish N. Nadkarni, MD, MPH, Irene and Dr. Arthur M. Fishberg Professor of Medicine at Icahn Mount Sinai, Director of The Charles Bronfman Institute of Personalized Medicine, and Chief of the Division of Data-Driven and Digital Medicine (D3M) at the Mount Sinai Health System.
Hospitals and health systems generate massive volumes of data every day. LLMs, such as OpenAI's GPT-4, offer encouraging ways to automate and streamline workflows by assisting with various tasks. However, continuously running these AI models is costly, creating a financial barrier to widespread use, say the investigators.
"Our study was motivated by the need to find practical ways to reduce costs while maintaining performance so health systems can confidently use LLMs at scale. We set out to 'stress test' these models, assessing how well they handle multiple tasks simultaneously, and to pinpoint strategies that keep both performance high and costs manageable," says first author Eyal Klang, MD, Director of the Generative AI Research Program in the D3M at Icahn Mount Sinai.
The study involved testing 10 LLMs with real patient data, examining how each model responded to various types of clinical questions. The team ran more than 300,000 experiments, incrementally increasing task loads to evaluate how the models managed rising demands.
Along with measuring accuracy, the team evaluated the models' adherence to clinical instructions. An economic analysis followed, revealing that grouping tasks could help hospitals cut AI-related costs while keeping model performance intact.
The study showed that by specifically grouping up to 50 clinical tasks -- such as matching patients for clinical trials, structuring research cohorts, extracting data for epidemiological studies, reviewing medication safety, and identifying patients eligible for preventive health screenings -- together, LLMs can handle them simultaneously without a significant drop in accuracy. This task-grouping approach suggests that hospitals could optimize workflows and reduce API costs as much as 17-fold, savings that could amount to millions of dollars per year for larger health systems, making advanced AI tools more financially viable.
"Recognizing the point at which these models begin to struggle under heavy cognitive loads is essential for maintaining reliability and operational stability. Our findings highlight a practical path for integrating generative AI in hospitals and open the door for further investigation of LLMs' capabilities within real-world limitations," says Dr. Nadkarni.
One unexpected finding, say the investigators, was how even advanced models like GPT-4 showed signs of strain when pushed to their cognitive limits. Instead of minor errors, the models' performance would periodically drop unpredictably under pressure.
"This research has significant implications for how AI can be integrated into health care systems. Grouping tasks for LLMs not only reduces costs but also conserves resources that can be better directed toward patient care," says co-author David L. Reich, MD, Chief Clinical Officer of the Mount Sinai Health System; President of The Mount Sinai Hospital and Mount Sinai Queens; Horace W. Goldsmith Professor of Anesthesiology; and Professor of Artificial Intelligence and Human Health, and Pathology, Molecular and Cell-Based Medicine, at Icahn Mount Sinai. "And by recognizing the cognitive limits of these models, health care providers can maximize AI utility while mitigating risks, ensuring that these tools remain a reliable support in critical health care settings."
Next, the research team plans to explore how these models perform in real-time clinical environments, managing real patient workloads and interacting directly with health care teams. Additionally, the team aims to test emerging models to see if cognitive thresholds shift as technology advances, working toward a reliable framework for health care AI integration. Ultimately, they say, their goal is to equip health care systems with tools that balance efficiency, accuracy, and cost-effectiveness, enhancing patient care without introducing new risks.