Evaluating large language models for renal colic imaging recommendations: a comparative analysis of Gemini, copilot, and ChatGPT-4.0
| Author | Yigit, Yavuz |
| Author | Ozbek, Asım Enes |
| Author | Dogru, Betul |
| Author | Gunay, Serkan |
| Author | Al Kahlout, Baha Hamdi |
| Available date | 2026-01-13T10:48:59Z |
| Publication Date | 2025 |
| Publication Name | International Journal of Emergency Medicine |
| Resource | Scopus |
| Identifier | http://dx.doi.org/10.1186/s12245-025-00895-3 |
| Citation | Yigit, Y., Ozbek, A.E., Dogru, B. et al. Evaluating large language models for renal colic imaging recommendations: a comparative analysis of Gemini, copilot, and ChatGPT-4.0. Int J Emerg Med 18, 123 (2025). https://doi.org/10.1186/s12245-025-00895-3 |
| ISSN | 1865-1372 |
| Abstract | Background: The field of natural language processing (NLP) has evolved significantly since its inception in the 1950s, with large language models (LLMs) now playing a crucial role in addressing medical challenges. Objectives: This study evaluates the alignment of three prominent LLMs-Gemini, Copilot, and ChatGPT-4.0-with expert consensus on imaging recommendations for acute flank pain. Methods: A total of 29 clinical vignettes representing different combinations of age, sex, pregnancy status, likelihood of stone disease, and alternative diagnoses were posed to the three LLMs (Gemini, Copilot, and ChatGPT-4.0) between March and April 2024. Responses were compared to the consensus recommendations of a multispecialty panel. The primary outcome was the rate of LLM responses matching the majority consensus. Secondary outcomes included alignment with consensus-rated perfect (9/9) or excellent (8/9) responses and agreement with any of the nine panel members. Results: Gemini aligned with the majority consensus in 65.5% of cases, compared to 41.4% for both Copilot and ChatGPT-4.0. In scenarios rated as perfect or excellent by the consensus, Gemini showed 69.5% agreement, significantly higher than Copilot and ChatGPT-4.0, both at 43.4% (p = 0.045 and < 0.001, respectively). Overall, Gemini demonstrated an agreement rate of 82.7% with any of the nine reviewers, indicating superior capability in addressing complex medical inquiries. Conclusion: Gemini consistently outperformed Copilot and ChatGPT-4.0 in aligning with expert consensus, suggesting its potential as a reliable tool in clinical decision-making. Further research is needed to enhance the reliability and accuracy of LLMs and to address the ethical and legal challenges associated with their integration into healthcare systems. |
| Language | en |
| Publisher | BioMed Central Ltd |
| Subject | ChatGPT-4.0 Copilot Gemini Imaging recommendations Large Language models (LLMs) Natural Language processing (NLP) Renal colic |
| Type | Article |
| Issue Number | 1 |
| Volume Number | 18 |
Files in this item
This item appears in the following Collection(s)
-
Medicine Research [2057 items ]


