Evaluating large language models for renal colic imaging recommendations: a comparative analysis of Gemini, copilot, and ChatGPT-4.0
| المؤلف | Yigit, Yavuz |
| المؤلف | Ozbek, Asım Enes |
| المؤلف | Dogru, Betul |
| المؤلف | Gunay, Serkan |
| المؤلف | Al Kahlout, Baha Hamdi |
| تاريخ الإتاحة | 2026-01-13T10:48:59Z |
| تاريخ النشر | 2025 |
| اسم المنشور | International Journal of Emergency Medicine |
| المصدر | Scopus |
| المعرّف | http://dx.doi.org/10.1186/s12245-025-00895-3 |
| الاقتباس | Yigit, Y., Ozbek, A.E., Dogru, B. et al. Evaluating large language models for renal colic imaging recommendations: a comparative analysis of Gemini, copilot, and ChatGPT-4.0. Int J Emerg Med 18, 123 (2025). https://doi.org/10.1186/s12245-025-00895-3 |
| الرقم المعياري الدولي للكتاب | 1865-1372 |
| الملخص | Background: The field of natural language processing (NLP) has evolved significantly since its inception in the 1950s, with large language models (LLMs) now playing a crucial role in addressing medical challenges. Objectives: This study evaluates the alignment of three prominent LLMs-Gemini, Copilot, and ChatGPT-4.0-with expert consensus on imaging recommendations for acute flank pain. Methods: A total of 29 clinical vignettes representing different combinations of age, sex, pregnancy status, likelihood of stone disease, and alternative diagnoses were posed to the three LLMs (Gemini, Copilot, and ChatGPT-4.0) between March and April 2024. Responses were compared to the consensus recommendations of a multispecialty panel. The primary outcome was the rate of LLM responses matching the majority consensus. Secondary outcomes included alignment with consensus-rated perfect (9/9) or excellent (8/9) responses and agreement with any of the nine panel members. Results: Gemini aligned with the majority consensus in 65.5% of cases, compared to 41.4% for both Copilot and ChatGPT-4.0. In scenarios rated as perfect or excellent by the consensus, Gemini showed 69.5% agreement, significantly higher than Copilot and ChatGPT-4.0, both at 43.4% (p = 0.045 and < 0.001, respectively). Overall, Gemini demonstrated an agreement rate of 82.7% with any of the nine reviewers, indicating superior capability in addressing complex medical inquiries. Conclusion: Gemini consistently outperformed Copilot and ChatGPT-4.0 in aligning with expert consensus, suggesting its potential as a reliable tool in clinical decision-making. Further research is needed to enhance the reliability and accuracy of LLMs and to address the ethical and legal challenges associated with their integration into healthcare systems. |
| اللغة | en |
| الناشر | BioMed Central Ltd |
| الموضوع | ChatGPT-4.0 Copilot Gemini Imaging recommendations Large Language models (LLMs) Natural Language processing (NLP) Renal colic |
| النوع | Article |
| رقم العدد | 1 |
| رقم المجلد | 18 |
الملفات في هذه التسجيلة
هذه التسجيلة تظهر في المجموعات التالية
-
أبحاث الطب [2057 items ]


