- ICH GCP
- US Clinical Trials Registry
- Klinisk forsøg NCT07632859
Diagnostic Accuracy of GPT-4o and Claude 4.6 Sonnet in Turkish ED Anamnesis Notes (LLM-ED-DX-TR)
Diagnostic Accuracy of Large Language Models From Emergency Department Anamnesis Notes: A Comparison of GPT-4o and Claude 4.6 Sonnet With Emergency Medicine Specialists
Studieoversigt
Status
Detaljeret beskrivelse
STUDY DESIGN: Retrospective diagnostic accuracy study, STARD-AI 2025 reporting, single center, cohort design.
AI INDEX TESTS: (1) GPT-4o (model version gpt-4o-2024-11-20; OpenAI API). (2) Claude 4.6 Sonnet (model version claude-sonnet-4-6; Anthropic API). Both accessed via Python (Google Colab). Temperature=0 for reproducibility. Zero-shot, stateless sessions - no cross-case context. No task-specific fine-tuning or additional training applied; models used as-is via API.
MODEL INTERPRETABILITY: Model interpretability analyses (such as SHAP, Grad-CAM, or layer-attribute visualizations) are not applicable to this study. Because GPT-4o and Claude 4.6 Sonnet are accessed as black-box models through proprietary, closed-source commercial APIs, internal model weights, gradients, and attention architectures are structurally inaccessible for post-hoc interpretability computations.
REFERENCE STANDARD: Three board-certified emergency medicine specialists independently evaluate each anonymized note, blinded to the original physician diagnosis and to each other. Primary diagnosis assigned by at least 2/3 specialists (majority vote) constitutes the gold standard. A 5-case calibration session precedes the main evaluation.
DATA PRIVACY: All anamnesis notes are fully de-identified (name, ID number, date of birth, physician name removed) prior to processing. De-identified notes are stored in a password-protected encrypted database. Only de-identified text is transmitted to LLM APIs - no personal health data. Compliant with Turkish Personal Data Protection Law (KVKK No. 6698).
PATIENT AND PUBLIC INVOLVEMENT: Not applicable. This retrospective study uses fully anonymized existing records; no patient or public involvement in design or conduct.
DATA SHARING: Anonymized dataset will be shared via Zenodo upon article acceptance. Statistical analysis code (Jamovi project files and Python prompt scripts) will be available on GitHub.
Undersøgelsestype
Tilmelding (Anslået)
Kontakter og lokationer
Studiekontakt
- Navn: Emir Ünal, Assistant Professor
- Telefonnummer: +905327766010
- E-mail: emirunal@gmail.com
Undersøgelse Kontakt Backup
- Navn: Emir Unal, Assistant Professor
- E-mail: emirunal@gmail.com
Studiesteder
-
-
Istanbul
-
Istanbul, Istanbul, Tyrkiet (Türkiye), 34899
- Marmara University Pendik Training and Research Hospital
-
Kontakt:
- Emir ünal
- E-mail: emirunal@gmail.com
-
-
Deltagelseskriterier
Berettigelseskriterier
Aldre berettiget til at studere
- Voksen
- Ældre voksen
Tager imod sunde frivillige
Prøveudtagningsmetode
Studiebefolkning
Beskrivelse
INCLUSION CRITERIA:
- Adult patients (aged 18 years and older) presenting to the emergency department.
- Complete electronic health record available in the hospital information system (HBYS) containing a detailed anamnesis note with chief complaint, symptom duration, associated symptoms, and relevant medical history.
- A definitive primary diagnosis recorded by the treating emergency physician using ICD-10 codes at the time of patient file closure.
EXCLUSION CRITERIA:
- Emergency department anamnesis notes containing fewer than 50 words or completely lacking substantive clinical content[cite: 1].
- Pediatric cases (age under 18 years)[cite: 1].
- Patients critically ill and triaged to high-acuity resuscitation areas (Emergency Severity Index [ESI] level 1)[cite: 1].
- Clinical notes containing residual identifying information that cannot be fully de-identified, preventing compliance with data privacy regulations[cite: 1].
- Non-independent clinical notes consisting solely of a brief cross-reference to a prior hospital visit without a new history entry[cite: 1].
Studieplan
Hvordan er undersøgelsen tilrettelagt?
Design detaljer
Kohorter og interventioner
Gruppe / kohorte |
|---|
|
Emergency Department Patient Cohort
Consecutive adult patients presenting to the emergency department with a fully documented electronic anamnesis note and a definitive primary ICD-10 diagnosis
|
Hvad måler undersøgelsen?
Primære resultatmål
Resultatmål |
Foranstaltningsbeskrivelse |
Tidsramme |
|---|---|---|
|
Diagnostic Accuracy of GPT-4o for ICD-10 Chapter-Level Diagnosis
Tidsramme: At the time of single-session algorithmic evaluation (each case evaluated once following data extraction in June 2026).
|
Proportion of cases in which GPT-4o primary (rank 1) diagnosis matches the 3-specialist majority-vote gold standard at the ICD-10 chapter level (22 categories).
Range: 0 to 1.00.
|
At the time of single-session algorithmic evaluation (each case evaluated once following data extraction in June 2026).
|
|
Diagnostic Accuracy of Claude 4.6 Sonnet for ICD-10 Chapter-Level Diagnosis
Tidsramme: At the time of single-session algorithmic evaluation (each case evaluated once following data extraction in June 2026).
|
Proportion of cases in which Claude 4.6 Sonnet primary (rank 1) diagnosis matches the 3-specialist majority-vote gold standard at the ICD-10 chapter level (22 categories).
Range: 0 to 1.00.
|
At the time of single-session algorithmic evaluation (each case evaluated once following data extraction in June 2026).
|
Sekundære resultatmål
Resultatmål |
Foranstaltningsbeskrivelse |
Tidsramme |
|---|---|---|
|
Cohen's Kappa Between GPT-4o Primary Diagnosis and Gold Standard
Tidsramme: At the time of algorithmic evaluation (June-July 2026)
|
Kappa coefficient measuring agreement between GPT-4o rank-1 ICD-10 chapter and the 3-specialist gold standard .
Interpreted per Landis & Koch (1977): <=0.20 slight; 0.21-0.40
fair; 0.41-0.60
moderate; 0.61-0.80
substantial; >0.80 almost perfect .
Range: -1.00 to 1.00 .
|
At the time of algorithmic evaluation (June-July 2026)
|
|
Cohen's Kappa Between Claude 4.6 Sonnet Primary Diagnosis and Gold Standard
Tidsramme: At the time of algorithmic evaluation (June-July 2026)
|
appa coefficient measuring agreement between Claude 4.6 Sonnet rank-1 ICD-10 chapter and the 3-specialist gold standard .
Interpreted per Landis & Koch (1977): <=0.20 slight; 0.21-0.40
fair; 0.41-0.60
moderate; 0.61-0.80
substantial; >0.80 almost perfect .
Range: -1.00 to 1.00
|
At the time of algorithmic evaluation (June-July 2026)
|
|
Top-3 Diagnostic Accuracy of GPT-4o
Tidsramme: At the time of algorithmic evaluation (June-July 2026)
|
Proportion of cases in which the 3-specialist gold standard diagnosis appears within GPT-4o's ranked list of three differential diagnoses .
Range: 0 to 1.00
|
At the time of algorithmic evaluation (June-July 2026)
|
|
Top-3 Diagnostic Accuracy of Claude 4.6 Sonnet
Tidsramme: At the time of algorithmic evaluation (June-July 2026)
|
Proportion of cases in which the 3-specialist gold standard diagnosis appears within Claude 4.6 Sonnet's ranked list of three differential diagnoses[cite: 1].
Range: 0 to 1.00
|
At the time of algorithmic evaluation (June-July 2026)
|
|
Treating Physician Diagnostic Accuracy Against Gold Standard
Tidsramme: At the time of the original clinical encounter (retrospective data spanning August-December 2025)
|
Proportion of cases in which the ICD-10 code entered by the treating emergency physician at file closure matches the 3-specialist majority-vote gold standard at the chapter level[cite: 1].
Range: 0 to 1.00
|
At the time of the original clinical encounter (retrospective data spanning August-December 2025)
|
Samarbejdspartnere og efterforskere
Efterforskere
- Ledende efterforsker: Emir Ünal, Marmara University
Publikationer og nyttige links
Generelle publikationer
- Bossuyt PM, Reitsma JB, Bruns DE, Gatsonis CA, Glasziou PP, Irwig L, Lijmer JG, Moher D, Rennie D, de Vet HC, Kressel HY, Rifai N, Golub RM, Altman DG, Hooft L, Korevaar DA, Cohen JF; STARD Group. STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies. BMJ. 2015 Oct 28;351:h5527. doi: 10.1136/bmj.h5527.
- Newman-Toker DE, Peterson SM, Badihian S, Hassoon A, Nassery N, Parizadeh D, Wilson LM, Jia Y, Omron R, Tharmarajah S, Guerin L, Bastani PB, Fracica EA, Kotwal S, Robinson KA. Diagnostic Errors in the Emergency Department: A Systematic Review [Internet]. Rockville (MD): Agency for Healthcare Research and Quality (US); 2022 Dec. Report No.: 22(23)-EHC043. Available from http://www.ncbi.nlm.nih.gov/books/NBK588118/
- Wei J et al. Chain-of-thought prompting elicits reasoning in LLMs. NeurIPS. 2022;35:24824-24837.
- Sounderajah V, Guni A, Liu X, Collins GS, Karthikesalingam A, Markar SR, Golub RM, Denniston AK, Shetty S, Moher D, Bossuyt PM, Darzi A, Ashrafian H; STARD-AI Steering Committee. The STARD-AI reporting guideline for diagnostic accuracy studies using artificial intelligence. Nat Med. 2025 Oct;31(10):3283-3289. doi: 10.1038/s41591-025-03953-8. Epub 2025 Sep 15.
- Niset A, Melot I, Pireau M, Englebert A, Scius N, Flament J, El Hadwe S, Al Barajraji M, Thonon H, Barrit S. Grounded large language models for diagnostic prediction in real-world emergency department settings. JAMIA Open. 2025 Oct 21;8(5):ooaf119. doi: 10.1093/jamiaopen/ooaf119. eCollection 2025 Oct.
- Williams CYK, Miao BY, Kornblith AE, Butte AJ. Evaluating the use of large language models to provide clinical recommendations in the Emergency Department. Nat Commun. 2024 Oct 8;15(1):8236. doi: 10.1038/s41467-024-52415-1.
- Hoppe JM, Auer MK, Struven A, Massberg S, Stremmel C. ChatGPT With GPT-4 Outperforms Emergency Department Physicians in Diagnostic Accuracy: Retrospective Analysis. J Med Internet Res. 2024 Jul 8;26:e56110. doi: 10.2196/56110.
- Kanjee Z, Crowe B, Rodman A. Accuracy of a Generative Artificial Intelligence Model in a Complex Diagnostic Challenge. JAMA. 2023 Jul 3;330(1):78-80. doi: 10.1001/jama.2023.8288.
- Takita H, Kabata D, Walston SL, Tatekawa H, Saito K, Tsujimoto Y, Miki Y, Ueda D. A systematic review and meta-analysis of diagnostic performance comparison between generative AI and physicians. NPJ Digit Med. 2025 Mar 22;8(1):175. doi: 10.1038/s41746-025-01543-z.
- Shan G, Chen X, Wang C, Liu L, Gu Y, Jiang H, Shi T. Comparing Diagnostic Accuracy of Clinical Professionals and Large Language Models: Systematic Review and Meta-Analysis. JMIR Med Inform. 2025 Apr 25;13:e64963. doi: 10.2196/64963.
- Taylor RA, Sangal RB, Smith ME, Haimovich AD, Rodman A, Iscoe MS, Pavuluri SK, Rose C, Janke AT, Wright DS, Socrates V, Declan A. Leveraging artificial intelligence to reduce diagnostic errors in emergency medicine: Challenges, opportunities, and future directions. Acad Emerg Med. 2025 Mar;32(3):327-339. doi: 10.1111/acem.15066. Epub 2024 Dec 15.
Datoer for undersøgelser
Studer store datoer
Studiestart (Anslået)
Primær færdiggørelse (Anslået)
Studieafslutning (Anslået)
Datoer for studieregistrering
Først indsendt
Først indsendt, der opfyldte QC-kriterier
Først opslået (Faktiske)
Opdateringer af undersøgelsesjournaler
Sidste opdatering sendt (Faktiske)
Sidste opdatering indsendt, der opfyldte kvalitetskontrolkriterier
Sidst verificeret
Mere information
Begreber relateret til denne undersøgelse
Yderligere relevante MeSH-vilkår
Andre undersøgelses-id-numre
- 09.2026.26-0514
Lægemiddel- og udstyrsoplysninger, undersøgelsesdokumenter
Studerer et amerikansk FDA-reguleret lægemiddelprodukt
Studerer et amerikansk FDA-reguleret enhedsprodukt
Disse oplysninger blev hentet direkte fra webstedet clinicaltrials.gov uden ændringer. Hvis du har nogen anmodninger om at ændre, fjerne eller opdatere dine undersøgelsesoplysninger, bedes du kontakte register@clinicaltrials.gov. Så snart en ændring er implementeret på clinicaltrials.gov, vil denne også blive opdateret automatisk på vores hjemmeside .
Kliniske forsøg med Akut medicin
-
Akdeniz University HospitalAfsluttetEmergency Airway Management | Gastric Inflation Risk During Bag-Valve-Mask Ventilation | Breathing EmergencyTyrkiet (Türkiye)
-
Hospital Israelita Albert EinsteinAfsluttetPatienter Emergency On-site Care ved Mobile Emergency UnitBrasilien
-
Insel Gruppe AG, University Hospital BernGaslini Children's HospitalTilmelding efter invitationAnæstesi | Trakeostomi komplikation | Emergency Front of Neck Airway hos børnSchweiz
-
RWTH Aachen UniversityAfsluttetBrug af telemedicin | Brug af telekonsultation | Emergency Medical Service Missions
-
University of Lausanne HospitalsUkendtHyppige brugere af Emergency Department (FUED'er)Schweiz
-
University Children's Hospital, ZurichAfsluttetEmergency Front of Neck Airway hos børnSchweiz
-
Nantes University HospitalIMT Mines Albi - France (https://www.imt-mines-albi.fr/)UkendtEmergency Medical Service Communication Systems, Health Care
-
Central Denmark RegionAfsluttetEmergency Medical Dispatch Center | Videostreaming | Nødopkald | Præhospital akutmedicinDanmark
-
Vanderbilt University Medical CenterAfsluttetOverholdelse, Medicin | Manglende overholdelse, medicinForenede Stater
-
Isfahan University of Medical SciencesIkke rekrutterer endnuUddannelse | Uddannelse, Medicin | Uddannelse, Medicin, Bachelor