Diagnostic Accuracy of GPT-4o and Claude 4.6 Sonnet in Turkish ED Anamnesis Notes (LLM-ED-DX-TR)

June 22, 2026 updated by: Emir Ünal, Marmara University Pendik Training and Research Hospital

Diagnostic Accuracy of Large Language Models From Emergency Department Anamnesis Notes: A Comparison of GPT-4o and Claude 4.6 Sonnet With Emergency Medicine Specialists

This retrospective diagnostic accuracy study evaluates the ability of two large language models (LLMs) - GPT-4o (gpt-4o-2024-11-20; OpenAI) and Claude 4.6 Sonnet (claude-sonnet-4-6; Anthropic) - to generate correct diagnoses from anonymized Turkish-language emergency department (ED) anamnesis notes, and compares their performance with the diagnosis entered by the treating emergency physician. A consensus gold standard is established by three independent board-certified emergency medicine specialists who blindly review each note and vote on the primary diagnosis using ICD-10 three-character codes; the majority vote (at least 2 of 3 specialists agreeing) constitutes the reference standard. Both LLMs are evaluated using a standardized zero-shot direct prompting strategy (temperature=0, stateless API sessions). The primary outcome is diagnostic accuracy (proportion of ICD-10 chapter-level matches) and Cohen's kappa for each LLM against the gold standard. Secondary outcomes include top-3 accuracy, treating physician accuracy, inter-model agreement, and subgroup analyses by ESI triage level and ICD-10 chapter. Inter-rater reliability among the three specialists is quantified using Fleiss' kappa. Analyses are performed in Jamovi. This study represents the first evaluation of LLM diagnostic accuracy using Turkish-language clinical notes and the first to benchmark LLM performance against an independent three-specialist majority-vote gold standard rather than against the treating physician's own diagnosis.

Study Overview

Status

Recruiting

Conditions

Detailed Description

STUDY DESIGN: Retrospective diagnostic accuracy study, STARD-AI 2025 reporting, single center, cohort design.

AI INDEX TESTS: (1) GPT-4o (model version gpt-4o-2024-11-20; OpenAI API). (2) Claude 4.6 Sonnet (model version claude-sonnet-4-6; Anthropic API). Both accessed via Python (Google Colab). Temperature=0 for reproducibility. Zero-shot, stateless sessions - no cross-case context. No task-specific fine-tuning or additional training applied; models used as-is via API.

MODEL INTERPRETABILITY: Model interpretability analyses (such as SHAP, Grad-CAM, or layer-attribute visualizations) are not applicable to this study. Because GPT-4o and Claude 4.6 Sonnet are accessed as black-box models through proprietary, closed-source commercial APIs, internal model weights, gradients, and attention architectures are structurally inaccessible for post-hoc interpretability computations.

REFERENCE STANDARD: Three board-certified emergency medicine specialists independently evaluate each anonymized note, blinded to the original physician diagnosis and to each other. Primary diagnosis assigned by at least 2/3 specialists (majority vote) constitutes the gold standard. A 5-case calibration session precedes the main evaluation.

DATA PRIVACY: All anamnesis notes are fully de-identified (name, ID number, date of birth, physician name removed) prior to processing. De-identified notes are stored in a password-protected encrypted database. Only de-identified text is transmitted to LLM APIs - no personal health data. Compliant with Turkish Personal Data Protection Law (KVKK No. 6698).

PATIENT AND PUBLIC INVOLVEMENT: Not applicable. This retrospective study uses fully anonymized existing records; no patient or public involvement in design or conduct.

DATA SHARING: Anonymized dataset will be shared via Zenodo upon article acceptance. Statistical analysis code (Jamovi project files and Python prompt scripts) will be available on GitHub.

Study Type

Observational

Enrollment (Estimated)

600

Contacts and Locations

This section provides the contact details for those conducting the study, and information on where this study is being conducted.

Study Contact

Name: Emir Ünal, Assistant Professor
Phone Number: +905327766010
Email: emirunal@gmail.com

Study Contact Backup

Name: Emir Unal, Assistant Professor
Email: emirunal@gmail.com

Study Locations

Turkey (Türkiye)
- Istanbul
  - Istanbul, Istanbul, Turkey (Türkiye), 34899
    - Recruiting
    - Marmara University Pendik Training and Research Hospital
    - Contact:
      
      Emir ünal
      
      Email: emirunal@gmail.com

Participation Criteria

Researchers look for people who fit a certain description, called eligibility criteria. Some examples of these criteria are a person's general health condition or prior treatments.

Eligibility Criteria

Ages Eligible for Study

Adult
Older Adult

Accepts Healthy Volunteers

Sampling Method

Non-Probability Sample

Study Population

The study population comprises consecutive adult patients (aged 18 years and older) who presented to the emergency department of a tertiary care training and research hospital and had their encounters fully documented in the hospital information system (HBYS). Eligible individuals must have a complete electronic anamnesis note containing the chief complaint, history of present illness, and clinical presentation, alongside a definitive primary ICD-10 diagnosis finalized by the treating emergency physician at file closure. The population excludes pediatric cases, patients triaged to high-acuity resuscitation areas (ESI level 1), and clinical notes with fewer than 50 words or insufficient clinical content.

Description

INCLUSION CRITERIA:

Adult patients (aged 18 years and older) presenting to the emergency department.
Complete electronic health record available in the hospital information system (HBYS) containing a detailed anamnesis note with chief complaint, symptom duration, associated symptoms, and relevant medical history.
A definitive primary diagnosis recorded by the treating emergency physician using ICD-10 codes at the time of patient file closure.

EXCLUSION CRITERIA:

Emergency department anamnesis notes containing fewer than 50 words or completely lacking substantive clinical content[cite: 1].
Pediatric cases (age under 18 years)[cite: 1].
Patients critically ill and triaged to high-acuity resuscitation areas (Emergency Severity Index [ESI] level 1)[cite: 1].
Clinical notes containing residual identifying information that cannot be fully de-identified, preventing compliance with data privacy regulations[cite: 1].
Non-independent clinical notes consisting solely of a brief cross-reference to a prior hospital visit without a new history entry[cite: 1].

Study Plan

This section provides details of the study plan, including how the study is designed and what the study is measuring.

How is the study designed?

Design Details

Number of groups / cohorts

Cohorts and Interventions

Group / Cohort
Emergency Department Patient Cohort Consecutive adult patients presenting to the emergency department with a fully documented electronic anamnesis note and a definitive primary ICD-10 diagnosis

What is the study measuring?

Primary Outcome Measures

Outcome Measure	Measure Description	Time Frame
Diagnostic Accuracy of GPT-4o for ICD-10 Chapter-Level Diagnosis Time Frame: At the time of single-session algorithmic evaluation (each case evaluated once following data extraction in June 2026).	Proportion of cases in which GPT-4o primary (rank 1) diagnosis matches the 3-specialist majority-vote gold standard at the ICD-10 chapter level (22 categories). Range: 0 to 1.00.	At the time of single-session algorithmic evaluation (each case evaluated once following data extraction in June 2026).
Diagnostic Accuracy of Claude 4.6 Sonnet for ICD-10 Chapter-Level Diagnosis Time Frame: At the time of single-session algorithmic evaluation (each case evaluated once following data extraction in June 2026).	Proportion of cases in which Claude 4.6 Sonnet primary (rank 1) diagnosis matches the 3-specialist majority-vote gold standard at the ICD-10 chapter level (22 categories). Range: 0 to 1.00.	At the time of single-session algorithmic evaluation (each case evaluated once following data extraction in June 2026).

Secondary Outcome Measures

Outcome Measure	Measure Description	Time Frame
Cohen's Kappa Between GPT-4o Primary Diagnosis and Gold Standard Time Frame: At the time of algorithmic evaluation (June-July 2026)	Kappa coefficient measuring agreement between GPT-4o rank-1 ICD-10 chapter and the 3-specialist gold standard . Interpreted per Landis & Koch (1977): <=0.20 slight; 0.21-0.40 fair; 0.41-0.60 moderate; 0.61-0.80 substantial; >0.80 almost perfect . Range: -1.00 to 1.00 .	At the time of algorithmic evaluation (June-July 2026)
Cohen's Kappa Between Claude 4.6 Sonnet Primary Diagnosis and Gold Standard Time Frame: At the time of algorithmic evaluation (June-July 2026)	appa coefficient measuring agreement between Claude 4.6 Sonnet rank-1 ICD-10 chapter and the 3-specialist gold standard . Interpreted per Landis & Koch (1977): <=0.20 slight; 0.21-0.40 fair; 0.41-0.60 moderate; 0.61-0.80 substantial; >0.80 almost perfect . Range: -1.00 to 1.00	At the time of algorithmic evaluation (June-July 2026)
Top-3 Diagnostic Accuracy of GPT-4o Time Frame: At the time of algorithmic evaluation (June-July 2026)	Proportion of cases in which the 3-specialist gold standard diagnosis appears within GPT-4o's ranked list of three differential diagnoses . Range: 0 to 1.00	At the time of algorithmic evaluation (June-July 2026)
Top-3 Diagnostic Accuracy of Claude 4.6 Sonnet Time Frame: At the time of algorithmic evaluation (June-July 2026)	Proportion of cases in which the 3-specialist gold standard diagnosis appears within Claude 4.6 Sonnet's ranked list of three differential diagnoses[cite: 1]. Range: 0 to 1.00	At the time of algorithmic evaluation (June-July 2026)
Treating Physician Diagnostic Accuracy Against Gold Standard Time Frame: At the time of the original clinical encounter (retrospective data spanning August-December 2025)	Proportion of cases in which the ICD-10 code entered by the treating emergency physician at file closure matches the 3-specialist majority-vote gold standard at the chapter level[cite: 1]. Range: 0 to 1.00	At the time of the original clinical encounter (retrospective data spanning August-December 2025)

Collaborators and Investigators

This is where you will find people and organizations involved with this study.

Sponsor

Marmara University Pendik Training and Research Hospital

Investigators

Principal Investigator: Emir Ünal, Marmara University

Publications and helpful links

The person responsible for entering information about the study voluntarily provides these publications. These may be about anything related to the study.

General Publications

Study record dates

These dates track the progress of study record and summary results submissions to ClinicalTrials.gov. Study records and reported results are reviewed by the National Library of Medicine (NLM) to make sure they meet specific quality control standards before being posted on the public website.

Study Major Dates

Study Start (Estimated)

June 1, 2026

Primary Completion (Estimated)

July 1, 2026

Study Completion (Estimated)

October 1, 2026

Study Registration Dates

First Submitted

June 3, 2026

First Submitted That Met QC Criteria

June 3, 2026

First Posted (Actual)

June 8, 2026

Study Record Updates

Last Update Posted (Actual)

June 25, 2026

Last Update Submitted That Met QC Criteria

June 22, 2026

Last Verified

June 1, 2026

More Information

Terms related to this study

Keywords

Large Language Model; GPT-4o; Claude 4.6 Sonnet; ICD-10; Clinical Coding; Turkish; Emergency Department; Diagnostic Accuracy; STARD; STARD-AI

Additional Relevant MeSH Terms

Other Study ID Numbers

09.2026.26-0514

Drug and device information, study documents

Studies a U.S. FDA-regulated drug product

Studies a U.S. FDA-regulated device product

This information was retrieved directly from the website clinicaltrials.gov without any changes. If you have any requests to change, remove or update your study details, please contact register@clinicaltrials.gov. As soon as a change is implemented on clinicaltrials.gov, this will be updated automatically on our website as well.

Clinical Trials on Emergency Medicine

Samsung Medical Center

Recruiting

Mobile Chat Service for Parents of Children in Pediatric Emergency Room

Emergency Medicine | Medical Informatics | Pediatric Emergency Medicine

Korea, Republic of
RWTH Aachen University

Completed

Feasibility of Telemedicine Under Ambulance Station Conditions

Telemedicine | Disaster Medicine | Emergency Medicine

Germany
Universiti Sains Malaysia

Enrolling by invitation

Comparative Evaluation of Ultrasound-Guided Femoral Nerve Block Training With Low-Fidelity Low-Cost Simulation Model Among Junior Doctors in Department of Emergency Medicine USM.

Emergency Medicine | Regional Anaesthesia

Malaysia
Washington University School of Medicine
Epharmix, Inc.

Completed

Improving Follow-Up for Discharged Emergency Care Patients

Emergency Medicine | Mobile Health | General Medicine

United States
Mario Negri Institute for Pharmacological Research
Centre Hospitalier Universitaire Vaudois; Fondazione Bruno Kessler; Astir s.r.l. and other collaborators

Recruiting

Propensity to Hospitalize Patients From the ED in European Centers. (eCREAM-UC1)

Emergency Medicine

Italy
Guangdong Provincial People's Hospital

Recruiting

International Big Data Centre in Emergency Medicine

Emergency Medicine

China
Massachusetts General Hospital

Terminated

Text-Enabled Ascertainment and Community Linkage for Health (TEACH)

Emergency Medicine

United States
Ataturk University

Completed

The Charlson Comorbidity Index: Predicting Severity in Emergency Departments (Charlson)

Emergency Medicine

Turkey
Mario Negri Institute for Pharmacological Research
Fondazione Bruno Kessler; Astir s.r.l.; Orobix Life S.r.l.

Recruiting

Development of a Multipurpose Dashboard to Monitor the Situation of Emergency Departments (eCREAM-UC2)

Emergency Medicine

Italy
University of Aarhus
Aarhus University Hospital; Herning Hospital; Horsens Hospital; Randers Regional... and other collaborators

Completed

Development and Evaluation of a Patient Safety Model

Emergency Medicine

Denmark

Diagnostic Accuracy of GPT-4o and Claude 4.6 Sonnet in Turkish ED Anamnesis Notes (LLM-ED-DX-TR)

Diagnostic Accuracy of Large Language Models From Emergency Department Anamnesis Notes: A Comparison of GPT-4o and Claude 4.6 Sonnet With Emergency Medicine Specialists

Study Overview

Status

Conditions

Detailed Description

Study Type

Enrollment (Estimated)

Contacts and Locations

Study Contact

Study Contact Backup

Study Locations

Participation Criteria

Eligibility Criteria

Ages Eligible for Study

Accepts Healthy Volunteers

Sampling Method

Study Population

Description

Study Plan

How is the study designed?

Design Details

Number of groups / cohorts

Cohorts and Interventions

Group / Cohort

What is the study measuring?

Primary Outcome Measures

Outcome Measure

Measure Description

Time Frame

Secondary Outcome Measures

Outcome Measure

Measure Description

Time Frame

Collaborators and Investigators

Sponsor

Investigators

Publications and helpful links

General Publications

Study record dates

Study Major Dates

Study Start (Estimated)

Primary Completion (Estimated)

Study Completion (Estimated)

Study Registration Dates

First Submitted

First Submitted That Met QC Criteria

First Posted (Actual)

Study Record Updates

Last Update Posted (Actual)

Last Update Submitted That Met QC Criteria

Last Verified

More Information

Terms related to this study

Keywords

Additional Relevant MeSH Terms

Other Study ID Numbers

Drug and device information, study documents

Studies a U.S. FDA-regulated drug product

Studies a U.S. FDA-regulated device product

Clinical Trials on Emergency Medicine

Search Similar Trials

Sponsors and Collaborators

Medical Conditions

Drug Interventions

CROs by country

CROs in Venezuela

Conditions

Rare Diseases

Drug Interventions

Dietary Supplements

Sponsor/Collaborators

Locations