- ICH GCP
- US Clinical Trials Registry
- Clinical Trial NCT07626060
Diagnostic Accuracy of GPT-4o and Claude for HEART Score Calculation in Chest Pain (LLM-HEART)
Diagnostic Accuracy of Large Language Models (GPT-4o and Claude) in HEART Score Calculation and 30-Day MACE Prediction in Emergency Department Chest Pain Patients: A Prospective Observational Validation Study Against Three-Expert Consensus
This prospective observational diagnostic accuracy study evaluates whether large language models (LLMs) - GPT-4o (OpenAI, gpt-4o-2024-11-20) and Claude (Anthropic, claude-sonnet-4-6) - can accurately calculate HEART scores from unstructured Turkish clinical notes and predict 30-day major adverse cardiac events (MACE) in emergency department patients presenting with non-traumatic chest pain.
The study will enroll 600 consecutive adult patients. For each patient, the same anonymized data (free-text anamnesis, ECG report text, troponin value, and age) will be independently processed by both LLMs via separate API calls with deterministic settings (temperature=0, JSON format). A three-expert consensus HEART score - derived through blinded independent scoring by three emergency medicine physicians with majority-vote adjudication - serves as the reference standard for agreement analysis. Actual 30-day MACE (all-cause death, AMI Type 1/2/4b, unplanned revascularization) determined via national health database and telephone follow-up serves as the outcome for diagnostic accuracy analysis.
A secondary documentation-quality sub-study will quantify how spontaneously Turkish emergency anamnesis notes capture HEART score parameters.
Study Overview
Status
Conditions
Detailed Description
AI SYSTEM SPECIFICATIONS AND PROMPT PROTOCOL Two distinct large language models (LLMs) will be evaluated as index tests: OpenAI GPT-4o (model string: gpt-4o-2024-11-20) and Anthropic Claude (model string: claude-sonnet-4-6). To ensure reproducibility and eliminate stochastic variation, both models will be accessed via standardized API calls using deterministic parameters (temperature = 0, max_tokens = 500, and strict JSON response format). The exact system prompt layout will be locked prior to initialization, and its integrity will be verified using a SHA-256 cryptographic hash. The models will evaluate each patient record independently in zero-shot isolation, with no cross-contamination or conversational history retention between runs.
REFERENCE STANDARD CONSENSUS PROTOCOL The reference standard consists of a structured consensus HEART score established by three independent emergency medicine physicians (each possessing >=3 years of clinical experience and specific training on HEART score criteria). The physicians will review the anonymized clinical charts while remaining strictly blinded to the LLM outputs and the final 30-day MACE outcomes. For each of the 5 HEART components (scored 0, 1, or 2), a majority vote (2/3 agreement) will determine the final component score. In the event of complete disagreement across all three reviewers on a specific component, a fourth independent adjudicator will resolve the tie.
INDETERMINATE RESULTS MANAGEMENT
In strict compliance with STARD-AI 2025 guidelines, cases with missing or uninterpretable parameters within the free-text clinical notes will be classified into predefined indeterminate tiers:
- Complete Cases: 0 indeterminate components (eligible for primary diagnostic accuracy analysis).
- Partial Indeterminate: Exactly 1 missing component preventing definitive automatic calculation.
- Full Indeterminate: >=2 missing components. The proportion of indeterminate classifications will be quantified for both LLMs and evaluated alongside the routine documentation quality of the charts.
STATISTICAL ANALYSIS AND AGREEMENT WEIGHTING Statistical power and sample size calculation are based on the Hanley-McNeil methodology for the Area Under the ROC Curve (AUC). To achieve an expected AUC of 0.85 with a non-inferiority margin of 0.05, a power of 80%, and a two-sided alpha of 0.05, the primary complete-case analysis requires 600 evaluable patients. Accounting for an anticipated 15% indeterminate rate, a total enrollment target of 690 patients is set. Inter-rater agreement between each LLM and the expert consensus will be computed using quadratic weighted Cohen's Kappa for the ordinal total HEART score (0-10) and linear weighted Kappa for individual components (0-2). Diagnostic performance metrics (sensitivity, specificity, PPV, NPV) will be calculated at prespecified binary (>=4) and trimodal thresholds with 95% Wilson confidence intervals. Pairwise comparison of AUC values between GPT-4o and Claude will be executed using the DeLong test.
DATA ANONYMIZATION AND PRIVACY To ensure full compliance with local personal data protection legislation (KVKK), all free-text emergency department notes will undergo strict de-identification. Patient names, institutional ID numbers, precise dates, and specific demographic identifiers will be stripped entirely before formatting the data payload for API transmission.
PATIENT AND PUBLIC INVOLVEMENT BEYANI Patient and public involvement was not applicable to this study as it involves the analysis of routinely collected clinical data.
Study Type
Enrollment (Estimated)
Contacts and Locations
Study Contact
- Name: Emir Unal, Assistant Professor
- Phone Number: +905327766010
- Email: emirunal@gmail.com
Study Contact Backup
- Name: Emre Kudu, associate professor
- Email: dr.emre.kudu@gmail.com
Study Locations
-
-
İ̇stanbul
-
Istanbul, İ̇stanbul, Turkey (Türkiye), 34870
- Marmara University Pendik Training and Research Hospital
-
Contact:
- Emir ünal
- Phone Number: 05327766010
- Email: emirunal@gmail.com
-
Sub-Investigator:
- Emre Kudu
-
Sub-Investigator:
- Erhan Altunbas
-
Sub-Investigator:
- Sinan Karacabey
-
-
Participation Criteria
Eligibility Criteria
Ages Eligible for Study
- Adult
- Older Adult
Accepts Healthy Volunteers
Sampling Method
Study Population
Description
INCLUSION CRITERIA:
- Age >=18 years
- Chief complaint of non-traumatic chest pain at the emergency department
- Written informed consent obtained from the patient or legally authorized representative
- Availability for 30-day follow-up (reachable by telephone and/or actively registered in the e-Nabiz national health database)
EXCLUSION CRITERIA:
- Traumatic chest pain etiology
- ST-elevation myocardial infarction (STEMI) at presentation requiring immediate reperfusion protocol
- Refusal or subsequent withdrawal of informed consent
- Inability to complete the mandatory 30-day follow-up period
WITHDRAWAL CRITERIA:
- Patient or representative requests data withdrawal after initial consent
- Administrative identification of retrospective data entry after enrollment
Study Plan
How is the study designed?
Design Details
What is the study measuring?
Primary Outcome Measures
Outcome Measure |
Measure Description |
Time Frame |
|---|---|---|
|
Area Under the ROC Curve (AUC) of GPT-4o and Claude HEART Score for 30-Day MACE Prediction
Time Frame: 30 days after index emergency department visit
|
AUC calculated separately for GPT-4o and Claude using the Hanley-McNeil method.
MACE is defined as a composite of all-cause death, acute myocardial infarction (Type 1/2/4b), and unplanned revascularization within 30 days.
HEART score range is 0-10; a higher score indicates a higher risk of MACE.
Analysis will be performed on complete cases only (0 indeterminate components).
|
30 days after index emergency department visit
|
Secondary Outcome Measures
Outcome Measure |
Measure Description |
Time Frame |
|---|---|---|
|
Sensitivity and Specificity of GPT-4o and Claude HEART Score at Prespecified Thresholds
Time Frame: 30 days after index emergency department visit
|
Diagnostic sensitivity and specificity calculated at two threshold types: (a) total score >=4 (binary high-risk cutoff) and (b) trimodal cutoffs (0-3 low risk, 4-6 intermediate risk, 7-10 high risk).
Metrics will be reported with 95% Wilson confidence intervals separately for each LLM.
|
30 days after index emergency department visit
|
|
Component-Level and Total-Score Agreement (Cohen's Kappa) Between LLMs and Expert Consensus
Time Frame: Baseline (At index emergency department visit)
|
Inter-rater agreement will be computed using quadratic weighted Cohen's Kappa for the ordinal total HEART score (range 0-10) and linear weighted Kappa for the individual components (range 0-2).
Calculated separately for GPT-4o vs. expert consensus and Claude vs. expert consensus.
Values will be interpreted using the Landis & Koch scale (<0.20 poor, 0.21-0.40
fair, 0.41-0.60
moderate, 0.61-0.80
good, >0.80 excellent).
|
Baseline (At index emergency department visit)
|
|
Comparative AUC Difference Between GPT-4o and Claude (DeLong Test)
Time Frame: 30 days after index emergency department visit
|
Statistical comparison of paired ROC curves between GPT-4o and Claude using the DeLong et al. (1988) method.
The formal hypothesis is non-inferiority with an expected delta AUC <= 0.05.
The correlation coefficient between the paired LLM measurements is estimated as rho >= 0.70.
|
30 days after index emergency department visit
|
|
Proportion of Indeterminate Results for GPT-4o and Claude
Time Frame: Baseline (At index emergency department visit)
|
The proportion of cases classified into predefined missing data tiers: Complete (0 indeterminate components), Partial indeterminate (exactly 1 missing component preventing definitive score calculation), and Full indeterminate (>=2 missing components).
Reported separately for each LLM and statistically compared between the two models.
|
Baseline (At index emergency department visit)
|
|
HEART Parameter Documentation Rate in Routine Turkish Anamnesis Notes
Time Frame: Baseline (At index emergency department visit)
|
For each of the 5 individual HEART components, the proportion of emergency department free-text anamnesis notes that spontaneously contain sufficient objective clinical information for scoring.
Rates will be categorized as: Present and scorable, Partiall
|
Baseline (At index emergency department visit)
|
|
Subgroup AUC by Age Group and Sex (Algorithmic Bias Assessment)
Time Frame: 30 days after the index emergency department visit
|
AUC values for 30-day MACE prediction were calculated separately across demographic strata: age groups (<45, 45-64, >=65 years) and biological sex (male vs. female).
This analysis serves as the formal algorithmic bias assessment required by the STARD-AI 2025 guidelines.
|
30 days after the index emergency department visit
|
Collaborators and Investigators
Publications and helpful links
General Publications
- Bossuyt PM, Reitsma JB, Bruns DE, Gatsonis CA, Glasziou PP, Irwig L, Lijmer JG, Moher D, Rennie D, de Vet HC, Kressel HY, Rifai N, Golub RM, Altman DG, Hooft L, Korevaar DA, Cohen JF; STARD Group. STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies. BMJ. 2015 Oct 28;351:h5527. doi: 10.1136/bmj.h5527.
- Backus BE, Six AJ, Kelder JC, Bosschaert MA, Mast EG, Mosterd A, Veldkamp RF, Wardeh AJ, Tio R, Braam R, Monnink SH, van Tooren R, Mast TP, van den Akker F, Cramer MJ, Poldervaart JM, Hoes AW, Doevendans PA. A prospective validation of the HEART score for chest pain patients at the emergency department. Int J Cardiol. 2013 Oct 3;168(3):2153-8. doi: 10.1016/j.ijcard.2013.01.255. Epub 2013 Mar 7.
- Collins GS, Moons KGM, Dhiman P, Riley RD, Beam AL, Van Calster B, Ghassemi M, Liu X, Reitsma JB, van Smeden M, Boulesteix AL, Camaradou JC, Celi LA, Denaxas S, Denniston AK, Glocker B, Golub RM, Harvey H, Heinze G, Hoffman MM, Kengne AP, Lam E, Lee N, Loder EW, Maier-Hein L, Mateen BA, McCradden MD, Oakden-Rayner L, Ordish J, Parnell R, Rose S, Singh K, Wynants L, Logullo P. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ. 2024 Apr 16;385:e078378. doi: 10.1136/bmj-2023-078378.
- Mahler SA, Riley RF, Hiestand BC, Russell GB, Hoekstra JW, Lefebvre CW, Nicks BA, Cline DM, Askew KL, Elliott SB, Herrington DM, Burke GL, Miller CD. The HEART Pathway randomized trial: identifying emergency department patients with acute chest pain for early discharge. Circ Cardiovasc Qual Outcomes. 2015 Mar;8(2):195-203. doi: 10.1161/CIRCOUTCOMES.114.001384. Epub 2015 Mar 3.
- Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, Scales N, Tanwani A, Cole-Lewis H, Pfohl S, Payne P, Seneviratne M, Gamble P, Kelly C, Babiker A, Scharli N, Chowdhery A, Mansfield P, Demner-Fushman D, Aguera Y Arcas B, Webster D, Corrado GS, Matias Y, Chou K, Gottweis J, Tomasev N, Liu Y, Rajkomar A, Barral J, Semturs C, Karthikesalingam A, Natarajan V. Large language models encode clinical knowledge. Nature. 2023 Aug;620(7972):172-180. doi: 10.1038/s41586-023-06291-2. Epub 2023 Jul 12.
- Albrecht M. C4-bound imidazolylidenes: from curiosities to high-impact carbene ligands. Chem Commun (Camb). 2008 Aug 21;(31):3601-10. doi: 10.1039/b806924g. Epub 2008 Jul 8.
Study record dates
Study Major Dates
Study Start (Estimated)
Primary Completion (Estimated)
Study Completion (Estimated)
Study Registration Dates
First Submitted
First Submitted That Met QC Criteria
First Posted (Actual)
Study Record Updates
Last Update Posted (Actual)
Last Update Submitted That Met QC Criteria
Last Verified
More Information
Terms related to this study
Keywords
Additional Relevant MeSH Terms
Other Study ID Numbers
- 09.2026.26-0150
Plan for Individual participant data (IPD)
Plan to Share Individual Participant Data (IPD)?
IPD Plan Description
IPD Sharing Time Frame
IPD Sharing Access Criteria
IPD Sharing Supporting Information Type
- STUDY_PROTOCOL
- SAP
- ANALYTIC_CODE
Drug and device information, study documents
Studies a U.S. FDA-regulated drug product
Studies a U.S. FDA-regulated device product
This information was retrieved directly from the website clinicaltrials.gov without any changes. If you have any requests to change, remove or update your study details, please contact register@clinicaltrials.gov. As soon as a change is implemented on clinicaltrials.gov, this will be updated automatically on our website as well.
Clinical Trials on Emergency Medicine
-
Samsung Medical CenterRecruitingEmergency Medicine | Medical Informatics | Pediatric Emergency MedicineKorea, Republic of
-
Universiti Sains MalaysiaEnrolling by invitationEmergency Medicine | Regional AnaesthesiaMalaysia
-
RWTH Aachen UniversityCompletedTelemedicine | Disaster Medicine | Emergency MedicineGermany
-
Washington University School of MedicineEpharmix, Inc.CompletedEmergency Medicine | Mobile Health | General MedicineUnited States
-
Mario Negri Institute for Pharmacological ResearchCentre Hospitalier Universitaire Vaudois; Fondazione Bruno Kessler; Astir s.r.l. and other collaboratorsRecruitingEmergency MedicineItaly
-
Guangdong Provincial People's HospitalRecruiting
-
Massachusetts General HospitalTerminatedEmergency MedicineUnited States
-
Ataturk UniversityCompletedEmergency MedicineTurkey
-
Mario Negri Institute for Pharmacological ResearchFondazione Bruno Kessler; Astir s.r.l.; Orobix Life S.r.l.Recruiting
-
University of AarhusAarhus University Hospital; Herning Hospital; Horsens Hospital; Randers Regional... and other collaboratorsCompleted
Clinical Trials on GPT-4o HEART Score Calculator
-
Maastricht UniversityAga Khan University; University of Indonesia, Jakarta, IndonesiaCompletedDiagnosis | Vignette of Fictional PatientsNetherlands, Indonesia, Kenya
-
North Sichuan Medical CollegePeking University; Peking University First Hospital; Monash University; Case Western... and other collaboratorsNot yet recruitingHeart Diseases | Infections | Pneumonia | Disease | Cancer | Respiratory FailureChina
-
North Sichuan Medical CollegeAffiliated Hospital of North Sichuan Medical CollegeCompleted
-
University College, LondonEnrolling by invitationDepression Anxiety Disorder | Depression - Major Depressive DisorderUnited Kingdom
-
Ohio UniversityOhioHealthUnknown
-
University of FloridaRoche DiagnosticsCompletedChest Pain | Acute Coronary SyndromeUnited States
-
VieCuri Medical CentreTerminatedMyocardial Infarction | Chest Pain | Acute Coronary SyndromeNetherlands
-
UMC UtrechtZonMw: The Netherlands Organisation for Health Research and DevelopmentCompletedChest PainNetherlands
-
Kaiser PermanenteCompletedChest Pain | Acute Coronary Syndrome | Risk ReductionUnited States
-
Nanyang Technological UniversityNational University Health System, Singapore; National University of SingaporeRecruitingCardiovascular Diseases (CVD)Singapore