Diagnostic Accuracy of GPT-4o and Claude for HEART Score Calculation in Chest Pain (LLM-HEART)

June 22, 2026 updated by: Emir Ünal, Marmara University Pendik Training and Research Hospital

Diagnostic Accuracy of Large Language Models (GPT-4o and Claude) in HEART Score Calculation and 30-Day MACE Prediction in Emergency Department Chest Pain Patients: A Prospective Observational Validation Study Against Three-Expert Consensus

This prospective observational diagnostic accuracy study evaluates whether large language models (LLMs) - GPT-4o (OpenAI, gpt-4o-2024-11-20) and Claude (Anthropic, claude-sonnet-4-6) - can accurately calculate HEART scores from unstructured Turkish clinical notes and predict 30-day major adverse cardiac events (MACE) in emergency department patients presenting with non-traumatic chest pain.

The study will enroll 600 consecutive adult patients. For each patient, the same anonymized data (free-text anamnesis, ECG report text, troponin value, and age) will be independently processed by both LLMs via separate API calls with deterministic settings (temperature=0, JSON format). A three-expert consensus HEART score - derived through blinded independent scoring by three emergency medicine physicians with majority-vote adjudication - serves as the reference standard for agreement analysis. Actual 30-day MACE (all-cause death, AMI Type 1/2/4b, unplanned revascularization) determined via national health database and telephone follow-up serves as the outcome for diagnostic accuracy analysis.

A secondary documentation-quality sub-study will quantify how spontaneously Turkish emergency anamnesis notes capture HEART score parameters.

Study Overview

Status

Recruiting

Conditions

Intervention / Treatment

Detailed Description

AI SYSTEM SPECIFICATIONS AND PROMPT PROTOCOL Two distinct large language models (LLMs) will be evaluated as index tests: OpenAI GPT-4o (model string: gpt-4o-2024-11-20) and Anthropic Claude (model string: claude-sonnet-4-6). To ensure reproducibility and eliminate stochastic variation, both models will be accessed via standardized API calls using deterministic parameters (temperature = 0, max_tokens = 500, and strict JSON response format). The exact system prompt layout will be locked prior to initialization, and its integrity will be verified using a SHA-256 cryptographic hash. The models will evaluate each patient record independently in zero-shot isolation, with no cross-contamination or conversational history retention between runs.

REFERENCE STANDARD CONSENSUS PROTOCOL The reference standard consists of a structured consensus HEART score established by three independent emergency medicine physicians (each possessing >=3 years of clinical experience and specific training on HEART score criteria). The physicians will review the anonymized clinical charts while remaining strictly blinded to the LLM outputs and the final 30-day MACE outcomes. For each of the 5 HEART components (scored 0, 1, or 2), a majority vote (2/3 agreement) will determine the final component score. In the event of complete disagreement across all three reviewers on a specific component, a fourth independent adjudicator will resolve the tie.

INDETERMINATE RESULTS MANAGEMENT

In strict compliance with STARD-AI 2025 guidelines, cases with missing or uninterpretable parameters within the free-text clinical notes will be classified into predefined indeterminate tiers:

Complete Cases: 0 indeterminate components (eligible for primary diagnostic accuracy analysis).
Partial Indeterminate: Exactly 1 missing component preventing definitive automatic calculation.
Full Indeterminate: >=2 missing components. The proportion of indeterminate classifications will be quantified for both LLMs and evaluated alongside the routine documentation quality of the charts.

STATISTICAL ANALYSIS AND AGREEMENT WEIGHTING Statistical power and sample size calculation are based on the Hanley-McNeil methodology for the Area Under the ROC Curve (AUC). To achieve an expected AUC of 0.85 with a non-inferiority margin of 0.05, a power of 80%, and a two-sided alpha of 0.05, the primary complete-case analysis requires 600 evaluable patients. Accounting for an anticipated 15% indeterminate rate, a total enrollment target of 690 patients is set. Inter-rater agreement between each LLM and the expert consensus will be computed using quadratic weighted Cohen's Kappa for the ordinal total HEART score (0-10) and linear weighted Kappa for individual components (0-2). Diagnostic performance metrics (sensitivity, specificity, PPV, NPV) will be calculated at prespecified binary (>=4) and trimodal thresholds with 95% Wilson confidence intervals. Pairwise comparison of AUC values between GPT-4o and Claude will be executed using the DeLong test.

DATA ANONYMIZATION AND PRIVACY To ensure full compliance with local personal data protection legislation (KVKK), all free-text emergency department notes will undergo strict de-identification. Patient names, institutional ID numbers, precise dates, and specific demographic identifiers will be stripped entirely before formatting the data payload for API transmission.

PATIENT AND PUBLIC INVOLVEMENT BEYANI Patient and public involvement was not applicable to this study as it involves the analysis of routinely collected clinical data.

Study Type

Observational

Enrollment (Estimated)

690

Contacts and Locations

This section provides the contact details for those conducting the study, and information on where this study is being conducted.

Study Contact

Name: Emir Ünal, Assistant Professor
Phone Number: +905327766010
Email: emirunal@gmail.com

Study Contact Backup

Name: Emre Kudu, associate professor
Email: dr.emre.kudu@gmail.com

Study Locations

Turkey (Türkiye)
- Istanbul
  - Istanbul, Istanbul, Turkey (Türkiye), 34870
    - Recruiting
    - Marmara University Pendik Training and Research Hospital
    - Contact:
      
      Emir ünal
      
      Phone Number: 05327766010
      
      Email: emirunal@gmail.com
    - Sub-Investigator:
      
      Emre Kudu
    - Sub-Investigator:
      
      Erhan Altunbas
    - Sub-Investigator:
      
      Sinan Karacabey

Participation Criteria

Researchers look for people who fit a certain description, called eligibility criteria. Some examples of these criteria are a person's general health condition or prior treatments.

Eligibility Criteria

Ages Eligible for Study

Adult
Older Adult

Accepts Healthy Volunteers

Sampling Method

Non-Probability Sample

Study Population

The study population consists of consecutive adult patients presenting with a chief complaint of non-traumatic chest pain to the emergency department of Marmara University Pendik Training and Research Hospital, a tertiary care academic medical center in Istanbul, Turkey. This target population comprises real-world emergency medicine admissions that require acute coronary syndrome risk stratification and evaluation with the HEART score. It excludes individuals presenting with traumatic pain etiologies or acute ST-elevation myocardial infarction (STEMI) requiring immediate, time-critical reperfusion pathways.

Description

INCLUSION CRITERIA:

Age >=18 years
Chief complaint of non-traumatic chest pain at the emergency department
Written informed consent obtained from the patient or legally authorized representative
Availability for 30-day follow-up (reachable by telephone and/or actively registered in the e-Nabiz national health database)

EXCLUSION CRITERIA:

Traumatic chest pain etiology
ST-elevation myocardial infarction (STEMI) at presentation requiring immediate reperfusion protocol
Refusal or subsequent withdrawal of informed consent
Inability to complete the mandatory 30-day follow-up period

WITHDRAWAL CRITERIA:

Patient or representative requests data withdrawal after initial consent
Administrative identification of retrospective data entry after enrollment

Study Plan

This section provides details of the study plan, including how the study is designed and what the study is measuring.

How is the study designed?

Design Details

What is the study measuring?

Primary Outcome Measures

Outcome Measure	Measure Description	Time Frame
Area Under the ROC Curve (AUC) of GPT-4o and Claude HEART Score for 30-Day MACE Prediction Time Frame: 30 days after index emergency department visit	AUC calculated separately for GPT-4o and Claude using the Hanley-McNeil method. MACE is defined as a composite of all-cause death, acute myocardial infarction (Type 1/2/4b), and unplanned revascularization within 30 days. HEART score range is 0-10; a higher score indicates a higher risk of MACE. Analysis will be performed on complete cases only (0 indeterminate components).	30 days after index emergency department visit

Secondary Outcome Measures

Outcome Measure	Measure Description	Time Frame
Sensitivity and Specificity of GPT-4o and Claude HEART Score at Prespecified Thresholds Time Frame: 30 days after index emergency department visit	Diagnostic sensitivity and specificity calculated at two threshold types: (a) total score >=4 (binary high-risk cutoff) and (b) trimodal cutoffs (0-3 low risk, 4-6 intermediate risk, 7-10 high risk). Metrics will be reported with 95% Wilson confidence intervals separately for each LLM.	30 days after index emergency department visit
Component-Level and Total-Score Agreement (Cohen's Kappa) Between LLMs and Expert Consensus Time Frame: Baseline (At index emergency department visit)	Inter-rater agreement will be computed using quadratic weighted Cohen's Kappa for the ordinal total HEART score (range 0-10) and linear weighted Kappa for the individual components (range 0-2). Calculated separately for GPT-4o vs. expert consensus and Claude vs. expert consensus. Values will be interpreted using the Landis & Koch scale (<0.20 poor, 0.21-0.40 fair, 0.41-0.60 moderate, 0.61-0.80 good, >0.80 excellent).	Baseline (At index emergency department visit)
Comparative AUC Difference Between GPT-4o and Claude (DeLong Test) Time Frame: 30 days after index emergency department visit	Statistical comparison of paired ROC curves between GPT-4o and Claude using the DeLong et al. (1988) method. The formal hypothesis is non-inferiority with an expected delta AUC <= 0.05. The correlation coefficient between the paired LLM measurements is estimated as rho >= 0.70.	30 days after index emergency department visit
Proportion of Indeterminate Results for GPT-4o and Claude Time Frame: Baseline (At index emergency department visit)	The proportion of cases classified into predefined missing data tiers: Complete (0 indeterminate components), Partial indeterminate (exactly 1 missing component preventing definitive score calculation), and Full indeterminate (>=2 missing components). Reported separately for each LLM and statistically compared between the two models.	Baseline (At index emergency department visit)
HEART Parameter Documentation Rate in Routine Turkish Anamnesis Notes Time Frame: Baseline (At index emergency department visit)	For each of the 5 individual HEART components, the proportion of emergency department free-text anamnesis notes that spontaneously contain sufficient objective clinical information for scoring. Rates will be categorized as: Present and scorable, Partiall	Baseline (At index emergency department visit)
Subgroup AUC by Age Group and Sex (Algorithmic Bias Assessment) Time Frame: 30 days after the index emergency department visit	AUC values for 30-day MACE prediction were calculated separately across demographic strata: age groups (<45, 45-64, >=65 years) and biological sex (male vs. female). This analysis serves as the formal algorithmic bias assessment required by the STARD-AI 2025 guidelines.	30 days after the index emergency department visit

Collaborators and Investigators

This is where you will find people and organizations involved with this study.

Sponsor

Marmara University Pendik Training and Research Hospital

Publications and helpful links

The person responsible for entering information about the study voluntarily provides these publications. These may be about anything related to the study.

General Publications

Study record dates

These dates track the progress of study record and summary results submissions to ClinicalTrials.gov. Study records and reported results are reviewed by the National Library of Medicine (NLM) to make sure they meet specific quality control standards before being posted on the public website.

Study Major Dates

Study Start (Estimated)

June 1, 2026

Primary Completion (Estimated)

March 1, 2027

Study Completion (Estimated)

June 1, 2027

Study Registration Dates

First Submitted

May 27, 2026

First Submitted That Met QC Criteria

June 3, 2026

First Posted (Actual)

June 4, 2026

Study Record Updates

Last Update Posted (Actual)

June 23, 2026

Last Update Submitted That Met QC Criteria

June 22, 2026

Last Verified

June 1, 2026

More Information

Terms related to this study

Keywords

Additional Relevant MeSH Terms

Other Study ID Numbers

09.2026.26-0150

Plan for Individual participant data (IPD)

Plan to Share Individual Participant Data (IPD)?

YES

IPD Plan Description

Anonymized individual participant data (including de-identified baseline demographics, clinical presentation characteristics, index test outputs from GPT-4o and Claude, and the reference standard expert consensus HEART scores) will be made publicly available to support academic transparency and replication. Additionally, the complete deterministic system prompt texts (verified with SHA-256 cryptographic hashes) and the complete statistical analysis code will be included as supplementary material.

IPD Sharing Time Frame

The anonymized dataset, protocol documents, and analytic code will be made available immediately upon formal publication of the study results.

IPD Sharing Access Criteria

Data and code will be accessible via an open-access repository on the Open Science Framework (OSF) for researchers and clinicians interested in replication or meta-analysis.

IPD Sharing Supporting Information Type

STUDY_PROTOCOL
SAP
ANALYTIC_CODE

Drug and device information, study documents

Studies a U.S. FDA-regulated drug product

Studies a U.S. FDA-regulated device product

This information was retrieved directly from the website clinicaltrials.gov without any changes. If you have any requests to change, remove or update your study details, please contact register@clinicaltrials.gov. As soon as a change is implemented on clinicaltrials.gov, this will be updated automatically on our website as well.

Clinical Trials on Emergency Medicine

Samsung Medical Center

Recruiting

Mobile Chat Service for Parents of Children in Pediatric Emergency Room

Emergency Medicine | Medical Informatics | Pediatric Emergency Medicine

Korea, Republic of
Universiti Sains Malaysia

Enrolling by invitation

Comparative Evaluation of Ultrasound-Guided Femoral Nerve Block Training With Low-Fidelity Low-Cost Simulation Model Among Junior Doctors in Department of Emergency Medicine USM.

Emergency Medicine | Regional Anaesthesia

Malaysia
RWTH Aachen University

Completed

Feasibility of Telemedicine Under Ambulance Station Conditions

Telemedicine | Disaster Medicine | Emergency Medicine

Germany
Washington University School of Medicine
Epharmix, Inc.

Completed

Improving Follow-Up for Discharged Emergency Care Patients

Emergency Medicine | Mobile Health | General Medicine

United States
Nura Medical

Not yet recruiting

Testing of Medication Dosing Software

Pediatric Emergency Medicine
Mario Negri Institute for Pharmacological Research
Centre Hospitalier Universitaire Vaudois; Fondazione Bruno Kessler; Astir s.r.l. and other collaborators

Recruiting

Propensity to Hospitalize Patients From the ED in European Centers. (eCREAM-UC1)

Emergency Medicine

Italy
Guangdong Provincial People's Hospital

Recruiting

International Big Data Centre in Emergency Medicine

Emergency Medicine

China
Massachusetts General Hospital

Terminated

Text-Enabled Ascertainment and Community Linkage for Health (TEACH)

Emergency Medicine

United States
Ataturk University

Completed

The Charlson Comorbidity Index: Predicting Severity in Emergency Departments (Charlson)

Emergency Medicine

Turkey
Mario Negri Institute for Pharmacological Research
Fondazione Bruno Kessler; Astir s.r.l.; Orobix Life S.r.l.

Recruiting

Development of a Multipurpose Dashboard to Monitor the Situation of Emergency Departments (eCREAM-UC2)

Emergency Medicine

Italy

Clinical Trials on GPT-4o HEART Score Calculator

Maastricht University
Aga Khan University; University of Indonesia, Jakarta, Indonesia

Completed

The Big Unknown: A Journey Into Generative AI's Transformative Effect on Medical Professions

Diagnosis | Vignette of Fictional Patients

Netherlands, Indonesia, Kenya
North Sichuan Medical College
Peking University; Peking University First Hospital; Monash University; Case Western... and other collaborators

Not yet recruiting

Multi-Disciplinary Treatment on the Anthropomorphism of Large Language Models (MDTALLM)

Heart Diseases | Infections | Pneumonia | Disease | Cancer | Respiratory Failure

China
North Sichuan Medical College
Afﬁliated Hospital of North Sichuan Medical College

Completed

Ophthalmic Diseases and AI: an RCT Study

Eye Diseases

China
University College, London

Enrolling by invitation

Evaluating the Effectiveness and Acceptability of a GPT-4o and RAG-Based Voice Chatbot for Depression Screening Using PHQ-9 (GPT4-RAG-PHQ)

Depression Anxiety Disorder | Depression - Major Depressive Disorder

United Kingdom
Ohio University
OhioHealth

Unknown

Healthy Heart Score Intervention In the Primary Care Setting

Cardiovascular Diseases

United States
University of Florida
Roche Diagnostics

Completed

High-Sensitivity Cardiac Troponin T to OPtimize Chest Pain Risk Stratification (STOP CP)

Chest Pain | Acute Coronary Syndrome

United States
VieCuri Medical Centre

Terminated

Improving the Referral of Patients With Chest Pain (Urgent)

Myocardial Infarction | Chest Pain | Acute Coronary Syndrome

Netherlands
UMC Utrecht
ZonMw: The Netherlands Organisation for Health Research and Development

Completed

Impact on Management of the HEART Risk Score in Chest Pain Patients (HEART-Impact)

Chest Pain

Netherlands
Nanyang Technological University
National University Health System, Singapore; National University of Singapore

Recruiting

Healthy Living With Online suPport & Education for Cardiovascular Disease in the Primary Care Setting (HOPE-CVD-GP)

Cardiovascular Diseases (CVD)

Singapore
Kaiser Permanente

Completed

Standardizing Emergency Work-ups Around Risk Data (STEWARD)

Chest Pain | Acute Coronary Syndrome | Risk Reduction

United States

Diagnostic Accuracy of GPT-4o and Claude for HEART Score Calculation in Chest Pain (LLM-HEART)

Diagnostic Accuracy of Large Language Models (GPT-4o and Claude) in HEART Score Calculation and 30-Day MACE Prediction in Emergency Department Chest Pain Patients: A Prospective Observational Validation Study Against Three-Expert Consensus

Study Overview

Status

Conditions

Intervention / Treatment

Detailed Description

Study Type

Enrollment (Estimated)

Contacts and Locations

Study Contact

Study Contact Backup

Study Locations

Participation Criteria

Eligibility Criteria

Ages Eligible for Study

Accepts Healthy Volunteers

Sampling Method

Study Population

Description

Study Plan

How is the study designed?

Design Details

What is the study measuring?

Primary Outcome Measures

Outcome Measure

Measure Description

Time Frame

Secondary Outcome Measures

Outcome Measure

Measure Description

Time Frame

Collaborators and Investigators

Sponsor

Publications and helpful links

General Publications

Study record dates

Study Major Dates

Study Start (Estimated)

Primary Completion (Estimated)

Study Completion (Estimated)

Study Registration Dates

First Submitted

First Submitted That Met QC Criteria

First Posted (Actual)

Study Record Updates

Last Update Posted (Actual)

Last Update Submitted That Met QC Criteria

Last Verified

More Information

Terms related to this study

Keywords

Additional Relevant MeSH Terms

Other Study ID Numbers

Plan for Individual participant data (IPD)

Plan to Share Individual Participant Data (IPD)?

IPD Plan Description

IPD Sharing Time Frame

IPD Sharing Access Criteria

IPD Sharing Supporting Information Type

Drug and device information, study documents

Studies a U.S. FDA-regulated drug product

Studies a U.S. FDA-regulated device product

Clinical Trials on Emergency Medicine

Clinical Trials on GPT-4o HEART Score Calculator

Search Similar Trials

Sponsors and Collaborators

Medical Conditions

Drug Interventions

CROs by country

CROs in Estonia

Conditions

Rare Diseases

Drug Interventions

Dietary Supplements

Sponsor/Collaborators

Locations