Questa pagina è stata tradotta automaticamente e l'accuratezza della traduzione non è garantita. Si prega di fare riferimento al Versione inglese per un testo di partenza.

Diagnostic Accuracy of GPT-4o and Claude for HEART Score Calculation in Chest Pain (LLM-HEART)

22 giugno 2026 aggiornato da: Emir Ünal, Marmara University Pendik Training and Research Hospital

Diagnostic Accuracy of Large Language Models (GPT-4o and Claude) in HEART Score Calculation and 30-Day MACE Prediction in Emergency Department Chest Pain Patients: A Prospective Observational Validation Study Against Three-Expert Consensus

This prospective observational diagnostic accuracy study evaluates whether large language models (LLMs) - GPT-4o (OpenAI, gpt-4o-2024-11-20) and Claude (Anthropic, claude-sonnet-4-6) - can accurately calculate HEART scores from unstructured Turkish clinical notes and predict 30-day major adverse cardiac events (MACE) in emergency department patients presenting with non-traumatic chest pain.

The study will enroll 600 consecutive adult patients. For each patient, the same anonymized data (free-text anamnesis, ECG report text, troponin value, and age) will be independently processed by both LLMs via separate API calls with deterministic settings (temperature=0, JSON format). A three-expert consensus HEART score - derived through blinded independent scoring by three emergency medicine physicians with majority-vote adjudication - serves as the reference standard for agreement analysis. Actual 30-day MACE (all-cause death, AMI Type 1/2/4b, unplanned revascularization) determined via national health database and telephone follow-up serves as the outcome for diagnostic accuracy analysis.

A secondary documentation-quality sub-study will quantify how spontaneously Turkish emergency anamnesis notes capture HEART score parameters.

Panoramica dello studio

Stato

Reclutamento

Condizioni

Intervento / Trattamento

Descrizione dettagliata

AI SYSTEM SPECIFICATIONS AND PROMPT PROTOCOL Two distinct large language models (LLMs) will be evaluated as index tests: OpenAI GPT-4o (model string: gpt-4o-2024-11-20) and Anthropic Claude (model string: claude-sonnet-4-6). To ensure reproducibility and eliminate stochastic variation, both models will be accessed via standardized API calls using deterministic parameters (temperature = 0, max_tokens = 500, and strict JSON response format). The exact system prompt layout will be locked prior to initialization, and its integrity will be verified using a SHA-256 cryptographic hash. The models will evaluate each patient record independently in zero-shot isolation, with no cross-contamination or conversational history retention between runs.

REFERENCE STANDARD CONSENSUS PROTOCOL The reference standard consists of a structured consensus HEART score established by three independent emergency medicine physicians (each possessing >=3 years of clinical experience and specific training on HEART score criteria). The physicians will review the anonymized clinical charts while remaining strictly blinded to the LLM outputs and the final 30-day MACE outcomes. For each of the 5 HEART components (scored 0, 1, or 2), a majority vote (2/3 agreement) will determine the final component score. In the event of complete disagreement across all three reviewers on a specific component, a fourth independent adjudicator will resolve the tie.

INDETERMINATE RESULTS MANAGEMENT

In strict compliance with STARD-AI 2025 guidelines, cases with missing or uninterpretable parameters within the free-text clinical notes will be classified into predefined indeterminate tiers:

Complete Cases: 0 indeterminate components (eligible for primary diagnostic accuracy analysis).
Partial Indeterminate: Exactly 1 missing component preventing definitive automatic calculation.
Full Indeterminate: >=2 missing components. The proportion of indeterminate classifications will be quantified for both LLMs and evaluated alongside the routine documentation quality of the charts.

STATISTICAL ANALYSIS AND AGREEMENT WEIGHTING Statistical power and sample size calculation are based on the Hanley-McNeil methodology for the Area Under the ROC Curve (AUC). To achieve an expected AUC of 0.85 with a non-inferiority margin of 0.05, a power of 80%, and a two-sided alpha of 0.05, the primary complete-case analysis requires 600 evaluable patients. Accounting for an anticipated 15% indeterminate rate, a total enrollment target of 690 patients is set. Inter-rater agreement between each LLM and the expert consensus will be computed using quadratic weighted Cohen's Kappa for the ordinal total HEART score (0-10) and linear weighted Kappa for individual components (0-2). Diagnostic performance metrics (sensitivity, specificity, PPV, NPV) will be calculated at prespecified binary (>=4) and trimodal thresholds with 95% Wilson confidence intervals. Pairwise comparison of AUC values between GPT-4o and Claude will be executed using the DeLong test.

DATA ANONYMIZATION AND PRIVACY To ensure full compliance with local personal data protection legislation (KVKK), all free-text emergency department notes will undergo strict de-identification. Patient names, institutional ID numbers, precise dates, and specific demographic identifiers will be stripped entirely before formatting the data payload for API transmission.

PATIENT AND PUBLIC INVOLVEMENT BEYANI Patient and public involvement was not applicable to this study as it involves the analysis of routinely collected clinical data.

Tipo di studio

Osservativo

Iscrizione (Stimato)

690

Contatti e Sedi

Questa sezione fornisce i recapiti di coloro che conducono lo studio e informazioni su dove viene condotto lo studio.

Contatto studio

Nome: Emir Unal, Assistant Professor
Numero di telefono: +905327766010
Email: emirunal@gmail.com

Backup dei contatti dello studio

Nome: Emre Kudu, associate professor
Email: dr.emre.kudu@gmail.com

Luoghi di studio

Turchia (Türkiye)
- Istanbul
  - Istanbul, Istanbul, Turchia (Türkiye), 34870
    - Reclutamento
    - Marmara University Pendik Training and Research Hospital
    - Contatto:
      
      Emir ünal
      
      Numero di telefono: 05327766010
      
      Email: emirunal@gmail.com
    - Sub-investigatore:
      
      Emre Kudu
    - Sub-investigatore:
      
      Erhan Altunbas
    - Sub-investigatore:
      
      Sinan Karacabey

Criteri di partecipazione

I ricercatori cercano persone che corrispondano a una certa descrizione, chiamata criteri di ammissibilità. Alcuni esempi di questi criteri sono le condizioni generali di salute di una persona o trattamenti precedenti.

Criteri di ammissibilità

Età idonea allo studio

Adulto
Adulto più anziano

Accetta volontari sani

Metodo di campionamento

Campione non probabilistico

Popolazione di studio

The study population consists of consecutive adult patients presenting with a chief complaint of non-traumatic chest pain to the emergency department of Marmara University Pendik Training and Research Hospital, a tertiary care academic medical center in Istanbul, Turkey. This target population comprises real-world emergency medicine admissions that require acute coronary syndrome risk stratification and evaluation with the HEART score. It excludes individuals presenting with traumatic pain etiologies or acute ST-elevation myocardial infarction (STEMI) requiring immediate, time-critical reperfusion pathways.

Descrizione

INCLUSION CRITERIA:

Age >=18 years
Chief complaint of non-traumatic chest pain at the emergency department
Written informed consent obtained from the patient or legally authorized representative
Availability for 30-day follow-up (reachable by telephone and/or actively registered in the e-Nabiz national health database)

EXCLUSION CRITERIA:

Traumatic chest pain etiology
ST-elevation myocardial infarction (STEMI) at presentation requiring immediate reperfusion protocol
Refusal or subsequent withdrawal of informed consent
Inability to complete the mandatory 30-day follow-up period

WITHDRAWAL CRITERIA:

Patient or representative requests data withdrawal after initial consent
Administrative identification of retrospective data entry after enrollment

Piano di studio

Questa sezione fornisce i dettagli del piano di studio, compreso il modo in cui lo studio è progettato e ciò che lo studio sta misurando.

Come è strutturato lo studio?

Dettagli di progettazione

Cosa sta misurando lo studio?

Misure di risultato primarie

Misura del risultato	Misura Descrizione	Lasso di tempo
Area Under the ROC Curve (AUC) of GPT-4o and Claude HEART Score for 30-Day MACE Prediction Lasso di tempo: 30 days after index emergency department visit	AUC calculated separately for GPT-4o and Claude using the Hanley-McNeil method. MACE is defined as a composite of all-cause death, acute myocardial infarction (Type 1/2/4b), and unplanned revascularization within 30 days. HEART score range is 0-10; a higher score indicates a higher risk of MACE. Analysis will be performed on complete cases only (0 indeterminate components).	30 days after index emergency department visit

Misure di risultato secondarie

Misura del risultato	Misura Descrizione	Lasso di tempo
Sensitivity and Specificity of GPT-4o and Claude HEART Score at Prespecified Thresholds Lasso di tempo: 30 days after index emergency department visit	Diagnostic sensitivity and specificity calculated at two threshold types: (a) total score >=4 (binary high-risk cutoff) and (b) trimodal cutoffs (0-3 low risk, 4-6 intermediate risk, 7-10 high risk). Metrics will be reported with 95% Wilson confidence intervals separately for each LLM.	30 days after index emergency department visit
Component-Level and Total-Score Agreement (Cohen's Kappa) Between LLMs and Expert Consensus Lasso di tempo: Baseline (At index emergency department visit)	Inter-rater agreement will be computed using quadratic weighted Cohen's Kappa for the ordinal total HEART score (range 0-10) and linear weighted Kappa for the individual components (range 0-2). Calculated separately for GPT-4o vs. expert consensus and Claude vs. expert consensus. Values will be interpreted using the Landis & Koch scale (<0.20 poor, 0.21-0.40 fair, 0.41-0.60 moderate, 0.61-0.80 good, >0.80 excellent).	Baseline (At index emergency department visit)
Comparative AUC Difference Between GPT-4o and Claude (DeLong Test) Lasso di tempo: 30 days after index emergency department visit	Statistical comparison of paired ROC curves between GPT-4o and Claude using the DeLong et al. (1988) method. The formal hypothesis is non-inferiority with an expected delta AUC <= 0.05. The correlation coefficient between the paired LLM measurements is estimated as rho >= 0.70.	30 days after index emergency department visit
Proportion of Indeterminate Results for GPT-4o and Claude Lasso di tempo: Baseline (At index emergency department visit)	The proportion of cases classified into predefined missing data tiers: Complete (0 indeterminate components), Partial indeterminate (exactly 1 missing component preventing definitive score calculation), and Full indeterminate (>=2 missing components). Reported separately for each LLM and statistically compared between the two models.	Baseline (At index emergency department visit)
HEART Parameter Documentation Rate in Routine Turkish Anamnesis Notes Lasso di tempo: Baseline (At index emergency department visit)	For each of the 5 individual HEART components, the proportion of emergency department free-text anamnesis notes that spontaneously contain sufficient objective clinical information for scoring. Rates will be categorized as: Present and scorable, Partiall	Baseline (At index emergency department visit)
Subgroup AUC by Age Group and Sex (Algorithmic Bias Assessment) Lasso di tempo: 30 days after the index emergency department visit	AUC values for 30-day MACE prediction were calculated separately across demographic strata: age groups (<45, 45-64, >=65 years) and biological sex (male vs. female). This analysis serves as the formal algorithmic bias assessment required by the STARD-AI 2025 guidelines.	30 days after the index emergency department visit

Collaboratori e investigatori

Qui è dove troverai le persone e le organizzazioni coinvolte in questo studio.

Sponsor

Marmara University Pendik Training and Research Hospital

Pubblicazioni e link utili

La persona responsabile dell'inserimento delle informazioni sullo studio fornisce volontariamente queste pubblicazioni. Questi possono riguardare qualsiasi cosa relativa allo studio.

Pubblicazioni generali

Studiare le date dei record

Queste date tengono traccia dell'avanzamento della registrazione dello studio e dell'invio dei risultati di sintesi a ClinicalTrials.gov. I record degli studi e i risultati riportati vengono esaminati dalla National Library of Medicine (NLM) per assicurarsi che soddisfino specifici standard di controllo della qualità prima di essere pubblicati sul sito Web pubblico.

Studia le date principali

Inizio studio (Stimato)

1 giugno 2026

Completamento primario (Stimato)

1 marzo 2027

Completamento dello studio (Stimato)

1 giugno 2027

Date di iscrizione allo studio

Primo inviato

27 maggio 2026

Primo inviato che soddisfa i criteri di controllo qualità

3 giugno 2026

Primo Inserito (Effettivo)

4 giugno 2026

Aggiornamenti dei record di studio

Ultimo aggiornamento pubblicato (Effettivo)

23 giugno 2026

Ultimo aggiornamento inviato che soddisfa i criteri QC

22 giugno 2026

Ultimo verificato

1 giugno 2026

Maggiori informazioni

Termini relativi a questo studio

Parole chiave

Termini MeSH pertinenti aggiuntivi

Altri numeri di identificazione dello studio

09.2026.26-0150

Piano per i dati dei singoli partecipanti (IPD)

Hai intenzione di condividere i dati dei singoli partecipanti (IPD)?

SÌ

Descrizione del piano IPD

Anonymized individual participant data (including de-identified baseline demographics, clinical presentation characteristics, index test outputs from GPT-4o and Claude, and the reference standard expert consensus HEART scores) will be made publicly available to support academic transparency and replication. Additionally, the complete deterministic system prompt texts (verified with SHA-256 cryptographic hashes) and the complete statistical analysis code will be included as supplementary material.

Periodo di condivisione IPD

The anonymized dataset, protocol documents, and analytic code will be made available immediately upon formal publication of the study results.

Criteri di accesso alla condivisione IPD

Data and code will be accessible via an open-access repository on the Open Science Framework (OSF) for researchers and clinicians interested in replication or meta-analysis.

Tipo di informazioni di supporto alla condivisione IPD

STUDIO_PROTOCOLLO
LINFA
CODICE_ANALITICO

Informazioni su farmaci e dispositivi, documenti di studio

Studia un prodotto farmaceutico regolamentato dalla FDA degli Stati Uniti

Studia un dispositivo regolamentato dalla FDA degli Stati Uniti

Queste informazioni sono state recuperate direttamente dal sito web clinicaltrials.gov senza alcuna modifica. In caso di richieste di modifica, rimozione o aggiornamento dei dettagli dello studio, contattare register@clinicaltrials.gov. Non appena verrà implementata una modifica su clinicaltrials.gov, questa verrà aggiornata automaticamente anche sul nostro sito web .

Prove cliniche su GPT-4o HEART Score Calculator

Maastricht University
Aga Khan University; University of Indonesia, Jakarta, Indonesia

Completato

Il Grande Sconosciuto: Un Viaggio Nell'Effetto Trasformativo dell'IA Generativa sulle Professioni Mediche

Diagnosi | Vignette di Pazienti Fittizi

Olanda, Indonesia, Kenya
North Sichuan Medical College
Peking University; Peking University First Hospital; Monash University; Case Western... e altri collaboratori

Non ancora reclutamento

Trattamento multidisciplinare sull'antropomorfismo dei modelli linguistici di grandi dimensioni (MDTALLM)

Malattie cardiache | Infezioni | Polmonite | Patologia | Cancro | Insufficienza respiratoria

Cina
North Sichuan Medical College
Afﬁliated Hospital of North Sichuan Medical College

Completato

Malattie oftalmiche e AI: uno studio RCT

Malattie degli occhi

Cina
University College, London

Iscrizione su invito

Valutazione dell'efficacia e dell'accettabilità di un chatbot vocale a base di GPT-4O e RAG per lo screening della depressione usando PHQ-9 (GPT4-RAG-PHQ)

Disturbo d'ansia da depressione | Depressione - Disturbo depressivo maggiore

Regno Unito

Diagnostic Accuracy of GPT-4o and Claude for HEART Score Calculation in Chest Pain (LLM-HEART)

Diagnostic Accuracy of Large Language Models (GPT-4o and Claude) in HEART Score Calculation and 30-Day MACE Prediction in Emergency Department Chest Pain Patients: A Prospective Observational Validation Study Against Three-Expert Consensus

Panoramica dello studio

Stato

Condizioni

Intervento / Trattamento

Descrizione dettagliata

Tipo di studio

Iscrizione (Stimato)

Contatti e Sedi

Contatto studio

Backup dei contatti dello studio

Luoghi di studio

Criteri di partecipazione

Criteri di ammissibilità

Età idonea allo studio

Accetta volontari sani

Metodo di campionamento

Popolazione di studio

Descrizione

Piano di studio

Come è strutturato lo studio?

Dettagli di progettazione

Cosa sta misurando lo studio?

Misure di risultato primarie

Misura del risultato

Misura Descrizione

Lasso di tempo

Misure di risultato secondarie

Misura del risultato

Misura Descrizione

Lasso di tempo

Collaboratori e investigatori

Sponsor

Pubblicazioni e link utili

Pubblicazioni generali

Studiare le date dei record

Studia le date principali

Inizio studio (Stimato)

Completamento primario (Stimato)

Completamento dello studio (Stimato)

Date di iscrizione allo studio

Primo inviato

Primo inviato che soddisfa i criteri di controllo qualità

Primo Inserito (Effettivo)

Aggiornamenti dei record di studio

Ultimo aggiornamento pubblicato (Effettivo)

Ultimo aggiornamento inviato che soddisfa i criteri QC

Ultimo verificato

Maggiori informazioni

Termini relativi a questo studio

Parole chiave

Termini MeSH pertinenti aggiuntivi

Altri numeri di identificazione dello studio

Piano per i dati dei singoli partecipanti (IPD)

Hai intenzione di condividere i dati dei singoli partecipanti (IPD)?

Descrizione del piano IPD

Periodo di condivisione IPD

Criteri di accesso alla condivisione IPD

Tipo di informazioni di supporto alla condivisione IPD

Informazioni su farmaci e dispositivi, documenti di studio

Studia un prodotto farmaceutico regolamentato dalla FDA degli Stati Uniti

Studia un dispositivo regolamentato dalla FDA degli Stati Uniti

Prove cliniche su GPT-4o HEART Score Calculator

Cerca prove simili

Sponsor e collaboratori

Condizioni mediche

Interventi farmacologici

CROs by country

CROs in Sweden

Condizioni

Malattie rare

Interventi farmacologici

Supplementi dietetici

Sponsor / Collaboratori

Sedi