이 페이지는 자동 번역되었으며 번역의 정확성을 보장하지 않습니다. 참조하십시오 영문판 원본 텍스트의 경우.

Diagnostic Accuracy of GPT-4o and Claude 4.6 Sonnet in Turkish ED Anamnesis Notes (LLM-ED-DX-TR)

2026년 6월 3일 업데이트: Emir Ünal, Marmara University Pendik Training and Research Hospital

Diagnostic Accuracy of Large Language Models From Emergency Department Anamnesis Notes: A Comparison of GPT-4o and Claude 4.6 Sonnet With Emergency Medicine Specialists

This retrospective diagnostic accuracy study evaluates the ability of two large language models (LLMs) - GPT-4o (gpt-4o-2024-11-20; OpenAI) and Claude 4.6 Sonnet (claude-sonnet-4-6; Anthropic) - to generate correct diagnoses from anonymized Turkish-language emergency department (ED) anamnesis notes, and compares their performance with the diagnosis entered by the treating emergency physician. A consensus gold standard is established by three independent board-certified emergency medicine specialists who blindly review each note and vote on the primary diagnosis using ICD-10 three-character codes; the majority vote (at least 2 of 3 specialists agreeing) constitutes the reference standard. Both LLMs are evaluated using a standardized zero-shot direct prompting strategy (temperature=0, stateless API sessions). The primary outcome is diagnostic accuracy (proportion of ICD-10 chapter-level matches) and Cohen's kappa for each LLM against the gold standard. Secondary outcomes include top-3 accuracy, treating physician accuracy, inter-model agreement, and subgroup analyses by ESI triage level and ICD-10 chapter. Inter-rater reliability among the three specialists is quantified using Fleiss' kappa. Analyses are performed in Jamovi. This study represents the first evaluation of LLM diagnostic accuracy using Turkish-language clinical notes and the first to benchmark LLM performance against an independent three-specialist majority-vote gold standard rather than against the treating physician's own diagnosis.

연구 개요

상태

아직 모집하지 않음

정황

상세 설명

STUDY DESIGN: Retrospective diagnostic accuracy study, STARD-AI 2025 reporting, single center, cohort design.

AI INDEX TESTS: (1) GPT-4o (model version gpt-4o-2024-11-20; OpenAI API). (2) Claude 4.6 Sonnet (model version claude-sonnet-4-6; Anthropic API). Both accessed via Python (Google Colab). Temperature=0 for reproducibility. Zero-shot, stateless sessions - no cross-case context. No task-specific fine-tuning or additional training applied; models used as-is via API.

MODEL INTERPRETABILITY: Model interpretability analyses (such as SHAP, Grad-CAM, or layer-attribute visualizations) are not applicable to this study. Because GPT-4o and Claude 4.6 Sonnet are accessed as black-box models through proprietary, closed-source commercial APIs, internal model weights, gradients, and attention architectures are structurally inaccessible for post-hoc interpretability computations.

REFERENCE STANDARD: Three board-certified emergency medicine specialists independently evaluate each anonymized note, blinded to the original physician diagnosis and to each other. Primary diagnosis assigned by at least 2/3 specialists (majority vote) constitutes the gold standard. A 5-case calibration session precedes the main evaluation.

DATA PRIVACY: All anamnesis notes are fully de-identified (name, ID number, date of birth, physician name removed) prior to processing. De-identified notes are stored in a password-protected encrypted database. Only de-identified text is transmitted to LLM APIs - no personal health data. Compliant with Turkish Personal Data Protection Law (KVKK No. 6698).

PATIENT AND PUBLIC INVOLVEMENT: Not applicable. This retrospective study uses fully anonymized existing records; no patient or public involvement in design or conduct.

DATA SHARING: Anonymized dataset will be shared via Zenodo upon article acceptance. Statistical analysis code (Jamovi project files and Python prompt scripts) will be available on GitHub.

연구 유형

관찰

등록 (추정된)

600

연락처 및 위치

이 섹션에서는 연구를 수행하는 사람들의 연락처 정보와 이 연구가 수행되는 장소에 대한 정보를 제공합니다.

연구 연락처

이름: Emir Ünal, Assistant Professor
전화번호: +905327766010
이메일: emirunal@gmail.com

연구 연락처 백업

이름: Emir Unal, Assistant Professor
이메일: emirunal@gmail.com

연구 장소

터키 (Türkiye)
- Istanbul
  - Istanbul, Istanbul, 터키 (Türkiye), 34899
    - Marmara University Pendik Training and Research Hospital
    - 연락하다:
      
      Emir ünal
      
      이메일: emirunal@gmail.com

참여기준

연구원은 적격성 기준이라는 특정 설명에 맞는 사람을 찾습니다. 이러한 기준의 몇 가지 예는 개인의 일반적인 건강 상태 또는 이전 치료입니다.

자격 기준

공부할 수 있는 나이

성인
고령자

건강한 자원 봉사자를 받아들입니다

아니

샘플링 방법

비확률 샘플

연구 인구

The study population comprises consecutive adult patients (aged 18 years and older) who presented to the emergency department of a tertiary care training and research hospital and had their encounters fully documented in the hospital information system (HBYS). Eligible individuals must have a complete electronic anamnesis note containing the chief complaint, history of present illness, and clinical presentation, alongside a definitive primary ICD-10 diagnosis finalized by the treating emergency physician at file closure. The population excludes pediatric cases, patients triaged to high-acuity resuscitation areas (ESI level 1), and clinical notes with fewer than 50 words or insufficient clinical content.

설명

INCLUSION CRITERIA:

Adult patients (aged 18 years and older) presenting to the emergency department.
Complete electronic health record available in the hospital information system (HBYS) containing a detailed anamnesis note with chief complaint, symptom duration, associated symptoms, and relevant medical history.
A definitive primary diagnosis recorded by the treating emergency physician using ICD-10 codes at the time of patient file closure.

EXCLUSION CRITERIA:

Emergency department anamnesis notes containing fewer than 50 words or completely lacking substantive clinical content[cite: 1].
Pediatric cases (age under 18 years)[cite: 1].
Patients critically ill and triaged to high-acuity resuscitation areas (Emergency Severity Index [ESI] level 1)[cite: 1].
Clinical notes containing residual identifying information that cannot be fully de-identified, preventing compliance with data privacy regulations[cite: 1].
Non-independent clinical notes consisting solely of a brief cross-reference to a prior hospital visit without a new history entry[cite: 1].

공부 계획

이 섹션에서는 연구 설계 방법과 연구가 측정하는 내용을 포함하여 연구 계획에 대한 세부 정보를 제공합니다.

연구는 어떻게 설계됩니까?

디자인 세부사항

그룹/코호트 수

코호트 및 개입

그룹/코호트
Emergency Department Patient Cohort Consecutive adult patients presenting to the emergency department with a fully documented electronic anamnesis note and a definitive primary ICD-10 diagnosis

연구는 무엇을 측정합니까?

주요 결과 측정

결과 측정	측정값 설명	기간
Diagnostic Accuracy of GPT-4o for ICD-10 Chapter-Level Diagnosis 기간: At the time of single-session algorithmic evaluation (each case evaluated once following data extraction in June 2026).	Proportion of cases in which GPT-4o primary (rank 1) diagnosis matches the 3-specialist majority-vote gold standard at the ICD-10 chapter level (22 categories). Range: 0 to 1.00.	At the time of single-session algorithmic evaluation (each case evaluated once following data extraction in June 2026).
Diagnostic Accuracy of Claude 4.6 Sonnet for ICD-10 Chapter-Level Diagnosis 기간: At the time of single-session algorithmic evaluation (each case evaluated once following data extraction in June 2026).	Proportion of cases in which Claude 4.6 Sonnet primary (rank 1) diagnosis matches the 3-specialist majority-vote gold standard at the ICD-10 chapter level (22 categories). Range: 0 to 1.00.	At the time of single-session algorithmic evaluation (each case evaluated once following data extraction in June 2026).

2차 결과 측정

결과 측정	측정값 설명	기간
Cohen's Kappa Between GPT-4o Primary Diagnosis and Gold Standard 기간: At the time of algorithmic evaluation (June-July 2026)	Kappa coefficient measuring agreement between GPT-4o rank-1 ICD-10 chapter and the 3-specialist gold standard . Interpreted per Landis & Koch (1977): <=0.20 slight; 0.21-0.40 fair; 0.41-0.60 moderate; 0.61-0.80 substantial; >0.80 almost perfect . Range: -1.00 to 1.00 .	At the time of algorithmic evaluation (June-July 2026)
Cohen's Kappa Between Claude 4.6 Sonnet Primary Diagnosis and Gold Standard 기간: At the time of algorithmic evaluation (June-July 2026)	appa coefficient measuring agreement between Claude 4.6 Sonnet rank-1 ICD-10 chapter and the 3-specialist gold standard . Interpreted per Landis & Koch (1977): <=0.20 slight; 0.21-0.40 fair; 0.41-0.60 moderate; 0.61-0.80 substantial; >0.80 almost perfect . Range: -1.00 to 1.00	At the time of algorithmic evaluation (June-July 2026)
Top-3 Diagnostic Accuracy of GPT-4o 기간: At the time of algorithmic evaluation (June-July 2026)	Proportion of cases in which the 3-specialist gold standard diagnosis appears within GPT-4o's ranked list of three differential diagnoses . Range: 0 to 1.00	At the time of algorithmic evaluation (June-July 2026)
Top-3 Diagnostic Accuracy of Claude 4.6 Sonnet 기간: At the time of algorithmic evaluation (June-July 2026)	Proportion of cases in which the 3-specialist gold standard diagnosis appears within Claude 4.6 Sonnet's ranked list of three differential diagnoses[cite: 1]. Range: 0 to 1.00	At the time of algorithmic evaluation (June-July 2026)
Treating Physician Diagnostic Accuracy Against Gold Standard 기간: At the time of the original clinical encounter (retrospective data spanning August-December 2025)	Proportion of cases in which the ICD-10 code entered by the treating emergency physician at file closure matches the 3-specialist majority-vote gold standard at the chapter level[cite: 1]. Range: 0 to 1.00	At the time of the original clinical encounter (retrospective data spanning August-December 2025)

공동 작업자 및 조사자

여기에서 이 연구와 관련된 사람과 조직을 찾을 수 있습니다.

스폰서

Marmara University Pendik Training and Research Hospital

수사관

수석 연구원: Emir Ünal, Marmara University

간행물 및 유용한 링크

연구에 대한 정보 입력을 담당하는 사람이 자발적으로 이러한 간행물을 제공합니다. 이것은 연구와 관련된 모든 것에 관한 것일 수 있습니다.

일반 간행물

연구 기록 날짜

이 날짜는 ClinicalTrials.gov에 대한 연구 기록 및 요약 결과 제출의 진행 상황을 추적합니다. 연구 기록 및 보고된 결과는 공개 웹사이트에 게시되기 전에 특정 품질 관리 기준을 충족하는지 확인하기 위해 국립 의학 도서관(NLM)에서 검토합니다.

연구 주요 날짜

연구 시작 (추정된)

2026년 6월 1일

기본 완료 (추정된)

2026년 7월 1일

연구 완료 (추정된)

2026년 10월 1일

연구 등록 날짜

최초 제출

2026년 6월 3일

QC 기준을 충족하는 최초 제출

2026년 6월 3일

처음 게시됨 (실제)

2026년 6월 8일

연구 기록 업데이트

마지막 업데이트 게시됨 (실제)

2026년 6월 8일

QC 기준을 충족하는 마지막 업데이트 제출

2026년 6월 3일

마지막으로 확인됨

2026년 6월 1일

추가 정보

이 연구와 관련된 용어

키워드

Large Language Model; GPT-4o; Claude 4.6 Sonnet; ICD-10; Clinical Coding; Turkish; Emergency Department; Diagnostic Accuracy; STARD; STARD-AI

추가 관련 MeSH 약관

기타 연구 ID 번호

09.2026.26-0514

약물 및 장치 정보, 연구 문서

미국 FDA 규제 의약품 연구

아니

미국 FDA 규제 기기 제품 연구

아니

이 정보는 변경 없이 clinicaltrials.gov 웹사이트에서 직접 가져온 것입니다. 귀하의 연구 세부 정보를 변경, 제거 또는 업데이트하도록 요청하는 경우 register@clinicaltrials.gov. 문의하십시오. 변경 사항이 clinicaltrials.gov에 구현되는 즉시 저희 웹사이트에도 자동으로 업데이트됩니다. .

Diagnostic Accuracy of GPT-4o and Claude 4.6 Sonnet in Turkish ED Anamnesis Notes (LLM-ED-DX-TR)

Diagnostic Accuracy of Large Language Models From Emergency Department Anamnesis Notes: A Comparison of GPT-4o and Claude 4.6 Sonnet With Emergency Medicine Specialists

연구 개요

상태

정황

상세 설명

연구 유형

등록 (추정된)

연락처 및 위치

연구 연락처

연구 연락처 백업

연구 장소

참여기준

자격 기준

공부할 수 있는 나이

건강한 자원 봉사자를 받아들입니다

샘플링 방법

연구 인구

설명

공부 계획

연구는 어떻게 설계됩니까?

디자인 세부사항

그룹/코호트 수

코호트 및 개입

그룹/코호트

연구는 무엇을 측정합니까?

주요 결과 측정

결과 측정

측정값 설명

기간

2차 결과 측정

결과 측정

측정값 설명

기간

공동 작업자 및 조사자

스폰서

수사관

간행물 및 유용한 링크

일반 간행물

연구 기록 날짜

연구 주요 날짜

연구 시작 (추정된)

기본 완료 (추정된)

연구 완료 (추정된)

연구 등록 날짜

최초 제출

QC 기준을 충족하는 최초 제출

처음 게시됨 (실제)

연구 기록 업데이트

마지막 업데이트 게시됨 (실제)

QC 기준을 충족하는 마지막 업데이트 제출

마지막으로 확인됨

추가 정보

이 연구와 관련된 용어

키워드

추가 관련 MeSH 약관

기타 연구 ID 번호

약물 및 장치 정보, 연구 문서

미국 FDA 규제 의약품 연구

미국 FDA 규제 기기 제품 연구

유사한 임상시험 검색

스폰서 및 공동 작업자

건강 상태

약물 개입

CROs by country

CROs in South Korea

정황

희귀 질병

약물 개입

식이 보충제

스폰서 / 협력자

위치