Diagnostic Reasoning With Customized GPT-4 Model

March 27, 2025 updated by: Jonathan Chen, Stanford University

Evaluating the Performance of LLMs and Clinicians in Complex Diagnostic Cases: A Randomized Controlled Trial

This study will assess the impact of immediate access to a customized version of GPT-4, a large language model, on performance in case-based diagnostic reasoning tasks. Specifically, it will compare this approach to a two-step process where participants first use traditional diagnostic decision support tools to support their diagnostic reasoning before gaining access to the customized GPT-4 model.

Study Overview

Detailed Description

Artificial intelligence (AI) technologies, particularly advanced large language models like OpenAI's ChatGPT, have the potential to enhance medical decision-making. While ChatGPT-4 was not specifically designed for medical applications, it has demonstrated promise in various healthcare contexts, including medical note-writing, addressing patient inquiries, and facilitating medical consultations. However, its impact on clinicians' diagnostic reasoning remains largely unknown.

Clinical reasoning is a complex process that involves pattern recognition, knowledge application, and probabilistic reasoning. Integrating AI tools like ChatGPT-4 into physician workflows could help reduce clinician workload and decrease the likelihood of missed diagnoses. However, ChatGPT-4 was neither developed nor validated for diagnostic reasoning, and it may produce misleading information, including plausible but incorrect conclusions that could misguide clinicians. If not used appropriately, it may fail to improve-and could even hinder-clinical decision-making. Therefore, it is essential to study how clinicians use large language models to support clinical reasoning before integrating them into routine patient care.

This study will examine how immediate access to a customized version of ChatGPT-4 impacts performance on case-based diagnostic reasoning tasks, compared to a stepwise approach. In the stepwise approach, participants will first use traditional diagnostic decision support tools to support their case reasoning before interacting with a customized ChatGPT-4 model, at which point they will have the opportunity to revise their initial answers.

Participants will be randomized into different study arms and will respond to diagnostic cases by providing three differential diagnoses, along with supporting and opposing findings for each. They will also identify their top diagnosis and propose next diagnostic steps. Independent reviewers, blinded to treatment assignment, will evaluate their responses.

Study Type

Interventional

Enrollment (Actual)

70

Phase

  • Not Applicable

Contacts and Locations

This section provides the contact details for those conducting the study, and information on where this study is being conducted.

Study Locations

    • California
      • Palo Alto, California, United States, 94305
        • Stanford University

Participation Criteria

Researchers look for people who fit a certain description, called eligibility criteria. Some examples of these criteria are a person's general health condition or prior treatments.

Eligibility Criteria

Ages Eligible for Study

  • Child
  • Adult
  • Older Adult

Accepts Healthy Volunteers

Yes

Description

Inclusion Criteria:

  • Participants must be licensed physicians and have completed at least post-graduate year 1 (PGY1) of medical training.
  • Training in Internal medicine, family medicine, or emergency medicine.

Exclusion Criteria:

  • Not currently practicing clinically.
  • Participated in one of our previous studies that used the same six diagnostic cases.

Study Plan

This section provides details of the study plan, including how the study is designed and what the study is measuring.

How is the study designed?

Design Details

  • Primary Purpose: Diagnostic
  • Allocation: Randomized
  • Interventional Model: Parallel Assignment
  • Masking: Single

Arms and Interventions

Participant Group / Arm
Intervention / Treatment
Active Comparator: Immediate access to customized version of GPT-4
Group will be encouraged to immediately use a customized version of GPT-4.
Group is given immediate access to a customized version of GPT-4 to support their diagnostic reasoning for each case.
Active Comparator: Conventional resources first, then granted access to customized version of GPT-4.
Group will be encouraged to first use any resources they wish besides large language models (UpToDate, Pubmed, google, etc) and then will be granted access to a customized version of GPT-4.
Group is first encouraged to reason through diagnostic cases with the support of conventional resources. After they submit a case's answers they are then given access to a customized version of GPT-4 and have the opportunity to change their initial answers.

What is the study measuring?

Primary Outcome Measures

Outcome Measure
Measure Description
Time Frame
Diagnostic reasoning
Time Frame: Through study completion, an average of 6 months
The primary outcome will be the percentage of correct responses per case (range: 0 to 100). For each case, participants will be asked to provide their top three differential diagnoses, along with supporting and opposing findings for each. They will receive 1 point for each plausible diagnosis. Supporting and opposing findings will be graded based on correctness, with 1 point for a partially correct response and 2 points for a completely correct response. Participants will then select their top diagnosis, earning 1 point for a reasonable choice and 2 points for the most accurate diagnosis. Finally, they will list up to three next steps for further patient evaluation, with 1 point awarded for a partially correct response and 2 points for a completely correct response. The primary outcome will be analyzed at the case level, comparing performance between the randomized study groups.
Through study completion, an average of 6 months

Secondary Outcome Measures

Outcome Measure
Measure Description
Time Frame
Time Spent Per Case
Time Frame: Through study completion, an average of 6 months
The investigators will compare the average time (in minutes) participants spend on each case across the two study arms.
Through study completion, an average of 6 months
Prompt frequency
Time Frame: Through study completion, an average of 6 months
The investigators will compare the frequency of participant prompts to the customized GPT-4 model between the two study groups.
Through study completion, an average of 6 months
Sentiment
Time Frame: Through study completion, an average of 6 months
The investigators will compare the tone and sentiment of participant prompts to the customized GPT-4 model across the two study groups. The investigators will create a qualitative coding system to categorize the nature of the participants' prompts.
Through study completion, an average of 6 months
Participant Perceptions of AI in Clinical Reasoning
Time Frame: Through study completion, an average of 6 months
This outcome would be assessed in both study arms and would encompass changes in attitudes, confidence, and willingness to use AI diagnostic tools before and after being exposed to the customized tool. We will assess the number of participants who were open to using AI to help with complex clinical reasoning (pre- and post-quiz), if they enjoyed working with the AI diagnostic tool, if they felt like the tool provided a valuable collaborative experience for clinical reasoning, if seeing the AI diagnostic tool's recommendations increased their confidence in their differential diagnoses, and if they would use an AI diagnostic tool like the one in this study in their daily job. These will be evaluated on a Likert scale ranking from strongly disagree to strongly agree.
Through study completion, an average of 6 months
Customized GPT-4's diagnostic reasoning
Time Frame: Through study completion, an average of 6 months
The customized GPT-4's 'independent' diagnoses will be assessed for accuracy. The outcome will be the percentage of correct responses per case (range: 0 to 100). For each case, the meta-prompt directs the customized GPT-4 to provide its top three differential diagnoses, along with supporting and opposing findings for each, a final diagnosis, and next steps. The customized GPT-4 will receive 1 point for each plausible diagnosis. Supporting and opposing findings will be graded based on correctness, with 1 point for a partially correct response and 2 points for a completely correct response. Its top diagnosis will earn 1 point for a reasonable choice and 2 points for the most accurate diagnosis. Finally, it will list up to three next steps for further patient evaluation, with 1 point awarded for a partially correct response and 2 points for a completely correct response. The outcome will be analyzed at the case level, comparing performance with the randomized study groups' scores.
Through study completion, an average of 6 months

Collaborators and Investigators

This is where you will find people and organizations involved with this study.

Investigators

  • Principal Investigator: Jonathan H Chen, MD, PhD, Stanford University

Study record dates

These dates track the progress of study record and summary results submissions to ClinicalTrials.gov. Study records and reported results are reviewed by the National Library of Medicine (NLM) to make sure they meet specific quality control standards before being posted on the public website.

Study Major Dates

Study Start (Actual)

December 16, 2024

Primary Completion (Actual)

January 24, 2025

Study Completion (Actual)

January 24, 2025

Study Registration Dates

First Submitted

February 11, 2025

First Submitted That Met QC Criteria

March 27, 2025

First Posted (Actual)

April 4, 2025

Study Record Updates

Last Update Posted (Actual)

April 4, 2025

Last Update Submitted That Met QC Criteria

March 27, 2025

Last Verified

March 1, 2025

More Information

Terms related to this study

Additional Relevant MeSH Terms

Other Study ID Numbers

  • 71319c

Plan for Individual participant data (IPD)

Plan to Share Individual Participant Data (IPD)?

NO

Drug and device information, study documents

Studies a U.S. FDA-regulated drug product

No

Studies a U.S. FDA-regulated device product

No

product manufactured in and exported from the U.S.

No

This information was retrieved directly from the website clinicaltrials.gov without any changes. If you have any requests to change, remove or update your study details, please contact register@clinicaltrials.gov. As soon as a change is implemented on clinicaltrials.gov, this will be updated automatically on our website as well.

Clinical Trials on Pathologic Processes

Clinical Trials on Immediate access to customized version of GPT-4

Subscribe