Mitigating Automation Bias in Physician-LLM Diagnostic Reasoning Using Behavioral Nudges

June 24, 2026 updated by: Ihsan Ayyub Qazi, PhD, Lahore University of Management Sciences

The goal of this randomized controlled trial is to evaluate whether behavioral nudges can reduce automation bias, the uncritical acceptance of automated output, in physicians using large language models (LLM) like ChatGPT-5.1 for clinical decision-making.

The main question it aims to answer is: Does a dual-mechanism behavioral nudge intervention (baseline accuracy anchoring plus case-specific color-coded confidence signals) reduce physicians' uncritical acceptance of incorrect LLM recommendations?

Researchers will compare physicians who receive LLM recommendations along with a behavioral nudge to those who receive LLM recommendations without the nudge to assess if the nudge reduces automation bias.

Participants will:

Evaluate six clinical vignettes accompanied by LLM-generated recommendations (half containing deliberate, clinically significant errors).
Control group: Be able to view LLM recommendations in standard format without the nudge.
Treatment group: Be able to view ChatGPT's diagnostic accuracy on standard medical datasets as an initial anchor, then receive color-coded confidence signals alongside each recommendation (e.g., red for low confidence).
Have their responses evaluated by blinded reviewers using an expert-developed assessment rubric to detect uncritical acceptance of erroneous information.

Study Overview

Status

Completed

Conditions

Diagnosis

Intervention / Treatment

Other: Behavioral Nudge Intervention

Detailed Description

Automation bias represents a critical challenge in modern clinical practice, particularly as artificial intelligence (AI) tools become increasingly embedded in healthcare workflows. This cognitive phenomenon describes the tendency of clinicians to favor suggestions from automated decision-making systems, even when those suggestions are incorrect. As Large Language Models (LLM) such as ChatGPT-5.1 gain traction in medical settings, their potential to reduce errors and improve efficiency must be weighed against a significant concern: these models lack rigorous medical validation and may amplify existing cognitive biases through incorrect or misleading recommendations.

The emergence of automation bias in medical contexts reflects a complex interplay of environmental and psychological factors. Time constraints in high-volume clinical settings create pressure to accept AI-generated recommendations without adequate scrutiny. Financial incentives that prioritize efficiency over thoroughness may further discourage critical evaluation necessary for sound clinical judgment. Cognitive fatigue during extended shifts diminishes physicians' capacity for sustained analytical thinking. These pressures interact with psychological mechanisms including diffusion of responsibility, overconfidence in technological solutions, and cognitive offloading, collectively creating conditions where uncritical acceptance of AI-generated recommendations becomes more likely.

This randomized controlled trial evaluates the effectiveness of a behavioral nudge intervention designed to mitigate automation bias among medical doctors utilizing LLM-generated diagnostic recommendations. The primary objective is to determine whether this intervention improves diagnostic reasoning performance scores when evaluating clinical vignettes that include deliberately flawed LLM recommendations. Secondary objectives include assessing whether physician experience level, gender, and prior LLM experience moderate the intervention's effectiveness, determining differential effectiveness for vignettes across different confidence signals.

This study employs a single-blind, randomized controlled trial with two parallel arms. Participants will be randomly assigned 1:1 to either the intervention or control arm. To eliminate variability from differences in prompting skills, participants will not interact directly with a live LLM interface. Instead, all participants will use a custom-built web platform displaying clinical vignettes with pre-generated LLM recommendations, ensuring identical LLM-generated content for each vignette.

All participants will evaluate six clinical vignettes during a single, proctored session lasting approximately 75 minutes. Three vignettes will contain deliberately introduced clinical reasoning flaws in the LLM recommendations, while three will contain correct recommendations. Vignettes will be presented in randomized order to prevent pattern detection.

Control arm participants will evaluate clinical vignettes with LLM diagnostic recommendations generated by ChatGPT presented in standard, neutral text format without additional contextual information. Intervention arm participants will evaluate the same vignettes alongside a behavioral nudge. This intervention consists of two synchronized cognitive cues: (1) an anchoring cue displaying ChatGPT's baseline diagnostic accuracy on standard medical datasets at the top of the interface panel, explicitly anchoring expectations to the model's fallibility, and (2) a selective attention cue displaying the LLM recommendation alongside a color-coded confidence signal generated through an ensemble assessment: three independent state-of-the-art LLMs (Claude Sonnet 4.5, Gemini 2.5 Pro Thinking, and GPT-5.1) each provide confidence ratings for the recommendation, and the mean confidence determines the signal color to mitigate single-model miscalibration.

The color-coded confidence signals are categorized into three distinct levels based on the ensemble's mean confidence relative to baseline diagnostic accuracy. Red signals are triggered when the mean confidence falls below ChatGPT's established baseline accuracy, explicitly flagging high-uncertainty cases that demand heightened critical scrutiny. Orange signals indicate that while the mean confidence exceeds the baseline average, it remains below 100%, signaling the need for continued clinical vigilance and the avoidance of complacency. Finally, green signals are reserved for instances of 100% ensemble consensus; however, even at this level of confidence, standard AI safety warnings remain present to guard against over-reliance on the system's output.

Participants will be presented with six clinical vignettes specifically designed to measure automation bias, sourced and modified from real cases representing a range of diagnostic difficulty and common medical specialties. Each vignette follows a standardized format including chief complaint, history of present illness, relevant past medical/social/family history, physical examination findings, and initial laboratory results.

The primary outcome is the Diagnostic Reasoning Performance Score, a composite percentage score based on a structured rubric evaluating: quality of differential diagnoses, supporting findings, opposing findings, final diagnosis accuracy, and appropriateness of next steps. Secondary outcomes include top-choice diagnosis accuracy (incorrect, partially correct, or correct). All responses will be evaluated by blinded reviewers using the assessment rubric.

Study Type

Interventional

Enrollment (Actual)

Phase

Not Applicable

Contacts and Locations

This section provides the contact details for those conducting the study, and information on where this study is being conducted.

Study Locations

Pakistan
- Punjab Province
  - Lahore, Punjab Province, Pakistan, 54792
    - Lahore University of Management Sciences

Participation Criteria

Researchers look for people who fit a certain description, called eligibility criteria. Some examples of these criteria are a person's general health condition or prior treatments.

Eligibility Criteria

Ages Eligible for Study

Child
Adult
Older Adult

Accepts Healthy Volunteers

Yes

Description

Inclusion Criteria:

Full or Provisionally Registered Medical Practitioners with the Pakistan Medical and Dental Council (PMDC).
Completed Bachelor of Medicine, Bachelor of Surgery (MBBS) Exam. The equivalent degree of MBBS in US and Canada is the Doctor of Medicine (MD).
Participants must have completed a structured training program on the use of ChatGPT (or a comparable large language model), totaling at least 10 hours of instruction. The program must include hands-on practice related to LLM's key aspects, specifically prompt engineering and content evaluation.

Exclusion Criteria:

Any other Registered Medical Practitioners (Full or Provisional) with PMDC (e.g., professionals with Bachelor of Dental Surgery or BDS).

Study Plan

This section provides details of the study plan, including how the study is designed and what the study is measuring.

How is the study designed?

Design Details

Primary Purpose: Diagnostic
Allocation: Randomized
Interventional Model: Parallel Assignment
Masking: Single

Number of Arms

Arms and Interventions

Participant Group / Arm	Intervention / Treatment
Active Comparator: ChatGPT Recommendations alongside a Behavioral Nudge Participants will evaluate six clinical vignettes. During the trial, they will have access to clinical recommendations from a specific, commercially available LLM (ChatGPT) in addition to conventional diagnostic resources. LLM recommendations for three vignettes will contain deliberately flawed diagnostic information and for three vignettes it will contain accurate recommendations). The cases will be presented in random order. Participants in this arm will receive a behavioral nudge embedded in the LLM recommendations interface that presents two synchronized cognitive cues when the LLM panel is expanded: (1) an anchoring cue displaying ChatGPT's baseline diagnostic accuracy on standard medical datasets at the top of the panel to set realistic expectations before cue intervention located immediately below, which shows the LLM recommendations alongside a case-specific color-coded confidence signal.	Other: Behavioral Nudge Intervention Participants in the treatment group will receive a behavioral nudge intervention embedded in the LLM recommendations interface that presents two synchronized cognitive cues when the LLM panel is expanded: (1) an anchoring cue displaying ChatGPT's baseline diagnostic accuracy on standard medical datasets at the top of the panel to set realistic expectations before viewing the specific recommendation, and (2) a selective attention cue located immediately below, which shows the LLM recommendation alongside a case-specific and color-coded confidence signal. This signal is categorized as red when the mean ensemble confidence falls below the established baseline accuracy, flagging high-uncertainty cases that demand critical evaluation; orange when confidence meets or exceeds the baseline but remains below 100%, intended to prevent complacency and maintain active clinical scrutiny; and green for a 100% ensemble consensus, though standard cautionary warnings still apply to guard against.
No Intervention: ChatGPT Recommendations without a Behavioral Nudge Participants will evaluate six clinical vignettes. During the trial, they will have access to clinical recommendations from a specific, commercially available LLM (ChatGPT) in addition to conventional diagnostic resources. LLM recommendations for three vignettes will contain deliberately flawed diagnostic information. The cases will be presented in random order. Participants in this arm will not receive any behavioral nudge.

Participant Group / Arm

Intervention / Treatment

Active Comparator: ChatGPT Recommendations alongside a Behavioral Nudge

Participants will evaluate six clinical vignettes. During the trial, they will have access to clinical recommendations from a specific, commercially available LLM (ChatGPT) in addition to conventional diagnostic resources. LLM recommendations for three vignettes will contain deliberately flawed diagnostic information and for three vignettes it will contain accurate recommendations). The cases will be presented in random order. Participants in this arm will receive a behavioral nudge embedded in the LLM recommendations interface that presents two synchronized cognitive cues when the LLM panel is expanded: (1) an anchoring cue displaying ChatGPT's baseline diagnostic accuracy on standard medical datasets at the top of the panel to set realistic expectations before cue intervention located immediately below, which shows the LLM recommendations alongside a case-specific color-coded confidence signal.

Other: Behavioral Nudge Intervention

Participants in the treatment group will receive a behavioral nudge intervention embedded in the LLM recommendations interface that presents two synchronized cognitive cues when the LLM panel is expanded: (1) an anchoring cue displaying ChatGPT's baseline diagnostic accuracy on standard medical datasets at the top of the panel to set realistic expectations before viewing the specific recommendation, and (2) a selective attention cue located immediately below, which shows the LLM recommendation alongside a case-specific and color-coded confidence signal. This signal is categorized as red when the mean ensemble confidence falls below the established baseline accuracy, flagging high-uncertainty cases that demand critical evaluation; orange when confidence meets or exceeds the baseline but remains below 100%, intended to prevent complacency and maintain active clinical scrutiny; and green for a 100% ensemble consensus, though standard cautionary warnings still apply to guard against.

No Intervention: ChatGPT Recommendations without a Behavioral Nudge

What is the study measuring?

Primary Outcome Measures

Outcome Measure	Measure Description	Time Frame
Diagnostic reasoning accuracy score Time Frame: Assessed at a single time point for each case, during the scheduled diagnostic reasoning evaluation session, which takes place between 0-5 days after participant enrollment.	The primary outcome will be the percent correct for each case, ranging from 0 to 100%, where higher scores indicate better diagnostic performance. For each case, participants will be asked for their three leading diagnoses, findings that support each diagnosis, and findings that oppose each diagnosis. For each plausible diagnosis, participants will receive 1 point. Findings supporting the diagnosis and findings opposing the diagnosis will also be graded based on correctness, with 1 point for each correct response. Participants will then be asked to name their top diagnosis they believe is most likely, earning 9 points for a reasonable response and 18 points for the most accurate response. Finally participants will be asked to name up to 3 next steps to further evaluate the patient with 0.5 point awarded for a partially correct response and 1 point for a completely correct response. The primary outcome will be compared at the case-level between the randomized groups.	Assessed at a single time point for each case, during the scheduled diagnostic reasoning evaluation session, which takes place between 0-5 days after participant enrollment.

Secondary Outcome Measures

Outcome Measure	Measure Description	Time Frame
Top choice diagnosis accuracy score Time Frame: Assessed at a single time point for each case, during the scheduled diagnostic reasoning evaluation session, which takes place between 0-5 days after participant enrollment.	The secondary outcome will measure participants' performance in identifying the most likely diagnosis for each clinical vignette. After evaluating each case, participants will select their single most likely diagnosis, which will be scored on a pre-specified Three-Tier Diagnostic Accuracy Scale: 18 points for the most accurate diagnosis, 9 points for a clinically reasonable alternative, and 0 points for an incorrect diagnosis. For each participant, a Top Choice Diagnosis Accuracy Score is calculated as (total points earned ÷ maximum possible points) × 100, yielding a 0-100 % range in which higher scores indicate greater diagnostic accuracy. This percentage score will be compared at the case-level between randomized groups to quantify the impact of automation bias on diagnostic decision-making.	Assessed at a single time point for each case, during the scheduled diagnostic reasoning evaluation session, which takes place between 0-5 days after participant enrollment.

Collaborators and Investigators

This is where you will find people and organizations involved with this study.

Sponsor

Lahore University of Management Sciences

Investigators

Principal Investigator: Muhammad Asadullah Khawaja, MBBS, King Edward Medical University
Principal Investigator: Ihsan Ayyub Qazi, PhD, Lahore University of Management Sciences (LUMS)
Principal Investigator: Ali Zafar Sheikh, MBBS, Lahore General Hospital
Principal Investigator: Muhammad Junaid Akhtar, MBBS, Children's Hospital, Lahore
Principal Investigator: Muhammad Hamad Alizai, PhD, Lahore University of Management Sciences (LUMS)

Study record dates

These dates track the progress of study record and summary results submissions to ClinicalTrials.gov. Study records and reported results are reviewed by the National Library of Medicine (NLM) to make sure they meet specific quality control standards before being posted on the public website.

Study Major Dates

Study Start (Actual)

January 17, 2026

Primary Completion (Actual)

May 25, 2026

Study Completion (Actual)

May 26, 2026

Study Registration Dates

First Submitted

December 26, 2025

First Submitted That Met QC Criteria

December 26, 2025

First Posted (Actual)

January 9, 2026

Study Record Updates

Last Update Posted (Actual)

June 25, 2026

Last Update Submitted That Met QC Criteria

June 24, 2026

Last Verified

June 1, 2026

More Information

Terms related to this study

Keywords

Additional Relevant MeSH Terms

Other Study ID Numbers

LUMS-IRB-0412/12192025/IAQ-FWA

Plan for Individual participant data (IPD)

Plan to Share Individual Participant Data (IPD)?

Drug and device information, study documents

Studies a U.S. FDA-regulated drug product

Studies a U.S. FDA-regulated device product

This information was retrieved directly from the website clinicaltrials.gov without any changes. If you have any requests to change, remove or update your study details, please contact register@clinicaltrials.gov. As soon as a change is implemented on clinicaltrials.gov, this will be updated automatically on our website as well.

Clinical Trials on Diagnosis

SuperSonic Imagine

Terminated

Improvement Image Quality for SuperSonic® MACH Ultrasound System (MACH IQ)

Diagnosis

France
European Institute of Oncology
European Union

Recruiting

Digital Solutions for bEtter cAre (ALTHEA)

Cancer Diagnosis

France, Lithuania, Germany, Italy, Spain
Umraniye Education and Research Hospital

Completed

Uterine Artery Diastolic Notching & Apelin-13 and 36

Diagnosis

Turkey
Vrije Universiteit Brussel

Recruiting

Pilot-testing a Perinatal Palliative Care Intervention Program (PPC-pilot)

Perinatal Palliative Care | Life-limiting Fetal Diagnosis | Life-limiting Neonatal Diagnosis

Belgium
Beytepe Murat Erdi Eker State Hospital

Completed

Effects of Selenium and Melatonin on Ocular Ischemic Syndrome

Anterior Segment Ischemia (Diagnosis)
European Organisation for Research and Treatment...

Not yet recruiting

Eortc MoBilE Device (EMBED) (EMBED)

Diagnosis of Cancer
Columbia University
Eunice Kennedy Shriver National Institute of Child Health and Human Development...

Recruiting

guideSEQ: Genomic Understanding, Impact, Decision & Ethics in Prenatal Sequencing (guideSEQ)

Prenatal Genetic Diagnosis

United States
Identifai Genetics

Recruiting

Identifai Genetics Analytic Validity Study - Compound Heterozygosity and Samples Collection

Genetics | Prenatal Diagnosis

United States
Peking Union Medical College Hospital

Not yet recruiting

Mapping of Genomic Structural Variations in Major Birth Defects

Prenatal Diagnosis
Danderyd Hospital

Recruiting

MEDECA - Markers in Early Detection of Cancer (MEDECA)

Cancer | Diagnosis

Sweden

Clinical Trials on Behavioral Nudge Intervention

University of Pennsylvania

Recruiting

Using the EHR to Advance Genomic Medicine Across a Diverse Health System

Parkinson Disease | Alzheimer Disease | ALS | Polyneuropathies | Frontotemporal Dementia | Pheochromocytoma | Thoracic Aortic Aneurysm | Paraganglioma | Genetic Predisposition | Cardiomyopathy Non-ischemic

United States
Sun Yat-sen University

Not yet recruiting

Nudging for Anticoagulation Adherence in Atrial Fibrillation (NUDGE-AF)

Atrial Fibrillation (AF)

China
Brigham and Women's Hospital
Brown University; National Institute on Aging (NIA)

Completed

Deprescribing in Patients Living With Dementia With Caregiver and Provider Nudges

Dementia | Alzheimer Disease | Mild Cognitive Impairment

United States
Singapore Management University

Not yet recruiting

The Effectiveness of a Digital Mindfulness Nudge Intervention on Well-Being, Cognitive, and Academic Outcomes in College Students

Digital Mindfulness Intervention
University of Texas at Austin

Active, not recruiting

Testing a Brief Reassurance Message Before a Musculoskeletal Clinic Visit

Any Non-traumatic Musculoskeletal Condition

United States
Northwestern University
National Institute of Mental Health (NIMH); University of Pennsylvania

Completed

Adolescent and Child Suicide Prevention in Routine Clinical Encounters (ASPIRE)

Suicide

United States
Singapore Management University

Recruiting

Effects of a Digital Micro-Movement Nudge Intervention on Well-Being and Academic Outcomes in College Students

Health Behaviour Change

Singapore
Abramson Cancer Center at Penn Medicine
AstraZeneca; National Comprehensive Cancer Network

Completed

Increasing Adherence to Lung Cancer Screening

Lung Cancer | Adherence, Patient

United States
Abramson Cancer Center at Penn Medicine
National Cancer Institute (NCI)

Completed

Improving Utilization of Supplemental Breast MRI Screening for Women With Extremely Dense Breasts

Breast Cancer

United States
University of Pennsylvania

Completed

How v. How and Why Nudges and Rewards

Control Condition | How Incentive Receive | How Incentive Holdout | How Incentive Ineligible | How and Why Incentive Receive | How and Why Incentive Holdout | How and Why Incentive Ineligible

United States

Mitigating Automation Bias in Physician-LLM Diagnostic Reasoning Using Behavioral Nudges

Study Overview

Status

Conditions

Intervention / Treatment

Detailed Description

Study Type

Enrollment (Actual)

Phase

Contacts and Locations

Study Locations

Participation Criteria

Eligibility Criteria

Ages Eligible for Study

Accepts Healthy Volunteers

Description

Study Plan

How is the study designed?

Design Details

Number of Arms

Arms and Interventions

Participant Group / Arm

Intervention / Treatment

What is the study measuring?

Primary Outcome Measures

Outcome Measure

Measure Description

Time Frame

Secondary Outcome Measures

Outcome Measure

Measure Description

Time Frame

Collaborators and Investigators

Sponsor

Investigators

Study record dates

Study Major Dates

Study Start (Actual)

Primary Completion (Actual)

Study Completion (Actual)

Study Registration Dates

First Submitted

First Submitted That Met QC Criteria

First Posted (Actual)

Study Record Updates

Last Update Posted (Actual)

Last Update Submitted That Met QC Criteria

Last Verified

More Information

Terms related to this study

Keywords

Additional Relevant MeSH Terms

Other Study ID Numbers

Plan for Individual participant data (IPD)

Plan to Share Individual Participant Data (IPD)?

Drug and device information, study documents

Studies a U.S. FDA-regulated drug product

Studies a U.S. FDA-regulated device product

Clinical Trials on Diagnosis

Clinical Trials on Behavioral Nudge Intervention

Search Similar Trials

Sponsors and Collaborators

Medical Conditions

Drug Interventions

CROs by country

CROs in Cambodia

Conditions

Rare Diseases

Drug Interventions

Dietary Supplements

Sponsor/Collaborators

Locations