Reasoning Enrichment With Feedback From IA in NEphrology Trial (REFINe)

January 12, 2026 updated by: Aghiles.HAMROUN, University Hospital, Lille

Reasoning Enhancement With Feedback From a Generative AI in Nephrology (REFINe): A Randomized Evaluation of Generative AI Support in Nephrology Diagnosis

The goal of this clinical trial is to learn how artificial intelligence (AI) may help doctors make diagnoses in kidney medicine. The researchers want to know whether an AI tool called a large language model (LLM) can help doctors choose the correct diagnosis more often and feel more confident in their answers.

Before starting the study, the research team tested several AI models and chose one of the best performers, a GPT-5-class model set to use high reasoning effort.

The main questions this study aims to answer are:

Do doctors make more correct diagnoses when they can see AI suggestions?
Does seeing AI suggestions change how confident doctors feel about their diagnosis?

Researchers will compare doctors who receive AI suggestions with doctors who do not receive AI suggestions to see how the AI affects accuracy, confidence, and decision-making.

Participants will complete up to 10 online clinical cases. For each case, they will:

Read a short medical scenario
Suggest up to three possible diagnoses

(If in the AI group) Review the AI's suggestions and decide whether to change their answer

The study will also look at how long participants take to answer each case and how the AI's performance compares to the human answers.

Study Overview

Status

Recruiting

Conditions

Intervention / Treatment

Other: AI suggestion

Detailed Description

This study evaluates whether providing clinicians with real-time diagnostic suggestions from a high-reasoning large language model (GPT-5) improves diagnostic accuracy, confidence, and efficiency when solving nephrology clinical vignettes. Prior to selecting the model for the trial, the research team benchmarked several state-of-the-art models across a pilot set of nephrology cases, including: GPT-5, GPT-5-mini, O3, GPT-4o, Llama-4 Maverick-17B, Gemini-2.5-Pro, Qwen-3 VL-235B Thinking, DeepSeek-V3.2-Exp, MedGEMMA-27B, Claude Sonnet-4.5, and Magistral-Medium-2509. GPT-5 (high-reasoning) demonstrated the highest diagnostic performance, stability, and interpretability, and was selected as the AI system used in the intervention arm.

Participants include medical students, residents, fellows, and practicing physicians. After creating an account, participants complete a demographic questionnaire (specialty, years of experience, practice type, age category, AI familiarity) and must explicitly agree to the use of these data for research purposes before accessing the vignettes. No directly identifying information is collected.

Participants are randomized (with stratification by professional status) to either the AI-supported arm or the control arm. Each participant is assigned 10 nephrology vignettes in French or English and may complete them over multiple sessions. Once a vignette is submitted, it cannot be revisited ("no backtracking"). Completion time per vignette is automatically recorded.

Control Arm

Participants view each vignette and provide up to three diagnoses ("Top-3"), followed by a confidence rating (0-10).

AI-Supported Arm

Participants first enter an initial Top-3 diagnosis and confidence rating without AI assistance. The system then displays GPT-5's diagnostic suggestions, after which participants may revise their diagnoses once. The vignette is locked after submission.

The study collects:

initial and final diagnoses,
confidence ratings before and (if applicable) after AI suggestions,
completion times,
participant demographic variables,
and the AI model's own diagnostic outputs.

Partial completion is permitted; all completed vignettes contribute to the analysis.

Primary and secondary outcomes include diagnostic accuracy (Top-3 and Top-1), accuracy improvement before vs. after AI, changes in diagnostic confidence, AI-induced diagnostic errors, human-versus-AI benchmarking, completion-time efficiency metrics, and the proportion of assigned vignettes completed.

The primary analysis will compare diagnostic accuracy between the control arm (physicians alone) and the experimental arm (physicians assisted by the AI model). Accuracy is analyzed as a binary outcome (correct vs incorrect diagnosis). Because each participant evaluates multiple clinical vignettes, accuracy will be modeled using a mixed-effects logistic regression with a fixed effect for study arm and random intercepts for both participant and vignette. This approach accounts for clustering and varying difficulty across cases. The primary hypothesis test uses a two-sided α = 0.05. Effect sizes will be reported as odds ratios with 95% confidence intervals. Secondary analyses will explore whether accuracy varies by demographic factors (e.g., experience level, specialty) using interaction terms.

Because each participant evaluates multiple vignettes, the team also performed simulation-based power analyses using mixed-effects logistic regression models with random intercepts for both participant and vignette, assuming an intra-participant ICC of 0.10. Under these assumptions, a total sample of 100 participants (50 per arm) with 10 vignettes per participant provides >99% power to detect a clinically meaningful improvement in diagnostic accuracy. The investigators therefore plan to enroll approximately 100 participants overall.

This study aims to quantify whether AI-augmented reasoning meaningfully improves diagnostic performance and decision-making when clinicians evaluate complex nephrology cases.

Study Type

Interventional

Enrollment (Estimated)

100

Phase

Not Applicable

Contacts and Locations

This section provides the contact details for those conducting the study, and information on where this study is being conducted.

Study Contact

Name: Raphaël BENTEGEAC, MD, MPH
Phone Number: +33651204000
Email: raphael.bentegeac@univ-lille.fr

Study Locations

France
- - Lille, France, 59000
    - Recruiting
    - Lille University Hospital (online study)
    - Contact:
      
      Raphaël BENTEGEAC, MD, MPH
      
      Phone Number: +33651204000
      
      Email: raphael.bentegeac@chu-lille.fr
    - Contact:
      
      Aghiles HAMROUN, MD, PhD
      
      Email: aghiles.hamroun@univ-lille.fr

Participation Criteria

Researchers look for people who fit a certain description, called eligibility criteria. Some examples of these criteria are a person's general health condition or prior treatments.

Eligibility Criteria

Ages Eligible for Study

Adult
Older Adult

Accepts Healthy Volunteers

Yes

Description

Inclusion Criteria:

Adults aged 18 years or older.

Able to read and answer clinical vignettes in English or French.

Access to a computer or smartphone with an internet connection.

Provides informed consent online.

Participants are expected to have at least basic medical training (e.g., medical students, residents, fellows, or practicing clinicians), although no formal verification is required.

Exclusion Criteria:

Individuals under 18 years of age.

Inability to complete online study procedures.

Prior involvement in the design, development, or evaluation of the AI system used in this study.

Study Plan

This section provides details of the study plan, including how the study is designed and what the study is measuring.

How is the study designed?

Design Details

Primary Purpose: Diagnostic
Allocation: Randomized
Interventional Model: Parallel Assignment
Masking: None (Open Label)

Number of Arms

Arms and Interventions

Participant Group / Arm	Intervention / Treatment
Experimental: Group with AI suggestions Participants in this arm will complete the same clinical case vignettes as the control group. For each case, they will receive a suggested diagnosis generated by a large language model (GPT-5, high-reasoning configuration), which was selected after internal benchmarking. Participants can review the AI suggestion before entering their own final diagnostic answer. No additional information, prompts, or coaching is provided. The intervention consists solely of displaying the AI-generated diagnostic suggestion during the case-solving task.	Other: AI suggestion This intervention consists of displaying an AI-generated diagnostic suggestion during the clinical case-solving task. After reading each vignette, participants see the top diagnostic proposal produced by a large language model (GPT-5, high-reasoning configuration), selected after internal benchmarking. The AI suggestion appears once per vignette and cannot be requested again or modified. Participants may revise their diagnostic answer after viewing the suggestion, but they cannot return to the vignette later. No additional guidance, coaching, or interactive features are provided.
No Intervention: Group without AI suggestions Participants in this arm will complete the clinical case vignettes independently, without any AI-generated diagnostic suggestions. They will read each vignette and provide their own diagnostic answer based solely on the information presented. No external decision support or additional materials are provided.

Participant Group / Arm

Intervention / Treatment

Experimental: Group with AI suggestions

Participants in this arm will complete the same clinical case vignettes as the control group. For each case, they will receive a suggested diagnosis generated by a large language model (GPT-5, high-reasoning configuration), which was selected after internal benchmarking. Participants can review the AI suggestion before entering their own final diagnostic answer. No additional information, prompts, or coaching is provided. The intervention consists solely of displaying the AI-generated diagnostic suggestion during the case-solving task.

Other: AI suggestion

This intervention consists of displaying an AI-generated diagnostic suggestion during the clinical case-solving task. After reading each vignette, participants see the top diagnostic proposal produced by a large language model (GPT-5, high-reasoning configuration), selected after internal benchmarking. The AI suggestion appears once per vignette and cannot be requested again or modified. Participants may revise their diagnostic answer after viewing the suggestion, but they cannot return to the vignette later. No additional guidance, coaching, or interactive features are provided.

No Intervention: Group without AI suggestions

Participants in this arm will complete the clinical case vignettes independently, without any AI-generated diagnostic suggestions. They will read each vignette and provide their own diagnostic answer based solely on the information presented. No external decision support or additional materials are provided.

What is the study measuring?

Primary Outcome Measures

Outcome Measure	Measure Description	Time Frame
Final diagnostic accuracy (top-3) with vs without AI support Time Frame: From first vignette answered until the end of the study (up to 12 months).	For each participant, proportion of vignettes where the correct main diagnosis is included in the participant's final top-3 diagnoses. Compare final top-3 accuracy between the AI arm (after AI suggestions) and the control arm (no AI). Percentage of correctly diagnosed cases (top-3).	From first vignette answered until the end of the study (up to 12 months).

Secondary Outcome Measures

Outcome Measure	Measure Description	Time Frame
Final diagnostic accuracy (top-1) with vs without AI support Time Frame: From first vignette answered until the end of the study (up to 12 months).	For each participant, proportion of vignettes where the correct main diagnosis is included in the participant's final top-1 diagnoses. Compare final top-1 accuracy between the AI arm (after AI suggestions) and the control arm (no AI). Percentage of correctly diagnosed cases (top-1).	From first vignette answered until the end of the study (up to 12 months).
Change in top-3 diagnostic accuracy before vs after AI suggestions (AI arm only) Time Frame: From first vignette answered until the end of the study (up to 12 months).	In the AI-supported arm, participants first provide an initial answer (up to three diagnoses) without AI suggestions, then see AI-generated suggestions and may revise their answer once; they cannot return to that vignette later. For each participant, the investigators compute the difference in top-3 accuracy between initial and final answers across all completed vignettes. Percentage-point change in Top-3 diagnostic accuracy	From first vignette answered until the end of the study (up to 12 months).
Change in top-1 diagnostic accuracy before vs after AI suggestions (AI arm only) Time Frame: From first vignette answered until the end of the study (up to 12 months).	In the AI-supported arm, participants first provide an initial answer (up to three diagnoses) without AI suggestions, then see AI-generated suggestions and may revise their answer once; they cannot return to that vignette later. For each participant, the investigators compute the difference in top-1 accuracy between initial and final answers across all completed vignettes. Percentage-point change in Top-1 diagnostic accuracy	From first vignette answered until the end of the study (up to 12 months).
Diagnostic confidence (0-10) before AI suggestions: Control vs AI arm Time Frame: From first vignette answered until the end of the study (up to 12 months).	Participants in both arms rate their confidence (0-10 scale) in their Top-3 diagnostic proposal before any AI suggestions. In the AI arm, this is the "pre-AI" rating. In the Control arm, this is the single confidence rating (since no AI is shown). The investigators compare the pre-AI confidence between arms, aggregated across all completed vignettes per participant.	From first vignette answered until the end of the study (up to 12 months).
Final diagnostic confidence (0-10) after AI suggestions: Control vs AI arm Time Frame: From first vignette answered until the end of the study (up to 12 months).	Final diagnostic confidence (0-10 scale) in the Top-3 diagnostic proposal across all completed vignettes, compared between arms. In the AI arm, this is the post-AI confidence rating. In the Control arm, this is the same confidence rating (participants do not receive AI suggestions).	From first vignette answered until the end of the study (up to 12 months).
Change in diagnostic confidence (0-10) before vs after AI suggestions (AI arm only) Time Frame: From first vignette answered until the end of the study (up to 12 months).	In the AI arm, participants provide confidence ratings (0-10 scale) for their Top-3 diagnoses both before and after seeing AI suggestions. For each participant, the investigators compute the within-participant change (post-AI minus pre-AI) across all completed vignettes. Change in confidence score (0-10 scale)	From first vignette answered until the end of the study (up to 12 months).
AI-induced diagnostic error (AI arm only) Time Frame: From first vignette answered until the end of the study (up to 12 months).	Among completed vignettes where the participant's initial Top-1 diagnosis is correct, proportion for which the final Top-1 diagnosis becomes incorrect after AI suggestions.	From first vignette answered until the end of the study (up to 12 months).
Change in Top-3 diagnosis after AI suggestions (AI arm only) Time Frame: From first vignette answered until the end of the study (up to 12 months).	Among completed vignettes in the AI arm, the proportion where the Top-3 diagnosis differs between pre-AI and post-AI answers.	From first vignette answered until the end of the study (up to 12 months).
Top-3 diagnostic accuracy: All human answers before AI vs AI accuracy Time Frame: From first vignette answered until the end of the study (up to 12 months).	For each vignette, the Top-3 diagnostic accuracy of human participants before any AI suggestions (combining participants from both study arms at their pre-AI stage) is compared with the Top-3 diagnostic accuracy of the AI model for the same vignette. The reported Outcome is the accuracy difference, defined as AI Top-3 accuracy minus human pre-AI Top-3 accuracy, expressed in percentage points and computed at the vignette level across all completed vignettes. Percentage-point difference in Top-3 diagnostic accuracy	From first vignette answered until the end of the study (up to 12 months).
Top-3 diagnostic accuracy: Human final answers after AI vs AI accuracy (AI arm only) Time Frame: From first vignette answered until the end of the study (up to 12 months).	For each vignette completed in the AI-supported arm, the Top-3 diagnostic accuracy of human participants after viewing AI suggestions is compared with the Top-3 diagnostic accuracy of the AI model. (Top-3 accuracy is a single measure) The reported Outcome is the accuracy difference, defined as AI Top-3 accuracy minus human post-AI Top-3 accuracy, expressed in percentage points and computed at the vignette level across all completed vignettes in the AI arm. Percentage-point difference in Top-3 diagnostic accuracy between AI and human	From first vignette answered until the end of the study (up to 12 months).
Completion time per vignette with and without AI support Time Frame: From first vignette answered until the end of the study (up to 12 months).	For each vignette, the platform records the time from vignette opening to answer submission. In the control arm, a single completion time is recorded for each vignette. In the AI-supported arm, completion time is recorded before viewing AI suggestions and again after viewing AI suggestions. The Outcome reports the difference in completion time between study arms, expressed in seconds and calculated across all completed vignettes. Seconds (difference in completion time)	From first vignette answered until the end of the study (up to 12 months).
Proportion of assigned vignettes completed Time Frame: From first vignette answered until the end of the study (up to 12 months).	For each participant, the proportion of the 10 vignettes completed within the study period, compared between arms.	From first vignette answered until the end of the study (up to 12 months).

Collaborators and Investigators

This is where you will find people and organizations involved with this study.

Sponsor

University Hospital, Lille

Collaborators

Institut Pasteur de Lille

Lille University

Study record dates

These dates track the progress of study record and summary results submissions to ClinicalTrials.gov. Study records and reported results are reviewed by the National Library of Medicine (NLM) to make sure they meet specific quality control standards before being posted on the public website.

Study Major Dates

Study Start (Actual)

November 20, 2025

Primary Completion (Estimated)

October 31, 2026

Study Completion (Estimated)

December 31, 2026

Study Registration Dates

First Submitted

November 19, 2025

First Submitted That Met QC Criteria

January 12, 2026

First Posted (Actual)

January 20, 2026

Study Record Updates

Last Update Posted (Actual)

January 20, 2026

Last Update Submitted That Met QC Criteria

January 12, 2026

Last Verified

January 1, 2026

More Information

Terms related to this study

Keywords

Additional Relevant MeSH Terms

Other Study ID Numbers

CHUL-191125

Drug and device information, study documents

Studies a U.S. FDA-regulated drug product

Studies a U.S. FDA-regulated device product

This information was retrieved directly from the website clinicaltrials.gov without any changes. If you have any requests to change, remove or update your study details, please contact register@clinicaltrials.gov. As soon as a change is implemented on clinicaltrials.gov, this will be updated automatically on our website as well.

Clinical Trials on Decision Support Systems, Clinical

University Hospital Augsburg
Erasmus Medical Center; University Hospital Schleswig-Holstein; Johannes Gutenberg...

Completed

Medical Imaging Decision And Support (MIDAS)

Clinical Decision Support Systems

Germany
Tampere University
Finnish Institute for Health and Welfare; The Finnish Funding Agency for Technology... and other collaborators

Completed

Evidence-Based Medicine Electronic Decision Support Study (EBMeDS)

Decision Support Systems, Clinical

Finland
Brigham and Women's Hospital

Unknown

Improving Safety and Quality With Outpatient Order Entry

Clinical Decision Support Systems | Ambulatory Care Information Systems

United States
Escola Superior de Enfermagem de Coimbra
Instituto Politécnico de Leiria

Not yet recruiting

Clinical Decision Support System for Remote Monitoring of Cardiovascular Disease Patients (mHEART4U)

Decision Support Systems, Clinical
Stanford University

Not yet recruiting

Testing an AI Tool to Help Primary Care Clinicians With Specialty Consultation Questions (SAGE)

Decision Support Systems, Clinical | Referral and Consultation

United States
Haseki Training and Research Hospital

Completed

Comparing Artificial Intelligence and Physicians: A Vignette-Based Study in Pediatric Clinical Decision-Making

Decision Support Systems, Clinical | Pediatrics | Clinical Decision-making | Artificial Intelligence (AI) in Diagnosis

Turkey (Türkiye)
University of Illinois at Chicago
Agency for Healthcare Research and Quality (AHRQ); Loyola University

Completed

Integrating Contextual Factors Into Clinical Decision Support

Decision Support Systems, Clinical | Medical Errors | Diagnostic Errors

United States
University Ghent
Research Foundation Flanders

Completed

Qualitative Research Among Physicians and Junior Doctors Into the Preconditions for Implementing a CDSS Based on AI in the ICU (KATRINA)

Artificial Intelligence | Decision Support Systems, Clinical | Qualitative Research

Belgium
Queen Mary University of London
King's College Hospital NHS Trust; Barts & The London NHS Trust; Imperial College... and other collaborators

Not yet recruiting

Clinical Evaluation of an AI Risk Prediction System (AI-TRiPS) (AI-TRiPS)

Trauma | Decision Support Systems, Clinical | Injury
University of Colorado, Denver

Completed

Feasibility Of An Advanced Care Decision Aid Among Patients And Physicians_Matlock

Palliative Care | Decision Support Systems, Clinical | Decision Making | Decision Support Techniques | Hospices

United States

Clinical Trials on AI suggestion

Dr Abdurrahman Yurtaslan Ankara Oncology Training...

Completed

Verbal Suggestion to Reduce Pain and Anxiety During Unsedated Colonoscopy

Pain | Anxiety | Colonoscopy | Patient Comfort | Sedation-Free Procedure

Turkey (Türkiye)
Mayo Clinic

Completed

ICU Doulas Providing Psychological Support

Depression | Cognitive Impairment | Anxiety | Post Traumatic Stress Disorder | Illness, Critical

United States
Brigham and Women's Hospital
Oregon Health and Science University; Vanderbilt University; Geisinger Clinic

Completed

Improving Quality by Maintaining Accurate Problems in the EHR (IQ-MAPLE)

Myocardial Infarction | Coronary Artery Disease | Smoking | Stroke | Hypertension | Chronic Obstructive Pulmonary Disease | Atrial Fibrillation | Asthma | Tuberculosis | Sleep Apnea | Sickle Cell Disease | Hyperlipidemia | Congestive Heart Failure

United States
University of Utah

Completed

Psychosocial Support for Acute Hospital Pain and Distress

Acute Pain

United States
Sidney Kimmel Comprehensive Cancer Center at Johns...
Foundation Medicine

Completed

IMAGE Study: Personalized Molecular Profiling in Cancer Treatment at Johns Hopkins

Metastatic Breast Cancer

United States
National Taiwan University Hospital

Terminated

Intervention on Osteoporosis and Chronic Kidney Disease-mineral and Bone Disorder (CKD-MBD)

Osteoporosis, Postmenopausal

Taiwan
Trustees of Dartmouth College

Not yet recruiting

Individual Differences in Placebo Analgesic Effects

Healthy

United States
University of Utah

Completed

Psychosocial Support for Pre-operative Pain and Distress (Mind-Body JRA)

Acute Pain

United States
Universitaire Ziekenhuizen KU Leuven
KU Leuven; Research Foundation Flanders

Enrolling by invitation

The Effect of Odors on Asthma Symptoms

Asthma

Belgium
Baylor University

Completed

Effect of Music and Other Audio Recordings for Chronic Pain in Aging Adults

Low Back Pain | Chronic Pain | Sleep | Aging

United States

Reasoning Enrichment With Feedback From IA in NEphrology Trial (REFINe)

Reasoning Enhancement With Feedback From a Generative AI in Nephrology (REFINe): A Randomized Evaluation of Generative AI Support in Nephrology Diagnosis

Study Overview

Status

Conditions

Intervention / Treatment

Detailed Description

Study Type

Enrollment (Estimated)

Phase

Contacts and Locations

Study Contact

Study Locations

Participation Criteria

Eligibility Criteria

Ages Eligible for Study

Accepts Healthy Volunteers

Description

Study Plan

How is the study designed?

Design Details

Number of Arms

Arms and Interventions

Participant Group / Arm

Intervention / Treatment

What is the study measuring?

Primary Outcome Measures

Outcome Measure

Measure Description

Time Frame

Secondary Outcome Measures

Outcome Measure

Measure Description

Time Frame

Collaborators and Investigators

Sponsor

Collaborators

Study record dates

Study Major Dates

Study Start (Actual)

Primary Completion (Estimated)

Study Completion (Estimated)

Study Registration Dates

First Submitted

First Submitted That Met QC Criteria

First Posted (Actual)

Study Record Updates

Last Update Posted (Actual)

Last Update Submitted That Met QC Criteria

Last Verified

More Information

Terms related to this study

Keywords

Additional Relevant MeSH Terms

Other Study ID Numbers

Drug and device information, study documents

Studies a U.S. FDA-regulated drug product

Studies a U.S. FDA-regulated device product

Clinical Trials on Decision Support Systems, Clinical

Clinical Trials on AI suggestion

Search Similar Trials

Sponsors and Collaborators

Medical Conditions

Drug Interventions

CROs by country

CROs in Bosnia & Herzegovina

Conditions

Rare Diseases

Drug Interventions

Dietary Supplements

Sponsor/Collaborators

Locations