Evaluation of One-Shot Vision Differential Diagnosis (OSVDE) and Multi-Step Conversational Non-Inferiority (MSCNE) in AI Medical Interviewing (OSVDE-MSCNE)

March 20, 2026 updated by: Magic Health Inc. (d.b.a. Nolla Health)

AI Medical Interviewing and Diagnostic System Performance Evaluation: One-Shot Vision Differential Diagnosis (OSVDE) and Multi-Step Conversational Non-Inferiority (MSCNE) Evaluation.

This study evaluates the diagnostic performance of a multimodal artificial intelligence (AI) system (AIMD.1) using de-identified medical images and semi-synthetic patient simulations. The study combines retrospective analysis of existing publicly available image datasets with prospective data collection from licensed clinicians who complete diagnostic evaluation tasks.

In the One-Shot Vision Differential Evaluation (OSVDE) stage, clinicians review individual de-identified medical images and generate a ranked list of potential diagnoses based solely on visual features. In the Multi-Step Conversational Non-Inferiority Evaluation (MSCNE) stage, clinicians complete diagnostic assessments using semi-synthetic patient simulations derived from de-identified medical images. Clinician performance will be compared with the AI system on the same diagnostic tasks.

Human participants consist solely of licensed clinicians who provide diagnostic responses. Medical images and simulated cases are study materials and are not considered study participants. No identifiable patient data are used, and the AI system is evaluated in an offline research environment and is not used for clinical decision-making or patient care.

Study Overview

Status

Recruiting

Conditions

Intervention / Treatment

Diagnostic test: AI Diagnostic System (AIMD.1)

Detailed Description

Artificial intelligence (AI) systems have demonstrated promising capabilities in medical diagnosis; however, rigorous benchmark evaluation is necessary prior to clinical deployment. AIMD.1 is a multimodal AI diagnostic system designed to assist with clinical reasoning through analysis of medical images and conversational diagnostic interactions.

This study is a benchmark performance evaluation of AIMD.1, a multimodal AI medical diagnostic system developed by Nolla Health and designed to assist clinical reasoning through analysis of medical images and structured conversational diagnostic interactions. The evaluation is conducted entirely in an offline research environment; the AI system is not used to guide real-world clinical care or patient management during this study. The study is designed as a benchmark performance evaluation prior to any prospective validation involving real patients.

The global healthcare system faces a significant workforce shortage, with projections suggesting a deficit of up to 11 million practitioners by 2030. AI systems for medical diagnosis have shown promise in addressing this gap in controlled research settings, but rigorous benchmark validation against clinician-level performance is needed before clinical deployment. This study addresses that need by evaluating AIMD.1 against both established AI benchmark systems and directly against licensed clinicians completing the same diagnostic tasks.

The study employs two complementary evaluation stages designed to assess distinct aspects of diagnostic capability:

Stage 1 - One-Shot Vision Differential Evaluation (OSVDE): The AI system and clinician participants independently review individual de-identified medical images and generate ranked top-5 differential diagnoses based solely on visual features. The image corpus comprises approximately 11,500-15,000 de-identified images spanning at least 12 medical specialties (Dermatology, Internal Medicine, Otolaryngology, Gynecology, Orthopedics, Pediatrics, Geriatrics, Emergency Medicine, Ophthalmology, Endocrinology, Family Medicine, and others) and 48 disease clusters. Image sources include approximately 14,000 images retrieved from standard search engines (Google and Bing) and open access repositories such as the PMC Open Access Dataset, filtered for Creative Commons and public domain licensing, as well as approximately 1,000 de-identified clinical images provided by Nolla Health under terms of service permitting de-identified use for research purposes, with all images de-identified per HIPAA Safe Harbor standards. All source images undergo random combinations of affine and non-affine transformations (blurring, sharpening, contrast adjustment, color adjustment, pixel shifting, rotation, stretching, Gaussian noise, among others) to produce fundamentally distinct images from the originals while preserving clinically relevant visual features on average. Each transformed image is verified by at least one licensed dermatologist or primary care clinician and relabeled as necessary, with ambiguous or non-clinically relevant images removed from the corpus. This preprocessing pipeline also provides additional de-identification through cropping or masking of potentially identifiable regions, including facial features. Images are localized using disease-name keywords defined in the study's disease ontology and downloaded in standard formats (JPEG, PNG) with associated metadata including ground-truth diagnostic labels and disease categories. Where available, additional metadata such as Fitzpatrick skin type (I-VI) and patient age range (pediatric, adult, geriatric) are recorded to enable subgroup analyses of diagnostic performance across demographic categories.

Stage 2 - Multi-Step Conversational Non-Inferiority Evaluation (MSCNE): The AI system and clinicians complete diagnostic tasks using semi-synthetic patient simulations grounded in de-identified medical images. These simulations deliver structured clinical information through a conversational interface across multiple interaction steps, allowing assessment of multi-step diagnostic reasoning that more closely mirrors real-world clinical encounters than single-image evaluation alone. Approximately 380-500 simulated cases are evaluated. Each clinician completes a subset (30-50%) of the simulation cases, and their performance serves as the human benchmark for a formal non-inferiority comparison with the AI system.

Approximately 10-30 licensed clinicians will participate in the study. Clinicians must hold active license in at least one of the target medical specialties and must be 18 years of age or older. Clinicians will be recruited via professional networks, institutional contacts, and relevant medical associations. Participation is voluntary. Clinicians will complete diagnostic evaluation sessions remotely using a computer or tablet with a reliable internet connection. Sessions are expected to last approximately 60-90 minutes in aggregate. Clinicians will provide differential diagnoses for subsets of the image and simulation cases. For OSVDE, each clinician reviews approximately 10-30% of the image dataset. Human participants consist solely of clinicians providing diagnostic responses; the image datasets and synthetic cases serve as study materials and are not considered participants. Clinicians are compensated $1.00 per case for the OSVDE one-shot visual evaluation and $10.00 per case for the MSCNE multi-step conversational evaluation. Compensation is for time and effort and is not contingent on diagnostic accuracy. Performance is compared using paired statistical designs where both the AI system and clinicians evaluate overlapping case sets. The AI system is additionally benchmarked against established AI diagnostic systems on the same datasets.

Clinician participants will receive a Clinician Information Sheet describing the study purpose, procedures, voluntary nature of participation, data handling practices, and contact information for the research team and IRB prior to participation. Clinicians will indicate their acknowledgment before beginning the evaluation.

All images used in the study are de-identified and originate from publicly available sources or datasets that meet de-identification standards. No electronic health record (EHR) data is accessed at any point during the study. No stored or processed image remains fundamentally the same as any source image due to the transformation pipeline, thereby strengthening de-identification safe harbors. Additional preprocessing steps ensure removal of any potentially identifiable information before inclusion in the research dataset. Images are assigned sequential study identifiers (e.g., IMG_00001) with no linkage to original sources. No code key or crosswalk exists that could enable re-identification. Data is stored on HIPAA-compliant workstations or cloud services (GCP or AWS) with AES-256 encryption at rest, multi-factor authentication, role-based access controls, and all access logged and audited. Data transfers use TLS 1.3 or stronger encryption in transit.

Primary outcome measures include Top-1 diagnostic accuracy, defined as the proportion of cases in which the AI system's primary diagnosis matches the reference diagnosis. Secondary outcomes include Top-5 diagnostic accuracy, expected calibration error (ECE), area under the ROC curve (AUC) per disease cluster, per-class sensitivity and specificity across disease categories, and time-to-diagnosis measured in conversational turns for the MSCNE simulated cases.

Statistical analysis employs bootstrap resampling (1,000 iterations) for confidence interval estimation, McNemar's test for paired accuracy comparisons, and non-inferiority testing with a pre-specified margin of δ = 5%. With 11,500-15,000 images (approximately 100-500 samples per each of the 48 disease clusters) and an assumed true accuracy of 70%, the design achieves a 95% confidence interval width of approximately ±0.8%, providing sufficient precision to detect meaningful differences from benchmark performance. No interim analyses are planned; all analyses are conducted after complete data collection. Results will be reported in accordance with TRIPOD guidelines for prediction model studies.

Clinician responses are recorded using anonymous study identifiers (e.g., CLIN_001, CLIN_002) with no link to the clinician's name, institution, or other identifying information. Only aggregate performance results (e.g., group accuracy rates, mean time-to-diagnosis) will be reported. No individual clinician results will be published or shared outside the research team. Demographic data collected from clinicians is limited to specialty, years of experience (in ranges), and practice setting (academic vs. community), recorded in a manner that prevents identification of individual clinicians.

The study duration is expected to be approximately six months: dataset cleaning, quality verification, and preprocessing (Month 1); OSVDE evaluation and analysis (Months 2-4); MSCNE evaluation and analysis (Months 2-5); and final analysis, reporting, and manuscript preparation (Month 6). The total study is to be executed in 2026. Publication will include only aggregate and summary-level data; no individual person-level data will be published or deposited in external repositories. This protocol ID 1026 has been verified as Exempt according to 45CFR46.104(d) on 03/10/2026 by Solutions IRB (855) 226-4472 (www.solutionsirb.com).

Study Type

Observational

Enrollment (Estimated)

Contacts and Locations

This section provides the contact details for those conducting the study, and information on where this study is being conducted.

Study Contact

Name: Luis R Soenksen, MSE, PhD
Phone Number: (617) 936-9293
Email: soenksen@nollahealth.com

Study Contact Backup

Name: Sean Geiger, B.S.
Phone Number: (412) 412-8786
Email: sean@nollahealth.com

Study Locations

United States
- New York
  - New York, New York, United States, 10003
    - Recruiting
    - Nolla Health (Magic Health Inc.)
    - Contact:
      
      Sean Geiger, B.S.
      
      Phone Number: (412) 412-8786
      
      Email: sean@nollahealth.com
    - Contact:
      
      Luis R Soenksen, MSE, PhD
      
      Phone Number: (617)936-9293
      
      Email: soenksen@nollahealth.com
    - Principal Investigator:
      
      Luis R Soenksen, MSE, PhD
    - Sub-Investigator:
      
      Sean Geiger, B.S.
    - Sub-Investigator:
      
      Luis Wenus

Participation Criteria

Researchers look for people who fit a certain description, called eligibility criteria. Some examples of these criteria are a person's general health condition or prior treatments.

Eligibility Criteria

Ages Eligible for Study

Adult
Older Adult

Accepts Healthy Volunteers

Yes

Sampling Method

Non-Probability Sample

Study Population

The study population consists of licensed physicians across multiple medical specialties who participate in diagnostic evaluation tasks using de-identified medical images and semi-synthetic patient simulations. Participants are recruited through professional networks and medical associations. No patients are enrolled in this study.

Description

Inclusion Criteria:

Active license in Dermatology, Internal Medicine, Otolaryngology, Gynecology, Orthopedics, Pediatrics, Geriatrics, Emergency Medicine, Ophthalmology, Psychiatry, Endocrinology, Family Medicine, or a closely related specialty
Age 18 years or older
Ability to complete diagnostic evaluation sessions remotely using a computer or tablet with reliable internet access

Exclusion Criteria:

Loss of active license in an eligible specialty
Inability to complete the evaluation session remotely

Study Plan

This section provides details of the study plan, including how the study is designed and what the study is measuring.

How is the study designed?

Design Details

Number of groups / cohorts

Cohorts and Interventions

Group / Cohort	Intervention / Treatment
Clinician Participants Licensed clinicians who participate in diagnostic evaluation tasks using de-identified medical images and semi-synthetic patient simulations to assess diagnostic accuracy. Clinicians provide differential diagnoses for benchmark comparison with an AI diagnostic system.	Diagnostic test: AI Diagnostic System (AIMD.1) AIMD.1 (also known as NollaMD agent) is a multimodal artificial intelligence (AI) diagnostic system designed to generate differential diagnoses based on analysis of medical images and structured clinical information. In this study, the system is evaluated using de-identified medical images and semi-synthetic patient simulations under controlled research conditions. The AI system generates ranked diagnostic outputs and associated confidence scores, which are compared with reference diagnoses and clinician performance metrics. The system is evaluated in an offline research environment. AI outputs are not used for clinical decision-making, patient management, or real-world medical care. Other Names: Artificial Intelligence Differential Diagnostic System AI Clinical Decision Support System AI Conversational Clinical System NollaMD

Group / Cohort

Intervention / Treatment

Clinician Participants

Licensed clinicians who participate in diagnostic evaluation tasks using de-identified medical images and semi-synthetic patient simulations to assess diagnostic accuracy. Clinicians provide differential diagnoses for benchmark comparison with an AI diagnostic system.

Diagnostic test: AI Diagnostic System (AIMD.1)

AIMD.1 (also known as NollaMD agent) is a multimodal artificial intelligence (AI) diagnostic system designed to generate differential diagnoses based on analysis of medical images and structured clinical information. In this study, the system is evaluated using de-identified medical images and semi-synthetic patient simulations under controlled research conditions. The AI system generates ranked diagnostic outputs and associated confidence scores, which are compared with reference diagnoses and clinician performance metrics. The system is evaluated in an offline research environment. AI outputs are not used for clinical decision-making, patient management, or real-world medical care.

Other Names:

Artificial Intelligence Differential Diagnostic System
AI Clinical Decision Support System
AI Conversational Clinical System
NollaMD

What is the study measuring?

Primary Outcome Measures

Outcome Measure	Measure Description	Time Frame
Top-1 Diagnostic Accuracy Time Frame: At completion of diagnostic evaluations (up to 6 months)	Proportion of evaluated cases in which the primary diagnosis generated by the AI Diagnostic System matches the reference (ground truth) diagnosis. Accuracy will be calculated across de-identified medical image cases and semi-synthetic patient simulation cases and compared with clinician performance.	At completion of diagnostic evaluations (up to 6 months)

Secondary Outcome Measures

Outcome Measure	Measure Description	Time Frame
Top-5 Diagnostic Accuracy Time Frame: At completion of diagnostic evaluations (up to 6 months)	Proportion of evaluated cases in which the correct reference diagnosis appears within the top five ranked diagnoses generated by the AI Diagnostic System.	At completion of diagnostic evaluations (up to 6 months)
Diagnostic Accuracy of Clinician Participants Time Frame: At completion of diagnostic evaluations (up to 6 months)	Proportion of evaluated cases in which the clinician participant's primary diagnosis matches the reference diagnosis, calculated across assigned image and simulation cases.	At completion of diagnostic evaluations (up to 6 months)
Non-Inferiority of AI Diagnostic Accuracy Compared to Clinicians Time Frame: At completion of diagnostic evaluations (up to 6 months)	Difference in Top-1 diagnostic accuracy between the AI Diagnostic System and clinician participants. Non-inferiority will be assessed using a predefined margin of 5%.	At completion of diagnostic evaluations (up to 6 months)
Calibration of AI Diagnostic Confidence Time Frame: At completion of diagnostic evaluations (up to 6 months)	Calibration performance of AI-generated diagnostic confidence scores assessed using Expected Calibration Error (ECE).	At completion of diagnostic evaluations (up to 6 months)
Area Under the Receiver Operating Characteristic Curve (AUC) and Precision Recall Curve (PRC) Time Frame: At completion of diagnostic evaluations (up to 6 months)	Area under the receiver operating characteristic curve and the Precision Recall Curve for AI diagnostic classification across disease categories.	At completion of diagnostic evaluations (up to 6 months)
Time-to-Diagnosis in Conversational Simulations Time Frame: At completion of simulation evaluations (up to 6 months)	Number of conversational turns required by the AI system and clinician participants to reach a final diagnosis in semi-synthetic patient simulation cases.	At completion of simulation evaluations (up to 6 months)

Collaborators and Investigators

This is where you will find people and organizations involved with this study.

Sponsor

Magic Health Inc. (d.b.a. Nolla Health)

Investigators

Principal Investigator: Luis R Soenksen, MSE, PhD, Nolla Health

Study record dates

These dates track the progress of study record and summary results submissions to ClinicalTrials.gov. Study records and reported results are reviewed by the National Library of Medicine (NLM) to make sure they meet specific quality control standards before being posted on the public website.

Study Major Dates

Study Start (Actual)

March 19, 2026

Primary Completion (Estimated)

September 19, 2026

Study Completion (Estimated)

September 19, 2026

Study Registration Dates

First Submitted

March 10, 2026

First Submitted That Met QC Criteria

March 10, 2026

First Posted (Actual)

March 13, 2026

Study Record Updates

Last Update Posted (Actual)

March 25, 2026

Last Update Submitted That Met QC Criteria

March 20, 2026

Last Verified

March 1, 2026

More Information

Terms related to this study

Keywords

Other Study ID Numbers

NH-OSVDE-MSCNE-1026

Plan for Individual participant data (IPD)

Plan to Share Individual Participant Data (IPD)?

IPD Plan Description

Individual participant data will not be shared. The study involves de-identified medical images and anonymous clinician diagnostic responses. Only aggregate summary results (e.g., diagnostic accuracy metrics and statistical analyses) will be reported in publications and presentations.

Drug and device information, study documents

Studies a U.S. FDA-regulated drug product

Studies a U.S. FDA-regulated device product

This information was retrieved directly from the website clinicaltrials.gov without any changes. If you have any requests to change, remove or update your study details, please contact register@clinicaltrials.gov. As soon as a change is implemented on clinicaltrials.gov, this will be updated automatically on our website as well.

Clinical Trials on Differential Diagnosis

Tang-Du Hospital

Recruiting

PET/MRI Study on Changes of Fat and Muscle Metabolism.

Brown Fat and Muscle Metabolic | Differential Diagnosis

China
National Taiwan University Hospital

Completed

The Application of Point-of-care Ultrasonography in Differential Diagnosis of Shock in Emergency and Critical Care

Shock | Emergencies | Ultrasonography | Critical Care | Differential Diagnosis

Taiwan
University of Chicago

Completed

Individual Differences in Drug Response (IDT)

Differential Female Response to Δ9-tetrahydrocannabinol (THC)

United States
Rambam Health Care Campus

Completed

Differential Lung Ventilation vs. CPAP

One Lung Ventilation | Continuous Positive Airway Pressure | Differential Lung Ventilation

Israel
Mahidol University
Korean Association for the Study of Intestinal Diseases

Recruiting

Validation of Scoring Systems for Differentiating Intestinal Tuberculosis from Crohn's Disease

Crohn Disease | Intestinal Tuberculosis | Differential Diagnosis

Thailand
Vascular Biogenics Ltd. operating as VBL Therapeutics

Completed

Safety and Efficacy of VB-111 in Subjects With Advanced Differentiated Thyroid Cancer

Differential Thyroid Cancer

United States
SuperSonic Imagine

Terminated

Improvement Image Quality for SuperSonic® MACH Ultrasound System (MACH IQ)

Diagnosis

France
European Institute of Oncology
European Union

Recruiting

Digital Solutions for bEtter cAre (ALTHEA)

Cancer Diagnosis

France, Lithuania, Germany, Italy, Spain
Umraniye Education and Research Hospital

Completed

Uterine Artery Diastolic Notching & Apelin-13 and 36

Diagnosis

Turkey
Vrije Universiteit Brussel

Recruiting

Pilot-testing a Perinatal Palliative Care Intervention Program (PPC-pilot)

Perinatal Palliative Care | Life-limiting Fetal Diagnosis | Life-limiting Neonatal Diagnosis

Belgium

Clinical Trials on AI Diagnostic System (AIMD.1)

Ruijin Hospital
Fudan University; Affiliated Hospital of Jiangnan University; Shanghai 10th People... and other collaborators

Recruiting

Endoscopic Ultrasound-guided Fine-needle Aspiration of Solid Pancreatic Lesions With Rapid Staining of Cytological Smears Followed by Whole Slide Scanning and Artificial Intelligence Diagnosis: A Prospective, Multicenter Study.

Pancreatic Disease

China
Peking University Third Hospital

Completed

Real-world Effectiveness Evaluation of Clinical Decision Support System Based on Artificial Intelligence (AI-CDSS)

Medical Informatics Applications

China
Qilu Hospital of Shandong University

Recruiting

Artificial Intelligence in Assessing Gastric Intestinal Metaplasia Via the EGGIM Score

Artificial Intelligence | Endoscopy | Intestinal Metaplasia of Gastric Mucosa

China
Second Affiliated Hospital of Nanchang University
Nanchang Bright Eye Hospital

Recruiting

Study on the Diagnostic Efficacy of ICL Selection and Prediction Depth Model Based on Eye Images

Myopia | Posterior Chamber Phakic Intraocular Lens | Vault | Deep Neural Network | Anterior Chamber Angle

China
Second Affiliated Hospital of Nanchang University
Hangzhou Huaxia Eye Hospital; Nanchang Bright Eye Hospital

Recruiting

Diagnostic Efficacy of CNN in Predicting Intraoperative Complications and Postoperative Outcomes in SMILE

Intraoperative Complications | Postoperative Outcomes | Small-incision Lenticule Extraction (SMILE) Surgery | Deep Convolutional Neural Network

China
Sun Yat-sen University

Completed

Diagnostic Efficacy of CNN in Differentiation of Visual Field

Diagnositic Efficacy of Deep Convolutional Neural Network in Differentiation of Glaucoma Visual Field From Non-glaucoma Visual Field

China
Marmara University

Completed

LLM Performance in Endodontic Diagnostics

Endodontic Diagnosis, Endodontic Diseases, Endodontic Treatment, Endodontic Decision-making

Turkey (Türkiye)
Yaou Liu

Enrolling by invitation

An Explainable Neuroradiologist Artificial Intelligence Assistance System for Brain CT and MRI

MRI | Radiology | Central Nervous System Disease | CT | AI (Artificial Intelligence)

China
Union Hospital, Tongji Medical College, Huazhong...

Not yet recruiting

X-ray Assisted Diagnostic System

Chest X-ray for Clinical Evaluation

China
Peking Union Medical College Hospital
Chinese Academy of Medical Sciences

Recruiting

Construction of a Benchmark for Breast Ultrasound AI Interpretation and Performance Evaluation of Multimodal AI Models (BUST-AI Bench)

Breast Neoplasms | Breast Diseases | Ultrasonography

China

Evaluation of One-Shot Vision Differential Diagnosis (OSVDE) and Multi-Step Conversational Non-Inferiority (MSCNE) in AI Medical Interviewing (OSVDE-MSCNE)

AI Medical Interviewing and Diagnostic System Performance Evaluation: One-Shot Vision Differential Diagnosis (OSVDE) and Multi-Step Conversational Non-Inferiority (MSCNE) Evaluation.

Study Overview

Status

Conditions

Intervention / Treatment

Detailed Description

Study Type

Enrollment (Estimated)

Contacts and Locations

Study Contact

Study Contact Backup

Study Locations

Participation Criteria

Eligibility Criteria

Ages Eligible for Study

Accepts Healthy Volunteers

Sampling Method

Study Population

Description

Study Plan

How is the study designed?

Design Details

Number of groups / cohorts

Cohorts and Interventions

Group / Cohort

Intervention / Treatment

What is the study measuring?

Primary Outcome Measures

Outcome Measure

Measure Description

Time Frame

Secondary Outcome Measures

Outcome Measure

Measure Description

Time Frame

Collaborators and Investigators

Sponsor

Investigators

Study record dates

Study Major Dates

Study Start (Actual)

Primary Completion (Estimated)

Study Completion (Estimated)

Study Registration Dates

First Submitted

First Submitted That Met QC Criteria

First Posted (Actual)

Study Record Updates

Last Update Posted (Actual)

Last Update Submitted That Met QC Criteria

Last Verified

More Information

Terms related to this study

Keywords

Other Study ID Numbers

Plan for Individual participant data (IPD)

Plan to Share Individual Participant Data (IPD)?

IPD Plan Description

Drug and device information, study documents

Studies a U.S. FDA-regulated drug product

Studies a U.S. FDA-regulated device product

Clinical Trials on Differential Diagnosis

Clinical Trials on AI Diagnostic System (AIMD.1)

Search Similar Trials

Sponsors and Collaborators

Medical Conditions

Drug Interventions

CROs by country

CROs in El Salvador

Conditions

Rare Diseases

Drug Interventions

Dietary Supplements

Sponsor/Collaborators

Locations