Evaluation of One-Shot Vision Differential Diagnosis (OSVDE) and Multi-Step Conversational Non-Inferiority (MSCNE) in AI Medical Interviewing (OSVDE-MSCNE)

March 20, 2026 updated by: Magic Health Inc. (d.b.a. Nolla Health)

AI Medical Interviewing and Diagnostic System Performance Evaluation: One-Shot Vision Differential Diagnosis (OSVDE) and Multi-Step Conversational Non-Inferiority (MSCNE) Evaluation.

This study evaluates the diagnostic performance of a multimodal artificial intelligence (AI) system (AIMD.1) using de-identified medical images and semi-synthetic patient simulations. The study combines retrospective analysis of existing publicly available image datasets with prospective data collection from licensed clinicians who complete diagnostic evaluation tasks.

In the One-Shot Vision Differential Evaluation (OSVDE) stage, clinicians review individual de-identified medical images and generate a ranked list of potential diagnoses based solely on visual features. In the Multi-Step Conversational Non-Inferiority Evaluation (MSCNE) stage, clinicians complete diagnostic assessments using semi-synthetic patient simulations derived from de-identified medical images. Clinician performance will be compared with the AI system on the same diagnostic tasks.

Human participants consist solely of licensed clinicians who provide diagnostic responses. Medical images and simulated cases are study materials and are not considered study participants. No identifiable patient data are used, and the AI system is evaluated in an offline research environment and is not used for clinical decision-making or patient care.

Study Overview

Detailed Description

Artificial intelligence (AI) systems have demonstrated promising capabilities in medical diagnosis; however, rigorous benchmark evaluation is necessary prior to clinical deployment. AIMD.1 is a multimodal AI diagnostic system designed to assist with clinical reasoning through analysis of medical images and conversational diagnostic interactions.

This study is a benchmark performance evaluation of AIMD.1, a multimodal AI medical diagnostic system developed by Nolla Health and designed to assist clinical reasoning through analysis of medical images and structured conversational diagnostic interactions. The evaluation is conducted entirely in an offline research environment; the AI system is not used to guide real-world clinical care or patient management during this study. The study is designed as a benchmark performance evaluation prior to any prospective validation involving real patients.

The global healthcare system faces a significant workforce shortage, with projections suggesting a deficit of up to 11 million practitioners by 2030. AI systems for medical diagnosis have shown promise in addressing this gap in controlled research settings, but rigorous benchmark validation against clinician-level performance is needed before clinical deployment. This study addresses that need by evaluating AIMD.1 against both established AI benchmark systems and directly against licensed clinicians completing the same diagnostic tasks.

The study employs two complementary evaluation stages designed to assess distinct aspects of diagnostic capability:

Stage 1 - One-Shot Vision Differential Evaluation (OSVDE): The AI system and clinician participants independently review individual de-identified medical images and generate ranked top-5 differential diagnoses based solely on visual features. The image corpus comprises approximately 11,500-15,000 de-identified images spanning at least 12 medical specialties (Dermatology, Internal Medicine, Otolaryngology, Gynecology, Orthopedics, Pediatrics, Geriatrics, Emergency Medicine, Ophthalmology, Endocrinology, Family Medicine, and others) and 48 disease clusters. Image sources include approximately 14,000 images retrieved from standard search engines (Google and Bing) and open access repositories such as the PMC Open Access Dataset, filtered for Creative Commons and public domain licensing, as well as approximately 1,000 de-identified clinical images provided by Nolla Health under terms of service permitting de-identified use for research purposes, with all images de-identified per HIPAA Safe Harbor standards. All source images undergo random combinations of affine and non-affine transformations (blurring, sharpening, contrast adjustment, color adjustment, pixel shifting, rotation, stretching, Gaussian noise, among others) to produce fundamentally distinct images from the originals while preserving clinically relevant visual features on average. Each transformed image is verified by at least one licensed dermatologist or primary care clinician and relabeled as necessary, with ambiguous or non-clinically relevant images removed from the corpus. This preprocessing pipeline also provides additional de-identification through cropping or masking of potentially identifiable regions, including facial features. Images are localized using disease-name keywords defined in the study's disease ontology and downloaded in standard formats (JPEG, PNG) with associated metadata including ground-truth diagnostic labels and disease categories. Where available, additional metadata such as Fitzpatrick skin type (I-VI) and patient age range (pediatric, adult, geriatric) are recorded to enable subgroup analyses of diagnostic performance across demographic categories.

Stage 2 - Multi-Step Conversational Non-Inferiority Evaluation (MSCNE): The AI system and clinicians complete diagnostic tasks using semi-synthetic patient simulations grounded in de-identified medical images. These simulations deliver structured clinical information through a conversational interface across multiple interaction steps, allowing assessment of multi-step diagnostic reasoning that more closely mirrors real-world clinical encounters than single-image evaluation alone. Approximately 380-500 simulated cases are evaluated. Each clinician completes a subset (30-50%) of the simulation cases, and their performance serves as the human benchmark for a formal non-inferiority comparison with the AI system.

Approximately 10-30 licensed clinicians will participate in the study. Clinicians must hold active license in at least one of the target medical specialties and must be 18 years of age or older. Clinicians will be recruited via professional networks, institutional contacts, and relevant medical associations. Participation is voluntary. Clinicians will complete diagnostic evaluation sessions remotely using a computer or tablet with a reliable internet connection. Sessions are expected to last approximately 60-90 minutes in aggregate. Clinicians will provide differential diagnoses for subsets of the image and simulation cases. For OSVDE, each clinician reviews approximately 10-30% of the image dataset. Human participants consist solely of clinicians providing diagnostic responses; the image datasets and synthetic cases serve as study materials and are not considered participants. Clinicians are compensated $1.00 per case for the OSVDE one-shot visual evaluation and $10.00 per case for the MSCNE multi-step conversational evaluation. Compensation is for time and effort and is not contingent on diagnostic accuracy. Performance is compared using paired statistical designs where both the AI system and clinicians evaluate overlapping case sets. The AI system is additionally benchmarked against established AI diagnostic systems on the same datasets.

Clinician participants will receive a Clinician Information Sheet describing the study purpose, procedures, voluntary nature of participation, data handling practices, and contact information for the research team and IRB prior to participation. Clinicians will indicate their acknowledgment before beginning the evaluation.

All images used in the study are de-identified and originate from publicly available sources or datasets that meet de-identification standards. No electronic health record (EHR) data is accessed at any point during the study. No stored or processed image remains fundamentally the same as any source image due to the transformation pipeline, thereby strengthening de-identification safe harbors. Additional preprocessing steps ensure removal of any potentially identifiable information before inclusion in the research dataset. Images are assigned sequential study identifiers (e.g., IMG_00001) with no linkage to original sources. No code key or crosswalk exists that could enable re-identification. Data is stored on HIPAA-compliant workstations or cloud services (GCP or AWS) with AES-256 encryption at rest, multi-factor authentication, role-based access controls, and all access logged and audited. Data transfers use TLS 1.3 or stronger encryption in transit.

Primary outcome measures include Top-1 diagnostic accuracy, defined as the proportion of cases in which the AI system's primary diagnosis matches the reference diagnosis. Secondary outcomes include Top-5 diagnostic accuracy, expected calibration error (ECE), area under the ROC curve (AUC) per disease cluster, per-class sensitivity and specificity across disease categories, and time-to-diagnosis measured in conversational turns for the MSCNE simulated cases.

Statistical analysis employs bootstrap resampling (1,000 iterations) for confidence interval estimation, McNemar's test for paired accuracy comparisons, and non-inferiority testing with a pre-specified margin of δ = 5%. With 11,500-15,000 images (approximately 100-500 samples per each of the 48 disease clusters) and an assumed true accuracy of 70%, the design achieves a 95% confidence interval width of approximately ±0.8%, providing sufficient precision to detect meaningful differences from benchmark performance. No interim analyses are planned; all analyses are conducted after complete data collection. Results will be reported in accordance with TRIPOD guidelines for prediction model studies.

Clinician responses are recorded using anonymous study identifiers (e.g., CLIN_001, CLIN_002) with no link to the clinician's name, institution, or other identifying information. Only aggregate performance results (e.g., group accuracy rates, mean time-to-diagnosis) will be reported. No individual clinician results will be published or shared outside the research team. Demographic data collected from clinicians is limited to specialty, years of experience (in ranges), and practice setting (academic vs. community), recorded in a manner that prevents identification of individual clinicians.

The study duration is expected to be approximately six months: dataset cleaning, quality verification, and preprocessing (Month 1); OSVDE evaluation and analysis (Months 2-4); MSCNE evaluation and analysis (Months 2-5); and final analysis, reporting, and manuscript preparation (Month 6). The total study is to be executed in 2026. Publication will include only aggregate and summary-level data; no individual person-level data will be published or deposited in external repositories. This protocol ID 1026 has been verified as Exempt according to 45CFR46.104(d) on 03/10/2026 by Solutions IRB (855) 226-4472 (www.solutionsirb.com).

Study Type

Observational

Enrollment (Estimated)

30

Contacts and Locations

This section provides the contact details for those conducting the study, and information on where this study is being conducted.

Study Contact

Study Contact Backup

Study Locations

    • New York
      • New York, New York, United States, 10003
        • Recruiting
        • Nolla Health (Magic Health Inc.)
        • Contact:
        • Contact:
        • Principal Investigator:
          • Luis R Soenksen, MSE, PhD
        • Sub-Investigator:
          • Sean Geiger, B.S.
        • Sub-Investigator:
          • Luis Wenus

Participation Criteria

Researchers look for people who fit a certain description, called eligibility criteria. Some examples of these criteria are a person's general health condition or prior treatments.

Eligibility Criteria

Ages Eligible for Study

  • Adult
  • Older Adult

Accepts Healthy Volunteers

Yes

Sampling Method

Non-Probability Sample

Study Population

The study population consists of licensed physicians across multiple medical specialties who participate in diagnostic evaluation tasks using de-identified medical images and semi-synthetic patient simulations. Participants are recruited through professional networks and medical associations. No patients are enrolled in this study.

Description

Inclusion Criteria:

  • Active license in Dermatology, Internal Medicine, Otolaryngology, Gynecology, Orthopedics, Pediatrics, Geriatrics, Emergency Medicine, Ophthalmology, Psychiatry, Endocrinology, Family Medicine, or a closely related specialty
  • Age 18 years or older
  • Ability to complete diagnostic evaluation sessions remotely using a computer or tablet with reliable internet access

Exclusion Criteria:

  • Loss of active license in an eligible specialty
  • Inability to complete the evaluation session remotely

Study Plan

This section provides details of the study plan, including how the study is designed and what the study is measuring.

How is the study designed?

Design Details

Cohorts and Interventions

Group / Cohort
Intervention / Treatment
Clinician Participants
Licensed clinicians who participate in diagnostic evaluation tasks using de-identified medical images and semi-synthetic patient simulations to assess diagnostic accuracy. Clinicians provide differential diagnoses for benchmark comparison with an AI diagnostic system.
AIMD.1 (also known as NollaMD agent) is a multimodal artificial intelligence (AI) diagnostic system designed to generate differential diagnoses based on analysis of medical images and structured clinical information. In this study, the system is evaluated using de-identified medical images and semi-synthetic patient simulations under controlled research conditions. The AI system generates ranked diagnostic outputs and associated confidence scores, which are compared with reference diagnoses and clinician performance metrics. The system is evaluated in an offline research environment. AI outputs are not used for clinical decision-making, patient management, or real-world medical care.
Other Names:
  • Artificial Intelligence Differential Diagnostic System
  • AI Clinical Decision Support System
  • AI Conversational Clinical System
  • NollaMD

What is the study measuring?

Primary Outcome Measures

Outcome Measure
Measure Description
Time Frame
Top-1 Diagnostic Accuracy
Time Frame: At completion of diagnostic evaluations (up to 6 months)
Proportion of evaluated cases in which the primary diagnosis generated by the AI Diagnostic System matches the reference (ground truth) diagnosis. Accuracy will be calculated across de-identified medical image cases and semi-synthetic patient simulation cases and compared with clinician performance.
At completion of diagnostic evaluations (up to 6 months)

Secondary Outcome Measures

Outcome Measure
Measure Description
Time Frame
Top-5 Diagnostic Accuracy
Time Frame: At completion of diagnostic evaluations (up to 6 months)
Proportion of evaluated cases in which the correct reference diagnosis appears within the top five ranked diagnoses generated by the AI Diagnostic System.
At completion of diagnostic evaluations (up to 6 months)
Diagnostic Accuracy of Clinician Participants
Time Frame: At completion of diagnostic evaluations (up to 6 months)
Proportion of evaluated cases in which the clinician participant's primary diagnosis matches the reference diagnosis, calculated across assigned image and simulation cases.
At completion of diagnostic evaluations (up to 6 months)
Non-Inferiority of AI Diagnostic Accuracy Compared to Clinicians
Time Frame: At completion of diagnostic evaluations (up to 6 months)
Difference in Top-1 diagnostic accuracy between the AI Diagnostic System and clinician participants. Non-inferiority will be assessed using a predefined margin of 5%.
At completion of diagnostic evaluations (up to 6 months)
Calibration of AI Diagnostic Confidence
Time Frame: At completion of diagnostic evaluations (up to 6 months)
Calibration performance of AI-generated diagnostic confidence scores assessed using Expected Calibration Error (ECE).
At completion of diagnostic evaluations (up to 6 months)
Area Under the Receiver Operating Characteristic Curve (AUC) and Precision Recall Curve (PRC)
Time Frame: At completion of diagnostic evaluations (up to 6 months)
Area under the receiver operating characteristic curve and the Precision Recall Curve for AI diagnostic classification across disease categories.
At completion of diagnostic evaluations (up to 6 months)
Time-to-Diagnosis in Conversational Simulations
Time Frame: At completion of simulation evaluations (up to 6 months)
Number of conversational turns required by the AI system and clinician participants to reach a final diagnosis in semi-synthetic patient simulation cases.
At completion of simulation evaluations (up to 6 months)

Collaborators and Investigators

This is where you will find people and organizations involved with this study.

Investigators

  • Principal Investigator: Luis R Soenksen, MSE, PhD, Nolla Health

Study record dates

These dates track the progress of study record and summary results submissions to ClinicalTrials.gov. Study records and reported results are reviewed by the National Library of Medicine (NLM) to make sure they meet specific quality control standards before being posted on the public website.

Study Major Dates

Study Start (Actual)

March 19, 2026

Primary Completion (Estimated)

September 19, 2026

Study Completion (Estimated)

September 19, 2026

Study Registration Dates

First Submitted

March 10, 2026

First Submitted That Met QC Criteria

March 10, 2026

First Posted (Actual)

March 13, 2026

Study Record Updates

Last Update Posted (Actual)

March 25, 2026

Last Update Submitted That Met QC Criteria

March 20, 2026

Last Verified

March 1, 2026

More Information

Terms related to this study

Plan for Individual participant data (IPD)

Plan to Share Individual Participant Data (IPD)?

NO

IPD Plan Description

Individual participant data will not be shared. The study involves de-identified medical images and anonymous clinician diagnostic responses. Only aggregate summary results (e.g., diagnostic accuracy metrics and statistical analyses) will be reported in publications and presentations.

Drug and device information, study documents

Studies a U.S. FDA-regulated drug product

No

Studies a U.S. FDA-regulated device product

No

This information was retrieved directly from the website clinicaltrials.gov without any changes. If you have any requests to change, remove or update your study details, please contact register@clinicaltrials.gov. As soon as a change is implemented on clinicaltrials.gov, this will be updated automatically on our website as well.

Clinical Trials on Differential Diagnosis

Clinical Trials on AI Diagnostic System (AIMD.1)

Subscribe