Construction of a Benchmark for Breast Ultrasound AI Interpretation and Performance Evaluation of Multimodal AI Models (BUST-AI Bench)

March 24, 2026 updated by: Qingli Zhu, Peking Union Medical College Hospital

Construction of a Standardized Benchmark Evaluation System for Intelligent Breast Ultrasound Image Interpretation and Systematic Performance Assessment of Multimodal Artificial Intelligence Models Based on ACR BI-RADS v2025 Criteria

This single-center, retrospective, observational study aims to construct a standardized benchmark evaluation system for intelligent breast ultrasound image interpretation and to systematically assess the diagnostic performance of current mainstream multimodal artificial intelligence (AI) models.

De-identified B-mode breast ultrasound images with confirmed pathological diagnoses will be retrospectively collected from the institutional archive (2018-2025) and supplemented with images from published open-access datasets. Expert radiologists with varying experience levels will independently annotate all images according to the American College of Radiology (ACR) Breast Imaging Reporting and Data System (BI-RADS) v2025 criteria, including glandular tissue composition, lesion characterization (mass vs. non-mass lesion), morphological descriptors, and final BI-RADS classification.

Baseline deep learning models (CNN-based ResNet-50 and Transformer-based USFM) will be trained to establish performance baselines and to stratify cases by diagnostic difficulty through cross-architecture consensus. Multiple multimodal large language models (MLLMs), including both general-purpose and medical-domain models, will then be evaluated via standardized API calls using BI-RADS-guided chain-of-thought prompts at temperature 0 for reproducibility.

Primary endpoints include BI-RADS classification accuracy and diagnostic AUC for benign-malignant differentiation. Model robustness and safety will be assessed through out-of-distribution rejection testing, temperature-stability experiments, and thinking-mode ablation studies. This study adheres to the FLAIR and TRIPOD-LLM reporting guidelines.

Study Overview

Detailed Description

Background: Breast cancer is the most prevalent malignancy among women worldwide. Ultrasound is a first-line screening modality, particularly in Asian populations with dense breast tissue where mammographic sensitivity is limited. However, ultrasound interpretation is highly operator-dependent, with substantial inter-observer variability in BI-RADS classification, especially for category 4A-4B lesions. Multimodal large language models (MLLMs) have emerged as a promising tool for medical image analysis due to their zero-shot diagnostic capability, interpretable chain-of-thought reasoning, and structured report generation. Nevertheless, there is currently no standardized benchmark for evaluating AI performance in breast ultrasound interpretation.

Study Design: Approximately 1,380 breast ultrasound images will be curated (1,200 evaluation set + 150 out-of-distribution safety test set + 30 prompt development set), encompassing three diagnostic categories: normal breast, benign lesions (BI-RADS 2-4B), and malignant lesions (BI-RADS 3-5). Two junior radiologists (<5 years of experience) and two senior radiologists (>15 years) will independently annotate images per ACR BI-RADS v2025 with arbitration by a fifth expert for discordant cases.

Diagnostic difficulty will be stratified into three tiers using cross-architecture deep learning consensus: Tier 1 (straightforward, both models correct), Tier 2 (equivocal, one correct/one incorrect), and Tier 3 (difficult, both incorrect, with senior expert validation). MLLMs will be evaluated across multiple dimensions: classification accuracy, sensitivity, specificity, F1 score, AUC, Cohen's kappa agreement with expert consensus, expected calibration error (ECE), morphological feature description accuracy, and chain-of-thought reasoning quality.

Safety Assessment: (1) Out-of-distribution rejection test using 150 non-diagnostic images (degraded images, non-breast ultrasound, other imaging modalities); (2) Temperature-stability pre-experiment across parameter settings; (3) Thinking-mode ablation comparing standard vs. chain-of-thought reasoning modes. All experiments use fixed model snapshots, system fingerprint monitoring, and complete logging for reproducibility.

Study Type

Observational

Enrollment (Estimated)

1380

Contacts and Locations

This section provides the contact details for those conducting the study, and information on where this study is being conducted.

Study Contact

Study Contact Backup

Study Locations

      • Beijing, China, 100730
        • Recruiting
        • Peking Union Medical College Hospital
        • Contact:

Participation Criteria

Researchers look for people who fit a certain description, called eligibility criteria. Some examples of these criteria are a person's general health condition or prior treatments.

Eligibility Criteria

Ages Eligible for Study

  • Adult
  • Older Adult

Accepts Healthy Volunteers

Yes

Sampling Method

Non-Probability Sample

Study Population

De-identified breast ultrasound images from adult patients who underwent breast ultrasound examination at Peking Union Medical College Hospital between 2018 and 2025 with subsequent pathological confirmation, supplemented by images from published, ethics-approved, open-access breast ultrasound datasets (e.g., BUSI, BrEaST).

Description

Inclusion Criteria:

  • B-mode breast ultrasound grayscale images from the institutional PACS database or from published open-access breast ultrasound datasets with documented original institutional ethics approval
  • Image quality adequate for clinical diagnosis with clear visualization of the region of interest
  • Pathological diagnosis confirmed (for benign and malignant lesion groups), or normal breast status confirmed by a senior radiologist with >15 years of breast ultrasound experience (for the normal group)
  • Complete de-identification with removal of all personally identifiable information

Exclusion Criteria:

  • Severely degraded image quality precluding meaningful BI-RADS assessment
  • Duplicate images from the same patient (only the most representative image retained per lesion)
  • Images with residual personally identifiable information after de-identification processing
  • Cases with ambiguous, disputed, or unavailable pathological results
  • Non-B-mode ultrasound images, including elastography, contrast-enhanced ultrasound, and Doppler imaging

Study Plan

This section provides details of the study plan, including how the study is designed and what the study is measuring.

How is the study designed?

Design Details

Cohorts and Interventions

Group / Cohort
Intervention / Treatment
Normal Breast
Breast ultrasound images showing normal glandular tissue across different tissue composition types, with no focal lesions identified. Confirmed by senior radiologist review.
Retrospective evaluation of de-identified breast ultrasound images by multiple AI systems, including baseline deep learning models (ResNet-50, USFM) and multimodal large language models, using standardized BI-RADS-guided chain-of-thought prompts via API. No patient contact or clinical decision-making is involved.
Benign Lesion
Breast ultrasound images containing pathologically confirmed benign lesions (BI-RADS 2-4B), including fibroadenoma, cyst, lipoma, sclerosing adenosis, intraductal papilloma, and selected non-mass lesions (NML).
Retrospective evaluation of de-identified breast ultrasound images by multiple AI systems, including baseline deep learning models (ResNet-50, USFM) and multimodal large language models, using standardized BI-RADS-guided chain-of-thought prompts via API. No patient contact or clinical decision-making is involved.
Malignant Lesion
Breast ultrasound images containing pathologically confirmed malignant lesions (BI-RADS 3-5), including invasive ductal carcinoma, invasive lobular carcinoma, mucinous carcinoma, and selected non-mass lesions (NML).
Retrospective evaluation of de-identified breast ultrasound images by multiple AI systems, including baseline deep learning models (ResNet-50, USFM) and multimodal large language models, using standardized BI-RADS-guided chain-of-thought prompts via API. No patient contact or clinical decision-making is involved.

What is the study measuring?

Primary Outcome Measures

Outcome Measure
Measure Description
Time Frame
Diagnostic Accuracy for Pathological Diagnosis
Time Frame: At study completion, approximately 12 months
Sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and F1 score of AI models for benign-malignant classification, with histopathological diagnosis as the gold standard.
At study completion, approximately 12 months
BI-RADS Classification Accuracy
Time Frame: At study completion, approximately 12 months
Overall accuracy of AI models in assigning BI-RADS categories (2, 3, 4A, 4B, 4C, 5) to breast ultrasound images, compared with expert consensus annotation as the reference standard.
At study completion, approximately 12 months

Secondary Outcome Measures

Outcome Measure
Measure Description
Time Frame
Agreement with Expert Consensus (Cohen's Kappa)
Time Frame: At study completion, approximately 12 months
Cohen's kappa coefficient measuring agreement between each AI model's BI-RADS classification and the expert consensus annotation, reported with 95% confidence intervals.
At study completion, approximately 12 months
Out-of-Distribution Rejection Rate
Time Frame: At study completion, approximately 12 months
Proportion of non-diagnostic images (degraded quality, non-breast ultrasound, other imaging modalities) correctly identified and refused by AI models, evaluating domain safety.
At study completion, approximately 12 months
Sensitivity, Specificity, PPV, NPV, and F1 Score
Time Frame: At study completion, approximately 12 months
Standard diagnostic performance metrics for benign-malignant classification, reported for each AI model individually.
At study completion, approximately 12 months

Collaborators and Investigators

This is where you will find people and organizations involved with this study.

Investigators

  • Principal Investigator: Qingli Zhu, MD, Peking Union Medical College Hospital

Publications and helpful links

The person responsible for entering information about the study voluntarily provides these publications. These may be about anything related to the study.

General Publications

Study record dates

These dates track the progress of study record and summary results submissions to ClinicalTrials.gov. Study records and reported results are reviewed by the National Library of Medicine (NLM) to make sure they meet specific quality control standards before being posted on the public website.

Study Major Dates

Study Start (Actual)

March 12, 2026

Primary Completion (Estimated)

December 1, 2026

Study Completion (Estimated)

March 1, 2027

Study Registration Dates

First Submitted

March 24, 2026

First Submitted That Met QC Criteria

March 24, 2026

First Posted (Actual)

March 30, 2026

Study Record Updates

Last Update Posted (Actual)

March 30, 2026

Last Update Submitted That Met QC Criteria

March 24, 2026

Last Verified

March 1, 2026

More Information

Terms related to this study

Other Study ID Numbers

  • K10349
  • 2024-I2M-CT-B-035 (Other Grant/Funding Number: CAMS Innovation Fund for Medical Sciences)
  • I-26PJ0568 (Other Identifier: Ethics Committee, Peking Union Medical College Hospital)

Plan for Individual participant data (IPD)

Plan to Share Individual Participant Data (IPD)?

YES

IPD Plan Description

The de-identified benchmark evaluation dataset, including expert-annotated breast ultrasound images with paired BI-RADS reading reports, is planned for public release to promote academic reproducibility and collaborative research.

IPD Sharing Time Frame

Within 6 months of primary publication, available indefinitely

IPD Sharing Access Criteria

Open access via a recognized data repository (to be determined)

IPD Sharing Supporting Information Type

  • STUDY_PROTOCOL
  • SAP
  • ANALYTIC_CODE

Drug and device information, study documents

Studies a U.S. FDA-regulated drug product

No

Studies a U.S. FDA-regulated device product

No

This information was retrieved directly from the website clinicaltrials.gov without any changes. If you have any requests to change, remove or update your study details, please contact register@clinicaltrials.gov. As soon as a change is implemented on clinicaltrials.gov, this will be updated automatically on our website as well.

Clinical Trials on Breast Neoplasms

Clinical Trials on Multimodal AI Model Diagnostic Evaluation

Subscribe