Construction of a Benchmark for Breast Ultrasound AI Interpretation and Performance Evaluation of Multimodal AI Models (BUST-AI Bench)

March 24, 2026 updated by: Qingli Zhu, Peking Union Medical College Hospital

Construction of a Standardized Benchmark Evaluation System for Intelligent Breast Ultrasound Image Interpretation and Systematic Performance Assessment of Multimodal Artificial Intelligence Models Based on ACR BI-RADS v2025 Criteria

This single-center, retrospective, observational study aims to construct a standardized benchmark evaluation system for intelligent breast ultrasound image interpretation and to systematically assess the diagnostic performance of current mainstream multimodal artificial intelligence (AI) models.

De-identified B-mode breast ultrasound images with confirmed pathological diagnoses will be retrospectively collected from the institutional archive (2018-2025) and supplemented with images from published open-access datasets. Expert radiologists with varying experience levels will independently annotate all images according to the American College of Radiology (ACR) Breast Imaging Reporting and Data System (BI-RADS) v2025 criteria, including glandular tissue composition, lesion characterization (mass vs. non-mass lesion), morphological descriptors, and final BI-RADS classification.

Baseline deep learning models (CNN-based ResNet-50 and Transformer-based USFM) will be trained to establish performance baselines and to stratify cases by diagnostic difficulty through cross-architecture consensus. Multiple multimodal large language models (MLLMs), including both general-purpose and medical-domain models, will then be evaluated via standardized API calls using BI-RADS-guided chain-of-thought prompts at temperature 0 for reproducibility.

Primary endpoints include BI-RADS classification accuracy and diagnostic AUC for benign-malignant differentiation. Model robustness and safety will be assessed through out-of-distribution rejection testing, temperature-stability experiments, and thinking-mode ablation studies. This study adheres to the FLAIR and TRIPOD-LLM reporting guidelines.

Study Overview

Status

Recruiting

Conditions

Intervention / Treatment

Diagnostic test: Multimodal AI Model Diagnostic Evaluation

Detailed Description

Background: Breast cancer is the most prevalent malignancy among women worldwide. Ultrasound is a first-line screening modality, particularly in Asian populations with dense breast tissue where mammographic sensitivity is limited. However, ultrasound interpretation is highly operator-dependent, with substantial inter-observer variability in BI-RADS classification, especially for category 4A-4B lesions. Multimodal large language models (MLLMs) have emerged as a promising tool for medical image analysis due to their zero-shot diagnostic capability, interpretable chain-of-thought reasoning, and structured report generation. Nevertheless, there is currently no standardized benchmark for evaluating AI performance in breast ultrasound interpretation.

Study Design: Approximately 1,380 breast ultrasound images will be curated (1,200 evaluation set + 150 out-of-distribution safety test set + 30 prompt development set), encompassing three diagnostic categories: normal breast, benign lesions (BI-RADS 2-4B), and malignant lesions (BI-RADS 3-5). Two junior radiologists (<5 years of experience) and two senior radiologists (>15 years) will independently annotate images per ACR BI-RADS v2025 with arbitration by a fifth expert for discordant cases.

Diagnostic difficulty will be stratified into three tiers using cross-architecture deep learning consensus: Tier 1 (straightforward, both models correct), Tier 2 (equivocal, one correct/one incorrect), and Tier 3 (difficult, both incorrect, with senior expert validation). MLLMs will be evaluated across multiple dimensions: classification accuracy, sensitivity, specificity, F1 score, AUC, Cohen's kappa agreement with expert consensus, expected calibration error (ECE), morphological feature description accuracy, and chain-of-thought reasoning quality.

Safety Assessment: (1) Out-of-distribution rejection test using 150 non-diagnostic images (degraded images, non-breast ultrasound, other imaging modalities); (2) Temperature-stability pre-experiment across parameter settings; (3) Thinking-mode ablation comparing standard vs. chain-of-thought reasoning modes. All experiments use fixed model snapshots, system fingerprint monitoring, and complete logging for reproducibility.

Study Type

Observational

Enrollment (Estimated)

1380

Contacts and Locations

This section provides the contact details for those conducting the study, and information on where this study is being conducted.

Study Contact

Name: Qingli Zhu, MD
Phone Number: +86 13621376699
Email: zqlpumch@126.com

Study Contact Backup

Name: Yinglan Wu, MD
Phone Number: +86 15626121076
Email: wuylan7@gmail.com

Study Locations

China
- - Beijing, China, 100730
    - Recruiting
    - Peking Union Medical College Hospital
    - Contact:
      
      Qingli Zhu, MD
      
      Phone Number: +86 13621376699
      
      Email: zqlpumch@126.com

Participation Criteria

Researchers look for people who fit a certain description, called eligibility criteria. Some examples of these criteria are a person's general health condition or prior treatments.

Eligibility Criteria

Ages Eligible for Study

Adult
Older Adult

Accepts Healthy Volunteers

Yes

Sampling Method

Non-Probability Sample

Study Population

De-identified breast ultrasound images from adult patients who underwent breast ultrasound examination at Peking Union Medical College Hospital between 2018 and 2025 with subsequent pathological confirmation, supplemented by images from published, ethics-approved, open-access breast ultrasound datasets (e.g., BUSI, BrEaST).

Description

Inclusion Criteria:

B-mode breast ultrasound grayscale images from the institutional PACS database or from published open-access breast ultrasound datasets with documented original institutional ethics approval
Image quality adequate for clinical diagnosis with clear visualization of the region of interest
Pathological diagnosis confirmed (for benign and malignant lesion groups), or normal breast status confirmed by a senior radiologist with >15 years of breast ultrasound experience (for the normal group)
Complete de-identification with removal of all personally identifiable information

Exclusion Criteria:

Severely degraded image quality precluding meaningful BI-RADS assessment
Duplicate images from the same patient (only the most representative image retained per lesion)
Images with residual personally identifiable information after de-identification processing
Cases with ambiguous, disputed, or unavailable pathological results
Non-B-mode ultrasound images, including elastography, contrast-enhanced ultrasound, and Doppler imaging

Study Plan

This section provides details of the study plan, including how the study is designed and what the study is measuring.

How is the study designed?

Design Details

Number of groups / cohorts

Cohorts and Interventions

Group / Cohort	Intervention / Treatment
Normal Breast Breast ultrasound images showing normal glandular tissue across different tissue composition types, with no focal lesions identified. Confirmed by senior radiologist review.	Diagnostic test: Multimodal AI Model Diagnostic Evaluation Retrospective evaluation of de-identified breast ultrasound images by multiple AI systems, including baseline deep learning models (ResNet-50, USFM) and multimodal large language models, using standardized BI-RADS-guided chain-of-thought prompts via API. No patient contact or clinical decision-making is involved.
Benign Lesion Breast ultrasound images containing pathologically confirmed benign lesions (BI-RADS 2-4B), including fibroadenoma, cyst, lipoma, sclerosing adenosis, intraductal papilloma, and selected non-mass lesions (NML).	Diagnostic test: Multimodal AI Model Diagnostic Evaluation Retrospective evaluation of de-identified breast ultrasound images by multiple AI systems, including baseline deep learning models (ResNet-50, USFM) and multimodal large language models, using standardized BI-RADS-guided chain-of-thought prompts via API. No patient contact or clinical decision-making is involved.
Malignant Lesion Breast ultrasound images containing pathologically confirmed malignant lesions (BI-RADS 3-5), including invasive ductal carcinoma, invasive lobular carcinoma, mucinous carcinoma, and selected non-mass lesions (NML).	Diagnostic test: Multimodal AI Model Diagnostic Evaluation Retrospective evaluation of de-identified breast ultrasound images by multiple AI systems, including baseline deep learning models (ResNet-50, USFM) and multimodal large language models, using standardized BI-RADS-guided chain-of-thought prompts via API. No patient contact or clinical decision-making is involved.

What is the study measuring?

Primary Outcome Measures

Outcome Measure	Measure Description	Time Frame
Diagnostic Accuracy for Pathological Diagnosis Time Frame: At study completion, approximately 12 months	Sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and F1 score of AI models for benign-malignant classification, with histopathological diagnosis as the gold standard.	At study completion, approximately 12 months
BI-RADS Classification Accuracy Time Frame: At study completion, approximately 12 months	Overall accuracy of AI models in assigning BI-RADS categories (2, 3, 4A, 4B, 4C, 5) to breast ultrasound images, compared with expert consensus annotation as the reference standard.	At study completion, approximately 12 months

Secondary Outcome Measures

Outcome Measure	Measure Description	Time Frame
Agreement with Expert Consensus (Cohen's Kappa) Time Frame: At study completion, approximately 12 months	Cohen's kappa coefficient measuring agreement between each AI model's BI-RADS classification and the expert consensus annotation, reported with 95% confidence intervals.	At study completion, approximately 12 months
Out-of-Distribution Rejection Rate Time Frame: At study completion, approximately 12 months	Proportion of non-diagnostic images (degraded quality, non-breast ultrasound, other imaging modalities) correctly identified and refused by AI models, evaluating domain safety.	At study completion, approximately 12 months
Sensitivity, Specificity, PPV, NPV, and F1 Score Time Frame: At study completion, approximately 12 months	Standard diagnostic performance metrics for benign-malignant classification, reported for each AI model individually.	At study completion, approximately 12 months

Collaborators and Investigators

This is where you will find people and organizations involved with this study.

Sponsor

Peking Union Medical College Hospital

Collaborators

Chinese Academy of Medical Sciences

Investigators

Principal Investigator: Qingli Zhu, MD, Peking Union Medical College Hospital

Publications and helpful links

The person responsible for entering information about the study voluntarily provides these publications. These may be about anything related to the study.

General Publications

Study record dates

These dates track the progress of study record and summary results submissions to ClinicalTrials.gov. Study records and reported results are reviewed by the National Library of Medicine (NLM) to make sure they meet specific quality control standards before being posted on the public website.

Study Major Dates

Study Start (Actual)

March 12, 2026

Primary Completion (Estimated)

December 1, 2026

Study Completion (Estimated)

March 1, 2027

Study Registration Dates

First Submitted

March 24, 2026

First Submitted That Met QC Criteria

March 24, 2026

First Posted (Actual)

March 30, 2026

Study Record Updates

Last Update Posted (Actual)

March 30, 2026

Last Update Submitted That Met QC Criteria

March 24, 2026

Last Verified

March 1, 2026

More Information

Terms related to this study

Keywords

Additional Relevant MeSH Terms

Other Study ID Numbers

K10349
2024-I2M-CT-B-035 (Other Grant/Funding Number: CAMS Innovation Fund for Medical Sciences)
I-26PJ0568 (Other Identifier: Ethics Committee, Peking Union Medical College Hospital)

Plan for Individual participant data (IPD)

Plan to Share Individual Participant Data (IPD)?

YES

IPD Plan Description

The de-identified benchmark evaluation dataset, including expert-annotated breast ultrasound images with paired BI-RADS reading reports, is planned for public release to promote academic reproducibility and collaborative research.

IPD Sharing Time Frame

Within 6 months of primary publication, available indefinitely

IPD Sharing Access Criteria

Open access via a recognized data repository (to be determined)

IPD Sharing Supporting Information Type

STUDY_PROTOCOL
SAP
ANALYTIC_CODE

Drug and device information, study documents

Studies a U.S. FDA-regulated drug product

Studies a U.S. FDA-regulated device product

This information was retrieved directly from the website clinicaltrials.gov without any changes. If you have any requests to change, remove or update your study details, please contact register@clinicaltrials.gov. As soon as a change is implemented on clinicaltrials.gov, this will be updated automatically on our website as well.

Clinical Trials on Breast Neoplasms

Emory University
Eisai Inc.

Terminated

Trial of Eribulin Followed by Doxorubicin & Cyclophosphamide for Her2-negative, Locally Advanced Breast Cancer

Breast Cancer | Breast Neoplasms | Breast Tumors | Neoplasms, Breast | Cancer of the Breast | Tumors, Breast

United States
Innocrin Pharmaceutical

Completed

CYP17 Lyase and Androgen Receptor Inhibitor Treatment With Seviteronel Trial (INO-VT-464-006; NCT02580448) (CLARITY-01)

Breast Cancer | Advanced Breast Cancer | Metastatic Breast Cancer | Triple Negative Breast Cancer | Male Breast Cancer | ER+ Breast Cancer | Cancer of the Breast

United States
G1 Therapeutics, Inc.

Terminated

Trilaciclib (G1T28), a CDK 4/6 Inhibitor, in Combination With Gemcitabine and Carboplatin in Metastatic Triple Negative Breast Cancer (mTNBC)

Breast Cancer | Breast Neoplasm | Triple-Negative Breast Cancer | Triple-Negative Breast Neoplasms

United States, Bulgaria, Croatia, Slovenia, Serbia, Belgium, North Macedonia, Slovakia
National Cancer Institute (NCI)

Not yet recruiting

Collection of CSF Samples From Participants With Metastatic Triple Negative Breast Cancer (TNBC) and HER2+ Breast Cancer With no Prior History Nor Active Radiographically Detectable Brain Metastases

Breast Cancer | Breast Carcinoma | Malignant Neoplasm of Breast | Cancer of the Breast

United States
University of Washington
National Cancer Institute (NCI)

Completed

Sunitinib Malate, Paclitaxel, Doxorubicin Hydrochloride, and Cyclophosphamide Before Surgery in Treating Patients With Stage IIB-IIIC Breast Cancer

Inflammatory Breast Cancer | Male Breast Cancer | Stage II Breast Cancer | Stage IIIA Breast Cancer | Stage IIIB Breast Cancer | Stage IIIC Breast Cancer

United States
Dana-Farber Cancer Institute
Incyte Corporation

Active, not recruiting

Study Of Ruxolitinib (INCB018424) With Preoperative Chemotherapy For Triple Negative Inflammatory Breast Cancer

Inflammatory Breast Cancer (IBC)

United States
Providence Health & Services
Brooklyn ImmunoTherapeutics, LLC

Completed

Pre-operative IRX-2 in Early Stage Breast Cancer (ESBC)

Breast Neoplasm | Triple Negative Breast Cancer | Breast Neoplasm, Male

United States
Massachusetts General Hospital
Massachusetts Institute of Technology

Not yet recruiting

Wearable Ultrasound Patch for Breast Imaging

Breast Cancer | Breast Asymmetry | Breast Abnormalities | Breast Lesion

United States
Joseph Baar, MD, PhD

Completed

MUC1 Vaccine for Triple-negative Breast Cancer

Breast Cancer | Stage I Breast Cancer | Inflammatory Breast Cancer | Stage II Breast Cancer | Stage IIIA Breast Cancer | Stage IIIB Breast Cancer | Triple-negative Breast Cancer | Stage IIIC Breast Cancer

United States
Xijing Hospital

Active, not recruiting

Exemption of SLNB After Neoadjuvant Therapy for Triple-negative and Her2-positive Breast Cancer

Breast Cancer | Breast Cancer (Triple Negative Breast Cancer (TNBC))

China

Clinical Trials on Multimodal AI Model Diagnostic Evaluation

Qun Zhao

Completed

Multimodal Model Predicts Recurrence (FUTURE12)

Gastric Adenocarcinoma

China
Qun Zhao

Completed

Development of a Multimodal AI System for GIST Management

Gastrointestinal Stromal Tumors | Gastric Subepithelial Tumors | Artificial Intelligence (AI) | Gastric Leiomyoma | Multimodal Imaging

China
Huazhong University of Science and Technology

Recruiting

Multicenter Observational Study of Multimodal AI for Upper GI Mesenchymal Tumor Diagnosis

Leiomyoma | Schwannoma | Gastrointestinal Stromal Tumor (GIST) | Submucosal Tumor

China
Valentina Cerrone
Federico II University; University of Salerno, Italy

Recruiting

Refining mUltiple Artificial intelliGence strateGies for Automatic Pain Assessment Investigations: RUGGI Study (RUGGI)

Chronic Pain | Neuropathic Pain | Cancer Pain | Pain Assessment

Italy
The Eye Hospital of Wenzhou Medical University

Recruiting

AI-Driven Cancer Diagnosis and Prediction With EHR

Tumor

China
The Eye Hospital of Wenzhou Medical University

Recruiting

Early Diagnosis and Prediction of Maternal and Neonatal Diseases: (EDPMND)

Pregnancy-Related and Neonatal Disorders

China
The Eye Hospital of Wenzhou Medical University

Recruiting

AI-Driven Prediction of Hospital-Acquired Infections With EHR

Hospital-acquired Infections

China
University of Illinois at Chicago

Recruiting

The SENTINL-1 Study: Evaluating Patient-Reported Outcomes of AI-Inferred Lung Cancer Risk

Lung Cancer

United States
The Eye Hospital of Wenzhou Medical University

Recruiting

Ophthalmic Multimodal AI-Assisted Medical Decision-Making

Ocular Diseases

China, Macau
Peking University Third Hospital
Qingdao Municipal Hospital; Tianjin Medical University General Hospital; The... and other collaborators

Recruiting

Application of Multimodal Large Language Model in HFpEF (MeG-HFpEF)

Heart Failure With Preserved Ejection Fraction

China

Construction of a Benchmark for Breast Ultrasound AI Interpretation and Performance Evaluation of Multimodal AI Models (BUST-AI Bench)

Construction of a Standardized Benchmark Evaluation System for Intelligent Breast Ultrasound Image Interpretation and Systematic Performance Assessment of Multimodal Artificial Intelligence Models Based on ACR BI-RADS v2025 Criteria

Study Overview

Status

Conditions

Intervention / Treatment

Detailed Description

Study Type

Enrollment (Estimated)

Contacts and Locations

Study Contact

Study Contact Backup

Study Locations

Participation Criteria

Eligibility Criteria

Ages Eligible for Study

Accepts Healthy Volunteers

Sampling Method

Study Population

Description

Study Plan

How is the study designed?

Design Details

Number of groups / cohorts

Cohorts and Interventions

Group / Cohort

Intervention / Treatment

What is the study measuring?

Primary Outcome Measures

Outcome Measure

Measure Description

Time Frame

Secondary Outcome Measures

Outcome Measure

Measure Description

Time Frame

Collaborators and Investigators

Sponsor

Collaborators

Investigators

Publications and helpful links

General Publications

Study record dates

Study Major Dates

Study Start (Actual)

Primary Completion (Estimated)

Study Completion (Estimated)

Study Registration Dates

First Submitted

First Submitted That Met QC Criteria

First Posted (Actual)

Study Record Updates

Last Update Posted (Actual)

Last Update Submitted That Met QC Criteria

Last Verified

More Information

Terms related to this study

Keywords

Additional Relevant MeSH Terms

Other Study ID Numbers

Plan for Individual participant data (IPD)

Plan to Share Individual Participant Data (IPD)?

IPD Plan Description

IPD Sharing Time Frame

IPD Sharing Access Criteria

IPD Sharing Supporting Information Type

Drug and device information, study documents

Studies a U.S. FDA-regulated drug product

Studies a U.S. FDA-regulated device product

Clinical Trials on Breast Neoplasms

Clinical Trials on Multimodal AI Model Diagnostic Evaluation

Search Similar Trials

Sponsors and Collaborators

Medical Conditions

Drug Interventions

CROs by country

CROs in Gambia

Conditions

Rare Diseases

Drug Interventions