- ICH GCP
- US Clinical Trials Registry
- Clinical Trial NCT07500428
Construction of a Benchmark for Breast Ultrasound AI Interpretation and Performance Evaluation of Multimodal AI Models (BUST-AI Bench)
Construction of a Standardized Benchmark Evaluation System for Intelligent Breast Ultrasound Image Interpretation and Systematic Performance Assessment of Multimodal Artificial Intelligence Models Based on ACR BI-RADS v2025 Criteria
This single-center, retrospective, observational study aims to construct a standardized benchmark evaluation system for intelligent breast ultrasound image interpretation and to systematically assess the diagnostic performance of current mainstream multimodal artificial intelligence (AI) models.
De-identified B-mode breast ultrasound images with confirmed pathological diagnoses will be retrospectively collected from the institutional archive (2018-2025) and supplemented with images from published open-access datasets. Expert radiologists with varying experience levels will independently annotate all images according to the American College of Radiology (ACR) Breast Imaging Reporting and Data System (BI-RADS) v2025 criteria, including glandular tissue composition, lesion characterization (mass vs. non-mass lesion), morphological descriptors, and final BI-RADS classification.
Baseline deep learning models (CNN-based ResNet-50 and Transformer-based USFM) will be trained to establish performance baselines and to stratify cases by diagnostic difficulty through cross-architecture consensus. Multiple multimodal large language models (MLLMs), including both general-purpose and medical-domain models, will then be evaluated via standardized API calls using BI-RADS-guided chain-of-thought prompts at temperature 0 for reproducibility.
Primary endpoints include BI-RADS classification accuracy and diagnostic AUC for benign-malignant differentiation. Model robustness and safety will be assessed through out-of-distribution rejection testing, temperature-stability experiments, and thinking-mode ablation studies. This study adheres to the FLAIR and TRIPOD-LLM reporting guidelines.
Study Overview
Status
Conditions
Intervention / Treatment
Detailed Description
Background: Breast cancer is the most prevalent malignancy among women worldwide. Ultrasound is a first-line screening modality, particularly in Asian populations with dense breast tissue where mammographic sensitivity is limited. However, ultrasound interpretation is highly operator-dependent, with substantial inter-observer variability in BI-RADS classification, especially for category 4A-4B lesions. Multimodal large language models (MLLMs) have emerged as a promising tool for medical image analysis due to their zero-shot diagnostic capability, interpretable chain-of-thought reasoning, and structured report generation. Nevertheless, there is currently no standardized benchmark for evaluating AI performance in breast ultrasound interpretation.
Study Design: Approximately 1,380 breast ultrasound images will be curated (1,200 evaluation set + 150 out-of-distribution safety test set + 30 prompt development set), encompassing three diagnostic categories: normal breast, benign lesions (BI-RADS 2-4B), and malignant lesions (BI-RADS 3-5). Two junior radiologists (<5 years of experience) and two senior radiologists (>15 years) will independently annotate images per ACR BI-RADS v2025 with arbitration by a fifth expert for discordant cases.
Diagnostic difficulty will be stratified into three tiers using cross-architecture deep learning consensus: Tier 1 (straightforward, both models correct), Tier 2 (equivocal, one correct/one incorrect), and Tier 3 (difficult, both incorrect, with senior expert validation). MLLMs will be evaluated across multiple dimensions: classification accuracy, sensitivity, specificity, F1 score, AUC, Cohen's kappa agreement with expert consensus, expected calibration error (ECE), morphological feature description accuracy, and chain-of-thought reasoning quality.
Safety Assessment: (1) Out-of-distribution rejection test using 150 non-diagnostic images (degraded images, non-breast ultrasound, other imaging modalities); (2) Temperature-stability pre-experiment across parameter settings; (3) Thinking-mode ablation comparing standard vs. chain-of-thought reasoning modes. All experiments use fixed model snapshots, system fingerprint monitoring, and complete logging for reproducibility.
Study Type
Enrollment (Estimated)
Contacts and Locations
Study Contact
- Name: Qingli Zhu, MD
- Phone Number: +86 13621376699
- Email: zqlpumch@126.com
Study Contact Backup
- Name: Yinglan Wu, MD
- Phone Number: +86 15626121076
- Email: wuylan7@gmail.com
Study Locations
-
-
-
Beijing, China, 100730
- Recruiting
- Peking Union Medical College Hospital
-
Contact:
- Qingli Zhu, MD
- Phone Number: +86 13621376699
- Email: zqlpumch@126.com
-
-
Participation Criteria
Eligibility Criteria
Ages Eligible for Study
- Adult
- Older Adult
Accepts Healthy Volunteers
Sampling Method
Study Population
Description
Inclusion Criteria:
- B-mode breast ultrasound grayscale images from the institutional PACS database or from published open-access breast ultrasound datasets with documented original institutional ethics approval
- Image quality adequate for clinical diagnosis with clear visualization of the region of interest
- Pathological diagnosis confirmed (for benign and malignant lesion groups), or normal breast status confirmed by a senior radiologist with >15 years of breast ultrasound experience (for the normal group)
- Complete de-identification with removal of all personally identifiable information
Exclusion Criteria:
- Severely degraded image quality precluding meaningful BI-RADS assessment
- Duplicate images from the same patient (only the most representative image retained per lesion)
- Images with residual personally identifiable information after de-identification processing
- Cases with ambiguous, disputed, or unavailable pathological results
- Non-B-mode ultrasound images, including elastography, contrast-enhanced ultrasound, and Doppler imaging
Study Plan
How is the study designed?
Design Details
Cohorts and Interventions
Group / Cohort |
Intervention / Treatment |
|---|---|
|
Normal Breast
Breast ultrasound images showing normal glandular tissue across different tissue composition types, with no focal lesions identified.
Confirmed by senior radiologist review.
|
Retrospective evaluation of de-identified breast ultrasound images by multiple AI systems, including baseline deep learning models (ResNet-50, USFM) and multimodal large language models, using standardized BI-RADS-guided chain-of-thought prompts via API.
No patient contact or clinical decision-making is involved.
|
|
Benign Lesion
Breast ultrasound images containing pathologically confirmed benign lesions (BI-RADS 2-4B), including fibroadenoma, cyst, lipoma, sclerosing adenosis, intraductal papilloma, and selected non-mass lesions (NML).
|
Retrospective evaluation of de-identified breast ultrasound images by multiple AI systems, including baseline deep learning models (ResNet-50, USFM) and multimodal large language models, using standardized BI-RADS-guided chain-of-thought prompts via API.
No patient contact or clinical decision-making is involved.
|
|
Malignant Lesion
Breast ultrasound images containing pathologically confirmed malignant lesions (BI-RADS 3-5), including invasive ductal carcinoma, invasive lobular carcinoma, mucinous carcinoma, and selected non-mass lesions (NML).
|
Retrospective evaluation of de-identified breast ultrasound images by multiple AI systems, including baseline deep learning models (ResNet-50, USFM) and multimodal large language models, using standardized BI-RADS-guided chain-of-thought prompts via API.
No patient contact or clinical decision-making is involved.
|
What is the study measuring?
Primary Outcome Measures
Outcome Measure |
Measure Description |
Time Frame |
|---|---|---|
|
Diagnostic Accuracy for Pathological Diagnosis
Time Frame: At study completion, approximately 12 months
|
Sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and F1 score of AI models for benign-malignant classification, with histopathological diagnosis as the gold standard.
|
At study completion, approximately 12 months
|
|
BI-RADS Classification Accuracy
Time Frame: At study completion, approximately 12 months
|
Overall accuracy of AI models in assigning BI-RADS categories (2, 3, 4A, 4B, 4C, 5) to breast ultrasound images, compared with expert consensus annotation as the reference standard.
|
At study completion, approximately 12 months
|
Secondary Outcome Measures
Outcome Measure |
Measure Description |
Time Frame |
|---|---|---|
|
Agreement with Expert Consensus (Cohen's Kappa)
Time Frame: At study completion, approximately 12 months
|
Cohen's kappa coefficient measuring agreement between each AI model's BI-RADS classification and the expert consensus annotation, reported with 95% confidence intervals.
|
At study completion, approximately 12 months
|
|
Out-of-Distribution Rejection Rate
Time Frame: At study completion, approximately 12 months
|
Proportion of non-diagnostic images (degraded quality, non-breast ultrasound, other imaging modalities) correctly identified and refused by AI models, evaluating domain safety.
|
At study completion, approximately 12 months
|
|
Sensitivity, Specificity, PPV, NPV, and F1 Score
Time Frame: At study completion, approximately 12 months
|
Standard diagnostic performance metrics for benign-malignant classification, reported for each AI model individually.
|
At study completion, approximately 12 months
|
Collaborators and Investigators
Collaborators
Investigators
- Principal Investigator: Qingli Zhu, MD, Peking Union Medical College Hospital
Publications and helpful links
General Publications
- Sung H, Ferlay J, Siegel RL, Laversanne M, Soerjomataram I, Jemal A, Bray F. Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA Cancer J Clin. 2021 May;71(3):209-249. doi: 10.3322/caac.21660. Epub 2021 Feb 4.
- Bi WL, Hosny A, Schabath MB, Giger ML, Birkbak NJ, Mehrtash A, Allison T, Arnaout O, Abbosh C, Dunn IF, Mak RH, Tamimi RM, Tempany CM, Swanton C, Hoffmann U, Schwartz LH, Gillies RJ, Huang RY, Aerts HJWL. Artificial intelligence in cancer imaging: Clinical challenges and applications. CA Cancer J Clin. 2019 Mar;69(2):127-157. doi: 10.3322/caac.21552. Epub 2019 Feb 5.
- Collins GS, Moons KGM, Dhiman P, Riley RD, Beam AL, Van Calster B, Ghassemi M, Liu X, Reitsma JB, van Smeden M, Boulesteix AL, Camaradou JC, Celi LA, Denaxas S, Denniston AK, Glocker B, Golub RM, Harvey H, Heinze G, Hoffman MM, Kengne AP, Lam E, Lee N, Loder EW, Maier-Hein L, Mateen BA, McCradden MD, Oakden-Rayner L, Ordish J, Parnell R, Rose S, Singh K, Wynants L, Logullo P. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ. 2024 Apr 16;385:e078378. doi: 10.1136/bmj-2023-078378.
- Benary M, Wang XD, Schmidt M, Soll D, Hilfenhaus G, Nassir M, Sigler C, Knodler M, Keller U, Beule D, Keilholz U, Leser U, Rieke DT. Leveraging Large Language Models for Decision Support in Personalized Oncology. JAMA Netw Open. 2023 Nov 1;6(11):e2343689. doi: 10.1001/jamanetworkopen.2023.43689.
- Bhayana R, Krishna S, Bleakney RR. Performance of ChatGPT on a Radiology Board-style Examination: Insights into Current Strengths and Limitations. Radiology. 2023 Jun;307(5):e230582. doi: 10.1148/radiol.230582. Epub 2023 May 16.
- Clusmann J, Kolbinger FR, Muti HS, Carrero ZI, Eckardt JN, Laleh NG, Loffler CML, Schwarzkopf SC, Unger M, Veldhuizen GP, Wagner SJ, Kather JN. The future landscape of large language models in medicine. Commun Med (Lond). 2023 Oct 10;3(1):141. doi: 10.1038/s43856-023-00370-1.
- Seyyed-Kalantari L, Zhang H, McDermott MBA, Chen IY, Ghassemi M. Underdiagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations. Nat Med. 2021 Dec;27(12):2176-2182. doi: 10.1038/s41591-021-01595-0. Epub 2021 Dec 10.
- Moor M, Banerjee O, Abad ZSH, Krumholz HM, Leskovec J, Topol EJ, Rajpurkar P. Foundation models for generalist medical artificial intelligence. Nature. 2023 Apr;616(7956):259-265. doi: 10.1038/s41586-023-05881-4. Epub 2023 Apr 12.
- Miaojiao S, Xia L, Xian Tao Z, Zhi Liang H, Sheng C, Songsong W. Using a Large Language Model for Breast Imaging Reporting and Data System Classification and Malignancy Prediction to Enhance Breast Ultrasound Diagnosis: Retrospective Study. JMIR Med Inform. 2025 Jun 11;13:e70924. doi: 10.2196/70924.
- Jiao J, Zhou J, Li X, Xia M, Huang Y, Huang L, Wang N, Zhang X, Zhou S, Wang Y, Guo Y. USFM: A universal ultrasound foundation model generalized to tasks and organs towards label efficient image analysis. Med Image Anal. 2024 Aug;96:103202. doi: 10.1016/j.media.2024.103202. Epub 2024 May 15.
- Xiang H, Wang X, Xu M, Zhang Y, Zeng S, Li C, Liu L, Deng T, Tang G, Yan C, Ou J, Lin Q, He J, Sun P, Li A, Chen H, Heng PA, Lin X. Deep Learning-assisted Diagnosis of Breast Lesions on US Images: A Multivendor, Multicenter Study. Radiol Artif Intell. 2023 Jul 12;5(5):e220185. doi: 10.1148/ryai.220185. eCollection 2023 Sep.
- Kottlors J, Iuga AI, Bluethgen C, Bressem K, Kather JN, Moy L, Wald C, Wang W, Liu T, Ranschaert E, Dratsch T, Kleesiek J, Gertz RJ, Rajpurkar P, Bedayat A, Fink MA, Zeeck A, Chaudhari A, Alkasab T, Wu H, Nensa F, Wang B, Grosse Hokamp N, Laukamp KR, Persigehl T, Maintz D, Truhn D, Lennartz S. Guidelines for Reporting Studies on Large Language Models in Radiology: An International Delphi Expert Survey. Radiology. 2026 Feb;318(2):e250913. doi: 10.1148/radiol.250913.
Study record dates
Study Major Dates
Study Start (Actual)
Primary Completion (Estimated)
Study Completion (Estimated)
Study Registration Dates
First Submitted
First Submitted That Met QC Criteria
First Posted (Actual)
Study Record Updates
Last Update Posted (Actual)
Last Update Submitted That Met QC Criteria
Last Verified
More Information
Terms related to this study
Keywords
Additional Relevant MeSH Terms
Other Study ID Numbers
- K10349
- 2024-I2M-CT-B-035 (Other Grant/Funding Number: CAMS Innovation Fund for Medical Sciences)
- I-26PJ0568 (Other Identifier: Ethics Committee, Peking Union Medical College Hospital)
Plan for Individual participant data (IPD)
Plan to Share Individual Participant Data (IPD)?
IPD Plan Description
IPD Sharing Time Frame
IPD Sharing Access Criteria
IPD Sharing Supporting Information Type
- STUDY_PROTOCOL
- SAP
- ANALYTIC_CODE
Drug and device information, study documents
Studies a U.S. FDA-regulated drug product
Studies a U.S. FDA-regulated device product
This information was retrieved directly from the website clinicaltrials.gov without any changes. If you have any requests to change, remove or update your study details, please contact register@clinicaltrials.gov. As soon as a change is implemented on clinicaltrials.gov, this will be updated automatically on our website as well.
Clinical Trials on Breast Neoplasms
-
G1 Therapeutics, Inc.TerminatedBreast Cancer | Breast Neoplasm | Triple-Negative Breast Cancer | Triple-Negative Breast NeoplasmsUnited States, Bulgaria, Croatia, Slovenia, Serbia, Belgium, North Macedonia, Slovakia
-
Innocrin PharmaceuticalCompletedBreast Cancer | Advanced Breast Cancer | Metastatic Breast Cancer | Triple Negative Breast Cancer | Male Breast Cancer | ER+ Breast Cancer | Cancer of the BreastUnited States
-
Dana-Farber Cancer InstituteIncyte CorporationActive, not recruitingInflammatory Breast Cancer (IBC)United States
-
Providence Health & ServicesBrooklyn ImmunoTherapeutics, LLCCompletedBreast Neoplasm | Triple Negative Breast Cancer | Breast Neoplasm, MaleUnited States
-
Xijing HospitalActive, not recruitingBreast Cancer | Breast Cancer (Triple Negative Breast Cancer (TNBC))China
-
BerGenBio ASAMerck Sharp & Dohme LLCTerminatedTriple Negative Breast Cancer | Inflammatory Breast Cancer Stage IVSpain, United States, United Kingdom, Norway
-
CytomX TherapeuticsCompletedNeoplasms | Breast Cancer | Breast Neoplasms | Breast Neoplasms, Triple-Negative | Breast Neoplasms, Hormone Receptor Positive/HER2 NegativeUnited States, Spain, Korea, Republic of
-
Emory UniversityCompletedBreast Cancer | Breast Neoplasms | Breast Tumors | Neoplasms, Breast | Cancer of Breast | Cancer of the Breast | Human Mammary CarcinomaUnited States
-
Emory UniversityEisai Inc.TerminatedBreast Cancer | Breast Neoplasms | Breast Tumors | Neoplasms, Breast | Cancer of the Breast | Tumors, BreastUnited States
-
Sun Yat-Sen Memorial Hospital of Sun Yat-Sen UniversityRecruitingBreast Cancer | Triple -Negative Breast CancerChina
Clinical Trials on Multimodal AI Model Diagnostic Evaluation
-
Qun ZhaoCompletedGastrointestinal Stromal Tumors | Gastric Subepithelial Tumors | Artificial Intelligence (AI) | Gastric Leiomyoma | Multimodal ImagingChina
-
Huazhong University of Science and TechnologyRecruitingLeiomyoma | Schwannoma | Gastrointestinal Stromal Tumor (GIST) | Submucosal TumorChina
-
Valentina CerroneFederico II University; University of Salerno, ItalyRecruitingChronic Pain | Neuropathic Pain | Cancer Pain | Pain AssessmentItaly
-
The Eye Hospital of Wenzhou Medical UniversityRecruiting
-
University of Illinois at ChicagoRecruitingLung CancerUnited States
-
The Eye Hospital of Wenzhou Medical UniversityRecruitingPregnancy-Related and Neonatal DisordersChina
-
The Eye Hospital of Wenzhou Medical UniversityRecruiting
-
The Eye Hospital of Wenzhou Medical UniversityRecruiting
-
Changhai HospitalNingbo University Affiliated People's Hospital; Jiaxing University Affiliated... and other collaboratorsRecruitingPancreatic Cancer | PDAC - Pancreatic Ductal Adenocarcinoma | Intraductal Papillary Mucinous Neoplasm | Mucinous Cystic Neoplasm | High-grade Pancreatic Intraepithelial NeoplasiaChina