Using deep learning to assist readers during the arbitration process: a lesion-based retrospective evaluation of breast cancer screening performance
Laura Kerschke, Stefanie Weigel, Alejandro Rodriguez-Ruiz, Nico Karssemeijer, Walter Heindel, Laura Kerschke, Stefanie Weigel, Alejandro Rodriguez-Ruiz, Nico Karssemeijer, Walter Heindel
Abstract
Objectives: To evaluate if artificial intelligence (AI) can discriminate recalled benign from recalled malignant mammographic screening abnormalities to improve screening performance.
Methods: A total of 2257 full-field digital mammography screening examinations, obtained 2011-2013, of women aged 50-69 years which were recalled for further assessment of 295 malignant out of 305 truly malignant lesions and 2289 benign lesions after independent double-reading with arbitration, were included in this retrospective study. A deep learning AI system was used to obtain a score (0-95) for each recalled lesion, representing the likelihood of breast cancer. The sensitivity on the lesion level and the proportion of women without false-positive ratings (non-FPR) resulting under AI were estimated as a function of the classification cutoff and compared to that of human readers.
Results: Using a cutoff of 1, AI decreased the proportion of women with false-positives from 89.9 to 62.0%, non-FPR 11.1% vs. 38.0% (difference 26.9%, 95% confidence interval 25.1-28.8%; p < .001), preventing 30.1% of reader-induced false-positive recalls, while reducing sensitivity from 96.7 to 91.1% (5.6%, 3.1-8.0%) as compared to human reading. The positive predictive value of recall (PPV-1) increased from 12.8 to 16.5% (3.7%, 3.5-4.0%). In women with mass-related lesions (n = 900), the non-FPR was 14.2% for humans vs. 36.7% for AI (22.4%, 19.8-25.3%) at a sensitivity of 98.5% vs. 97.1% (1.5%, 0-3.5%).
Conclusion: The application of AI during consensus conference might especially help readers to reduce false-positive recalls of masses at the expense of a small sensitivity reduction. Prospective studies are needed to further evaluate the screening benefit of AI in practice.
Key points: • Integrating the use of artificial intelligence in the arbitration process reduces benign recalls and increases the positive predictive value of recall at the expense of some sensitivity loss. • Application of the artificial intelligence system to aid the decision to recall a woman seems particularly beneficial for masses, where the system reaches comparable sensitivity to that of the readers, but with considerably reduced false-positives. • About one-fourth of all recalled malignant lesions are not automatically marked by the system such that their evaluation (AI score) must be retrieved manually by the reader. A thorough reading of screening mammograms by readers to identify suspicious lesions therefore remains mandatory.
Keywords: Artificial intelligence; Breast cancer; Mammography; Screening.
Conflict of interest statement
The authors of this manuscript declare relationships as project partners of the EU-funded INTERREG V A-Projekt with the following companies: ScreenPoint Medical BV, Nijmegen, The Netherlands (AR and NK are employees).
© 2021. The Author(s).
Figures
References
- Lee CI, Houssami N, Elmore JG, Buist DSM. Pathways to breast cancer screening artificial intelligence algorithm validation. Breast. 2019;52:146–149. doi: 10.1016/j.breast.2019.09.005.
- Yala A, Schuster T, Miles R, Barzilay R, Lehman C. A deep learning model to triage screening mammograms: a simulation study. Radiology. 2019;293:38–46. doi: 10.1148/radiol.2019182908.
- Rodríguez-Ruiz A, Krupinski E, Mordang JJ, et al. Detection of breast cancer with mammography: effect of an artificial intelligence support system. Radiology. 2019;290:305–314. doi: 10.1148/radiol.2018181371.
- Rodriguez-Ruiz A, Lång K, Gubern-Merida A, et al. Can we reduce the workload of mammographic screening by automatic identification of normal exams with artificial intelligence? A feasibility study. Eur Radiol. 2019;29:4825–4832. doi: 10.1007/s00330-019-06186-9.
- Schaffter T, Buist DSM, Lee CI, et al. Evaluation of combined artificial intelligence and radiologist assessment to interpret screening mammograms. JAMA Netw Open. 2020;3:e200265. doi: 10.1001/jamanetworkopen.2020.0265.
- McKinney SM, Sieniek M, Godbole V, et al. International evaluation of an AI system for breast cancer screening. Nature. 2020;577:89–94. doi: 10.1038/s41586-019-1799-6.
- Geras KJ, Mann RM, Moy L. Artificial intelligence for mammography and digital breast tomosynthesis: current concepts and future perspectives. Radiology. 2019;293:246–259. doi: 10.1148/radiol.2019182627.
- Houssami N, Kirkpatrick-Jones G, Noguchi N, Lee CI. Artificial Intelligence (AI) for the early detection of breast cancer: a scoping review to assess AI’s potential in breast screening practice. Expert Rev Med Device. 2019;16:351–362. doi: 10.1080/17434440.2019.1610387.
- Le EPV, Wang Y, Huang Y, Hickman S, Gilbert FJ. Artificial intelligence in breast imaging. Clin Radiol. 2019;74:357–366. doi: 10.1016/j.crad.2019.02.006.
- Kim HE, Kim HH, Han BK, et al. Changes in cancer detection and false-positive recall in mammography using artificial intelligence: a retrospective, multireader study. Lancet Digit Health. 2020;2:e138–e148. doi: 10.1016/S2589-7500(20)30003-0.
- Marmot MG, Altman DG, Cameron DA, Dewar JA, Thompson SG, Wilcox M. The benefits and harms of breast cancer screening: an independent review. Br J Cancer. 2013;108:2205–2240. doi: 10.1038/bjc.2013.177.
- Tosteson AN, Fryback DG, Hammond CS, et al. Consequences of false-positive screening mammograms. JAMA Intern Med. 2014;174:954–961. doi: 10.1001/jamainternmed.2014.981.
- Perry N, Broeders M, de Wolf C, Törnberg S, Holland R, van Karsa L. European guidelines for quality assurance in breast cancer screening and diagnosis. 4. Luxembourg: Office for Official Publications of the European Communities; 2006.
- Kooperationsgemeinschaft Mammographie (2020) Jahresbericht Evaluation 2018 Deutsches Mammographie-Screening-Programm. Available via . Accessed 19 Jan 2021.
- National Evaluation Team for Breast Cancer Screening (2014) National evaluation of breast cancer screening in the Netherlands 1990 - 2011/2012 NETB XIII. Available via . Accessed 19 Jan 2021.
- Weigel S, Heindel W, Heidinger O, Berkemeyer S, Hense HW. Digital mammography screening: association between detection rate and nuclear grade of ductal carcinoma in situ. Radiology. 2014;271:38–44. doi: 10.1148/radiol.13131498.
- Weigel S, Khil L, Hense HW, et al. Detection rates of ductal carcinoma in situ with biennial digital mammography screening: radiologic findings support pathologic model of tumor progression. Radiology. 2018;286:424–432. doi: 10.1148/radiol.2017170673.
- Rodriguez-Ruiz A, Lång K, Gubern-Merida A. Stand-alone artificial intelligence for breast cancer detection in mammography: comparison with 101 radiologists. J Natl Cancer Inst. 2019;111:916–922. doi: 10.1093/jnci/djy222.
- Chakraborty DP. A brief history of free-response receiver operating characteristic paradigm data analysis. Acad Radiol. 2013;20:915–919. doi: 10.1016/j.acra.2013.03.001.
- Tango T. Equivalence test and confidence interval for the difference in proportions for the paired-sample design. Stat Med. 1998;17:891–908. doi: 10.1002/(SICI)1097-0258(19980430)17:8<891::AID-SIM780>;2-B.
- Kosinski AS. A weighted generalized score statistic for comparison of predictive values of diagnostic tests. Stat Med. 2013;32:964–977. doi: 10.1002/sim.5587.
- Domingo L, Hofvind S, Hubbard RA, et al. Cross-national comparison of screening mammography accuracy measures in US, Norway, and Spain. Eur Radiol. 2016;26:2520–2528. doi: 10.1007/s00330-015-4074-8.
- Aboutalib SS, Mohamed AA, Berg WA, Zuley ML, Sumkin JH, Wu S. Deep learning to distinguish recalled but benign mammography images in breast cancer screening. Clin Cancer Res. 2018;24:5902–5909. doi: 10.1158/1078-0432.CCR-18-1115.
- Zhang QS, Zhu SC. Visual interpretability for deep learning: a survey. Front Inf Technol Electronic Eng. 2018;19:27–39. doi: 10.1631/FITEE.1700808.
Source: PubMed