Using deep learning to assist readers during the arbitration process: a lesion-based retrospective evaluation of breast cancer screening performance

Laura Kerschke, Stefanie Weigel, Alejandro Rodriguez-Ruiz, Nico Karssemeijer, Walter Heindel, Laura Kerschke, Stefanie Weigel, Alejandro Rodriguez-Ruiz, Nico Karssemeijer, Walter Heindel

Abstract

Objectives: To evaluate if artificial intelligence (AI) can discriminate recalled benign from recalled malignant mammographic screening abnormalities to improve screening performance.

Methods: A total of 2257 full-field digital mammography screening examinations, obtained 2011-2013, of women aged 50-69 years which were recalled for further assessment of 295 malignant out of 305 truly malignant lesions and 2289 benign lesions after independent double-reading with arbitration, were included in this retrospective study. A deep learning AI system was used to obtain a score (0-95) for each recalled lesion, representing the likelihood of breast cancer. The sensitivity on the lesion level and the proportion of women without false-positive ratings (non-FPR) resulting under AI were estimated as a function of the classification cutoff and compared to that of human readers.

Results: Using a cutoff of 1, AI decreased the proportion of women with false-positives from 89.9 to 62.0%, non-FPR 11.1% vs. 38.0% (difference 26.9%, 95% confidence interval 25.1-28.8%; p < .001), preventing 30.1% of reader-induced false-positive recalls, while reducing sensitivity from 96.7 to 91.1% (5.6%, 3.1-8.0%) as compared to human reading. The positive predictive value of recall (PPV-1) increased from 12.8 to 16.5% (3.7%, 3.5-4.0%). In women with mass-related lesions (n = 900), the non-FPR was 14.2% for humans vs. 36.7% for AI (22.4%, 19.8-25.3%) at a sensitivity of 98.5% vs. 97.1% (1.5%, 0-3.5%).

Conclusion: The application of AI during consensus conference might especially help readers to reduce false-positive recalls of masses at the expense of a small sensitivity reduction. Prospective studies are needed to further evaluate the screening benefit of AI in practice.

Key points: • Integrating the use of artificial intelligence in the arbitration process reduces benign recalls and increases the positive predictive value of recall at the expense of some sensitivity loss. • Application of the artificial intelligence system to aid the decision to recall a woman seems particularly beneficial for masses, where the system reaches comparable sensitivity to that of the readers, but with considerably reduced false-positives. • About one-fourth of all recalled malignant lesions are not automatically marked by the system such that their evaluation (AI score) must be retrieved manually by the reader. A thorough reading of screening mammograms by readers to identify suspicious lesions therefore remains mandatory.

Keywords: Artificial intelligence; Breast cancer; Mammography; Screening.

Conflict of interest statement

The authors of this manuscript declare relationships as project partners of the EU-funded INTERREG V A-Projekt with the following companies: ScreenPoint Medical BV, Nijmegen, The Netherlands (AR and NK are employees).

© 2021. The Author(s).

Figures

Fig. 1
Fig. 1
Flow chart of screening examinations selected for the study. The ground truth in terms of cancer presence was determined based on histopathology and/or 24-month follow-up. Recalled malignant: malignant lesion detected by independent double reading and arbitration with recall recommendation (i.e., reader true-positive); recalled benign: benign lesion suspicious for malignancy after independent double reading with arbitration (i.e., reader false-positive); additional malignant: malignant lesion detected during assessment or 24-month interval after negative assessment, not marked for recall during consensus conference
Fig. 2
Fig. 2
Full-field digital screening mammographic views of two breast cancer–negative women (a, b) and a breast cancer–positive woman (cd) from the study sample. a Recalled density depicted by the right medio-lateral-oblique view of the screening mammogram. Assessment confirmed a benign focal asymmetry (reader false-positive). The software did not mark the lesion and did not display a lesion-specific score. The score was therefore evaluated as 0. b Recalled round mass, indistinct margin, located in the medial quadrants of the left breast shown in the cranio-caudal view. Assessment including minimal invasive biopsy confirmed a fibroadenoma (reader false-positive). The software did not mark the lesion and did not display a lesion-specific score (evaluated as 0). c Recalled architectural distortion located in the lateral quadrants of the right breast shown in the cranio-caudal view. Assessment confirmed an invasive breast cancer (no special type, pT1c, pN1a, cM0, G1) (reader true-positive). The software missed the invasive cancer (lesion-specific score evaluated as 0), but instead marked amorphous calcifications (not recalled by readers and therefore not included in the evaluation) related to benign changes (d). The lesion-specific score of the calcification was 42 resulting in a high overall score of 9
Fig. 3
Fig. 3
Distribution of recalled malignant (a) and recalled benign (b) lesions as a function of the AI score, representing the likelihood of breast cancer (0, 27, 28, …, 95).
Fig. 4
Fig. 4
Diagnostic performance of the AI system as a function of the classification cutoff for the AI score (0, 27, 28, …, 95). For each cutoff, the x-axis displays the proportion of women with at least one false-positive (i.e., recalled benign lesion with a score ≥ cutoff) out of all (2,257) women, whereas the y-axis shows the corresponding true-positive rate (i.e., proportion of recalled malignant lesions with a score ≥ cutoff out of all (305) malignant lesions). Point coordinates corresponding to the cutoff that yield the lowest decrease in sensitivity are shown in parentheses. Since 10 malignant lesions were not detected by double-reading with arbitration, the end point (1, 1) cannot be reached.

References

    1. Lee CI, Houssami N, Elmore JG, Buist DSM. Pathways to breast cancer screening artificial intelligence algorithm validation. Breast. 2019;52:146–149. doi: 10.1016/j.breast.2019.09.005.
    1. Yala A, Schuster T, Miles R, Barzilay R, Lehman C. A deep learning model to triage screening mammograms: a simulation study. Radiology. 2019;293:38–46. doi: 10.1148/radiol.2019182908.
    1. Rodríguez-Ruiz A, Krupinski E, Mordang JJ, et al. Detection of breast cancer with mammography: effect of an artificial intelligence support system. Radiology. 2019;290:305–314. doi: 10.1148/radiol.2018181371.
    1. Rodriguez-Ruiz A, Lång K, Gubern-Merida A, et al. Can we reduce the workload of mammographic screening by automatic identification of normal exams with artificial intelligence? A feasibility study. Eur Radiol. 2019;29:4825–4832. doi: 10.1007/s00330-019-06186-9.
    1. Schaffter T, Buist DSM, Lee CI, et al. Evaluation of combined artificial intelligence and radiologist assessment to interpret screening mammograms. JAMA Netw Open. 2020;3:e200265. doi: 10.1001/jamanetworkopen.2020.0265.
    1. McKinney SM, Sieniek M, Godbole V, et al. International evaluation of an AI system for breast cancer screening. Nature. 2020;577:89–94. doi: 10.1038/s41586-019-1799-6.
    1. Geras KJ, Mann RM, Moy L. Artificial intelligence for mammography and digital breast tomosynthesis: current concepts and future perspectives. Radiology. 2019;293:246–259. doi: 10.1148/radiol.2019182627.
    1. Houssami N, Kirkpatrick-Jones G, Noguchi N, Lee CI. Artificial Intelligence (AI) for the early detection of breast cancer: a scoping review to assess AI’s potential in breast screening practice. Expert Rev Med Device. 2019;16:351–362. doi: 10.1080/17434440.2019.1610387.
    1. Le EPV, Wang Y, Huang Y, Hickman S, Gilbert FJ. Artificial intelligence in breast imaging. Clin Radiol. 2019;74:357–366. doi: 10.1016/j.crad.2019.02.006.
    1. Kim HE, Kim HH, Han BK, et al. Changes in cancer detection and false-positive recall in mammography using artificial intelligence: a retrospective, multireader study. Lancet Digit Health. 2020;2:e138–e148. doi: 10.1016/S2589-7500(20)30003-0.
    1. Marmot MG, Altman DG, Cameron DA, Dewar JA, Thompson SG, Wilcox M. The benefits and harms of breast cancer screening: an independent review. Br J Cancer. 2013;108:2205–2240. doi: 10.1038/bjc.2013.177.
    1. Tosteson AN, Fryback DG, Hammond CS, et al. Consequences of false-positive screening mammograms. JAMA Intern Med. 2014;174:954–961. doi: 10.1001/jamainternmed.2014.981.
    1. Perry N, Broeders M, de Wolf C, Törnberg S, Holland R, van Karsa L. European guidelines for quality assurance in breast cancer screening and diagnosis. 4. Luxembourg: Office for Official Publications of the European Communities; 2006.
    1. Kooperationsgemeinschaft Mammographie (2020) Jahresbericht Evaluation 2018 Deutsches Mammographie-Screening-Programm. Available via . Accessed 19 Jan 2021.
    1. National Evaluation Team for Breast Cancer Screening (2014) National evaluation of breast cancer screening in the Netherlands 1990 - 2011/2012 NETB XIII. Available via . Accessed 19 Jan 2021.
    1. Weigel S, Heindel W, Heidinger O, Berkemeyer S, Hense HW. Digital mammography screening: association between detection rate and nuclear grade of ductal carcinoma in situ. Radiology. 2014;271:38–44. doi: 10.1148/radiol.13131498.
    1. Weigel S, Khil L, Hense HW, et al. Detection rates of ductal carcinoma in situ with biennial digital mammography screening: radiologic findings support pathologic model of tumor progression. Radiology. 2018;286:424–432. doi: 10.1148/radiol.2017170673.
    1. Rodriguez-Ruiz A, Lång K, Gubern-Merida A. Stand-alone artificial intelligence for breast cancer detection in mammography: comparison with 101 radiologists. J Natl Cancer Inst. 2019;111:916–922. doi: 10.1093/jnci/djy222.
    1. Chakraborty DP. A brief history of free-response receiver operating characteristic paradigm data analysis. Acad Radiol. 2013;20:915–919. doi: 10.1016/j.acra.2013.03.001.
    1. Tango T. Equivalence test and confidence interval for the difference in proportions for the paired-sample design. Stat Med. 1998;17:891–908. doi: 10.1002/(SICI)1097-0258(19980430)17:8<891::AID-SIM780>;2-B.
    1. Kosinski AS. A weighted generalized score statistic for comparison of predictive values of diagnostic tests. Stat Med. 2013;32:964–977. doi: 10.1002/sim.5587.
    1. Domingo L, Hofvind S, Hubbard RA, et al. Cross-national comparison of screening mammography accuracy measures in US, Norway, and Spain. Eur Radiol. 2016;26:2520–2528. doi: 10.1007/s00330-015-4074-8.
    1. Aboutalib SS, Mohamed AA, Berg WA, Zuley ML, Sumkin JH, Wu S. Deep learning to distinguish recalled but benign mammography images in breast cancer screening. Clin Cancer Res. 2018;24:5902–5909. doi: 10.1158/1078-0432.CCR-18-1115.
    1. Zhang QS, Zhu SC. Visual interpretability for deep learning: a survey. Front Inf Technol Electronic Eng. 2018;19:27–39. doi: 10.1631/FITEE.1700808.

Source: PubMed

3
Abonner