Comparison of Chest Radiograph Interpretations by Artificial Intelligence Algorithm vs Radiology Residents

Joy T Wu, Ken C L Wong, Yaniv Gur, Nadeem Ansari, Alexandros Karargyris, Arjun Sharma, Michael Morris, Babak Saboury, Hassan Ahmad, Orest Boyko, Ali Syed, Ashutosh Jadhav, Hongzhi Wang, Anup Pillai, Satyananda Kashyap, Mehdi Moradi, Tanveer Syeda-Mahmood, Joy T Wu, Ken C L Wong, Yaniv Gur, Nadeem Ansari, Alexandros Karargyris, Arjun Sharma, Michael Morris, Babak Saboury, Hassan Ahmad, Orest Boyko, Ali Syed, Ashutosh Jadhav, Hongzhi Wang, Anup Pillai, Satyananda Kashyap, Mehdi Moradi, Tanveer Syeda-Mahmood

Abstract

Importance: Chest radiography is the most common diagnostic imaging examination performed in emergency departments (EDs). Augmenting clinicians with automated preliminary read assistants could help expedite their workflows, improve accuracy, and reduce the cost of care.

Objective: To assess the performance of artificial intelligence (AI) algorithms in realistic radiology workflows by performing an objective comparative evaluation of the preliminary reads of anteroposterior (AP) frontal chest radiographs performed by an AI algorithm and radiology residents.

Design, setting, and participants: This diagnostic study included a set of 72 findings assembled by clinical experts to constitute a full-fledged preliminary read of AP frontal chest radiographs. A novel deep learning architecture was designed for an AI algorithm to estimate the findings per image. The AI algorithm was trained using a multihospital training data set of 342 126 frontal chest radiographs captured in ED and urgent care settings. The training data were labeled from their associated reports. Image-based F1 score was chosen to optimize the operating point on the receiver operating characteristics (ROC) curve so as to minimize the number of missed findings and overcalls per image read. The performance of the model was compared with that of 5 radiology residents recruited from multiple institutions in the US in an objective study in which a separate data set of 1998 AP frontal chest radiographs was drawn from a hospital source representative of realistic preliminary reads in inpatient and ED settings. A triple consensus with adjudication process was used to derive the ground truth labels for the study data set. The performance of AI algorithm and radiology residents was assessed by comparing their reads with ground truth findings. All studies were conducted through a web-based clinical study application system. The triple consensus data set was collected between February and October 2018. The comparison study was preformed between January and October 2019. Data were analyzed from October to February 2020. After the first round of reviews, further analysis of the data was performed from March to July 2020.

Main outcomes and measures: The learning performance of the AI algorithm was judged using the conventional ROC curve and the area under the curve (AUC) during training and field testing on the study data set. For the AI algorithm and radiology residents, the individual finding label performance was measured using the conventional measures of label-based sensitivity, specificity, and positive predictive value (PPV). In addition, the agreement with the ground truth on the assignment of findings to images was measured using the pooled κ statistic. The preliminary read performance was recorded for AI algorithm and radiology residents using new measures of mean image-based sensitivity, specificity, and PPV designed for recording the fraction of misses and overcalls on a per image basis. The 1-sided analysis of variance test was used to compare the means of each group (AI algorithm vs radiology residents) using the F distribution, and the null hypothesis was that the groups would have similar means.

Results: The trained AI algorithm achieved a mean AUC across labels of 0.807 (weighted mean AUC, 0.841) after training. On the study data set, which had a different prevalence distribution, the mean AUC achieved was 0.772 (weighted mean AUC, 0.865). The interrater agreement with ground truth finding labels for AI algorithm predictions had pooled κ value of 0.544, and the pooled κ for radiology residents was 0.585. For the preliminary read performance, the analysis of variance test was used to compare the distributions of AI algorithm and radiology residents' mean image-based sensitivity, PPV, and specificity. The mean image-based sensitivity for AI algorithm was 0.716 (95% CI, 0.704-0.729) and for radiology residents was 0.720 (95% CI, 0.709-0.732) (P = .66), while the PPV was 0.730 (95% CI, 0.718-0.742) for the AI algorithm and 0.682 (95% CI, 0.670-0.694) for the radiology residents (P < .001), and specificity was 0.980 (95% CI, 0.980-0.981) for the AI algorithm and 0.973 (95% CI, 0.971-0.974) for the radiology residents (P < .001).

Conclusions and relevance: These findings suggest that it is possible to build AI algorithms that reach and exceed the mean level of performance of third-year radiology residents for full-fledged preliminary read of AP frontal chest radiographs. This diagnostic study also found that while the more complex findings would still benefit from expert overreads, the performance of AI algorithms was associated with the amount of data available for training rather than the level of difficulty of interpretation of the finding. Integrating such AI systems in radiology workflows for preliminary interpretations has the potential to expedite existing radiology workflows and address resource scarcity while improving overall accuracy and reducing the cost of care.

Conflict of interest statement

Conflict of Interest Disclosures: Dr Syeda-Mahmood reported having a patent pending for the AI algorithm used in this study. No other disclosures were reported.

Figures

Figure 1.. Sampling of Data Distributions for…
Figure 1.. Sampling of Data Distributions for Artificial Intelligence Algorithm Training and Evaluation
Two images were excluded from the comparison study data set owing to radiology resident annotations missing. The prevalence distribution of training and study data sets are different owing to the difference in the sampling process.
Figure 2.. Deep Learning Network Architecture for…
Figure 2.. Deep Learning Network Architecture for Anteroposterior Chest Radiographs
Figure 3.. Receiver Operating Characteristic Curves of…
Figure 3.. Receiver Operating Characteristic Curves of Artifical Intelligence Algorithm on Study Data Set and Relative Performance
The findings selected were from the most prevalent ones in the modeling data set. The light blue square indicates mean sensitivity and 1 − specificity of the radiology residents on the comparison study data set; dark blue circle, operating point of the artificial intelligence algorithm based on the F1 score–based threshold derived from training data.

References

    1. Zha N, Patlas MNDR, Duszak R Jr. Radiologist burnout is not just isolated to the United States: perspectives from Canada. J Am Coll Radiol. 2019;16(1):121-123. doi:10.1016/j.jacr.2018.07.010
    1. Kane L. Medscape national physician burnout, depression and suicide report 2019. Medscape. January 16, 2019. Accessed September 11, 2020.
    1. Do HM, Spear LG, Nikpanah M, et al. . Augmented radiologist workflow improves report value and saves time: a potential model for implementation of artificial intelligence. Acad Radiol. 2020;27(1):96-105. doi:10.1016/j.acra.2019.09.014
    1. Wang X, Peng Y, Lu L, Lu Z, Bagheri M SR. ChestX-ray8: hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. Paper presented at: 2017 IEEE Conference on Computer Vision and Pattern Recognition; July 21-26, 2017; Honolulu, HI. Accessed September 11, 2020. doi:10.1109/CVPR.2017.369
    1. Taylor AG, Mielke C, Mongan J. Automated detection of moderate and large pneumothorax on frontal chest X-rays using deep convolutional neural networks: a retrospective study. PLoS Med. 2018;15(11):e1002697. doi:10.1371/journal.pmed.1002697
    1. Rajpurkar P, Irvin J, Zhu Ket al. . CheXNet: radiologist-level pneumonia detection on chest x-rays with deep learning. arXiv. Preprint posted online December 25, 2017. Accessed September 11, 2020.
    1. Pan I, Cadrin-Chênevert A, Cheng PM. Tackling the Radiological Society of North America Pneumonia Detection Challenge. AJR Am J Roentgenol. 2019;213(3):568-574. doi:10.2214/AJR.19.21512
    1. Morris MA, Saboury B, Burkett B, Gao J, Siegel EL. Reinventing radiology: big data and the future of medical imaging. J Thorac Imaging. 2018;33(1):4-16. doi:10.1097/RTI.0000000000000311
    1. Irvin J, Rajpurkar P, Ko Met al. . CheXpert: a large chest radiograph dataset with uncertainty labels and expert comparison. Paper presented at: AAAI-19 Thirty-Third AAAI Conference on Artificial Intelligence. January 27-February 1, 2019; Honolulu, HI. Accessed September 11, 2020. doi:10.1609/aaai.v33i01.3301590
    1. Majkowska A, Mittal S, Steiner DF, et al. . Chest radiograph interpretation with deep learning models: assessment with radiologist-adjudicated reference standards and population-adjusted evaluation. Radiology. 2020;294(2):421-431. doi:10.1148/radiol.2019191293
    1. Annarumma M, Withey SJ, Bakewell RJ, Pesce E, Goh V, Montana G. Automated triaging of adult chest radiographs with deep artificial neural networks. Radiology. 2019;291(1):196-202. doi:10.1148/radiol.2018180921
    1. Wong KCL, Moradi M, Wu J. Identifying disease-free chest x-ray images with deep transfer learning. Deep AI. April 2, 2019. Accessed September 11, 2020.
    1. Safran C, Bloomrosen M, Hammond WE, et al. ; Expert Panel . Toward a national framework for the secondary use of health data: an American Medical Informatics Association white paper. J Am Med Inform Assoc. 2007;14(1):1-9. doi:10.1197/jamia.M2273
    1. Brownlee J. Classification accuracy is not enough: more performance measures you can use. Accessed June 26, 2020.
    1. Hansell DM, Bankier AA, MacMahon H, McLoud TC, Müller NLRJ, Remy J. Fleischner Society: glossary of terms for thoracic imaging. Radiology. 2008;246(3):697-722. doi:10.1148/radiol.2462070712
    1. Johnson AEW, Pollard TJ, Shen L, et al. . MIMIC-III, a freely accessible critical care database. Sci Data. 2016;3(1):160035. doi:10.1038/sdata.2016.35
    1. Coden A, Gruhl D, Lewis N, Tanenblatt M TJ. Spot the drug: an unsupervised pattern matching method to extract drug names from very large clinical corpora. Paper presented at IEEE Second International Conference on Healthcare Informatics, Imaging and Systems Biology; September 28-29, 2012; San Diego, CA. Accessed September 11, 2020. doi:10.1109/HISB.2012.16
    1. Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. Paper presented at: 1st International Conference on Learning Representations, ICLR 2013. May 2-4, 2013; Scottsdale, AZ. Accessed September 11, 2020.
    1. Johnson AEW, Pollard TJ, Berkowitz SJ, et al. . MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci Data. 2019;6(1):317. doi:10.1038/s41597-019-0322-0
    1. Lin T-Y, Dollár P, Girshick R, He K, Hariharan B BS. Feature pyramid networks for object detection. Paper presented at: 2017 IEEE Conference on Computer Vision and Pattern Recognition. July 21-26, 2017; Honolulu, HI. Accessed September 11, 2020.
    1. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv. Preprint posted online April 10, 2015. Accessed September 11, 2020.
    1. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. Paper presented at: 2016 IEEE Conference on Computer Vision and Pattern Recognition. June 27-30, 2016; Las Vegas, NV. Accessed September 11, 2020. doi:10.1109/CVPR.2016.90
    1. Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L. Imagenet: a large-scale hierarchical image database. Paper presented at: 2009 IEEE Conference on Computer Vision and Pattern Recognition. June 20-25, 2009; Miami, FL. Accessed September 11, 2020. doi:10.1109/CVPR.2009.5206848
    1. Yu F, Koltun V. Multi-scale context aggregation by dilated convolutions. arXiv. Preprint posted online April 20, 2016.
    1. He K, Zhang X, Ren S, Sun J. Identity mappings in deep residual networks. Poster presented at: 14th European Conference on Computer Vision. October 8-16, 2016; Amsterdam, the Netherlands. Accessed September 11, 2020.
    1. Wu Y, He K. Group normalization. Paper Presented at: 16th European Conference on Computer Vision. September 8-14, 2018; Munich, Germany. Accessed September 11, 2020.
    1. Brownlee J. A gentle introduction to the rectified linear unit (ReLU). Accessed June 26, 2020.
    1. Lin T-Y, Maji S. Improved bilinear pooling with CNNs. arXiv Preprinted posted online July 21, 2017. Accessed September 11, 2020.
    1. Kumar S. Data splitting technique to fit any machine learning model. Accessed September 11, 2020.
    1. Parikh R, Mathai A, Parikh S, Chandra Sekhar G, Thomas R. Understanding and using sensitivity, specificity and predictive values. Indian J Ophthalmol. 2008;56(1):45-50. doi:10.4103/0301-4738.37595
    1. De Vries H, Elliott MN, Kanouse DE, Teleki SS. Using pooled kappa to summarize interrater agreement across many items. Field Methods. 2008;20(3):272-282. doi:10.1177/1525822X08317166

Source: PubMed

3
Subscribe