Video-based AI for beat-to-beat assessment of cardiac function

David Ouyang, Bryan He, Amirata Ghorbani, Neal Yuan, Joseph Ebinger, Curtis P Langlotz, Paul A Heidenreich, Robert A Harrington, David H Liang, Euan A Ashley, James Y Zou, David Ouyang, Bryan He, Amirata Ghorbani, Neal Yuan, Joseph Ebinger, Curtis P Langlotz, Paul A Heidenreich, Robert A Harrington, David H Liang, Euan A Ashley, James Y Zou

Abstract

Accurate assessment of cardiac function is crucial for the diagnosis of cardiovascular disease1, screening for cardiotoxicity2 and decisions regarding the clinical management of patients with a critical illness3. However, human assessment of cardiac function focuses on a limited sampling of cardiac cycles and has considerable inter-observer variability despite years of training4,5. Here, to overcome this challenge, we present a video-based deep learning algorithm-EchoNet-Dynamic-that surpasses the performance of human experts in the critical tasks of segmenting the left ventricle, estimating ejection fraction and assessing cardiomyopathy. Trained on echocardiogram videos, our model accurately segments the left ventricle with a Dice similarity coefficient of 0.92, predicts ejection fraction with a mean absolute error of 4.1% and reliably classifies heart failure with reduced ejection fraction (area under the curve of 0.97). In an external dataset from another healthcare system, EchoNet-Dynamic predicts the ejection fraction with a mean absolute error of 6.0% and classifies heart failure with reduced ejection fraction with an area under the curve of 0.96. Prospective evaluation with repeated human measurements confirms that the model has variance that is comparable to or less than that of human experts. By leveraging information across multiple cardiac cycles, our model can rapidly identify subtle changes in ejection fraction, is more reproducible than human evaluation and lays the foundation for precise diagnosis of cardiovascular disease in real time. As a resource to promote further innovation, we also make publicly available a large dataset of 10,030 annotated echocardiogram videos.

Figures

Extended Data Figure 1:. Hyperparameter search for…
Extended Data Figure 1:. Hyperparameter search for spatiotemporal convolutions on video dataset to predict ejection fraction.
Model architecture (R2+1D which is the architecture selected by EchoNet-Dynamic for EF prediction, R3D, and MC3), initialization (Kinetics-400 pretrained weights with solid line and random initial weights with dotted line), clip length (1, 8, 16, 32, 64, 96, and all frames), and sampling period (1, 2, 4, 6, and 8) were considered. (a) When varying clip lengths, performance is best at 64 frames (corresponding to 1.28 seconds), and starting from pretrained weights improves performance slightly across all models. (b) Varying sampling period with a length to approximately correspond to 64 frames prior to subsampling. Performance is best at a sampling period of 2.
Extended Data Figure 2:. Individual beat assessment…
Extended Data Figure 2:. Individual beat assessment of ejection fraction for each clip in the test dataset.
The left panel shows patients with low variance across beats (SD 2.5, n = 717). Each patient video is represented by multiple points representing the estimate of each beat and a line signifying 1.96 standard deviations from the mean. A greater proportion of beats are within 5% of ejection fraction from the human estimate (the shaded regions) in videos with low variance compared to individual beat assessment of ejection fraction in high variance patients.
Extended Data Figure 3:. Model performance during…
Extended Data Figure 3:. Model performance during training.
Mean square error (MSE) loss for left ventricular ejection fraction prediction during training on training dataset (a) and validation dataset (b). Pixel level cross entropy loss for left ventricle semantic segmentation during training on training dataset (c) and validation dataset (d).
Extended Data Figure 4:. Relationship between clip…
Extended Data Figure 4:. Relationship between clip length and speed and memory.
Hyperparameter search for model architecture (R2+1D, which is used by EchoNet-Dynamic for EF prediction, R3D, and MC3) and input video clip length (1, 8, 16, 32, 64, 96 frames) and impact on model processing time and model memory usage.
Extended Data Figure 5:. Variation in echocardiogram…
Extended Data Figure 5:. Variation in echocardiogram video quality and relationship with EchoNet-Dynamic model performance (n = 1,277).
Representative quintile video frames are shown with respective (a) mean pixel intensity and (b) standard deviation of pixel intensity with mean absolute error of EchoNet-Dynamic’s ejection fraction prediction and Dice Similarity Coefficient for segmentation of the left ventricle. Boxplot represents the median as a thick line, 25th and 75th percentiles as upper and lower bounds of the box, whiskers up to 1.5 times the interquartile range from the median.
Extended Data Figure 6.. Impact of degraded…
Extended Data Figure 6.. Impact of degraded image quality with model performance.
Random pixels were removed and replaced with pure black pixels to simulate ultrasound dropout. Representative video frames with dropout shown across range of dropout. The proportion of dropout was compared with model performance with respect to R2 of prediction of ejection fraction and Dice Similarity Coefficient compared to human segmentation of the left ventricle.
Figure 1.. EchoNet-Dynamic workflow.
Figure 1.. EchoNet-Dynamic workflow.
For each patient, EchoNet-Dynamic uses standard apical-4-chamber view echocardiogram video as input. The model first predicts ejection fraction for each cardiac cycle using spatiotemporal convolutions with residual connections and generates frame-level semantic segmentations of the left ventricle using weak supervision from expert human tracings. These outputs are combined to create beat-by-beat predictions of ejection fraction and to predict the presence of heart failure with reduced ejection fraction.
Figure 2.. Model Performance.
Figure 2.. Model Performance.
(a) EchoNet-Dynamic’s predicted EF vs. reported EF on the internal test dataset from Stanford (blue, n = 1,277) and the external test dataset from Cedars-Sinai (red, n = 2,895). The blue and red lines indicate the least-squares regression line between model prediction and human calculated EF. (b) Receiver operating characteristic curves for diagnosis of heart failure with reduced ejection fraction on internal test dataset (blue, n = 1,277) and external test dataset (red, n = 2,895). (c) Variance of metrics of cardiac function on repeat measurement. The first four boxplots highlights clinician variation using different techniques (n=55), and the last two boxplots show EchoNet-Dynamic’s variance on input images from standard ultrasound machines (n=55) and an ultrasound machine not previously seen by the model (n=49). Boxplot represents the median as a thick line, 25th and 75th percentiles as upper and lower bounds of the box, and individual points for instances greater than 1.5 times the interquartile range from the median. (d) Weak supervision with human expert tracings of the left ventricle at end-systole (ESV) and end-diastole (EDV) is used to train a semantic segmentation model with input video frames throughout the cardiac cycle. (e) Dice Similarity Coefficient (DSC) was calculated for each ESV/EDV frame (n = 1,277). (f) The area of the left ventricle segmentation was used to identify heart rate and bin clips for beat-to-beat evaluation of EF.
Figure 3.. Beat-to-beat evaluation of ejection fraction.
Figure 3.. Beat-to-beat evaluation of ejection fraction.
(a) Atrial fibrillation and arrhythmias can be identified by significant variation in intervals between ventricular contractions. (b) Significant variation in left ventricle segmentation area was associated with higher variance in EF prediction. (c) Histogram of standard deviation of beat-to-beat evaluation of EF (n = 1,277) across the internal test videos. (d) Assessing the effect of beat-to-beat based on the number of sampled beats averaged for prediction. Each boxplot represents 100 random samples of a certain number of beats and comparison with reported ejection fraction. Boxplot represents the median as a thick line, 25th and 75th percentiles as upper and lower bounds of the box, and whiskers up to 1.5 times the interquartile range from the median.

Source: PubMed

3
Suscribir