Generalizable spelling using a speech neuroprosthesis in an individual with severe limb and vocal paralysis

Sean L Metzger, Jessie R Liu, David A Moses, Maximilian E Dougherty, Margaret P Seaton, Kaylo T Littlejohn, Josh Chartier, Gopala K Anumanchipalli, Adelyn Tu-Chan, Karunesh Ganguly, Edward F Chang, Sean L Metzger, Jessie R Liu, David A Moses, Maximilian E Dougherty, Margaret P Seaton, Kaylo T Littlejohn, Josh Chartier, Gopala K Anumanchipalli, Adelyn Tu-Chan, Karunesh Ganguly, Edward F Chang

Abstract

Neuroprostheses have the potential to restore communication to people who cannot speak or type due to paralysis. However, it is unclear if silent attempts to speak can be used to control a communication neuroprosthesis. Here, we translated direct cortical signals in a clinical-trial participant (ClinicalTrials.gov; NCT03698149) with severe limb and vocal-tract paralysis into single letters to spell out full sentences in real time. We used deep-learning and language-modeling techniques to decode letter sequences as the participant attempted to silently spell using code words that represented the 26 English letters (e.g. "alpha" for "a"). We leveraged broad electrode coverage beyond speech-motor cortex to include supplemental control signals from hand cortex and complementary information from low- and high-frequency signal components to improve decoding accuracy. We decoded sentences using words from a 1,152-word vocabulary at a median character error rate of 6.13% and speed of 29.4 characters per minute. In offline simulations, we showed that our approach generalized to large vocabularies containing over 9,000 words (median character error rate of 8.23%). These results illustrate the clinical viability of a silently controlled speech neuroprosthesis to generate sentences from a large vocabulary through a spelling-based approach, complementing previous demonstrations of direct full-word decoding.

Conflict of interest statement

S.L.M., J.R.L., D.A.M., and E.F.C. are inventors on a pending provisional patent application that is directly relevant to the neural-decoding approach used in this work. G.K.A and E.F.C are inventors on patent application PCT/US2020/028926, D.A.M. and E.F.C. are inventors on patent application PCT/US2020/043706 and E.F.C. is an inventor on patent US9905239B2 which are broadly relevant to the neural-decoding approach in this work. The remaining authors declare no competing interests.

© 2022. The Author(s).

Figures

Fig. 1. Schematic depiction of the spelling…
Fig. 1. Schematic depiction of the spelling pipeline.
a At the start of a sentence-spelling trial, the participant attempts to silently say a word to volitionally activate the speller. b Neural features (high-gamma activity and low-frequency signals) are extracted in real time from the recorded cortical data throughout the task. The features from a single electrode (electrode 0, Fig. 5a) are depicted. For visualization, the traces were smoothed with a Gaussian kernel with a standard deviation of 150 milliseconds. The microphone signal shows that there is no vocal output during the task. c The speech-detection model, consisting of a recurrent neural network (RNN) and thresholding operations, processes the neural features to detect a silent-speech attempt. Once an attempt is detected, the spelling procedure begins. d During the spelling procedure, the participant spells out the intended message throughout letter-decoding cycles that occur every 2.5 s. Each cycle, the participant is visually presented with a countdown and eventually a go cue. At the go cue, the participant attempts to silently say the code word representing the desired letter. e High-gamma activity and low-frequency signals are computed throughout the spelling procedure for all electrode channels and parceled into 2.5-s non-overlapping time windows. f An RNN-based letter-classification model processes each of these neural time windows to predict the probability that the participant was attempting to silently say each of the 26 possible code words or attempting to perform a hand-motor command (g). Prediction of the hand-motor command with at least 80% probability ends the spelling procedure (i). Otherwise, the predicted letter probabilities are processed by a beam-search algorithm in real time and the most likely sentence is displayed to the participant. g After the participant spells out his intended message, he attempts to squeeze his right hand to end the spelling procedure and finalize the sentence. h The neural time window associated with the hand-motor command is passed to the classification model. i If the classifier confirms that the participant attempted the hand-motor command, a neural network-based language model (DistilGPT-2) rescores valid sentences. The most likely sentence after rescoring is used as the final prediction.
Fig. 2. Performance summary of the spelling…
Fig. 2. Performance summary of the spelling system during the copy-typing task.
a Character error rates (CERs) observed during real-time sentence spelling with a language model (LM), denoted as ‘+LM (Real-time results)’, and offline simulations in which portions of the system were omitted. In the ‘Chance’ condition, sentences were created by replacing the outputs from the neural classifier with randomly generated letter probabilities without altering the remainder of the pipeline. In the ‘Only neural decoding’ condition, sentences were created by concatenating together the most likely character from each of the classifier’s predictions during a sentence trial (no whitespace characters were included). In the ‘+Vocab. constraints’ condition, the predicted letter probabilities from the neural classifier were used with a beam search that constrained the predicted character sequences to form words within the 1152-word vocabulary. The final condition ‘+ LM (Real-time results)’ incorporates language modeling. The sentences decoded with the full system in real time exhibited lower CERs than sentences decoded in the other conditions (***P < 0.0001, P-values provided in Table S2, two-sided Wilcoxon Rank-Sum test with 6-way Holm-Bonferroni correction). b Word error rates (WERs) for real-time results and corresponding offline omission simulations from A (***P < 0.0001, P-values provided in Table S3, two-sided Wilcoxon Rank-Sum test with 6-way Holm-Bonferroni correction). c The decoded characters per minute during real-time testing. d The decoded words per minute during real-time testing. In ad, the distribution depicted in each boxplot was computed across n = 34 real-time blocks (in each block, the participant attempted to spell between 2 and 5 sentences), and each boxplot depicts the median as a center line, quartiles as bottom and top box edges, and the minimum and maximum values as whiskers (except for data points that are 1.5 times the interquartile range, which are individually plotted). e Number of excess characters in each decoded sentence. f Example sentence-spelling trials with decoded sentences from each non-chance condition. Incorrect letters are colored red. Superscripts 1 and 2 denote the correct target sentence for the two decoded sentences with errors. All other example sentences did not contain any errors. Data to recreate panels ae are provided as a Source Data file.
Fig. 3. Characterization of high-gamma activity (HGA)…
Fig. 3. Characterization of high-gamma activity (HGA) and low-frequency signals (LFS) during silent-speech attempts.
a 10-fold cross-validated classification accuracy on silently attempted NATO code words when using HGA alone, LFS alone, and both HGA+LFS simultaneously. Classification accuracy using only LFS is significantly higher than using only HGA, and using both HGA+LFS results in significantly higher accuracy than either feature type alone (**P = 4.71 × 10−4, z = 3.78 for each comparison, two-sided Wilcoxon Rank-Sum test with 3-way Holm-Bonferroni correction). Chance accuracy is 3.7%. Each boxplot corresponds to n = 10 cross-validation folds (which are also plotted as dots) and depicts the median as a center line, quartiles as bottom and top box edges, and the minimum and maximum values as whiskers (except for data points that are 1.5 times the interquartile range). be Electrode contributions. Electrodes that appear larger and more opaque provide more important features to the classification model. b, c Show contributions associated with HGA features using a model trained on HGA alone (b) vs using the combined LFS + HGA feature set (c). d, e depict contributions associated with LFS features using a model trained on LFS alone (d) vs the combined LFS + HGA feature set (e). f Histogram of the minimum number of principal components (PCs) required to explain more than 80% of the total variance, denoted as σ2, in the spatial dimension for each feature set over 100 bootstrap iterations. The number of PCs required were significantly different for each feature set (***P < 0.0001, P-values provided in Table S5, two-sided Wilcoxon Rank-Sum test with 3-way Holm-Bonferroni correction). g Histogram of the minimum number of PCs required to explain more than 80% of the variance in the temporal dimension for each feature set over 100 bootstrap iterations (***P < 0.0001, P-values provided in Table S6, two-sided Wilcoxon Rank-Sum test with 3-way Holm-Bonferroni correction, *P < 0.01 two-sided Wilcoxon Rank-Sum test with 3-way Holm-Bonferroni correction). h Effect of temporal smoothing on classification accuracy. Each point represents the median, and error bars represent the 99% confidence interval around bootstrapped estimations of the median. Data to recreate all panels are provided as a Source Data file.
Fig. 4. Comparison of neural signals during…
Fig. 4. Comparison of neural signals during attempts to silently say English letters and NATO code words.
a Classification accuracy (across n = 10 cross-validation folds) using models trained with HGA+LFS features is significantly higher for NATO code words than for English letters (**P = 1.57 × 10−4, z = 3.78, two-sided Wilcoxon Rank-Sum test). The dotted horizontal line represents chance accuracy. b Nearest-class distance is significantly larger for NATO code words than for letters (boxplots show values across the n = 26 code words or letters; *P = 2.85 × 10−3, z = 2.98, two-sided Wilcoxon Rank-Sum test). In a, b, each data point is plotted as a dot, and each boxplot depicts the median as a center line, quartiles as bottom and top box edges, and the minimum and maximum values as whiskers (except for data points that are 1.5 times the interquartile range). c The nearest-class distance is greater for the majority of code words than for the corresponding letters. In b and c, nearest-class distances are computed as the Frobenius norm between trial-averaged HGA+LFS features. Data to recreate all panels are provided as a Source Data file.
Fig. 5. Differences in neural signals and…
Fig. 5. Differences in neural signals and classification performance between overt- and silent-speech attempts.
a MRI reconstruction of the participant’s brain overlaid with implanted electrode locations. The locations of the electrodes used in b and c are bolded and numbered in the overlay. b Evoked high-gamma activity (HGA) during silent (orange) and overt (green) attempts to say the NATO code word kilo. c Evoked high-gamma activity (HGA) during silent (orange) and overt (green) attempts to say the NATO code word tango. Evoked responses in b and c are aligned to the go cue, which is marked as a vertical dashed line at time 0. Each curve depicts the mean ± standard error across n = 100 speech attempts. d Code-word classification accuracy for silent- and overt-speech attempts with various model-training schemes. All comparisons revealed significant differences between the result pairs (P < 0.01, two-sided Wilcoxon Rank-Sum test with 28-way Holm-Bonferroni correction) except for those marked as ‘ns’. Each boxplot corresponds to n = 10 cross-validation folds (which are also plotted as dots) and depicts the median as a center line, quartiles as bottom and top box edges, and the minimum and maximum values as whiskers (except for data points that are 1.5 times the interquartile range). Chance accuracy is 3.84%. Data to recreate all panels are provided as a Source Data file.
Fig. 6. The spelling approach can generalize…
Fig. 6. The spelling approach can generalize to larger vocabularies and conversational settings.
a Simulated character error rates from the copy-typing task with different vocabularies, including the original vocabulary used during real-time decoding. b Word error rates from the corresponding simulations in a. In a and b, each boxplot corresponds to n = 34 blocks (in each of these blocks, the participant attempted to spell between two to five sentences). c Character and word error rates across the volitionally chosen responses and messages decoded in real time during the conversational task condition. Each boxplot corresponds to n = 9 blocks (in each of these blocks, the participant attempted to spell between two to four conversational responses; each dot corresponds to a single block). In ac, each boxplot depicts the median as a center line, quartiles as bottom and top box edges, and the minimum and maximum values as whiskers (except for data points that are 1.5 times the interquartile range, which are individually plotted). d Examples of presented questions from trials of the conversational task condition (left) along with corresponding responses decoded from the participant’s brain activity (right). In the final example, the participant spelled out his intended message without being prompted with a question. Data to recreate panels ac are provided as a Source Data file.

References

    1. Beukelman DR, Fager S, Ball L, Dietz A. AAC for adults with acquired neurological conditions: a review. Augment. Altern. Commun. 2007;23:230–242. doi: 10.1080/07434610701553668.
    1. Felgoise SH, Zaccheo V, Duff J, Simmons Z. Verbal communication impacts quality of life in patients with amyotrophic lateral sclerosis. Amyotroph. Lateral Scler. Front. Degener. 2016;17:179–183. doi: 10.3109/21678421.2015.1125499.
    1. Brumberg JS, Pitt KM, Mantie-Kozlowski A, Burnison JD. Brain–computer interfaces for augmentative and alternative communication: a tutorial. Am. J. Speech Lang. Pathol. 2018;27:1–12. doi: 10.1044/2017_AJSLP-16-0244.
    1. Vansteensel MJ, et al. Fully implanted brain–computer interface in a locked-in patient with ALS. N. Engl. J. Med. 2016;375:2060–2066. doi: 10.1056/NEJMoa1608085.
    1. Pandarinath C, et al. High performance communication by people with paralysis using an intracortical brain-computer interface. eLife. 2017;6:1–27. doi: 10.7554/eLife.18554.
    1. Willett FR, Avansino DT, Hochberg LR, Henderson JM, Shenoy KV. High-performance brain-to-text communication via handwriting. Nature. 2021;593:249–254. doi: 10.1038/s41586-021-03506-2.
    1. Branco MP, et al. Brain-computer interfaces for communication: preferences of individuals with locked-in syndrome. Neurorehabil. Neural Repair. 2021;35:267–279. doi: 10.1177/1545968321989331.
    1. Bouchard KE, Mesgarani N, Johnson K, Chang EF. Functional organization of human sensorimotor cortex for speech articulation. Nature. 2013;495:327–332. doi: 10.1038/nature11911.
    1. Carey D, Krishnan S, Callaghan MF, Sereno MI, Dick F. Functional and quantitative MRI mapping of somatomotor representations of human supralaryngeal vocal tract. Cereb. Cortex. 2017;27:265–278.
    1. Chartier J, Anumanchipalli GK, Johnson K, Chang EF. Encoding of articulatory kinematic trajectories in human speech sensorimotor cortex. Neuron. 2018;98:1042–1054.e4. doi: 10.1016/j.neuron.2018.04.031.
    1. Lotte F, et al. Electrocorticographic representations of segmental features in continuous speech. Front. Hum. Neurosci. 2015;09:1–13. doi: 10.3389/fnhum.2015.00097.
    1. Herff C, et al. Brain-to-text: decoding spoken phrases from phone representations in the brain. Front. Neurosci. 2015;9:1–11. doi: 10.3389/fnins.2015.00217.
    1. Makin JG, Moses DA, Chang EF. Machine translation of cortical activity to text with an encoder–decoder framework. Nat. Neurosci. 2020;23:575–582. doi: 10.1038/s41593-020-0608-8.
    1. Mugler EM, et al. Direct classification of all American English phonemes using signals from functional speech motor cortex. J. Neural Eng. 2014;11:035015–035015. doi: 10.1088/1741-2560/11/3/035015.
    1. Sun P, Anumanchipalli GK, Chang EF. Brain2Char: a deep architecture for decoding text from brain recordings. J. Neural Eng. 2020;17:066015. doi: 10.1088/1741-2552/abc742.
    1. Dash, D., Ferrari, P. & Wang, J. Decoding imagined and spoken phrases from non-invasive neural (MEG) signals. Front. Neurosci. 14, 290 (2020).
    1. Wilson GH, et al. Decoding spoken English from intracortical electrode arrays in dorsal precentral gyrus. J. Neural Eng. 2020;17:066007. doi: 10.1088/1741-2552/abbfef.
    1. Cooney, C., Folli, R. & Coyle, D. H. A bimodal deep learning architecture for EEG-fNIRS decoding of overt and imagined speech. IEEE Trans. Biomed. Eng. 1–1 10.1109/TBME.2021.3132861 (2021).
    1. Angrick, M. et al. Speech synthesis from stereotactic EEG using an electrode shaft dependent multi-input convolutional neural network approach. In 2021 43rd Annual International Conference of the IEEE Engineering in Medicine Biology Society (EMBC). p. 6045–6048. 10.1109/EMBC46164.2021.9629711 (2021).
    1. Moses DA, et al. Neuroprosthesis for decoding speech in a paralyzed person with anarthria. N. Engl. J. Med. 2021;385:217–227. doi: 10.1056/NEJMoa2027540.
    1. Adolphs S, Schmitt N. Lexical coverage of spoken discourse. Appl. Linguist. 2003;24:425–438. doi: 10.1093/applin/24.4.425.
    1. van Tilborg, A. & Deckers, S. R. J. M. Vocabulary selection in AAC: application of core vocabulary in atypical populations. Perspectives of the ASHA Special Interest Groups. Vol. 1, p. 125–138 (American Speech-Language-Hearing Association, 2016).
    1. Hannun, A. Y., Maas, A. L., Jurafsky, D. & Ng, A. Y. First-Pass Large Vocabulary Continuous Speech Recognition using Bi-Directional Recurrent DNNs. arXiv10.48550/arXiv.1408.2873 (2014).
    1. Moses DA, Leonard MK, Makin JG, Chang EF. Real-time decoding of question-and-answer speech dialogue using human cortical activity. Nat. Commun. 2019;10:3096. doi: 10.1038/s41467-019-10994-4.
    1. Dash, D. et al. Neural Speech Decoding for Amyotrophic Lateral Sclerosis. 10.21437/Interspeech.2020-3071 (2020).
    1. Proix T, et al. Imagined speech can be decoded from low- and cross-frequency intracranial EEG features. Nat. Commun. 2022;13:48. doi: 10.1038/s41467-021-27725-3.
    1. Anumanchipalli GK, Chartier J, Chang EF. Speech synthesis from neural decoding of spoken sentences. Nature. 2019;568:493–498. doi: 10.1038/s41586-019-1119-1.
    1. Simonyan, K., Vedaldi, A. & Zisserman, A. Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv13126034 Cs (2014).
    1. Rezeika A, et al. Brain–computer interface spellers: a review. Brain Sci. 2018;8:57. doi: 10.3390/brainsci8040057.
    1. Sellers EW, Ryan DB, Hauser CK. Noninvasive brain-computer interface enables communication after brainstem stroke. Sci. Transl. Med. 2014;6:257re7–257re7. doi: 10.1126/scitranslmed.3007801.
    1. Gilja V, et al. A high-performance neural prosthesis enabled by control algorithm design. Nat. Neurosci. 2012;15:1752–1757. doi: 10.1038/nn.3265.
    1. Kawala-Sterniuk A, et al. Summary of over fifty years with brain-computer interfaces—a review. Brain Sci. 2021;11:43. doi: 10.3390/brainsci11010043.
    1. Serruya MD, Hatsopoulos NG, Paninski L, Fellows MR, Donoghue JP. Instant neural control of a movement signal. Nature. 2002;416:141–142. doi: 10.1038/416141a.
    1. Laufer, B. Special Language: From Human Thinking to Thinking Machines. 316323 (Multilingual Matters, 1989).
    1. Webb S, Rodgers MPH. Vocabulary demands of television programs. Lang. Learn. 2009;59:335–366. doi: 10.1111/j.1467-9922.2009.00509.x.
    1. Conant DF, Bouchard KE, Leonard MK, Chang EF. Human sensorimotor cortex control of directly-measured vocal tract movements during vowel production. J. Neurosci. 2018;38:2382–17. doi: 10.1523/JNEUROSCI.2382-17.2018.
    1. Gerardin E, et al. Partially overlapping neural networks for real and imagined hand movements. Cereb. Cortex. 2000;10:1093–1104. doi: 10.1093/cercor/10.11.1093.
    1. Silversmith DB, et al. Plug-and-play control of a brain–computer interface through neural map stabilization. Nat. Biotechnol. 2020;39:326–335. doi: 10.1038/s41587-020-0662-5.
    1. Guenther, F. H. & Hickok, G. Neurobiology of Language. p. 725–740 (Elsevier, 2016).
    1. Moses, D. A., Leonard, M. K. & Chang, E. F. Real-time classification of auditory sentences using evoked cortical activity in humans. J. Neural Eng. 15, 036005 (2018).
    1. Ludwig KA, et al. Using a common average reference to improve cortical neuron recordings from microelectrode arrays. J. Neurophysiol. 2009;101:1679–1689. doi: 10.1152/jn.90989.2008.
    1. Williams, A. J., Trumpis, M., Bent, B., Chiang, C.-H. & Viventi, J. A Novel µECoG Electrode Interface for Comparison of Local and Common Averaged Referenced Signals. in 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) 5057–5060 (IEEE, 2018).
    1. Parks TW, McClellan JH. Chebyshev approximation for nonrecursive digital filters with linear phase. IEEE Trans. Circuit Theory. 1972;19:189–194. doi: 10.1109/TCT.1972.1083419.
    1. Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. arXiv10.48550/arXiv.1412.6980 (2017).
    1. Cho, K. et al. Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation. 1724–1734 (Association for Computational Linguistics, 2014).
    1. Fort, S., Hu, H. & Lakshminarayanan, B. Deep ensembles: a loss landscape perspective. arXiv10.48550/arXiv.1912.02757 (2020).
    1. About the Oxford 3000 and 5000 word lists at Oxford Learner’s Dictionaries. Oxford University Press. .
    1. Brants, T. & Franz, A.. Web 1T 5-gram Version 1. 20971520 KB. 10.35111/CQPA-A498 (2006).

Source: PubMed

3
Subscribe