Generalized EmbedSOM on quadtree-structured self-organizing maps

Miroslav Kratochvíl, Abhishek Koladiya, Jiří Vondrášek, Miroslav Kratochvíl, Abhishek Koladiya, Jiří Vondrášek

Abstract

EmbedSOM is a simple and fast dimensionality reduction algorithm, originally developed for its applications in single-cell cytometry data analysis. We present an updated version of EmbedSOM, viewed as an algorithm for landmark-directed embedding enrichment, and demonstrate that it works well even with manifold-learning techniques other than the self-organizing maps. Using this generalization, we introduce an inwards-growing variant of self-organizing maps that is designed to mitigate some earlier identified deficiencies of EmbedSOM output. Finally, we measure the performance of the generalized EmbedSOM, compare several variants of the algorithm that utilize different landmark-generating functions, and showcase the functionality on single-cell cytometry datasets from recent studies.

Keywords: dimensionality reduction; self-organizing maps; single-cell cytometry.

Conflict of interest statement

No competing interests were disclosed.

Copyright: © 2020 Kratochvíl M et al.

Figures

Figure 1.. Overview of EmbedSOM interaction with…
Figure 1.. Overview of EmbedSOM interaction with landmarks on a toy dataset.
Embedding process starts by reducing the input dataset (data flow is visualized as orange arrows) to landmarks (black arrows and dots) in high-dimensional (top row) and low-dimensional space (middle row). EmbedSOM quickly places the relatively large amount of individual input points into matching neighborhoods of the low-dimensional landmarks. The landmark-generating methods from left: A simple grid from SOM algorithm, a random selection of input points with 2-D topology reconstructed by t-SNE, and a GQTSOM-based grid. GQTSOM landmarks are labeled by their level in the quadtree. Visualizations from other methods – (bottom row) are presented with computation time (t) for comparison. R code that produces the plots is available in Supplementary material.
Figure 2.. Performance of EmbedSOM variants compared…
Figure 2.. Performance of EmbedSOM variants compared with other dimensionality reduction methods.
The speed is represented in cells per second. EmbedSOM-based algorithms show almost perfect linear scaling with growing dataset size, and even minor speed improvements when sufficient data is available for saturating the parallel computation. As expected from their asymptotic complexities, performance of UMAP, TriMap and t-SNE decreased with additional data. t-SNE was not executed on datasets larger than 50 thousand cells because of time constrains.
Figure 3.. Comparison of EmbedSOM visualizations of…
Figure 3.. Comparison of EmbedSOM visualizations of the Wong dataset using different landmarks.
Top row: cells embedded using 3 different landmark-generating methods, colored by the tissue of sample origin. Middle row: The same embedding colored by major cell types. The colors used for annotation are purposefully reproduced from the article of Bechtet al. to simplify comparison. Bottom row: visualizations of the low-dimensional landmark images, colored by their corresponding marker expressions.
Figure 4.. Display of clusters of rare…
Figure 4.. Display of clusters of rare cell types in GQTSOM-based embedding.
Top left: Overview of the cleaned and embedded Unen dataset, colored by expression of main cell lineage markers. The contour based on Gaussian difference is added for easier identification of changes in cell density. Labels mark the rare cell types identified by van Unenet al. , Belkinaet al. : (a) CD4 +CD28 –CCR7 +, (b) CD4 +CD28 –CCR7 –CD56 –, (c) CD4 +CD28 –CCR7 –CD56 +, and (d) CD7 +CD3 –CD127 –CD45RA –CD56 partial. Top right: Expressions of separate markers used for the identification. Bottom: Cells color-coded by sample origin (left) and separated by disease status of the patient (right).

References

    1. Kratochvíl M, Koladiya A, Balounova J, et al. : SOM-based embedding improves efficiency of high-dimensional cytometry data analysis. bioRxiv. 2019. 10.1101/496869
    1. Van Gassen S, Callebaut B, Van Helden MJ, et al. : FlowSOM: Using self-organizing maps for visualization and interpretation of cytometry data. Cytometry A. 2015;87(7):636–645. 10.1002/cyto.a.22625
    1. Weber LM, Robinson MD: Comparison of clustering methods for high-dimensional single-cell flow and mass cytometry data. Cytometry Part A. 2016;89(12):1084–1096. 10.1002/cyto.a.23030
    1. Rauber A, Merkl D, Dittenbach M: The growing hierarchical self-organizing map: exploratory analysis of high-dimensional data. IEEE Trans Neural Netw. 2002;13(6):1331–1341. 10.1109/TNN.2002.804221
    1. Van Der Maaten L: Accelerating t-SNE using tree-based algorithms. J Mach Learn Res. 2014;15(1):3221–3245.
    1. Becht E, McInnes L, Healy J, et al. : Dimensionality reduction for visualizing single-cell data using UMAP. Nat Biotechnol. 2019;37(1):38. 10.1038/nbt.4314
    1. Amid E, Warmuth MK: TriMap: Large-scale dimensionality reduction using triplets.2019.
    1. Moon KR, van Dijk D, Wang Z, et al. : Visualizing structure and transitions in high-dimensional biological data. Nat Biotechnol. 2019;37(2):1482–1492. 10.1038/s41587-019-0336-3
    1. Borodin PA: Linearity of metric projections on Chebyshev subspaces in L 1 and C. Mathematical Notes. 1998;63(6):717–723. 10.1007/BF02312764
    1. Ding J, Condon A, Shah SP: Interpretable dimensionality reduction of single cell transcriptome data with deep generative models. Nat Commun. 2018;9(1):2002. 10.1038/s41467-018-04368-5
    1. Dittenbach M, Merkl D, Rauber A: The growing hierarchical self-organizing map. In: Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: NewChallenges and Perspectives for the New Millennium, IEEE,2000;6:15–19. 10.1109/IJCNN.2000.859366
    1. Samet H: The quadtree and related hierarchical data structures. ACM Computing Surveys (CSUR). 1984;16(2):187–260. 10.1145/356924.356930
    1. Wong MT, Ong DE, Lim FS, et al. : A High-Dimensional Atlas of Human T Cell Diversity Reveals Tissue-Specific Trafficking and Cytokine Signatures. Immunity. 2016;45(2):442–456. 10.1016/j.immuni.2016.07.007
    1. van Unen V, Li N, Molendijk I, et al. : Mass Cytometry of the Human Mucosal Immune System Identifies Tissue- and Disease-Associated Immune Subsets. Immunity. 2016;44(5):1227–1239. 10.1016/j.immuni.2016.04.014
    1. van Unen V, Höllt T, Pezzotti N, et al. : Visual analysis of mass cytometry data by hierarchical stochastic neighbour embedding reveals rare cell types. Nat Commun. 2017;8(1):1740. 10.1038/s41467-017-01689-9
    1. Belkina AC , Ciccolella CO, Anno R, et al. : Automated optimized parameters for T-distributed stochastic neighbor embedding improve visualization and analysis of large datasets. Nat Commun. 2019;10(1):5415. 10.1038/s41467-019-13055-y

Source: PubMed

3
订阅