Generative Method to Discover Genetically Driven Image Biomarkers

Nematollah K Batmanghelich, Ardavan Saeedi, Michael Cho, Raul San Jose Estepar, Polina Golland, Nematollah K Batmanghelich, Ardavan Saeedi, Michael Cho, Raul San Jose Estepar, Polina Golland

Abstract

We present a generative probabilistic approach to discovery of disease subtypes determined by the genetic variants. In many diseases, multiple types of pathology may present simultaneously in a patient, making quantification of the disease challenging. Our method seeks common co-occurring image and genetic patterns in a population as a way to model these two different data types jointly. We assume that each patient is a mixture of multiple disease subtypes and use the joint generative model of image and genetic markers to identify disease subtypes guided by known genetic influences. Our model is based on a variant of the so-called topic models that uncover the latent structure in a collection of data. We derive an efficient variational inference algorithm to extract patterns of co-occurrence and to quantify the presence of heterogeneous disease processes in each patient. We evaluate the method on simulated data and illustrate its use in the context of Chronic Obstructive Pulmonary Disease (COPD) to characterize the relationship between image and genetic signatures of COPD subtypes in a large patient cohort.

Trial registration: ClinicalTrials.gov NCT00608764.

Figures

Fig. 1
Fig. 1
Subject s draws a subset of T topics from K population-level topics. Indices of the subject-level topics are stored in cs1,.., csT drawn from a categorical distribution. At the subject level, indices of the supervoxels { zsnI} and locations of minor alleles { zs,mG} are drawn from the subject-specific categorical distribution. Vector cs acts as a map from subject-specific topics to the population-level topics (i.e., cs(zsmG) or cs(zsnI)).
Fig. 2
Fig. 2
Simulated data results. Left: variational lower bound F (q*) for different values of (α, ω). Middle: the number of topics discovered by the model as a function of ω averaged over α. Right: normalized mutual information between the true and the discovered topics for our method and for k-means clustering (K–M) applied to pooled data. The number of discovered topics is reported in brackets under the corresponding value of ω (w gene, w/o gene). Two variants of our method are denoted by THDP.
Fig. 3
Fig. 3
Example simulated image data using 2D features. (a) Features from all subjects pooled into one set. Colors correspond to true topics, unavailable to the algorithm. (b) Image features for a single subject in a set. (c) Topics recovered by our algorithm (with genetic data) for the same subject based on the whole data set. (d) Topics recovered by k-means clustering applied to the pooled data in (a) (Colour figure online).
Fig. 4
Fig. 4
Four first topics, ranked according to their proportions. Each histogram density is one topic. The values inside of the brackets are the overall proportion computed from the posterior. The tables on the right report the top six SNPs for each topic with their estimated relative weights. We observe that the genetic signatures vary across topics.
Fig. 5
Fig. 5
Spatial average distribution of three topics. The color indicates the posterior probability. The higher the intensity of the color, the higher the probability (Colour figure online).
Fig. 6
Fig. 6
Left: Graphical model that represents the joint distribution. The open gray and white circles correspond to the observed and the latent random variables, respectively. The full circles represent fixed hyper-parameters. Superscript I and G denote image and genetic parts of the model respectively. Right: Update rules for the variational parameters.

Source: PubMed

3
S'abonner