Cluster analysis and display of genome-wide expression patterns

M B Eisen, P T Spellman, P O Brown, D Botstein, M B Eisen, P T Spellman, P O Brown, D Botstein

Abstract

A system of cluster analysis for genome-wide expression data from DNA microarray hybridization is described that uses standard statistical algorithms to arrange genes according to similarity in pattern of gene expression. The output is displayed graphically, conveying the clustering and the underlying expression data simultaneously in a form intuitive for biologists. We have found in the budding yeast Saccharomyces cerevisiae that clustering gene expression data groups together efficiently genes of known similar function, and we find a similar tendency in human data. Thus patterns seen in genome-wide expression experiments can be interpreted as indications of the status of cellular processes. Also, coexpression of genes of known function with poorly characterized or novel genes may provide a simple means of gaining leads to the functions of many genes for which information is not available currently.

Figures

Figure 1
Figure 1
Clustered display of data from time course of serum stimulation of primary human fibroblasts. Experimental details are described elsewhere (11). Briefly, foreskin fibroblasts were grown in culture and were deprived of serum for 48 hr. Serum was added back and samples taken at time 0, 15 min, 30 min, 1 hr, 2 hr, 3 hr, 4 hr, 8 hr, 12 hr, 16 hr, 20 hr, 24 hr. The final datapoint was from a separate unsynchronized sample. Data were measured by using a cDNA microarray with elements representing approximately 8,600 distinct human genes. All measurements are relative to time 0. Genes were selected for this analysis if their expression level deviated from time 0 by at least a factor of 3.0 in at least 2 time points. The dendrogram and colored image were produced as described in the text; the color scale ranges from saturated green for log ratios −3.0 and below to saturated red for log ratios 3.0 and above. Each gene is represented by a single row of colored boxes; each time point is represented by a single column. Five separate clusters are indicated by colored bars and by identical coloring of the corresponding region of the dendrogram. As described in detail in ref. , the sequence-verified named genes in these clusters contain multiple genes involved in (A) cholesterol biosynthesis, (B) the cell cycle, (C) the immediate–early response, (D) signaling and angiogenesis, and (E) wound healing and tissue remodeling. These clusters also contain named genes not involved in these processes and numerous uncharacterized genes. A larger version of this image, with gene names, is available at http://rana.stanford.edu/clustering/serum.html.
Figure 2
Figure 2
Cluster analysis of combined yeast data sets. Data from separate time courses of gene expression in the yeast S. cerevisiae were combined and clustered. Data were drawn from time courses during the following processes: the cell division cycle (9) after synchronization by alpha factor arrest (ALPH; 18 time points); centrifugal elutriation (ELU; 14 time points), and with a temperature-sensitive cdc15 mutant (CDC15; 15 time points); sporulation (10) (SPO, 7 time points plus four additional samples); shock by high temperature (HT, 6 time points); reducing agents (D, 4 time points) and low temperature (C; 4 time points) (P. T. S., J. Cuoczo, C. Kaiser, P.O. B., and D. B., unpublished work); and the diauxic shift (8) (DX, 7 time points). All data were collected by using DNA microarrays with elements representing nearly all of the ORFs from the fully sequenced S. cerevisiae genome (8); all measurements were made against a time 0 reference sample except for the cell-cycle experiments, where an unsynchronized sample was used. All genes (2,467) for which functional annotation was available in the Saccharomyces Genome Database were included (12). The contribution to the gene similarity score of each sample from a given process was weighted by the inverse of the square root of the number of samples analyzed from that process. The entire clustered image is shown in A; a larger version of this image, along with dendrogram and gene names, is available at http://rana.stanford.edu/clustering/yeastall.html. Full gene names are shown for representative clusters containing functionally related genes involved in (B) spindle pole body assembly and function, (C) the proteasome, (D) mRNA splicing, (E) glycolysis, (F) the mitochondrial ribosome, (G) ATP synthesis, (H) chromatin structure, (I) the ribosome and translation, (J) DNA replication, and (K) the tricarboxylic acid cycle and respiration. The full-color range represents log ratios of −1.2 to 1.2 for the cell-cycle experiments, −1.5 to 1.5 for the shock experiments, −2.0 to 2.0 for the diauxic shift, and −3.0 to 3.0 for sporulation. Gene name, functional category, and specific function are from the Saccharomyces Genome Database (13). Cluster I contains 112 ribosomal protein genes, seven translation initiation or elongation factors, three tRNA synthetases, and three genes of apparently unrelated function.
Figure 3
Figure 3
To demonstrate the biological origins of patterns seen in Figs. 1 and 2, data from Fig. 1 were clustered by using methods described here before and after random permutation within rows (random 1), within columns (random 2), and both (random 3).

Source: PubMed

3
Abonner