SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech Distance Metrics: Measuring similarity using the Euclidean and Correlation distance metrics Principle Components Analysis: Reducing the dimensionality of microarray data Clustering Agorithms: Kmeans Self-Organizing Maps (SOM) Hierarchical Clustering MATRIXgenes,conditions = Expression dataset the first genevector = (x11, x12, x13, x14… x1n) the leftmost condition vector = (x11, x21, x31 … xm1) Rows (genes) Columns (conditions [timepoints, or tissues]) x11 , x12 , x13 , … x1n x21 x31 , … Xm1 … xmn Similarity measures Clustering identifies group of genes with “similar” expression profiles How is similarity measured? Euclidian distance Correlation coefficient Others: Manhattan, Chebychev, Euclidean Squared In an experiment with 10 conditions, the gene expression profiles for two genes X, and Y would have this form X = (x1, x2, x3, …, x10) Y = (y1, y2, y3, …, y10) Similarity measure - Euclidian distance Gb: (x1, x2) d(Ga, Gb) = sqrt( (x1-y1)2 + (x2 -y2)2 ) Ga: (y1, y2) In general: if there are M experiments: X = (x1, x2, x3, …, xm) Y = (y1, y2, y3, …, ym) Similarity measure – Pearson Correlation Coefficient X = (x1, x2, x3, …, xm), Y = (y1, y2, y3, …, ym) D=1-r r = [Z(X)*Z(Y)] (dot product of the z-scores of vectors X and Y) r = |Z(X)| |Z(Y)| cos(T) • When two unit vectors are completely correlated, r=1 and D=0 • When two unit vectors are non correlated, r=0 and D = 1 Dot product review: http://mathworld.wolfram.com/DotProduct.html Euclidian vs Pearson Correlation Euclidian distance – takes into account the magnitude of the expression Gene Y Gene X Correlation distance - insensitive to the amplitude of expression, takes into account the trends of the change. Common trends are considered biologically relevant, the magnitude is considered less important What euclidean distance sees What correlation distance sees Principle Components Analysis (PCA) A method for projecting microarray data onto a reduced (2 or 3 dimensional) easily visualized space Definition: Principle Components - A set of variables that define a projection that encapsulates the maximum amount of variation in a dataset and is orthogonal (and therefore uncorrelated) to the previous principle component of the same dataset. Example Dataset : Thousands of genes probed in 10 conditions. The expression profile of each gene is presented by the vector of its expression levels: X = (X1, X2, X3, X4, X5) Imagine each gene X as a point in a 5-dimentional space. Each direction/axis corresponds to a specific condition Genes with similar profiles are close to each other in this space PCA- Project this dataset to 2 dimensions, preserving as much information as possible PCA transformation of a microarray dataset Visual estimation of the number of clusters in the data 1-page tutorial on singular value decomposition (PCA) http://web.mit.edu/be.400/www/SVD/Singular_Value_Decomposition.htm Cluster analysis Function Places genes with similar expression patterns in groups. Sometimes genes of unknown function will be grouped with genes of known function. The functions that are known allow the investigator to hypothesize regarding the functions of genes not yet characterized. Examples: Identify genes important in cell cycle regulation Identify genes that participate in a biosynthetic pathway Identify genes involved in a drug response Identify genes involved in a disease response Clustering yeast cell cycle dataset VS gene tree ordering How to choose the number of clusters needed to informatively partition the data Trial and error: Try clustering with a different number of clusters, and compare your results Criteria for comparison: Homogeneity vs Separation Use PCA (Principle Component Analysis) to visually determine how well the algorithm grouped genes Calculate the mean distance between all genes within a cluster (it should be small) and compare that to the distance between clusters (which should be large) Mathematical evaluation of clustering solution Merits of a ‘good’ clustering solution: Homogeneity: Separation: Genes inside a cluster are highly similar to each other. Average similarity between a gene and the center (average profile) of its cluster. Genes from different clusters have low similarity to each other. Weighted average similarity between centers of clusters. These are conflicting features: increasing the number of clusters tends to improve with-in cluster Homogeneity on the expense of between-cluster Separation Performance on Yeast Cell Cycle Data 698 genes, 72 conditions (Spellman et al. 1998). Each algorithm was run by its authors in a “blind” test. Separation “True” CAST* CLICK GeneCluster K-means Homogeneity *Ben-Dor, Shamir, Yakhini 1999 Clustering Algorithms K–means SOMs Hierarchical clustering K-MEANS The user sets the number of clusters- k Initialization: each gene is randomly assigned to one of the k clusters Average expression vector is calculated for each cluster (cluster’s profile) Iterate over the genes: 1. 2. 3. 4. • • • 5. 6. For each gene- compute its similarity to the cluster profiles. Move the gene to the cluster it is most similar to. Recalculated cluster profiles. Score current partition: sum of distances between genes and the profile of the cluster they are assigned to (homogeneity of the solution). Stop criteria: further shuffling of genes results in minor improvement in the clustering score genes gene gene gene gene gene gene gene gene gene gene gene gene gene gene .. A B C D E F G H I J K L M N 0hrs 0.12 0.47 1.97 1.21 0.25 0.81 1.64 1.78 0.14 1.01 0.91 1.71 1.46 0.88 1.15 1hr 1.68 1.37 0.87 1.22 0.70 0.34 0.08 1.64 0.68 0.84 1.57 1.33 0.12 1.21 1.30 Experiments 2hr 0.99 1.06 1.84 1.71 0.66 1.18 1.03 1.71 0.88 0.06 1.49 0.27 1.60 1.44 1.16 3hr 1.05 0.91 0.30 1.45 0.83 1.85 0.36 1.49 1.54 1.87 0.81 1.59 0.44 1.46 1.07 4hr 1.44 1.96 1.17 1.68 1.38 1.18 1.64 0.97 0.49 1.11 1.32 0.87 0.73 1.90 0.23 K-MEANS example: 4 clusters (too many?) Mean profile Standard deviation in each condition Evaluating Kmeans Cluster 1 Cluster 3 Misclassified Cluster 4 Cluster 2 K-means example: 3 clusters (looks right) Kmeans clustering: K=2 (too few) SOMs (Self-Organizing Maps) less clustering and more data organizing User sets the number of clusters in a form of a rectangular grid (e.g., 3x2) – ‘map nodes’ Imagine genes as points in (Mdimensional) space Initialization: map nodes are randomly placed in the data space Genes – data points Clusters – map nodes SOM - Scheme • Randomly choose a data point (gene). • Find its closest map node • Move this map node towards the data point • Move the neighbor map nodes towards this point, but to lesser extent (thinner arrows show weaker shift) • Iterate over data points • Each successive gene profile (black dot) has less of an influence on the displacement of the nodes. • Iterate through all profiles several times (10-100) • When positions of the cluster nodes have stabilized, assign each gene to its closest map node (cluster) Hierarchical Clustering Goal#1: Organize the genes in a structure of a hierarchical tree {1,2,3,4,5} 1) Initial step: each gene is regarded as a cluster with one item 2) Find the 2 most similar clusters and merge them into a common node (red dot) 3) Merge successive nodes until all genes are contained in a single cluster {1,2,3} {4,5} {1,2} Goal#2: Collapse branches to group genes into distinct clusters g1 g2 g3 g4 g5 Which genes to cluster? Apply filtering prior to clustering – focus the analysis on the ‘responding genes’ The application of controlled statistical tests to identify ‘responding genes’ usually ends up with too few genes that do not allow for a global characterization of the response. Variance: filter out genes that do not vary greatly among the conditions of the experiment. Non-varying genes skew clustering results, especially when using a correlation coefficient Fold change: choose genes that change by at least M-fold in at least L conditions. Clustering – Tools Cluster (Eisen) – hierarchical clustering GeneCluster (Tamayo) – SOM http://rana.lbl.gov/EisenSoftware.htm http://bioinfo.cnio.es/wwwsomtree/ TIGR MeV – K-Means, SOM, hierarchical, QTC, CAST Expander – CLICK, SOM, K-means, hierarchical http://www.tm4.org/mev.html http://www.cs.tau.ac.il/~rshamir/expander/expander.htm l Many others (e.g. GeneSpring) http://www.agilent.com/chem/genespring Analysis Strategy Transform Dataset Using PCA (1 Cluster (2 Parameters to test: • Distance Metric • Number of clusters • Separation & Homogeneity • Assign biological meaning to clusters (3 Original presentation created by Rani Elkon and posted at: http://www.tau.ac.il/lifesci/bioinfo/teaching/20022003/DNA_microarray_winter_2003.html