Introduction to Machine Learning for Microarray Analysis (for the non-computer scientist) Jennifer Listgarten October 2001 Outline • Introduction to Machine Learning • Clustering (Thoroughly) • Principal Components Analysis (Briefly) • Self-Organizing Maps (Briefly) Perhaps another time: • ‘Supervised’ vs. ‘Unsupervised’ Learning • Neural Networks and Support Vector Machines Outline • Introduction to Machine Learning • Clustering (Thoroughly) • Principal Components Analysis (Briefly) • Self-Organizing Maps (Briefly) What is Machine Learning Definition: The ability of a computer to recognize patterns that have occurred repeatedly and improve its performance based on past experience. Questions for Machine Learning • Which genes are co-regulated? • Which genes have similar functional roles? • Do certain gene profiles correlate with diseased patients? • (Which genes are upregulated/downregulated?) The Data: How to think about it In Machine Learning, each data point is a vector. Example: Patient_X= (gene_1, gene_2, gene_3, …, gene_N) Expression Ratio for Gene 3 Each vector ‘lives’ in a high-dimensional space. Patient_X= (gene_1, gene_2, gene_3, …, gene_N) Expression Ratio for Gene 3 N is normally larger than 2, so we can’t visualize the data. Our Goal Tease out the structure of our data from the high-dimensional space in which it lives. Breast cancer patients Healthy patients Patient_X= (gene_1, gene_2, gene_3, …, gene_N) Two ways to think of data for Microarray Experiments 1. All genes for one patient make a vector: Patient_X= (gene_1, gene_2, gene_3, …, gene_N) Two ways to think of data for Microarray Experiments 1. All genes for one patient make a vector: Patient_X= (gene_1, gene_2, gene_3, …, gene_N) 2. All experiments for one gene make a vector: Gene_X= (experiment_1, experiment_2 …, experiment_N) Outline • Introduction to Machine Learning • Clustering – Hierarchical Clustering – K-means Clustering – Stability of Clusters • Principal Components Analysis • Self-Organizing Maps Clustering Data Want ‘similar’ data to group together. Problems: • Don’t know which definition of ‘similar’ to use in order to extract useful information. • Without external validation, difficult to tell if clustering is meaningful. • How many clusters? Similarity • Metric: Formal name for ‘measure of similarity’ between 2 points. • Every clustering algorithm (hierarchical, k-means, etc.) needs to decide on a metric. • Can argue in favour of various metrics, but no correct answer. Some Metrics • Euclidean distance: 2 ( X i Y i ) i • Correlation based: X i X i Yi Yi 1 N i X Y • Mutual Information based: p( X , Y ) log p( X , Y ) p( X ) p(Y ) X and Y here are two vectors (eg. two patients) Outline • Introduction to general Machine Learning • Clustering – Hierarchical Clustering – K-means Clustering – Stability of Clusters • Principal Components Analysis • Self-Organizing Maps Outline of Hierarchical Clustering 1. Start by making every data point its own cluster. 2. Repeatedly combine ‘closest’ clusters until only one is left. (example to follow) Simple Example - Step 1 Initially each datum is its own cluster. High-dimensional point F G E A D B C Legend: Data point (patient) One cluster Simple Example - Step 2 Combine the two closest clusters. F G E A D B C E F Legend: Data point (patient) One cluster DENDOGRAM Simple Example - Step 3 Again: Combine the next two closest clusters. F G E A D B C Legend: Data point (patient) One cluster E F B C DENDOGRAM Simple Example - Step 4 And again... F G E A D B C Legend: Data point (patient) One cluster D E F B C DENDOGRAM Simple Example - Step 5 And again... F G E A D B C Legend: Data point (patient) One cluster D E F B C A DENDOGRAM Simple Example - Step 6 And again... F G E A D B C Legend: Data point (patient) One cluster G D E F B C A DENDOGRAM Simple Example - Step 7 And again... F G E A D B C Legend: Data point (patient) One cluster G D E F B C A DENDOGRAM Simple Example - Step 7 Metric Scale And again... F G E A D B C Legend: Data point (patient) One cluster G D E F B C A DENDOGRAM Digression:‘Distance’ between clusters 3 common ways: 1. Single-Linkage 2. Complete-Linkage 3. Average-Linkage What we get out of HC • A hierarchical set of clusters. • A dendogram showing which data points are most closely related, as defined by the metric. A B C D E F G A B C D E F G What we get out of HC A B C D E F G • Can we tell how data points are related by looking at the horizontal positions of the data points...? • Must be careful about interpretation of the dendogram - example to follow. Notice that we can swap branches while maintaining the tree structure. G D E F B C A Notice that we can swap branches while maintaining the tree structure. G D F E B C A Again... G D F E B C A Again... B C A G D F E Again... How many ways to swap branches if there are N data points? B C A G D F E Again... How many ways to swap branches if there are N data points? 2N-1 For N=100 2N-1=1.3 x 1030 B C A G D F E Two data points that were close together in one tree, may be far apart in another. 1 2 1. G and A far apart 2. G and A close G D F E B C A B C A G D F E Two data points that were close together in one tree, may be far apart in another. 1 2 1. G and A far apart 2. G and A close G D F E B C A B C A G D F E There is a way to help overcome the arbitrariness of the branches: SelfOrganizing Maps - SOMs discuss later Lesson Learned Be careful not to overinterpret the results of hierarchical clustering (along the horizontal axis). What is HC used for? A B C D E F G • Typically, grouping genes that are co-regulated. (Could use for grouping patients too.) • While useful, it is a relatively simple, unsophisticated tool. • It is more of a visualization tool, rather than a mathematical model. Outline • Introduction to Machine Learning • Clustering – Hierarchical Clustering – K-means Clustering • Principal Components Analysis • Self-Organizing Maps K-Means Clustering Goal: Given a desired number of clusters, • find out the cluster centers • find out which data point belongs to each cluster Must specify that we want 3 clusters. Legend: Data point (patient) Cluster centre Outline of K-Means Clustering •Step 1 - Decide how many clusters (let’s say 3). •Step 2 - Randomly choose 3 cluster centers. Outline of K-Means Clustering •Step 3 - Choose a metric. •Step 4 - Assign each point to the cluster that it is ‘closest’ to (according to metric). Outline of K-Means Clustering •Step 5 - Recalculate cluster centres using means of points that belong to a cluster. •Step 6 - Repeat until convergence (or fixed number of steps, etc.). Newly calculated cluster centres Outline of K-Means Clustering Another step.... assign points to clusters. Reassign Points Outline of K-Means Clustering And the final step.... reposition the means. Newly calculated cluster centres Outline of K-Means Clustering And the final step.... reassign points. Reassign Points Variations of K-Means: •K-Median Clustering (uses median to find new cluster positions instead of mean). Use median/mean/etc/ to reposition cluster. Related to K-Means Clustering: •Mixture of Gaussians (now clusters have a width as well) - Gaussian Probability Distribution instead of a metric. Other differences too. p( X | cluster_k ) 1 (2 k ) d / 2 2 2 ( X k ) exp 2 2 k Soft Partition (vs. Hard) Comments on K-Means Clustering • May not converge nicely, need multiple random restarts. • Results are straightforward (unlike hierarchical clustering). • Still a relatively simple tool - not much mathematical modelling. Comments on K-Means Clustering • Earlier: showed random initialization. • Can run hierarchical clustering to initialize K-Means Clustering algorithm. • This can help with convergence problems as well speed up the algorithm. Outline • Introduction to Machine Learning • Clustering – Hierarchical Clustering – K-means Clustering – Stability of Clusters • Principal Components Analysis • Self-Organizing Maps Stability of Clusters • Ideally, want different (good) clustering techniques to provide similiar results. • Otherwise the clustering is likely arbitrary, not modelling any true structure in the data set. Outline • Introduction to Machine Learning • Clustering (Thoroughly) • Principal Components Analysis (Briefly) • Self-Organizing Maps (Briefly) Principal Components Analysis (PCA) • Mathematical technique to reduce the dimensionality of the data. PCA - Dimension Reduction 2 Projections Shown •Some projections are more informative than others. •While projections differ, the object remains unchanged in original space. 3D2D (two ways) Why? • Instead of clustering 20,000 dimensional data - cluster 100 dimensional data. • Typically some dimensions are redundant. • Might eliminate noise, get more meaningful clusters (?) Why? • Instead of clustering 20,000 dimensional data - cluster 100 dimensional data. • Typically some dimensions are redundant. • Might eliminate noise, get more meaningful clusters (?) Fewer dimension means easier to initialize ML algorithms and get good results. Simple 2D Example One could cluster these 2D points in 2 dimensions. Y X Simple 2D Example One could cluster these 2D points in 2 dimensions. Y But... what if... X Simple 2D Example We squashed them into 1D. ‘squash’geometric projection X Simple 2D Example •Y-Dimension was redundant. •Only needed X variable to cluster nicely. Y X X Another 2D Example It is not obvious which dimension to keep now. Y X Another 2D Example There is no axis onto which we can project to get good separation of the data... Y Y X X Another 2D Example But if we project the data onto a combination (linear combination) of the two dimensions... it works out nicely. Y Y X X That was the Intuition Behind PCA Outline of PCA: • Step 1 - Find the direction that accounts for the largest amount of variation in the data set, call this E1. Y E1 First Principal Component X That was the Intuition Behind PCA Outline of PCA: • Step 1 - Find the direction that accounts for the largest amount of variation in the data set, call this E1. • Step 2 - Find the direction which is perpendicular (orthogonal/uncorrelated) to E1 and accounts for the next largest amount of variation in the data set, call this E2 . That was the Intuition Behind PCA Outline of PCA: • Step 3 - Find 3rd next best direction which is orthogonal to the other 2 directions - call this E3. • • • • Step N - Find the Nth such direction. (If there were N dimensions to begin with). PCA - Some Linear Algebra... Singular Value Decomposition (SVD) C UV c11 c12 ... ... ... ... c N 1 c N 2 ... c1N u11 u12 ... ... ... ... ... ... ... ... ... c NN u N 1 u N 2 ... u1N 1 0 0 0 v11 v12 ... ... 0 2 0 0 ... ... ... ... ... ... ... ... ... ... ... u NN 0 0 0 N v N 1 v N 2 ... v1N ... ... ... ... ... v NN T Covariance Matrix of Original Data Principal Components 2nd Principal Component Variance in each Component Principal Components Analysis • Typical dimensional reduction might be: 1 million 200,000, which might retain 95% of the original information. • Reduction depends on set of data. PCA - Useful for clustering? • It turns out that PCA can lead to worse clustering than simply using the original data (not necessarily). • PCA is often used in conjunction with other techniques, such as Artificial Neural Networks, Support Vector Machines, etc. PCA - Interesting Side Note • PCA is the basis for most of the Face Recognition systems currently in use. • eg. Security in airports etc. Principal ‘Face’ Directions Outline • Introduction to Machine Learning • Clustering (Thoroughly) • Principal Components Analysis (Briefly) • Self-Organizing Maps (Briefly) Self-Organizing Maps (SOMs) • Way to visualize high-dimensional data in a low-dimensional space. • Commonly view data in 1 or 2 dimensions with SOMs. Self-Organizing Maps (SOMs) Can think of SOMs as a cross between KMeans Clustering and PCA: Self-Organizing Maps (SOMs) Can think of SOMs as a cross between KMeans Clustering and PCA: K-Means: find Cluster centres. PCA: reduce dimensionality of the cluster centres. (i.e. impose a ‘structure’ on the relationship between cluster centres) (At the same time!) Self-Organizing Maps Example: Impose a 5000 Dimensions 1 Dimension 1D structure on the cluster centres. Self-Organizing Maps 4 2 5000 Dimensions 1 3 1 2 1 Dimension 3 4 •This imposes an ordering on the cluster centres. Self-Organizing Maps 19 18 7 8 9 17 5 5000 Dimensions 4 2 10 14 12 1 3 •Data points from Cluster 1 come before points in Cluster 2, etc. •Then order based on proximity to neighbouring clusters. 13 1 2 1 Dimension 15 6 11 3 16 •This imposes an ordering on the data points. 4 Self-Organizing Maps 19 18 7 8 9 17 5 5000 Dimensions 4 2 10 14 12 13 1 1 2 1 Dimension 15 6 11 3 16 3 4 What is important to know for our immediate interest: SOM imposes a unique ordering on the data points. SOMs and Hierarchical Clustering Hierarchical Clustering (dendogram) Recall the problem of arbitrariness in the order of the branches in Hierarchical Clustering? Can use SOMs to help. G D F E B C A SOMs, Eisen, Cluster 5 4 6 3 19 18 7 8 9 10 16 17 14 15 11 12 2 1 Hierarchical Clustering (dendogram) Eisen’s Cluster can use the ordering from the SOM, to do hierarchical clustering. 1) Run 1D SOM on data set. 2) Build dendogram using ordering from SOM. G D F E B C A 13 Self-Organizing Maps • Not normally what SOMs are used for (i.e. hierarchical clustering) • Mainly used for visualization, and as a first step before further analysis. • Can also use 2D, 3D SOMs. Concluding Remarks I hope that I managed to: 1) Give some idea of what machine learning is about (‘structure’ in highdimensional spaces). 2) The intuition behind several techniques, and familiarity with their names. Main Ideas of PCA • For N dimensional data, find N directions. • Can represent any one of our original data points using (a linear combination) of these N directions: N direction th Patient_X a1E1 a2 E2 a3 E3 ... ...aN EN i.e. Patient_X (a1 , a2 , a3 ,..., aN ) Main Ideas of PCA • Key Idea: Can represent extremely well any one of our original data points using fewer than N Fewer than N directions: directions. Patient_X a1E1 a2 E2 a3 E3 ... ...aN d EN d Patient_X (a1 , a2 , a3 ,..., aN d ) Another 2D Example But if we project the data onto a combination (linear combination) of the two dimensions... it works out nicely. Y Y X X Concluding Remarks Understand the algorithms behind your analysis. Blind application can lead to erroneous conclusions.