Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and Health National Jewish Health sonia.leach@gmail.com Center for Genes, Environment, and Health Someone once said “Artificial Intelligence = Search” so Machine Learning = ?Induction of New Knowledge from experience and ability to improve? Machine Learning is a natural outgrowth of the intersection of Computer Science and Statistics. We might say the defining question of Computer Science is “How can we build machines that solve problems, and which problems are inherently tractable/intractable?” The question that largely defines Statistics is “What can be inferred from data plus a set of modeling assumptions, with what reliability?” The defining question for Machine Learning builds on both, but it is a distinct question. Whereas Computer Science has focused primarily on how to manually program computers, Machine Learning focuses on the question of how to get computers to program themselves (from experience plus some initial structure). Whereas Statistics has focused primarily on what conclusions can be inferred from data, Machine Learning incorporates additional questions about what computational architectures and algorithms can be used to most effectively capture, store, index, retrieve and merge these data, how multiple learning subtasks can be orchestrated in a larger system, and questions of computational tractability. We say that a machine learns with respect to a particular task T, performance metric P, and type of experience E, if the system reliably improves its performance P at task T, following experience E. - Tom Mitchell http://www.cs.cmu.edu/~tom/pubs/MachineLearning.pdf Also interesting discussion of differences among AI, ML, Data Mining, Stats : http://stats.stackexchange.com/questions/5026/what-is-the-difference-between-data-mining-statistics-machine-learning-and-ai Center for Genes, Environment, and Health 2 Machine Learning • From Wikipedia: – – – – – – – – – – – 7.1 Decision tree learning 7.2 Association rule learning 7.3 Artificial neural networks 7.4 Inductive logic programming 7.5 Support vector machines 7.6 Clustering 7.7 Bayesian networks 7.8 Reinforcement learning 7.9 Representation learning 7.10 Similarity and metric learning 7.11 Sparse Dictionary Learning • From Alppaydin Intro to Mach Learn: – – – – – – – – – – – – – – – – – Supervised Learning Bayesian Decision Theory Parametric Methods Multivariate Methods Dimensionality Reduction Clustering Nonparametric Methods Decision Trees Linear Discrimination Multilayer Perceptrons Local Models Kernel Machines Bayesian Estimation Hidden Markov Models Graphical Models Combining Multiple Learners Reinforcement Learning http://www.realtechsupport.org/UB/MRIII/papers/MachineLearning/Alppaydin_MachineLearning_2010.pdf Center for Genes, Environment, and Health 3 Machine Learning (what I will cover) • Unsupervised • Supervised – – – – – – Dimensionality Reduction • PCA – Clustering • k-Means, SOM, Hierarchical – Association Set Mining – Probabilistic Graphical Models k-Nearest Neighbor Neural Nets Decision Trees/Random Forests SVMs Naïve Bayes • Issues • HMMs, Bayes Nets Connections to other lectures: Miller (HMM), Pollock (HMM), Leach (HMM), Lozupone (PCA, Feature Importance Scores, Clustering), Kechris (Regression), [Hunter (KnowledgeBased Analysis), Cohen (BioNLP), Phang (Expr Analysis) ….] R: http://cran.r-project.org/web/views/MachineLearning.html Center for Genes, Environment, and Health – – – – – – Regression/Classification Feature selection/reduction Missing data Boosting/bagging/jackknife Cross validation, generalization Model selection 4 Machine Learning • Supervised Learning • Unsupervised Learning – training set = both inputs and correct answers • Example: classification in predefined classes for which examples of labeled data are known – training set = just input data • Example: grouping data into categories based on similarities among them – It is similar with the optimization of an error function which measures the difference between the true answers and the answers given by the learner – Relies on statistical properties of data when try to extract models of data – Does not use an error concept but a model quality concept which should be maximized Center for Genes, Environment, and Health http://slideplayer.com/slide/4040706/ 5 Unsupervised Learning Center for Genes, Environment, and Health 6 Dimensionality Reduction: Principal Components Analysis (PCA) • Motivation: Instead of considering all variables, use small number of linear combos of those variables with minimum information lost http://blog.peltarion.com/2006/06/20/the-talented-drhebb-part-2-pca/ http://blog.peltarion.com/2006/06/20/the-talented-drhebb-part-2-pca/ 2D data: What if could only choose 1 of the variables to represent data? Choose y-axis, explains more variance in data Amount of variance explained by P1 > explained by Y P 1 v a r Amount of variance explained by single variable Center for Genes, Environment, and Health 7 Principal Components Analysis (PCA) • Let X=(x1,x2,…,xn) is a random vector (mean vector , covariance matrix ) – Example X=(height, weight, GPA, credit score) • Each element of vector has a distribution over the population (i.e. xi is a random variable) • A dataset is a set of samples from the joint distribution of X Sample Ht Wt GPA FICA • =(66, 179, 3.0, 687) Bob 73 185 3.3 610 2 = ht ht ,wt ht , gpa ht , fica Anna 62 105 3.7 730 Note: Ht & Wt usually strongly correlated 2 wt wt , gpa wt , fica ht , wt 2 ht , gpa wt , gpa gpa gpa, fica 2 fica ht , fica wt , fica gpa, fica Center for Genes, Environment, and Health Therese 69 137 2.89 717 Jacob 76 210 4.0 780 8 Principal Components Analysis (PCA) • If X=(x1,x2,…,xn) is a random vector (mean vector , covariance matrix ), then principal component transformation X Y = (X-) s.t. is orthogonal, T = is diagonal, 1 2 … p 0. – Linear orthogonal transform of original data to new coordinate system – each component is linear combination of original variables • coefficient of variables in linear combo = Loadings • data transformed to new coords = Scores – components ordered by percentage of variance explained along new axis – number of components = minimum dimension of input data matrix – set of orthogonal vectors not unique, not scale-invariant (covariance vs correlation), computed by eigen value decomposition (as above & R princomp) or singular value decomposition (SVD) (R prncmp) Center for Genes, Environment, and Health Adapted from S-plus Guide to Statistics 9 Principal Components Analysis (PCA) • If X is a random vector (mean , covariance matrix ), then principal component transformation X Y=(X-) s.t. is orthogonal, T = is diagonal, 1 2 … p 0. 1 2 3 4 5 6 7 8 9 10 diffgeom complex algebra reals stats 36 58 43 36 37 62 54 50 46 52 31 42 41 40 29 76 78 69 66 81 46 56 52 56 40 12 42 38 38 28 39 46 51 54 41 30 51 54 52 32 22 32 43 28 22 9 40 47 30 24 X What if we could only choose two dimensions? Center for Genes, Environment, and Health Adapted from S-plus Guide to Statistics 10 Principal Components Analysis (PCA) • If X is a random vector (mean , covariance matrix ), then principal component transformation X Y=(X-) s.t. is orthogonal, T = is diagonal, 1 2 … p 0. 1 2 3 4 5 6 7 8 9 10 diffgeom complex algebra reals stats 36 58 43 36 37 62 54 50 46 52 31 42 41 40 29 76 78 69 66 81 46 56 52 56 40 12 42 38 38 28 39 46 51 54 41 30 51 54 52 32 22 32 43 28 22 9 40 47 30 24 X ~i EXAMPLE IN R X = read.table('pca.input',sep=" ", header=TRUE) pc = princomp(X) mu = pc$center Gamma = pc$loadings Y = pc$scores XminusMu=sweep(X,MARGIN=2,mu,FUN="-") propOfVar= pc$sdev^2/sum(pc$sdev^2) eigenVals= pc$sdev^2 Component Importance: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Standard deviation 30.142 7.179 5.786 4.098 3.084 Proportion of Variance 0.890 0.050 0.032 0.016 0.009 Cumulative Proportion 0.890 0.941 0.974 0.990 1.000 Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 diffgeom 0.638 0.599 -0.407 -0.112 -0.237 complex 0.372 -0.230 0.593 -0.595 -0.320 algebra 0.240 -0.371 0.645 -0.624 (loadings) reals 0.333 -0.671 -0.557 -0.234 0.271 statistics 0.535 0.414 0.404 0.615 Y(scores) [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] [9,] [10,] Comp.1 Comp.2 Comp.3 -2.292745 5.827588 8.966977 25.846460 13.457048 -3.257987 -14.856875 4.337867 -4.057297 70.434116 -3.286077 6.423473 13.768664 -4.392701 -6.058773 -28.899236 -4.611347 4.338621 5.216449 -4.536616 -7.625423 -3.432334 -11.115805 -3.553422 -31.579207 8.354892 -2.497369 -34.205292 -4.034848 7.321199 Center for Genes, Environment, and Health Comp.4 -7.1630488 0.5344066 -2.5308172 3.9571310 -4.7551497 -2.2710490 2.2093319 -0.9908949 5.6986938 5.3113963 Comp.5 -2.2195936 0.4777994 1.4998247 0.8815369 -2.2951908 6.7118075 3.2618335 -4.1604420 -1.9742069 -2.1833687 Adapted from S-plus Guide to Statistics 11 Principal Components Analysis (PCA) 1 2 3 4 5 6 7 8 9 10 diffgeom complex algebra reals stats 36 58 43 36 37 62 54 50 46 52 31 42 41 40 29 76 78 69 66 81 46 56 52 56 40 12 42 38 38 28 39 46 51 54 41 30 51 54 52 32 22 32 43 28 22 9 40 47 30 24 X ~i Component Importance: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Standard deviation 30.142 7.179 5.786 4.098 3.084 Proportion of Variance 0.890 0.050 0.032 0.016 0.009 Cumulative Proportion 0.890 0.941 0.974 0.990 1.000 Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 diffgeom 0.638 0.599 -0.407 -0.112 -0.237 complex 0.372 -0.230 0.593 -0.595 -0.320 algebra 0.240 -0.371 0.645 -0.624 (loadings) reals 0.333 -0.671 -0.557 -0.234 0.271 statistics 0.535 0.414 0.404 0.615 Y(scores) [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] [9,] [10,] Arrows for original variables: Length=PropVarExplained in 2 comps Direction=relative loadings in 2 comps ex) diffgeom largest(++,++) algebra smallest (+,-) Comp.1 Comp.2 Comp.3 -2.292745 5.827588 8.966977 25.846460 13.457048 -3.257987 -14.856875 4.337867 -4.057297 70.434116 -3.286077 6.423473 13.768664 -4.392701 -6.058773 -28.899236 -4.611347 4.338621 5.216449 -4.536616 -7.625423 -3.432334 -11.115805 -3.553422 -31.579207 8.354892 -2.497369 -34.205292 -4.034848 7.321199 Comp.4 -7.1630488 0.5344066 -2.5308172 3.9571310 -4.7551497 -2.2710490 2.2093319 -0.9908949 5.6986938 5.3113963 Comp.5 -2.2195936 0.4777994 1.4998247 0.8815369 -2.2951908 6.7118075 3.2618335 -4.1604420 -1.9742069 -2.1833687 X = read.table('pca.input',sep=" ", header=TRUE) pc = princomp(X) ## Verify Y = (X-mu)*Gamma mu = pc$center unique(Y-as.matrix(XminusMu)%*%Gamma) Gamma = pc$loadings ## Verify X repr by Comp. i== Y[,i] Y = pc$scores XminusMu=sweep(X,MARGIN=2,mu,FUN="-") par(mfrow=c(2,1),pty="s"),biplot(pc) plot(Y[,1],Y[,2],col="white") propOfVar= pc$sdev^2 /sum(pc$sdev^2) text(Y[,1],Y[,2],1:10) eigenVals= pc$sdev^2 Center for Genes, Environment, and Health Adapted from S-plus Guide to Statistics 12 Principal Components Analysis (PCA) • If X is a random vector (mean , covariance matrix ), then principal component transformation X Y=(X-) s.t. is orthogonal, T = is diagonal, 1 2 … p 0. 1 2 3 4 5 6 7 8 9 10 diffgeom complex algebra reals stats 36 58 43 36 37 62 54 50 46 52 31 42 41 40 29 76 78 69 66 81 46 56 52 56 40 12 42 38 38 28 39 46 51 54 41 30 51 54 52 32 22 32 43 28 22 9 40 47 30 24 X What if we could only choose two dimensions? Center for Genes, Environment, and Health Adapted from S-plus Guide to Statistics 13 Clustering • Partitioning – Must specify number of clusters – K-Means, Self-Organizing Maps (SOM/Kohonen Net) • Hierarchical Clustering – Do not need to specify number of clusters – Need to specify distance metric and linkage method • Other approaches – Fuzzy clustering (probabilistic membership) – Spectral Clustering (using eigen value decomposition) Center for Genes, Environment, and Health 14 Clustering http://apandre.wordpress.com/visible-data/cluster-analysis/ Center for Genes, Environment, and Health 15 R package: mlbench: Machine Learning Benchmark Problems http://stackoverflow.com/questions/4722290/generating-synthetic-datasets Center for Genes, Environment, and Health 16 k-Means • Intitialize: Select the initial k Centroids – REPEAT • Form k clusters by assigning all points to the ‘closest’ Centroid • Recompute the Centroid for each cluster – UNTIL ”The Centroids don’t change or all changes are below predefined threshold” • Initial Centroids are random vectors, randomly selected among vectors, first k vectors, etc or computed from random 1st assignment • ‘closest’ typically defined by Euclidean distance (Voronoi diagram) dist E ( x, y ) dist E ( y, x) 2 n x y i 1 i i • Prone to local maxima so typically do N random restarts, take best (min sum of distE2 to centroids) • In practice, favors separated spherical clusters Center for Genes, Environment, and Health Images from wikipedia 17 k-Means Iteration 0 Iteration 1 Iteration 2 Iteration 3 Iteration 4 Iteration 5 Center for Genes, Environment, and Health Images from wikipedia http://en.wikipedia.org/wiki/K-means_clustering 18 Self-Organizing Maps (SOM) • Similar to k-Means, goal to assign data to map node (e.g. Centroid in k-Means) with ‘closest’ weight vector to data space vector (minimize distE(x,w)) • Difference: map nodes constrained by neighborhood relationships, whereas k-Means Centroids freely move • Must input initial topology, map ‘stretches’ to cover nD data in 2D, similar data assigned to map neighbors Image from wikipedia Center for Genes, Environment, and Health 19 Self-Organizing Maps (SOM) • • • • • http://www.sciencedirect.com/science/article/pii/S0014579399005244 Center for Genes, Environment, and Health 1. Initialization – Choose random values for initial weight vectors wj. 2. Sampling – Draw a sample training input vector x from the input space. 3. Matching – Find the winning neuron I(x) with weight vector closest to input vector (i.e.,min distE) 4. Updating – Apply the weight update equation wji = (t)Tj,I(x) (t)( xi-wji) where (t) = learning rate @ time t* Tj,I(x) (t)=neighborhood @ time t 5. Continuation – keep returning to step 2 until the feature map stops changing. * Informal intro to simulated annealing, gradient descent… 20 Self-Organizing Maps (SOM) http://www.sciencedirect.com/science/article/pii/S0014579399005244 Center for Genes, Environment, and Health 21 Self-Organizing Maps (SOM) • Initial grids – Wrt size: 1-dimensional, 2-dimensional, 3-dimensional – Wrt structure: Rectangular, Hexagonal, Arbitrary planar http://slideplayer.com/slide/4040706/ Center for Genes, Environment, and Health http://www.cis.hut.fi/somtoolbox/documentation/grids.gif 22 Self-Organizing Maps (SOM) Example: Clustering Gene Expression Profiles http://physiolgenomics.physiology.org/content/physiolgenomics/10/2/103/F2.large.jpg Center for Genes, Environment, and Health 23 Hierarchical Clustering • Divisive – (top down) start with all points in 1 cluster, successively subdivide ‘farthest’ points until full tree • Agglomerative – (bottom up) start with each point in its own cluster (singleton), merge ‘closest’ pair of Clusters at each step until root – Requires metric to define ‘closest’ – distance no longer between points, but between clusters – Linkage strategy for which merge is often based on pairwise point comparisons • Dendrogram shows order of splits Center for Genes, Environment, and Health 24 Hierarchical Clustering http://images.slideplayer.com/11/3289326/slides/slide_7.jpg Center for Genes, Environment, and Health 25 • Euclidean Distance Metrics – distance in Euclidean space • Pearson Correlation – linear relationships • Spearman Correlation – monotonic relationships • Mutual Information – non-linear relationships • Polyserial Correlation dist E ( x, y ) 2 n x y i 1 i i n x x y y i i i 1 dist P ( x, y ) 1 n n 2 2 x x y y i 1 i i 1 i n r r r r x x y y i i i 1 dist S (rx , ry ) 1 n n 2 2 r r r r i 1 yi y i 1 xi x rz rank (z) dist MI ( x, y) H ( x, y) MI ( x, y) MI ( x, y ) H ( x) H ( y ) H ( x, y ) H ( x) x p x log p x and H ( x, y ) x , y p x , y log p x , y – correlation continuous vs ordinal (polychoric if ordinal vs ordinal) • Hamming Distance, Jaccard, Dice (binary variables) dist H M 01 M 10 dist J M 01 M 10 2 X Y Like Jaccard but Good when 0 dist 1 D M 01 M 10 M 11 gives no info X Y 2*Matches Center for Genes, Environment, and Health 26 Distance Metrics • Euclidean vs Pearson (linear) vs Spearman (monotonic) Numbers are Pearson correlation Note Pearson invariant to slope Pearson=0 if non-linear Center for Genes, Environment, and Health A A 1 1 0 8 -1 0 0 1 -1 0.8 1 -1 1 8 9 6 0 1 6 -1 1 -0.7 0 0 0.3 0 0 0 0 0 Pearson 0 0 Spearman 17 19 EucDist 22 23 EucDist 0 0 Pearson 1 0.85 Pearson 1 0.91 Spearman 27 Linkage Methods • Single Linkage argmin S,T min sS,tT dist(s,t) • Complete Linkage argmin S,T max sS,tT dist(s,t) • Average Linkage (a.k.a. group average) argmin S,T average sS,tT dist(s,t) • Centroid Linkage (People err after Eisen et al 1998 Treeview paper think=Average Linkage!) – min dist(centroid(S), centroid(T)) • Ward’s Linkage (optimizes same criterion as kMeans) • UPGMA (Unweighted Pair Group Method with Arithmetic Mean) from Lozupone lecture – assumes constant rate of evolution, average linkage, Euclidean distance Center for Genes, Environment, and Health 28 R package: mlbench: Machine Learning Benchmark Problems http://stackoverflow.com/questions/4722290/generating-synthetic-datasets Center for Genes, Environment, and Health 29 Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California 9.0 276 91 40.6 Colorado 7.9 204 78 38.7 Center for Genes, Environment, and Health Murder Assault UrbanPop Rape Comp.1 Comp.2 -0.53 0.41 -0.58 0.18 -0.27 -0.87 -0.54 -0.16 30 Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California 9.0 276 91 40.6 Colorado 7.9 204 78 38.7 Center for Genes, Environment, and Health Murder Assault UrbanPop Rape Comp.1 Comp.2 -0.53 0.41 -0.58 0.18 -0.27 -0.87 -0.54 -0.16 31 Center for Genes, Environment, and Health 32 Choosing the Number of Clusters • Rule of thumb: k= n/2 • Elbow or Knee method (bend in plot of metric) • K-means likes spherical so minimize within-cluster variation (SSE, sum dist of all points to cluster mean) or maximize between-cluster variation (dist between clusts) or both CH(K)=[B(K)/K-1]/[W(K)/(n-K)] *Calinski & Harabasz 1974 • Gap Statistic B(K) W(K) CH(K) *Tibshirani, Walther, Hasties 2001 – Calculate SSE, randomize dataset, calculate SSE rand, n times, gap= log(mean SSErand/ SSE) • Hierarchical – plot dist chosen at each merge (okay for single, complete) See also http://cran.r-project.org/web/packages/clusterCrit/vignettes/clusterCrit.pdf for long list of indices, NbClust R package: http://cedric.cnam.fr/fichiers/art_2579.pdf and http://www.stat.cmu.edu/~ryantibs/datamining/lectures/06-clus3.pdf Center for Genes, Environment, and Health Gap(K) 33 Association Set Mining • Also known as Market Basket Analysis {milk, eggs} {butter} • Support of itemset X supp(X) = # transactions with itemset X • Confidence of rule conf(X Y) = supp(X &Y)/ supp(X) • Lift of rule (perf over assuming independent) lift(X Y) = supp(X &Y)/ (supp(X)*supp(Y)) • Want rules with max supp, conf, lift • Other measures found at: http://michael.hahsler.net/research/association_rules/measures.html Center for Genes, Environment, and Health 34 Association Set Mining • Tables of data converted to transactions by creating binary variables for all categories for all variables (must discretize continuous, missing data okay) ID Gender Age Height (inches) Race Diagnosis CC245 Male 6 25 Caucasian Depression CC346 Male 75 60 African COPD 30 54 Asian Obesity 15 54 African CC978 CC125 Female { {gender_M=Y, age_child=Y, height_20-29=Y, race_WH=Y, diag_depr=Y}, {gender_M=Y, age_senior=Y, height_60-69=Y, race_BL=Y, diag_copd=Y}, {age_adult=Y, height_50-59=Y,race_AS=Y, diag_obes=Y}, {gender_F=Y, age_adol=Y, height_50-59=Y, race_BL=Y} } Center for Genes, Environment, and Health 35 Association Set Mining Example in R: arules pkg, apriori algorithm lhs {Class=2nd, Age=Child} {Class=2nd, Sex=Female, Age=Child} {Class=1st, Sex=Female} {Class=1st, Sex=Female, Age=Adult} 1 2 3 4 rhs support confidence lift => {Survived=Yes} 0.011 1.000 3.097 => {Survived=Yes} 0.006 1.000 3.096 => {Survived=Yes} 0.064 0.972 3.010 Note that rule 2 subsumed by rule 1, which has better lift (and support) – can remove redundants => {Survived=Yes} 0.064 0.972 3.010 … 12 {Sex=Female, Survived=Yes} => {Age=Adult} 27 {Class=2nd} => {Age=Adult} Center for Genes, Environment, and Health 0.143 0.118 0.918 0.915 0.966 0.963 36 Probabilistic Graphical Models Y X Markov Process (MP) Xt− 1 Xt Time Observability Hidden Markov Model (HMM) X t-1 Xt Ot-1 Ot A t− 1 Markov Decision Process X t − 1 (MDP) U t− 1 Center for Genes, Environment, and Health At Utility Partially Observable X t Markov Decision Process U (POMDP) t Observability and Utility A t− 1 Ot− 1 At Xt− 1 Xt U t− 1 Ut Ot 37 Hidden Markov Model • Finite set of N states X • Finite set of M observations O • Parameter set λ = (A, B, π) Hidden Markov Model (HMM) – Initial state distribution πi = Pr(X1 = i) – Transition probability aij = Pr(Xt=j | Xt-1 = i) – Emission probability bik = Pr(Ot=k | Xt = i) Example: 1 2 3 N=3, M=2 π=(0.25, 0.55, 0.2) obs1 st1 st2 st3 A =st1 0 0.2 0.8 B = st1 0.1 st2 0 0 . 9 0 . 1 0 0 st3 1.0 X t-1 Xt Ot-1 Ot obs2 0.9 st2 0.75 0.25 st3 0.5 0.5 • Given observation sequence O=O1,O2,…,On, how compute Pr(O| λ)? Center for Genes, Environment, and Health 38 Example: 1 πi = Pr(X1 = i) aij = Pr(Xt=j | Xt-1 = i) bik = Pr(Ot=k | Xt = i) 2 3 N=3, M=2 π=(0.25, 0.55, 0.2) A = 0 0.2 0.8 B = 0.1 0 0.9 0.1 1.0 0 0 0.9 0.75 0.25 0.5 0.5 • Probability of O is sum over all state sequences Pr(O|λ) = ∑all X Pr(O|X, λ) Pr(X|λ) = ∑all X πx1 bx1o1 ax1x2 bx2o2 . . . axT-1xT bxToT • At each t, are N states to reach, so NT possible state sequences and 2T multiplications per seq, means O(2T*NT) operations • So 3 states, length 10 seq = 1,180,980 operations and len 20 = 1e11! • Efficient dynamic programming algo: Forward algorithm (Baum&Welch) O(N2T) 39 Center for Genes, Environment, and Health Applications in Bioinformatics • DNA – motif matching, gene matching, multiple sequence alignment • Amino Acids – domain matching, fold recognition • Microarrays/Whole Genome Sequencing – assign copy number • ChIP-chip/seq – distinct chromatin states Center for Genes, Environment, and Health 40 Bayesian Networks • Given set of random variables, the joint probability distribution can be represented by: – Structure: Directed Acyclic Graph (DAG) • variables are nodes, absence of arcs captures conditional independencies – Parameters: Local Conditional Probability Distributions (CPDs) • conditional probability of variable given values of parents in graph • Joint Probability factors into product of local CPDs: Pr(X1, X2, …, Xn) = i=1 to N Pr(Xi | Parents(Xi)) Center for Genes, Environment, and Health 41 Bayesian Networks • Generally can think of directed arcs as ‘causal’ (be careful!) – If the sprinkler is on OR it is raining, then the grass will be wet: Pr(W|S,R) • If observe wet grass, can determine whether because of sprinkler or rain – Pr(R|W) and Pr(S|W) – Bayes rule = Pr(X|Y) = Pr(Y|X)Pr(X)/Pr(Y) • Note S and R compete to explain W: this model says sprinkler usage is (conditionally) independent of rain, but if know the grass is wet, and it is raining, then it is less likely that the sprinkler being on is the explanation for W – Pr(S|W,R) < Pr(S|W) • “explaining away” Note only need 9 parameters instead of 24=16 Center for Genes, Environment, and Health http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html 42 16873470 Applications in Bioinformatics Gene regulatory networks (Friedman et al, 2000, PMID: 11108481) Determining Regulators with PRMS (Segal et al, 2002, RECOMB) Predicting clinical outcomes using expression data (Gevaert et al, 2006, PMID: 16873470) Gene Function Prediction (Troyanskaya et al, 2003, PMID: 12826619 ) Center for Genes, Environment, and Health Hanalyzer – edge scores (Leach et al, 2009, PMID: 19325874) 43 Supervised Learning Center for Genes, Environment, and Health 44 Supervised Learning • Given examples (x,y) of input features x and output variable y, learn function f(x)=y – Regression (continuous response) vs Classification (discrete response) – Dimensionality Reduction (Feature selection vs extraction) – Cross validation (Leave-One-Out vs N-Fold) – Generalization (Training set error vs Test set error) – Missing data and Imputation – Model Selection (AIC, BIC) – Boosting/bagging/jackknife – Curse of dimensionality Center for Genes, Environment, and Health 45 Supervised Learning • Boosting (weak learners on different subsets) – Train H1 on random data split, sample among H1’s predictions so next data set to train H2 has half wrong, half right in H1. Train H3 where both H1 and H2 wrong. Return majority vote H1, H2, H3 (Adaboost weights examples, weighted vote) • Bagging (bootstrap aggregate) – Train multiple models on random with replacement (bootstrap) splits of input data, average predictions • Jackknife (vs bootstrap) – disjoint subsets of data • Model Selection: balance goodness of fit (likelihood L) with complexity of model (number of parameters k) for n samples – Bayesian information criterion (BIC): minimize k ln(n)-2 ln(L) strong, better – Akaike information criterion (AIC): minimize 2k – 2 ln(L) (less theory than BIC) • Curse of dimensionality – greater D, data samples sparser in covering space so need more & more data to learn properly Center for Genes, Environment, and Health 46 Decision Boundaries https://sites.google.com/a/iupr.com/diacourse/lectures/lecture08-classification-withneural-networks Center for Genes, Environment, and Health 47 k-Nearest Neighbors • Store database of (x,y) pairs, classify new example by majority vote of k nearest neighbors (regression if assign (weighted) mean y in neighborhood) • No training needed, non-parametric, sensitive to local structure in data, frequent class tends to dominate If k=3, (green, red) If k=5, (green, blue) • Curse of dimensionality if many variables, any query equidistant to all points – reduce features by PCA • Allows complicated boundaries between classes Center for Genes, Environment, and Health 48 Neural Network: Linear Perceptron Step activation function • Learning :(Backpropagation) • • • • Initialize wt, choose learning rate 1) Calculate prediction y*j,t = f[wt xj] 2) Update weights wt+1 = wt+(yj – y*j,t)xj Repeat 1&2 until (yj – y*j,t) < threshold – Can be generalized to multi-class – Optimal only if data linearly separable vs Center for Genes, Environment, and Health 49 Neural Network: Multi-Layer Perceptron • Smooth activation function instead • Can also have multiple hidden layers • Can learn when data not linearly separable • Learn like before but backpropagation from output layer Smooth activation function (signmoid, tanh) Input layer Center for Genes, Environment, and Health Hidden layer Output layer 50 Decision Tree • Node is attribute tested, branch BIOPSY+ is outcome, leaf is (majority) class (prob) Y N • Discrete: X=xi?, Real: X<value? Rx SIDE BREATH EFFECT >90% • Greedy algorithm chooses Y N Y N best attribute to split upon: Died: Died: Died: BREATH 15 – pi = fraction items labeled i in set <30% Alive: 15 – Gini impurity: IG(p) =ij pipj Y N prob items labeled i chosen * Died: Died: prob i mistakenly assigned class j 80 30 Alive: Alive: – Information gain: IE(p) =-i pi log2pi 1 7 20 Alive: 57 3 Alive: 27 – Real value: SSE • EASY TO INTERPRET!!! Can overfit, large tree for XOR, biased in favor of attributes with more levels => ensembles Center for Genes, Environment, and Health 51 Random Forest • Classifier consisting of ensemble of decision trees {h(x, k)} where k is some i.i.d. random vector, and each tree casts vote for class of x (Breiman 2001) 1. Bagging – k is random selection of N samples (with replacement) to grow tree 2. Dietterich 98: k is random split among n (best) splits 3. Ho 98: k is random subset of features to grow tree (√k ) 4. Adaboost-like: k is random weights on examples – 4 better than {2,3} better than 1 on generalization error • Out-of-bag estimates : internal estimates of generalization error, classifier strength and correlation between trees Center for Genes, Environment, and Health 52 Random Forest • Most popular implementation {h(x, k)}: bagging (random subset samples w/ repl.) + random subset features – If set of features small, trees more correlated, so can make new features as random linear combinations of orig. features • Out-of-bag classifier for specific {x,y} = aggregate over trees that didn’t use {x,y} as training data (removes need for setting aside test data) • Out-of-bag estimate is error rate for out-of-bag classifer for training set (can also estimate OOB strength and correlation) • Can estimate variable importance from OOB estimates – For m-th variable, permute its values, compare misclassification rate of OOB classifiers on ‘noised-up’ data with OOB on real data, large increase implies m-th variable important Center for Genes, Environment, and Health 53 Support Vector Machine (SVM) • Support vectors are points that lie closest to decision surface, maximize ‘margin’, hyperplane separating examples (solution change if SVs removed) • Kernel function – maps not-linearly separable data to transformed space where transformed data is lin. sep. • Advantages: non-probabilistic, optimization not greedy search, not affected by local minima, theoretical guarantee of performance, escape curse of dimensionality Center for Genes, Environment, and Health 54 Support Vector Machine (SVM) • Distance between H and H1 is 1/||w|| so to maximize margin, need to minimize ||w|| (where ||w||= sqrt( w ) ) s.t. no points between H1&H2: i xi w + b +1 when yi = +1 xi w + b -1 when yi = -1 i 2 yi(xi w)1 + + + • Quadratic program (constrained optimization, solved by (dual of) Lagrangian multiplier) Max L = i- ½ijxixj s.t w=iyixi and iyi=0 • If not linearly separable, use transformation to space where is linearly separable, via kernels, i.e. (xi) not xi • If use L1-norm (not L2 above), weights = variable importance http://books.nips.cc/papers/files/nips16/NIPS2003_AA07.pdf Center for Genes, Environment, and Health http://www.cs.ucf.edu/courses/cap6412/fall2009/papers/Berwick2003.pdf 55 Support Vector Machine (SVM) Not separated by linear function, but can by quadratic one 2 x x' p Polynomial (p=1 linear) K ( x, x' ) exp K ( x, x' ) ( x x'1) 2 2 ~sigmoid (like Neural Net) K ( x, x' ) tanh( x x' ) Radial basis function (Gaussians) Center for Genes, Environment, and Health 56 Other Useful Kernel Functions • Use of kernels allows complex data types to be used in SVMs w/o having to translate into realvalued, fixed length vectors K: D x D R • String kernel: compare two sequences • Graph kernel: compare two nodes in graph or two graphs • Image kernels: compare two images • and so on … (any symmetric, positive semidefinite matrix is a kernel) Center for Genes, Environment, and Health 57 Center for Genes, Environment, and Health 58 Naïve Bayes C • Recall Bayes rule Pr(X|Y) = Pr(Y|X)Pr(X) / Pr(Y) • Classifier: F1 F2 F3 … Fn Pr(C|F1,…,Fn ) = Pr(C) Pr(F1,…,Fn|C) / Pr(F1,…,Fn) – Note denominator does not depend on C (effectively constant Z) – “Naïve” assumption because assume Fi, Fj independent – Simplifies calculation: Pr(C|F1,…,Fn ) = 1/Z * Pr(C) i Pr(Fi|C) • Learn parameters Pr(C) & each Pr(Fi|C) by maximum likelihood (multinomial, Gaussian, …) – Can learn each Pr(Fi|C) independently, escape curse of dimensionality, not need dataset to scale with # Fi Center for Genes, Environment, and Health 59 Center for Genes, Environment, and Health 60 Examples in R • Making 2D datasets – Install libraries: mlbench • Clustering (Hierarchical, K-Means, SOM) – Install libraries: kohonen • Classification (kNN, NN, DT, SVM, NB) – Install libraries: class (if R>3.0, o/w knn), neuralnet, rpart, e1071 Center for Genes, Environment, and Health 61 R package: mlbench: Machine Learning Benchmark Problems http://stackoverflow.com/questions/4722290/generating-synthetic-datasets Center for Genes, Environment, and Health 62 Additional References • • • • • Logit regression example: http://www.ats.ucla.edu/stat/r/dae/logit.htm PCA: http://www.cs.princeton.edu/picasso/mats/PCA-Tutorial-Intuition_jp.pdf Statistical Pattern Recognition Toolbox for Demos: http://cmp.felk.cvut.cz/cmp/software/stprtool/examples.html KMeans: https://onlinecourses.science.psu.edu/stat857/node/125 SOMS: – – – • Distance metrics: – – – • http://www.statmethods.net/stats/correlations.html http://people.revoledu.com/kardi/tutorial/Similarity/index.html – nice discussion of differences http://www.datavis.ca/papers/corrgram.pdf - make visual panel (like heatmap) of correlation between variables Choosing number of clusters: – – – – – – • http://www.cs.bham.ac.uk/~jxb/NN/l16.pdf http://www.loria.fr/~rougier/coding/article/article.html http://www.sciencedirect.com/science/article/pii/S0014579399005244 Nice one: http://cran.r-project.org/web/packages/clusterCrit/vignettes/clusterCrit.pdf http://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set http://psycnet.apa.org/journals/met/16/3/285/ http://blog.echen.me/2011/03/19/counting-clusters/ http://stackoverflow.com/questions/15376075/cluster-analysis-in-r-determine-the-optimal-number-of-clusters http://www.stat.cmu.edu/~ryantibs/datamining/lectures/06-clus3.pdf Neural Networks: – – – – Good one: http://www.cogsys.wiai.uni-bamberg.de/teaching/ss05/ml/slides/cogsysII-4.pdf https://sites.google.com/a/iupr.com/dia-course/lectures/lecture08-classification-with-neural-networks https://sites.google.com/a/iupr.com/dia-course/lectures/lecture08-classification-with-neural-networks Nice for MLP: http://users.ics.aalto.fi/ahonkela/dippa/node41.html • • • Boosting vs Bagging: http://people.cs.pitt.edu/~milos/courses/cs2750-Spring04/lectures/class23.pdf Random Forests: https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm SVMs:Idiots’ guide to SVMs: http://www.cs.ucf.edu/courses/cap6412/fall2009/papers/Berwick2003.pdf • Kernel Methods: http://www.kernel-methods.net/tutorials/KMtalk.pdf Center for Genes, Environment, and Health 63 The End Center for Genes, Environment, and Health 64 Not used Center for Genes, Environment, and Health 65 Hidden Markov Model • Finite set of N states X • Finite set of M observations O • Parameter set λ = (A, B, π) Hidden Markov Model (HMM) – Initial state distribution πi = Pr(X1 = i) – Transition probability aij = Pr(Xt=j | Xt-1 = i) – Emission probability bik = Pr(Ot=k | Xt = i) Example: 1 2 3 N=3, M=2 π=(0.25, 0.55, 0.2) obs1 st1 st2 st3 A =st1 0 0.2 0.8 B = st1 0.1 st2 0 0 . 9 0 . 1 0 0 st3 1.0 X t-1 Xt Ot-1 Ot obs2 0.9 st2 0.75 0.25 st3 0.5 0.5 • Given observation sequence O=O1,O2,…,On, how compute Pr(O| λ)? Center for Genes, Environment, and Health 66 Example: 1 πi = Pr(X1 = i) aij = Pr(Xt=j | Xt-1 = i) bik = Pr(Ot=k | Xt = i) 2 3 N=3, M=2 π=(0.25, 0.55, 0.2) A = 0 0.2 0.8 B = 0.1 0 0.9 0.1 1.0 0 0 0.9 0.75 0.25 0.5 0.5 • Probability of O is sum over all state sequences Pr(O|λ) = ∑all X Pr(O|X, λ) Pr(X|λ) = ∑all X πx1 bx1o1 ax1x2 bx2o2 . . . axT-1xT bxToT • What is computational complexity of this sum? Center for Genes, Environment, and Health 67 Example: 1 πi = Pr(X1 = i) aij = Pr(Xt=j | Xt-1 = i) bik = Pr(Ot=k | Xt = i) 2 3 N=3, M=2 π=(0.25, 0.55, 0.2) A = 0 0.2 0.8 B = 0.1 0 0.9 0.1 1.0 0 0 0.9 0.75 0.25 0.5 0.5 • Probability of O is sum over all state sequences Pr(O|λ) = ∑all X Pr(O|X, λ) Pr(X|λ) = ∑all X πx1 bx1o1 ax1x2 bx2o2 . . . axT-1xT bxToT • At each t, are N states to reach, so NT possible state sequences and 2T multiplications per seq, means O(2T*NT) operations • So 3 states, length 10 seq = 1,180,980 operations and len 20 = 1e11! 68 Center for Genes, Environment, and Health Example: 1 πi = Pr(X1 = i) aij = Pr(Xt=j | Xt-1 = i) bik = Pr(Ot=k | Xt = i) 2 3 N=3, M=2 π=(0.25, 0.55, 0.2) A = 0 0.2 0.8 B = 0.1 0 0.9 0.1 1.0 0 0 0.9 0.75 0.25 0.5 0.5 • Probability of O is sum over all state sequences Pr(O|λ) = ∑all X Pr(O|X, λ) Pr(X|λ) = ∑all X πx1 bx1o1 ax1x2 bx2o2 . . . axT-1xT bxToT • Efficient dynamic programming algorithm to do this: Forward algorithm(Baum and Welch,O(N2T)) Center for Genes, Environment, and Health 69 The Forward Algorithm CpG G C A T 0.8 .3 .3 .2 .2 0.2 G C A T Probability of a Sequence is the Sum of All Paths that Can Produce It .3*( .3*.8+ .1*.1) =.075 .3*( G .1 G G .3 0.1 .1 .1 .4 .4 0.9 =.0185 .2*( .0185*.8+ .0029*.1) =.003 .2*( .003*.8+ .0025*.1) =.0005 .1*( .3*.2+ .1*.9) =.015 .1*( .075*.2+ .015*.9) =.0029 .4*( .0185*.2+ .0029*.9) =.0025 .4*( .003*.2+ .0025*.9) =.0011 C G A A .075*.8+ .015*.1) David Pollock’s Lecture Non-CpG Center for Genes, Environment, and Health 70 Parameter estimation by Baum-Welch Forward Backward Algorithm Forward variable αt(i) =Pr(O1..t,Xt=i | λ) Backward variable βt(i) =Pr(Ot+1..N|Xt=i, λ) DEFINITIVE tutorial: Rabiner 1989: http://compbio.ucdenver.edu/hunter/cpbs7711/2010_09_09_Rabiner1989_HMMTutorial_Leach.pdf and erratum: http://compbio.ucdenver.edu/hunter/cpbs7711/2010_09_09_Rabiner1989_HMMTutorial_Erratum_Leach.pdf Forward Algorithm • Dynamic programming method to compute forward variable: αt(i) =Pr(O1..t,Xt=i | λ) • Base Condition: for 1 i N α1(i) = πx1 bxio1 • Recurrence: for 1 j N and 1 t T-1 αt+1(j) = [ ∑i=1 to N αt(i) axixj ] bxjot+1 • Then probability of sequence Pr(O | λ) = ∑i=1 to N αT(i) Center for Genes, Environment, and Health *Backward algorithm for βt(i) is analogous 72 Center for Genes, Environment, and Health 73 Center for Genes, Environment, and Health 74 A A 1 1 0 8 -1 0 0 1 -1 0.8 1 -1 1 8 9 6 0 1 6 -1 1 -0.7 0 0 0.3 0 0 0 A A 1 1 0 8 -1 0 0 Center for Genes, Environment, and Health 0 0 Pearson 0 0 Spearman 17 19 EucDist 22 23 EucDist 0 0 Pearson 1 0.85 Pearson 1 0.91 Spearman 1 -1 0.8 1 -1 1 8 9 6 0 1 6 -1 1 -0.7 0 0 0.3 0 0 0 0 0 Pearson 0 0 Spearman 17 19 EucDist 22 23 EucDist 0 0 Pearson 1 0.85 Pearson 1 0.91 Spearman 75