Lecture 1: Overview

Bios 760R, Lecture 1 Overview Overview of the course Classification The “curse of dimensionality”  Reminder of some background knowledge Tianwei Yu Room 334 Tianwei.yu@emory.edu 1 Overview Lecture 1 Overview Lecture 2 Bayesian decision theory Lecture 3 Density estimation Lecture 4 Linear machine Classification Lecture 5 Neural network Clustering Lecture 6 Support vector machine Dimension reduction Lecture 7 Tree and forest Lecture 8 Bump hunting Lecture 9 Boosting Lecture 10 Model generalization Lecture 11 Similarity measures, Hierarchical clustering, Subspace clustering Lecture 12 Center-based & Model-based clustering Lecture 13 PCA, SIR Lecture 14 ICA, PLSDA Focus of the course: 2 Overview References: Textbook: The elements of statistical learning. Hastie, Tibshirani & Friedman. Other references: Pattern classification. Duda, Hart & Stork. Data clustering: theory, algorithms and application. Gan, Ma & Wu. Applied multivariate statistical analysis. Johnson & Wichern. Evaluation: Three projects (30% each) Class participation (10%) 3 Overview Supervised learning ”direct data mining” Classification Estimation Prediction Unsupervised learning ”indirect data mining” Clustering Association rules Description, dimension reduction and visualization Machine Learning /Data mining Semi-supervised learning Modified from Figure 1.1 from <Data Clustering> by Gan, Ma and Wu 4 Overview In supervised learning, the problem is well-defined: Given a set of observations {xi, yi}, estimate the density Pr(Y, X) Usually the goal is to find the model/parameters to minimize a loss, A common loss is Expected Prediction Error: It is minimized at Objective criteria exists to measure the success of a supervised learning mechanism. 5 Overview In unsupervised learning, there is no output variable, all we observe is a set {xi}. The goal is to infer Pr(X) and/or some of its properties. When the dimension is low, nonparametric density estimation is possible; When the dimension is high, may need to find simple properties without density estimation, or apply strong assumptions to estimate the density. There is no objective criteria to evaluate the outcome; Heuristic arguments are used to motivate the methods; Reasonable explanation of the outcome is expected from the subjective field of study. 6 Classification The general scheme. An example. 7 Classification In most cases, a single feature is not enough to generate a good classifier. 8 Classification Two extremes: overly rigid and overly flexible classifiers. 9 Classification Goal: an optimal trade-off between model simplicity and training set performance. This is similar to the AIC / BIC / …… model selection in regression. 10 Classification An example of the overall scheme involving classification: 11 Classification A classification project: a systematic view. 12 Curse of Dimensionality Bellman R.E., 1961. In p-dimensions, to get a hypercube with volume r, the edge length needed is r1/p. In 10 dimensions, to capture 1% of the data to get a local average, we need 63% of the range of each input variable. 13 Curse of Dimensionality In other words, To get a “dense” sample, if we need N=100 samples in 1 dimension, then we need N=10010 samples in 10 dimensions. In high-dimension, the data is always sparse and do not support density estimation. More data points are closer to the boundary, rather than to any other data point  prediction is much harder near the edge of the training sample. 14 Curse of Dimensionality Estimating a 1D density with 40 data points. Standard normal distribution. 15 Curse of Dimensionality Estimating a 2D density with 40 data points. 2D normal distribution; zero mean; variance matrix is identity matrix. 16 Curse of Dimensionality Another example – the EPE of the nearest neighbor predictor. To find E(Y|X=x), take the average of data points close to a given x, i.e. the top k nearest neighbors of x Assumes f(x) is well-approximated by a locally constant function When N is large, the neighborhood is small, the prediction is accurate. 17 Curse of Dimensionality Just a reminder, the expected prediction error contains variance and bias components. Under model: Y=f(X)+ε EPE ( x0 ) = E[(Y - fˆ ( x0 )) 2 ] = E[(e 2 + 2e ( f ( x0 ) - fˆ ( x0 )) + ( f ( x0 ) - fˆ ( x0 )) 2 )] = s 2 + E[( f ( x ) - fˆ ( x )) 2 ] 0 0 = s 2 + E[ fˆ ( x0 ) - E ( fˆ ( x0 ))]2 + [ E ( fˆ ( x0 )) - f ( x0 )]2 = s 2 + Var ( fˆ ( x )) + Bias 2 ( fˆ ( x )) 0 0 18 Curse of Dimensionality We have talked about the curse of dimensionality in the sense of density estimation. In a classification problem, we do not necessarily need density estimation. Generative model --- care about class density function. Discriminative model --- care about boundary. Example: Classifying belt fish and carp. Looking at the length/width ratio is enough. Why should we care how many teeth each kind of fish have, or what shape fins they have? 19 High-throughput biological data We talk about “curse of dimensionality” when N is not >>>p. In bioinformatics, usually N<100, and p>1000. How to deal with this N<<p issue? It is assumed most predictors do not contain information to predict Y. Dramatically reduce p before model-building. Filter genes based on: variation. normal/disease test statistic. projection. functional groups/network modules. …… For the most part of this course, we will pretend the N<<p issue doesn’t exist. Some methods are robust against large 20 numbers of nuisance variables. Reminder of some results for random vectors The multivariate Gaussian distribution: The covariance matrix, not limited to Gaussian. * Gaussian is fully defined by mean vector and covariance matrix (first and second moments). 21 Reminder of some results for random vectors The correlation matrix: Relationship with covariance matrix: V 1 1 2 V rV 2 [ = diag s ii 1 2 ] =S 22 Reminder of some results for random vectors k k x ¢Ax = å å aij x i x j Quadratic form: i=1 j=1 A linear combination of the elements of a random vector: 2 Var(aX + bX ) = E[(aX + bX ) (a m + b m )] 1 2 1 2 1 2 2-D example: = E[a(X1 - m1 ) + b(X 2 - m2 )]2 és11 s12 ùéaù = a s11 + b s 22 + 2abs12 = [a b]ê úê ú = c ¢Sc ës12 s 22 ûëbû 2 2 23 Reminder of some results for random vectors A “new” random vector generated from linear combinations of a random vector: 24 Reminder of some results for random vectors Let A be a kxk square symmetrix matrix, then it has k pairs of eigenvalues and eigenvectors. A can be decomposed as: A = l1e1e1¢ + l2e2e2¢ + ....... + lkekek¢ = PLP ¢ Positive-definite matrix: x ¢Ax > 0,"x ¹ 0 l1 ³ l2 ³ ...... ³ lk > 0 Note : x ¢Ax = l1 ( x ¢e1 ) 2 + ...... + lk ( x ¢ek ) 2 Reminder of some results for random vectors Inverse of a positive-definite matrix: k A = PL P ¢ = å -1 -1 i=1 1 li eiei¢ Square root matrix of a positive-definite matrix: k A1 2 = PL 2 P ¢ = å li eiei¢ 1 i=1 Reminder of some results for random vectors 27 Reminder of some results for random vectors Proof of the first (and second) point of the previous slide. 28 Reminder of some results for random vectors With a sample of the random vector: 29 Reminder of some results for random vectors To estimate mean vector and covariance matrix: ⌢ m=X n ⌢ 1 ¢ S=S= X X X X ( ) ( ) å i i n -1 i=1 30 Reminder of some results for random vectors 31 Reminder of some results for random vectors 32 Reminder of some results for random vectors 33

Lecture 1: Overview

Related documents

Products

Support

Lecture 1: Overview

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib