Lecture Slides for INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, 2004 alpaydin@boun.edu.tr http://www.cmpe.boun.edu.tr/~ethem/i2ml CHAPTER 5: Multivariate Methods Multivariate Data Multiple measurements (sensors) d inputs/features/attributes: d-variate N instances/observations/examples X 11 2 X1 X N X 1 X 21 X 22 X N 2 X d1 2 Xd N X d 3 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Multivariate Parameters Me an : E x μ 1 ,...,d T Covarianc e: ij CovX i , X j Corre lation : CorrX i , X j ij CovX E X μ X μ T ij i j 12 12 1d 2 21 2 2d 2 d 1 d 2 d 4 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Parameter Estimation t x t 1 i N Sample me an m : mi N ,i 1,...,d x N Covarianc ematrix S : sij Corre lation matrix R : rij t 1 t i mi x tj m j N sij si s j 5 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Estimation of Missing Values What to do if certain instances have missing attributes? Ignore those instances: not a good idea if the sample is small Use ‘missing’ as an attribute: may give information Imputation: Fill in the missing value Mean imputation: Use the most likely value (e.g., mean) Imputation by regression: Predict based on other attributes 6 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Multivariate Normal Distribution x ~ N d μ, Σ 1 1 T 1 px e xp x μ Σ x μ 1/ 2 d/2 2 2 Σ 7 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Multivariate Normal Distribution Mahalanobis distance: (x – μ)T ∑–1 (x – μ) measures the distance from x to μ in terms of ∑ (normalizes for difference in variances and correlations) 12 12 Bivariate: d = 2 p x 1 , x 2 12 1 212 2 2 1 2 2 e xp z1 2z1z2 z2 2 2 1 21 zi x i i / i 8 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Bivariate Normal 9 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 10 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Independent Inputs: Naive Bayes If xi are independent, offdiagonals of ∑ are 0, Mahalanobis distance reduces to weighted (by 1/σi) Euclidean distance: 1 d x 1 i i p x pi x i e xp d 2 d / 2 i 1 i 1 i 2 i d 2 i 1 If variances are also equal, reduces to Euclidean distance 11 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Parametric Classification If p (x | Ci ) ~ N ( μi , ∑i ) 1 1 T 1 p x | Ci e xp x μi Σi x μi 1/ 2 d/2 2 2 Σi Discriminant functions are gi x log p x | Ci log P Ci d 1 1 T 1 log2 log Σi x μi Σi x μi log P Ci 2 2 2 12 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Estimation of Parameters r ˆ P C t t i i mi N t t r t i x t r t i r x t Si t i t t mi x mi t r t i T 1 1 T 1 gi x log Si x mi Si x mi log ˆ P Ci 2 2 13 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Different Si Quadratic discriminant 1 1 T 1 1 T 1 gi x log Si x Si x 2x T Si mi mi Si mi log ˆ P Ci 2 2 T x T Wi x wi x wi 0 whe re 1 1 Wi Si 2 1 wi Si mi wi 0 1 T 1 1 mi Si mi log Si log ˆ P Ci 2 2 14 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) likelihoods discriminant: P (C1|x ) = 0.5 posterior for C1 15 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Common Covariance Matrix S Shared common sample covariance S S ˆ P Ci Si i Discriminant reduces to 1 T gi x x mi S 1 x mi log ˆ P Ci 2 which is a linear discriminant gi x wi x wi 0 T whe re 1 wi S mi wi 0 1 T 1 mi S mi log ˆ P Ci 2 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 16 Common Covariance Matrix S 17 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Diagonal S When xj j = 1,..d, are independent, ∑ is diagonal p (x|Ci) = ∏j p (xj |Ci) (Naive Bayes’ assumption) 1 x mij gi x 2 j 1 s j d t j 2 log ˆ P Ci Classify based on weighted Euclidean distance (in sj units) to the nearest mean 18 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Diagonal S variances may be different 19 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Diagonal S, equal variances Nearest mean classifier: Classify based on Euclidean distance to the nearest mean gi x x mi 2s 2 2 log ˆ P Ci 1 2 x tj mij 2s j 1 d 2 log ˆ P Ci Each mean can be considered a prototype or template and this is template matching 20 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Diagonal S, equal variances *? 21 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Model Selection Assumption Covariance matrix No of parameters Shared, Hyperspheric Si=S=s2I 1 Shared, Axis-aligned Si=S, with sij=0 d Shared, Hyperellipsoidal Si=S Different, Hyperellipsoidal Si d(d+1)/2 K d(d+1)/2 As we increase complexity (less restricted S), bias decreases and variance increases Assume simple models (allow some bias) to control variance (regularization) 22 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Discrete Features Binary features: pij p x j 1 | Ci if xj are independent (Naive Bayes’) d xj 1x j p x | Ci pij 1 pij j 1 the discriminant is linear gi x log p x | Ci log P Ci x j log pij 1 x j log 1 pij log P Ci j Estimated parameters ˆ pij t t x t jri t r t i 23 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Discrete Features Multinomial (1-of-nj) features: xj {v1, v2,..., vnj} pijk p z jk 1 | Ci p x j vk | Ci if xj are independent d nj p x | Ci pijkjk z j 1 k 1 gi x j k z jk log pijk log P Ci ˆ pijk t t z r t jk i t r t i 24 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Multivariate Regression r gx | w ,w ,...,w t t 0 1 d Multivariate linear model w 0 w1x 1t w2x 2t wd x dt 1 E w 0 , w1 ,...,w d | X t r t w 0 w1x 1t wd x dt 2 2 Multivariate polynomial model: Define new higher-order variables z1=x1, z2=x2, z3=x12, z4=x22, z5=x1x2 and use the linear model in this new z space (basis functions, kernel trick, SVM: Chapter 10) 25 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)