CHAPTER 4: Parametric Methods Parametric Estimation Given X = { xt }t goal: infer probability distribution p(x) Parametric estimation: Assume a form for p (x | θ) and estimate θ, its sufficient statistics, using X e.g., N ( μ, σ2) where θ = { μ, σ2} Problem: How can we obtain θ from X? Assumption: X contains samples of a onedimensional random variable Later multivariate estimation: X contains multiple and not only a single measurement. Example; Gaussian Distribution Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 http://en.wikipedia.org/wiki/Normal_distribution Maximum Likelihood Estimation Density function p with parameters θ is given and xt~p (X |θ) Likelihood of θ given the sample X l (θ|X) = p (X |θ) = ∏t p (xt|θ) We look θ for that “maximizes the likelihood of the sample”! Log likelihood L(θ|X) = log l (θ|X) = ∑t log p (xt|θ) Maximum likelihood estimator (MLE) θ* = argmaxθ L(θ|X) Homework: Sample: 0, 3, 3, 4, 5 and x~N(,)? Use MLE to find(,)! Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 3 Examples: Bernoulli/Multinomial Bernoulli: Two states, failure/success, x in {0,1} P (x) = pox (1 – po ) (1 – x) L (po|X) = log ∏t poxt (1 – po ) (1 – xt) MLE: po = ∑t xt / N Multinomial: K>2 states, xi in {0,1} P (x1,x2,...,xK) = ∏i pixi L(p1,p2,...,pK|X) = log ∏t ∏i pixit MLE: pi = ∑t xit / N 4 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Gaussian (Normal) Distribution p x x 2 1 exp 2 2 2 x 2 1 p x exp 2 2 2 μ σ http://en.wikipedia.org/wiki/Probability_density_function p(x) = N ( μ, σ2) MLE for μ and σ2: m s2 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) t x t N x t m 2 t N 5 Bias and Variance Unknown parameter θ Estimator di = d (Xi) on sample Xi Bias: bθ(d) = E [d] – θ Variance: E [(d–E [d])2] Mean square error of the estimator d: r (d,θ) = E [(d–θ)2] = (E [d] – θ)2 + E [(d–E [d])2] = Bias2 + Variance Error in the Model itself Variation/randomness of the model 6 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Bayes’ Estimator Treat θ as a random var with prior p (θ) Bayes’ rule: p (θ|X) = p(X|θ) * p(θ) / p(X) Maximum a Posteriori (MAP): θMAP = argmaxθ p(θ|X) Maximum Likelihood (ML): θML = argmaxθ p(X|θ) Bayes’ Estimator: θBayes’ = E[θ|X] = ∫ θ p(θ|X) dθ Comments: ML just takes the maximum value of the density function Compared with ML, MAP additionally considers priors Bayes’ estimator averages over all possible values of θ which are weighted by their likelihood to occur (which is measured by a probability distribution p(θ)). For MAP see: http://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation For comparison see: http://metaoptimize.com/qa/questions/7885/what-is-the-relationship-between-mle-map-em-point-estimation 7 Skip today Bayes’ Estimator: Example xt ~ N (θ, σo2) and θ ~ N ( μ, σ2) θML = m θMAP = θBayes’ = 2 1/ N/ E | X m 2 2 2 2 N / 0 1 / N / 0 1 / 2 0 σ: converges to m 8 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) ! Parametric Classification gi x p x | Ci P Ci or equivalent ly kind of p(Ci|x) gi x log p x | Ci log P Ci 2 1 x i p x | Ci exp 2 2i 2i 1 x i gi x log 2 log i log P Ci 2 2 2i 2 9 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Data ML/MAP P(x|Ci) Parametric Classification gi x p x | Ci P Ci or equivalent ly kind of p(Ci|x) gi x log p x | Ci log P Ci Using Bayes Theorem P(C1|x)=P(C1)xP(x|C1)/P(x) P(C2|x)=P(C2)xP(x|C2)/P(x) As P(x) is the same in both formulas, we can drop it! 10 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Given the sample X {x t ,r t }tN1 t 1 if x Ci t ri t 0 if x C j , j i x ML estimates are P̂ Ci ri t N t x ri t mi t t r i t si2 x t mi rit t r i t 2 t t Discriminant becomes 1 x mi gi x log 2 log si log P̂ Ci 2 2 2si 2 11 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Equal variances Single boundary at halfway between means 12 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Variances are different Two boundaries Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Homework! 13 Regression r f x estimator for r : g x | ~ N 0, 2 p r | x ~ N g x | , 2 N L | X log p x t ,r t Maximizing the probability of the sample again! t 1 N N log p r t | x t log p x t t 1 t 1 14 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Skip to 20! Regression: From LogL to Error N L | X log t 1 rt g xt | 1 exp 2 2 2 2 1 N log 2 2 r t g x t | 2 t 1 N 1 E | X r t g x t | 2 t 1 N 2 2 15 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Linear Regression g x | w , w w x w t t 1 0 1 0 t t r Nw w x 0 1 t t r x t t t N A t x t w 0 x w1 x t t t 2 t t t x r t w 0 t w y w t t t 2 r x 1 t x t w A 1 y Relationship to what we discussed in Topic2?? Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 16 Polynomial Regression Here we get k+1 equations with k+1 unknowns! t g x | w k , , w 2 , w 1 , w 0 w k x 1 x 1 2 1 x D N 1 x t k w2 x x x x 1 2 2 2 N 2 T w D D 1 t 2 w1x t w 0 1 r x 2 2 k r x r N N 2 r x 1 k DT r 17 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Other Error Measures Square Error: 1 E | X r t g x t | 2 t 1 N r N Relative Square Error: E | X t 2 2 g xt | t 1 r N 2 t r t 1 Absolute Error: E (θ|X) = ε-sensitive Error: E (θ|X) = ∑t |rt – g(xt|θ)| ∑ t 1(|rt – g(xt|θ)|>ε) (|rt – g(xt|θ)| – ε) 18 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Bias and Variance E r g x | x E r E r | x | x E r | x g x 2 2 noise 2 squared error E X E r | x gx | x E r | x E X gx E X gx E X gx 2 2 bias 2 variance To be revisited next week! 19 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Estimating Bias and Variance M samples Xi={xti , rti}, i=1,...,M are used to fit gi (x), i =1,...,M 1 Bias g N 2 g x f x t t 2 t 1 Variance g NM 1 g x gi x M t g x g x t t 2 i t i 20 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Bias/Variance Dilemma Example: gi(x)=2 has no variance and high bias gi(x)= ∑t rti/N has lower bias with variance As we increase complexity, bias decreases (a better fit to data) and variance increases (fit varies more with data) Bias/Variance dilemma: (Geman et al., 1992) 21 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) f f bias gi g variance 22 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Polynomial Regression Best fit “min error” 23 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Model Selection Remark: will be discussed in more depth later: Topic 11 Cross-validation: Measure generalization accuracy by testing on data unused during training Regularization: Penalize complex models E’=error on data + λ model complexity Akaike’s information criterion (AIC), Bayesian information criterion (BIC) Minimum description length (MDL): Kolmogorov complexity, shortest description of data Structural risk minimization (SRM) 24 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Bayesian Model Selection Prior on models, p(model) p data | model p model p model | data p data Regularization, when prior favors simpler models Bayes, MAP of the posterior, p(model|data) Average over a number of models with high posterior (voting, ensembles: Chapter 15) 25 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) CHAPTER 5: Multivariate Methods Normal Distribution: http://en.wikipedia.org/wiki/Normal_distribution Z-score: see http://en.wikipedia.org/wiki/Standard_score Multivariate Data Multiple measurements (sensors) d inputs/features/attributes: d-variate N instances/observations/examples X 11 2 X1 X N X 1 X 21 X 22 X N 2 X d1 2 Xd N X d 27 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Example: 16 0 0 0 16 -3 0 -3 1 Multivariate Parameters Mean : E x μ 1 ,..., d T Covariance : ij CovX i , X j Correlatio n : Corr X i , X j ij Cov X E X μ X μ T ij i j 12 12 1d 2 21 2 2d 2 d 1 d 2 d Correlation: http://en.wikipedia.org/wiki/Correlation 28 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Parameter Estimation Sample mean m : mi N t x i t 1 N , i 1,..., d x N Covariance matrix S : sij x S N t 1 t m xt m N t 1 t i mi x tj m j N or T with m (m1 ,..., md ) Correlation matrix R : rij sij si s j http://en.wikipedia.org/wiki/Multivariate_normal_distribution http://webscripts.softpedia.com/script/Scientific-Engineering-Ruby/Statistics-and-Probability/Multivariate-Gaussian-Distribution-35454.html 29 Multivariate Normal Distribution Mahalanobis distance between x and x ~ N d μ, Σ 1 1 T 1 p x exp x μ Σ x μ 1/ 2 d/2 2 2 Σ (5.9) 30 http://en.wikipedia.org/wiki/Mahalanobis_distance Mahalanobis Distance The Mahalanobis distance is based on correlations between variables by which different patterns can be identified and analyzed. It differs from Euclidean distance in that it takes into account the correlations of the data set and is scale-invariant. x ~ N d μ, Σ Mahalanobis distance between x and 1 1 T 1 p x exp x μ Σ x μ 1/ 2 d/2 2 2 Σ 31 http://www.analyzemath.com/Calculators/inverse_matrix_3by3.html Multivariate Normal Distribution Mahalanobis distance: (x – μ)T ∑–1 (x – μ) measures the distance from x to μ in terms of ∑ (normalizes for difference in variances and correlations) Bivariate: d = 2 2 1 12 1 2 2 2 Remark: is the correlation between the two variables p x1 , x2 1 21 2 1 2 2 exp z1 2 z1 z 2 z 2 2 2 1 2 1 zi xi i / i Z-score: see http://en.wikipedia.org/wiki/Standard_score Called z-score zi for xi 32 Bivariate Normal 33 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 34 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Independent Inputs: Naive Bayes If xi are independent, offdiagonals of ∑ are 0, Mahalanobis distance reduces to weighted (by 1/σi) Euclidean distance: 1 d x 1 i i p x pi x i exp d 2 d / 2 i 1 i 1 i 2 i d 2 i 1 If variances are also equal, reduces to Euclidean distance 35 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Parametric Classification If p (x | Ci ) ~ N ( μi , ∑i ) 1 1 T 1 p x | Ci exp x μi Σi x μi 1/ 2 d/2 2 2 Σi Discriminant functions are gi x log p x | Ci log P Ci d 1 1 T 1 log2 log Σi x μi Σi x μi log P Ci 2 2 2 36 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Estimation of Parameters r P̂ C t t i i mi N t t r t i x t r t i r x t Si t i t t m i x mi t r t i T 1 1 T 1 gi x log Si x mi Si x mi log P̂ Ci 2 2 37 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) skip Different Si Quadratic discriminant 1 1 T 1 1 T 1 gi x log Si x Si x 2x T Si mi mi Si mi log P̂ Ci 2 2 T x T Wi x w i x w i 0 where 1 1 Wi Si 2 1 w i Si mi 1 T 1 1 w i 0 mi Si mi log Si log P̂ Ci 2 2 38 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) likelihoods discriminant: P (C1|x ) = 0.5 posterior for C1 39 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Initially skip! Common Covariance Matrix S Shared common sample covariance S S P̂ C i Si i Discriminant reduces to 1 T gi x x mi S 1 x mi log P̂ Ci 2 which is a linear discriminant gi x w i x w i 0 T where 1 w i S mi w i 0 1 T 1 mi S mi log P̂ Ci 2 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 40 Initially skip! Common Covariance Matrix S 41 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Likely covered in April! Diagonal S When xj j = 1,..d, are independent, ∑ is diagonal p (x|Ci) = ∏j p (xj |Ci) (Naive Bayes’ assumption) 1 x mij gi x 2 j 1 s j d t j 2 log P̂ Ci Classify based on weighted Euclidean distance (in sj units) to the nearest mean 42 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Diagonal S variances may be different 43 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Diagonal S, equal variances Nearest mean classifier: Classify based on Euclidean distance to the nearest mean gi x x mi 2s 2 2 log P̂ Ci 1 2 x tj mij 2s j 1 d 2 log P̂ Ci Each mean can be considered a prototype or template and this is template matching 44 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Diagonal S, equal variances *? 45 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) Model Selection Assumption Covariance matrix No of parameters Shared, Hyperspheric Si=S=s2I 1 Shared, Axis-aligned Si=S, with sij=0 d Shared, Hyperellipsoidal Si=S Different, Hyperellipsoidal Si d(d+1)/2 K d(d+1)/2 As we increase complexity (less restricted S), bias decreases and variance increases Assume simple models (allow some bias) to control variance (regularization) 46 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) skip! Discrete Features Binary features: pij p x j 1 | Ci if xj are independent (Naive Bayes’) d xj 1x j p x | Ci pij 1 pij j 1 the discriminant is linear gi x log p x | Ci log P Ci x j log pij 1 x j log 1 pij log P Ci j Estimated parameters p̂ij t t x t jri t r t i 47 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) skip! Discrete Features Multinomial (1-of-nj) features: xj {v1, v2,..., vnj} pijk p z jk 1 | Ci p x j v k | Ci if xj are independent d nj p x | Ci pijkjk z j 1 k 1 gi x j k z jk log pijk log P Ci p̂ijk t t z r t jk i t r t i 48 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) skip! Multivariate Regression r g x | w , w ,..., w t t 0 1 d Multivariate linear model w 0 w 1x 1t w 2x 2t w d x dt 1 E w 0 , w 1 ,..., w d | X t r t w 0 w 1x 1t w d x dt 2 Multivariate polynomial model: 2 Define new higher-order variables z1=x1, z2=x2, z3=x12, z4=x22, z5=x1x2 and use the linear model in this new z space (basis functions, kernel trick, SVM: Chapter 10) 49 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)