Parametric and Multivariate Methods Christopher Simpkins chris.simpkins@gatech.edu Chris Simpkins (Georgia Tech) CS 4641 Machine Learning 1 / 20 Parametric and Multivariate Methods Maximum Likelihood Estimation Bias and Variance Parametric Classification Model Selection Multivariate Data Chris Simpkins (Georgia Tech) CS 4641 Machine Learning 2 / 20 Parametric Estimation Given X = {x t }t where x t ∼ p(x) Assume a distribution for p(x|θ) and estimate the sufficient statistics, θ, using X (the data) Example: if X ∼ N(µ, σ 2 ), we estimate µ and σ 2 , the sufficient statistics of the Gaussian distribution, using X Chris Simpkins (Georgia Tech) CS 4641 Machine Learning 3 / 20 Maximum Likelihood Estimation Say we have X = {x t }t where x t ∼ p(x|θ). Maximum likelihood estimation is finding the parameter(s) θ that make x most likely Because the x t are i.i.d., the likelihood of θ given a sample X is the product of the individual x t likelihoods: l(θ|X ) ≡ p(X |θ) = N Y p(x t |θ) t=1 In MLE, we find the parameters θ that make X most likely to be drawn: θ∗ = argmaxθ L(θ|X ) For mathematical convenience and computational expedience we use the log likelihood, which turns the product into a sum: L(θ|X ) ≡ log l(θ|X ) = N X log p(x t |θ) t=1 Chris Simpkins (Georgia Tech) CS 4641 Machine Learning 4 / 20 Example: Bernoulli Density Given the Bernoulli distribution, P(x) = px (1 − p)1−x , x ∈ {0, 1} The expectation and variance are: P E[X ] = P x xp(x) = 1 · p + 0 · (1 − p) = p Var (X ) = x (x − E[X ])2 p(x) = p(1 − p) The log likelihood of a given Bernoulli sample X = {x t }t is N Y L(p|X ) = log t p(x ) (1 − p)(1−x t) t=1 = X = X x t log p + t X (1 − x t ) log(1 − p) t t x log p + (N − t Chris Simpkins (Georgia Tech) X x t ) log(1 − p) t CS 4641 Machine Learning 5 / 20 Example: Bernoulli Density (continued) If we set the derivative p: dL dp = 0 and solve for p we get the MLE for xt N Note that p̂ is itself a random variable with p̂i for given Xi . P p̂ = t As N increases, variance decreases and p̂i s become more similar Chris Simpkins (Georgia Tech) CS 4641 Machine Learning 6 / 20 Example: Multinomial Density Multinomial distribution is a generalization of Bernoulli Given {xt }N t=1 where ( 1 t xi = 0 if experiment t chose state i otherwise We can follow the same procedure as for the Bernoulli and get: P t x p̂i = t i N Chris Simpkins (Georgia Tech) CS 4641 Machine Learning 7 / 20 Example: Gaussian Density Gaussian (normal) density is denoted N(µ, σ 2 ) where E[X ] ≡ µ and Var (X ) ≡ σ 2 Gaussian density function is: h (x − µ2 ) i 1 p(x) = √ exp − 2σ 2 2πσ This leads to the likelihood function P t (x − µ)2 N L(µ, σ|X ) = − log(2π) − N log σ − t 2 2σ 2 and setting the derivative of the log likelihood function equal to 0 we get the MLE estimators for µ (m) and σ 2 (s2 ): P t x m= t PN t (x − m)2 s2 = t N Chris Simpkins (Georgia Tech) CS 4641 Machine Learning 8 / 20 Evaluating an Estimator: Bias Let d(X ) be an estimator for θ. The bias of d is: bθ (d) = E[d(X )] − θ If bθ (d) = 0 for all θ, then d is an unbiased estimator For example, m is an unbiased estimator of µ: P t x E[m] − µ = E[ t ] − µ N 1X E[x t ] − µ = N t Nµ = −µ N =µ−µ=0 This means that, though m may be different from µ for a particular sample, as we take more samples the average of their sample means will approach the population mean Chris Simpkins (Georgia Tech) CS 4641 Machine Learning 9 / 20 Evaluating an Estimator: Variance The variance of an estimator tells us how much it varies from sample to sample m is a consistent estimator because Var (m) → 0 as N → ∞. s2 is biased - details in the book The mean square error of an esitmator d (after doing a bunch of tedious agebra - look in the book if you’re interested) is: r (d, θ) = Var (d) + (bθ (d))2 Chris Simpkins (Georgia Tech) CS 4641 Machine Learning 10 / 20 The Bayes’ Estimator The estimators we’ve seen so far are frequentist - they’re based on sample data We can also take a Bayesian approach, which takes advantage of expert knowledge and helps deal with small data sets We use Bayes’ Rule to combine a prior density p(θ) with evidence (the sample data) to get a posterior estimate of θ: p(θ|X ) = Chris Simpkins (Georgia Tech) p(X |θ)p(θ) p(X ) CS 4641 Machine Learning 11 / 20 Parametric Classification Remember Bayesian decision theory: P(Ci |x) = the discriminant p(x|Ci )P(Ci ) , p(x) and gi (x) = p(x|Ci )P(Ci ) or gi (x) = log p(x|Ci ) + log P(Ci ) If we assume p(x|Ci ) ∼ N(µi , σi2 ), then gi (x) = − 12 log 2π − log σi − (x−µi )2 2σi2 + log P(Ci ) Plugging in the estimators for mi , si2 , and P̂(Ci ) and doing some agebra, we get gi (x) = −(x − mi )2 and choose Ci if |x − mi | = min|x − mk | k In other words, we assign a query point to the class whose mean is nearest Chris Simpkins (Georgia Tech) CS 4641 Machine Learning 12 / 20 Tuning Model Complexity: Bias/Variance Dilemma Bias/variance dilemma: as model complexity increases, small changes in data set cause big changes in learned hypotheses (variance), but complex models fit the data better Bias means the model class does not contain the true hypothesis, which we call underfitting Variance means the model class is too general (complex) and learns the noise, which we call overiftting So how do we find the right model complexity? Chris Simpkins (Georgia Tech) CS 4641 Machine Learning 13 / 20 Model Selection Regularization: augment the error function with a parameter, λ, that penalizes model complexity E 0 = error on data + λ · model complexity Akaike or Bayesian Information Criterion (AIC, BIC) - estimate discrepancy between test error and training error Structural risk minimization: choose the simplest model that gives us good emprical error Mininmum description length (MDL): information-theoretic Kolmogorov complexity: shortest description of the data Bayesian model selection: p(model|data) = p(data|model)p(model) p(data) The way we really do it: cross-validation As complexity increases validation error decraeases to a point, then increases. This “elbow” is the optimal model complexity Chris Simpkins (Georgia Tech) CS 4641 Machine Learning 14 / 20 Example: Model Section (a) Data and fitted polynomials 5 0 ï5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 (b) Error vs. polynomial order 3 Training Validation 2.5 2 1.5 1 0.5 1 2 Chris Simpkins (Georgia Tech) 3 4 5 CS 4641 Machine Learning 6 7 8 15 / 20 Multivariate Data A data set can be viewed as a matrix: 1 X1 X21 · · · X2 X2 · · · 2 1 X= . .. .. .. . . N N X1 X2 · · · Xd1 Xd2 .. . XdN And we can calculate various statistics from the data matrix, like the mean vector, E[x] = µ = [µ1 , ..., µd ]T , or attribute covariances ... Chris Simpkins (Georgia Tech) CS 4641 Machine Learning 16 / 20 Covariance The variance of a single variable x is given by: Pn (xi − X̄ )2 2 σ = i=1 n The variance of two variables, x and y, is given by: Pn (xi − X̄ )(yi − Ȳ ) cov (X , Y ) = i=1 n Covariance tells you how two variable vary together: If the covariance between two variables is positive, then as one variable increases the other will increase. If the covariance between two variables is negative, then as one variable increases the other will decrease. If the covariance between two variables is zero, then the two variables are completely independent of each other. Chris Simpkins (Georgia Tech) CS 4641 Machine Learning 17 / 20 Covariance Matrix For a vector of variables < X1 , ..., Xn >, such as the features of a data set, we can construct a matrix which represents the covariance between each pair of variables Xi and Xj where i and j are indexes of the feature vector. var (X1 ) cov (X1 , X2 ) · · · cov (X1 , Xn ) cov (X2 , X1 ) var (X2 ) · · · cov (X2 , Xn ) cov (X ) = .. .. .. . . . . . . cov (Xn , X1 ) cov (Xn , X2 ) · · · var (Xn ) Notice that: along the diagonal we have simply the variance of an individual variable, and the matrix is symmetric, that is, cov (Xi , Xj ) = cov (Xj , Xi ). Chris Simpkins (Georgia Tech) CS 4641 Machine Learning 18 / 20 Using a Covariance Matrix Consider the following data set:1 age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country Are any of these attributes related? 1 http://archive.ics.uci.edu/ml/datasets/Adult Chris Simpkins (Georgia Tech) CS 4641 Machine Learning 19 / 20 A Look Ahead Clustering - finding groups within unlabeled data Feature selection - selecting (or deriving) features to improve learning Dimensionality reduction - dealing with high dimensional data by transforming the data into a lower dimensional space Chris Simpkins (Georgia Tech) CS 4641 Machine Learning 20 / 20