Lecture 6: Statistics: Learning models from data DS GA 1002 Statistical and Mathematical Models http://www.cims.nyu.edu/~cfgranda/pages/DSGA1002_fall15 Carlos Fernandez-Granda 10/19/2015 Learning a model Of interest in itself, characterization of the data Useful to perform inference ( parametric, fit using a frequentist Models may be Bayesian nonparametric framework Parametric models Assumption: Data sampled from known distribution with a small number of unknown parameters Justification: Theoretical (Central Limit Theorem), empirical . . . Parametric models: Frequentist framework Parametric models: Bayesian framework Nonparametric models Likelihood Aim: Find parameters that fit the data best Criterion: What value of the parameters makes the data more likely For a fixed vector of iid data x the likelihood is defined as Lx (θ) := n Y pXi (xi , θ) i=1 if the distribution of the data is discrete and Lx (θ) := n Y i=1 if the distribution is continuous fXi (xi , θ) Maximum-likelihood estimator The likelihood is a function of the parameters θ For fixed θ the likelihood is the pmf/pdf evaluated at the data It quantifies how likely the data are according to the model Maximum-likelihood (ML) estimator : θ̂ ML (x) := arg max Lx (θ) θ = arg max log Lx (θ) . θ Maximizing the log-likelihood is equivalent, and often more convenient Examples Parameter of a Bernouilli p̂ML = n1 , n0 + n1 Parameters of a Gaussian n µ̂ML = 1X , n i=1 n 2 σ̂ML 1X = (xi − µ̂ML ) n i=1 Example: Fitting a Gaussian 0.25 n = 20 n = 1000 0.20 0.15 0.10 0.05 60 62 64 66 68 70 Height (inches) 72 74 76 Parametric models: Frequentist framework Parametric models: Bayesian framework Nonparametric models Bayesian framework Frequentist statistics: parameters are deterministic Bayesian statistics: parameters are random with known distribution Greater modeling flexibility, but stronger assumptions Bayesian framework Two modeling choices: 1. Prior distribution of the parameters 2. Conditional distribution of the data given the parameters This is the likelihood! Aim: Determining the posterior distribution of the parameters given the data Posterior distribution By Bayes’ Rule, if the data are discrete pΘ (θ) pX|Θ (x|θ) pΘ|X (θ|x) = P u pΘ (u) pX|Θ (x|u) for discrete parameters and fΘ (θ) pX|Θ (x|θ) u fΘ (u) pX|Θ (x|u) du fΘ|X (θ|x) = R for continuous parameters Posterior distribution By Bayes’ Rule, if the data are continuous pΘ (θ) fX|Θ (x|θ) pΘ|X (θ|x) = P u pΘ (u) fX|Θ (x|u) for discrete parameters and fΘ (θ) fX|Θ (x|θ) u fΘ (u) fX|Θ (x|u) du fΘ|X (θ|x) = R for continuous parameters Credible intervals Once we know the posterior, we can compute 1 − α credible intervals Now we can talk about the probability of the parameter belonging to a fixed interval Unlike frequentist confidence intervals, confidence intervals depend on the choice of prior Conditional mean We often need a point estimate of the parameter The mean of the posterior distribution is optimal if our criterion is mean squared error 2 E Θ|X = arg min E θ̂ (X) − Θ . θ̂(X) Example Parameter of a Bernouilli Prior distribution: 2.0 1.5 1.0 0.5 0.0 0.0 0.2 0.4 0.6 0.8 1.0 n0 = 91, n1 = 9 Posterior mean (uniform prior) Posterior mean (skewed prior) ML estimator 14 12 10 8 6 4 2 0 0.0 0.2 0.4 0.6 0.8 1.0 n0 = 1, n1 = 3 2.5 2.0 1.5 1.0 0.5 0.0 0.0 0.2 0.4 0.6 0.8 1.0 n0 = 3, n1 = 1 2.0 1.5 1.0 0.5 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Bayesian interpretation of ML estimator The ML estimator is the mode of the posterior if the prior is uniform Uniform priors are only possible if the parameter space is bounded Parametric models: Frequentist framework Parametric models: Bayesian framework Nonparametric models Nonparametric methods Parametric model may not be available or not fit the data well Alternative: estimate the distribution directly from the data Very challenging: many (infinite!) different distributions could have generated the data Empirical cdf Let X1 , X2 , . . . be an iid sequence with cdf FX The empirical cdf at x ∈ R is n 1X Fbn (x) := 1Xi ≤x n i=1 Fbn is an unbiased and consistent estimator of FX Fbn → FX in mean square as n → ∞ Example: Heights, n = 10 1.0 0.8 True cdf Empirical cdf 0.6 0.4 0.2 60 62 64 66 68 70 Height (inches) 72 74 76 Example: Heights, n = 100 1.0 0.8 True cdf Empirical cdf 0.6 0.4 0.2 60 62 64 66 68 70 Height (inches) 72 74 76 Example: Heights, n = 100 1.0 0.8 True cdf Empirical cdf 0.6 0.4 0.2 60 62 64 66 68 70 Height (inches) 72 74 76 Estimating the pdf Let X1 , X2 , . . . be an iid sequence with pdf fX The kernel density estimator of fX at x ∈ R is n 1 X b fh,n (x) := k nh i=1 x − Xi h , k is a kernel with maximum at 0 which decreases away 0 and satisfies k (x) ≥ 0 for all x ∈ R Z k (x) dx = 1 R Example: Abalone weights KDE bandwidth: 0.05 KDE bandwidth: 0.25 KDE bandwidth: 0.5 True pdf 1.0 0.8 0.6 0.4 0.2 0.0 1 0 1 2 Weight (grams) 3 4 Example: Abalone weights 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 1 0 1 2 Weight (grams) 3 4