Regression Eric Feigelson Classical regression model ``The expectation (mean) of the dependent (response) variable Y for a given value of the independent variable X (or vector of variables X) is equal to a specified mathematical function f, which depends on both X and a vector of parameters q, plus a random error (scatter).” All of the randomness in the model resides in this error term; the functional relationship is deterministic with a known mathematical form. The `error’ e is commonly assumed to be a normal (Gaussian) i.i.d. random variable with zero mean, e = N(m,s2) = N(0,s2). Parameter estimation & model selection Once a mathematical model is chosen, and a dataset is provided, then the `best fit’ parameters q are estimated by one (or more) of the following estimation techniques: • • • • • Method of moments Least squares (L2) Least absolute deviation (L1) Maximum likelihood estimation Bayesian inference Choices must be made regarding model complexity and parsimony (Occam’s Razor): • Does the LCDM model have a w-dot term? • Are three or four planets orbiting the star? Model selection methods are often needed: c2n, BIC, AIC, … The final model should be validated against the dataset (or other datasets) using goodness-of-fit tests (e.g. Anderson-Darling test) with bootstrap resamples for significance levels. Robust regression In the 1960-70s, a major effort was made to design regression methods that are `robust’ against the effects of non-Gaussianity and outliers. Today, a common approach is based on M estimation that reduces the influence of data points with large residuals. Some common methods: 1. Huber’s estimator Here we minimize a function with less dependence on outliers than the sum of squared residuals: 2. Rousseeuw;s least trimmed squares Minimize: 3. MM estimators More complicated versions with scaled residuals Quantile regression When residuals are non-Gaussian and the sample is large, regressions can be performed for different quantiles of the residuals, not just the mean of the residuals. The methodology is complex but often effective. 90% 50% 10% Nonlinear astrophysical models Long study of the radial profile of brightness in an elliptical galaxy (or spiral galaxy bulge) as a function of distance r from the center of an elliptical galaxy or spiral galaxy bulge. Models: Hubble’s isothermal sphere (1930), de Vaucouleurs’ r1/4 law (1948), Sersic’s law (1967): Nonlinear regression: Count and categorical data Some nonlinear regressions treat problems where the values of the response variable Y are not real numbers. Poisson regression Y is integer valued This occurs often in astronomy (and other fields) when one counts objects, photons, or other events as a function of some other variables. This includes `luminosity functions’ and `mass functions’ of galaxies, stars, planets, etc. However, Poisson regression has never been used by astronomers! The simplest Poisson regression model is loglinear: Approaches to regression with measurement error • `Minimum chi-squared’ regression • York’s (1966) analytical solution for bivariate linear regression • Modified weighted least squares (Akritas & Bershady 1996) • `Errors-in-variables’ or `latent variable’ hierarchical regression models (volumes by Fuller 1987; Carroll et al. 2006) • SIMEX: Simulation-Extrapolation Algorithm (Carroll et al.) • Normal mixture model: MLE and Bayesian (Kelly 2007) Unfortunately, no consensus has emerged. I think Kelly’s likelihood approach is most promising. Problems with `minimum chi-square regression a very common astronomical regression procedure Dataset of the form: (bivariate data with heteroscedastic measurement errors with known variances) Linear (or other) regression model: Best fit parameters from minimizing the function: The distributions of the parameters are estimated from tables of the c2 distribution (e.g. Dc2 = 2.71 around best-fit parameter values for 90% confidence interval for 1 degree of freedom; Numerical Recipes, Fig 15.6.4) This procedure is often called `minimum chi-square regression’ or `chisquare fitting’ because a random variable that is the sum of squared normal random variables follows a c2 distribution. If all of the variance in Y is attributable to the known measurement errors and these errors are normally distributed, then the model is valid. However, from a statistical viewpoint …. … this is a non-standard procedure! Pearson’s (1900) chi-square statistic was developed for a very specific problem: hypothesis testing for the multinomial experiment: contingency table of counts in a predetermined number of categories. where Oi are the observed counts, and Mi are the model counts dependent on the p model parameters q. The weights (denominator) are completely different than in the astronomers’ procedure. R.A. Fisher showed that X2P is asymptotically (for large n) distributed as c2 with k-p-1 degrees of freedom. When the astronomers’ measurement errors are estimated from the square-root of counts in the multinomial experiment, then their procedure is equivalent to Neyman’s version of the X2 statistic with the observed counts in the denominator. Why the astronomers procedure can be very uncertain! • When the experiment does not clearly specify the number of `bins’, the degrees of freedom in the resulting c2 distribution is ill-determined (Chernoff & Lehmann 1954). Astronomers often take real valued (X,Y) data and `bin’ them in arbitrary ordered classes in X (e.g. a histogram), average the Y values in the bin, and estimate the measurement error as sqrt(n_bin), and apply c2 fitting. This violates the ChernoffLehmann condition. • The astronomer often derives the denominator of the c2-like statistic in complicated and arbitrary fashions. There is no guarantee that the resulting `measurement errors’ account for the scatter in Y about the fitted function. The values may be estimated from the noise in the astronomical image or spectrum after complicated data reduction. Sometimes the astronomer adds an estimated systematic error to the noise error term, or adjusts the errors in some other fashion. The result is arbitrary changes in the minimization procedure with no guarantee that the statistic is c2 distributed.