MAXIMUM LIKELIHOOD ALISON BOWLING GENERAL LINEAR MODEL • ππ = π΅0 + π΅1 π1π + π΅2 π2π + π΅3 π3π + β― + π΅π πππ + ππ • ei ~ i.i.d. N(0, s2) • Residuals are • Independent and identically distributed • Normally distributed • Mean 0, Variance s2 • What to do when the normality assumption does not hold? • We can fit an alternative distribution • This requires Maximum Likelihood methods. ALTERNATIVE DISTRIBUTIONS • Binomial (proportions) • P (event occurring), 1-P (event not occurring) • Poisson (count data) MAXIMUM LIKELIHOOD • Myung, J. (2003). Tutorial on maximum likelihood estimation. Journal of Mathematical Psychology, 47, 90 – 100. • Standard approach to parameter estimation and inference in statistics • Many of the inference methods in statistics are based on MLE. • Chi-square test • Bayesian methods • Modelling of random effects PROBABILITY DISTRIBUTIONS • Imagine a biased coin, with the probability of heads, w, = 0.7, is tossed 10 times. • The following probability distribution, can be computed using the binomial theorem. 0.3 0.25 Probality of result (f(y)) This is a probability distribution. • the probability of obtaining a particular outcome for 10 tosses of a coin with w = .7 • 7 heads are more likely to occur than any other combination 0.2 0.15 0.1 0.05 0 0 1 2 3 4 5 6 7 Number of Heads 8 9 10 LIKELIHOOD FUNCTION • Suppose we don’t know w, but have tossed the coin 10 times and obtained y = 7 heads. • What is the most likely value of w? • This may be obtained from the likelihood function. • This is a function of the parameter, w, given the data, y. • The most likely value of w is at the peak of this function. MAXIMUM LIKELIHOOD ESTIMATION • We are interested in finding the probability distribution that underlies that data that have been collected. • We are consequently interested in finding the parameter value(s) that correspond to the desired probability distribution. • The MLE estimate is the maximum (peak) of the maximum likelihood function • This may be obtained from the first derivative of the MLF. • To make sure this is a peak (and not a valley), the second derivative is also checked. ITERATIVE METHOD • For very simple scenarios, the maximum can be obtained using calculus as in the example. • This is usually not possible, especially when the model involves many parameters. • This is done by an iterative series of trial and error steps. • Start with a value of a parameter, w, and compute the likelihood of obtaining this. • Then try another, and see if the likelihood is higher. • If so, keep going • Stop when the maximum is found (solution converges). MLE ALGORITHMS • Different algorithms are used to obtain the result • EM: estimation maximisation algorithm • Newton-Raphson • Fisher Scoring. • SPSS uses both the Newton-Raphson and the Fisher scoring method. LOG LIKELIHOOD • The computation of likelihood involves multiplying probabilities for each individual outcome • This can be computationally intensive. • For this reason, the log of the likelihood is computed instead. • Instead of multiplying, the outcomes are added. • Log (A x B) = Log A + Log B • We maximise the log of the likelihood rather than the likelihood itself, for computational convenience. -2LL • The log likelihood is the sum of the probabilities associated with the predicted and actual outcomes. • This is analogous to the residual sum of squares in OLS regression. • The larger the log likelihood the greater the unexplained variance. • This is usually negative, and can be made positive by adding the negative sign. • We multiply by 2 to enable us to obtain p values to compare models. • This value is -2LL EVALUATING MODELS • Using OLS we use R2 to evaluate models. • i.e. does the addition of a predictor produce a significant increase in R2? • R2 is based on Sums of Squares, which we do not have when using ML. • We use the -2LL, Deviance, and Information Criteria to evaluate models using ML. • Unlike R2, -2LL is not meaningful in its own right. • Used to compare with other models. DEVIANCE • Deviance is a measure of lack of fit. • Measures how much worse the model is than a perfectly fitting model. • Deviance can be used to obtain a measure of pseudo-R2 • π 2 = 1 − πππ£πππππ(πππ‘π‘ππ πππππ) πππ£πππππ (πππ‘ππππππ‘ππππ¦ ) LIKELIHOOD RATIO STATISTIC • πΏπ = −2πΏπΏπ − −2πΏπΏπΉ • LR = likelihood of reduced model (without the parameters) • LF = likelihood of the full model (with the parameters) • LR ~ c2r , where r = dffull – dfreduced • G2 compares the fitted model with the interceptonly model. MAXIMUM LIKELIHOOD IN SPSS • Logistic regression. • Used with a binomial outcome variable • E.g. yes, no; correct, incorrect; married, not married. • Generalised Linear models • Provides a range of non-linear models to be fitted. BAR-TAILED GODWIT DATA • Dependent variable is a count: • Maximum number of birds observed at each estuary for each year • Independent variables • Estuary: Richmond, Hastings, Clarence, Hunter, Tweed • categorical • Year: 1981 – 2014. • Continuous (centred to 0 at 1981). • Research question: • Does the number of Bar-tailed Godwits in the Richmond Estuary remain stable, or improve, compared to the other estuaries? STEP 1: GRAPH THE DATA It is obvious that these data have problems. Counts in the Hunter estuary are much higher than the other estuaries, and have much greater variance. STEP 2: DUMMY CODE THE ESTUARY DATA Use Richmond as the comparison category. Each of the other estuaries may be compared in turn with Richmond. Richmond Clarence Hunter Hastings Tweed 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 STEP 3: RUN OLS ANALYSIS OF THE DATA • I will just include Hunter in this analysis to illustrate. • Model: • πΊπππ€ππ‘π = π΅0 + π΅1 π»π’ππ‘πππ + π΅2 ππππ0π + π΅3 ππππ0π ∗ π»π’ππ‘πππ + ππ • Including just the Year0: • There is a non-significant change in Godwit numbers over the years. OLS DATA ANALYSIS • Including the estuary and estuary * Year0 interaction. There is a significant increase in R2 when the Hunter and Hunter* year interaction are included in the model. INTERPRETATION OF THE FULL MODEL • At year0 =0, the predicted Godwit for Richmond = 292 birds • Change in numbers over the years for Richmond = -4.4 • At Year0=0, difference between numbers in the Hunter and Richmond = 1449.7 (p < .001) • Over 24 years, difference in rate of change for Hunter, compared with Richmond is -15.2 (p = .031) • i.e. there is a steeper decline in bird numbers in Hunter estuary, than the Richmond estuary. CHECKING RESIDUALS…. • Residuals are not normally distributed. • The assumptions for a linear model are not met!! WHAT TO DO? • We could try a transformation of the DV • A Square root transformation is better, but not perfect • We could use a non-linear model • The data are counts, and we could use either a Poisson or Negative Binomial distribution • We will use a Negative Binomial (for reasons that will be explained later) • Use Generalized Linear Models for the analysis. INTERCEPT ONLY MODEL • No predictors are included, and the model simply tests whether the overall number of BT Godwits is different to zero. • The Log likelihood is -827.26 • -2LL = 1654.53 MODEL WITH THREE PARAMETERS • Running the model including Year0, Hunter and Hunter*Year0 gives the following Goodness of Fit Measures Log likelihood = -781.3 -2LL = 1562.6 COMPARING THE TWO MODELS • • • • -2LL for intercept only model = 1654.53 -2LL for full model (with parameters) = 1562.6 Likelihood ratio (G2) = 1654.5 – 1562.6 = 91.9 df = 3 , p < .001 • Therefore the model including the three parameters is a better fit to the data than just the intercept only model. • Limitations: 1. the models must be nested (one model must be contained within the other) 2. Data sets must be identical INFORMATION CRITERIA • Akaike’s Information Criterion : AIC = -2LL + 2k • Schwartz’s Bayesian Criterion : BIC = -2LL + k + ln(N) • k = number of parameters • N = number of participants • Can be used with non-nested models • These IC are similar to restricted R2 • The more parameters you have, the better a model is likely to fit the data. • The IC take this into account by penalising for additional parameters and/or participants. • Better fitting models have lower values of the IC. ANALYSIS OF COUNT DATA • Coxe, S., West, S.G. and Aiken, L. (2009). The analysis of count data: a gentle introduction to Poisson regression and its alternatives. Journal of Personality Assessment, 91, 121- 136. • • • • Poisson regression Overdispersed Poisson regression models Negative binomial regression models Models which address problems with zeros. ANALYSIS OF COUNT DATA • Count data are discrete numbers • Usually not normally distributed. • E.g. number of drinks on a Saturday night. • Modelled by a Poisson distribution. • This has one parameter, m. π π¦ −π π π = π π¦! POISSON MODEL • ln(π ) = π΅0 + π΅1 π1 + π΅2 π2 + π΅3 π3 + β― + π΅π ππ • Assumptions: (Y|X)~ Poi(μ), Var(Y|X)=fμ, f=1 • i.e. The residuals have a Poisson distribution. EXAMPLE: DRINKS DATA • Coxe et al Poisson dataset in SPSS format. • Sensation: mean score on a sensation seeking scale (1-7) • Gender (0 = female, 1 = male) • Y : number of drinks on a Saturday night. OLS REGRESSION • Intercept < 0 • When sensation = 0, number of drinks is negative!! • Residuals are not normally distribution. • OLS has problems!! POISSON REGRESSION: PARAMETERS • Sensation only • When sensation = 0, drinks = e-.14 = .86 • For every 1 unit change in sensation, number of drinks is multiplied by e-.231 = 1.26. POISSON REGRESSION: MODEL FIT • Sensation only: Model fit • G2 = 35.07 • Model fits better than the intercept only model • Deviance = 1151 • -2LL = -(-1037.5) x2 = 2075 • BIC = 2087 • Deviance for the intercept-only model = 1186 (check) • Pseudo-R2 = 1 − πππ£πππππ π πππ ππ‘πππ πππ£πππππ πππ‘ππππππ‘ππππ¦ =1− 1151 1186 = .03 POISSON REGRESSION: PARAMETERS • Sensation and Gender as predictors • What is the effect of gender on number of drinks consumed (holding sensation constant)?? EFFECT OF GENDER • Intercept = -.789 (for gender = 0; female) • Exp(-.789) = .45 • Females drink .45 drinks on a Saturday night • B = .839 (gender = 1: male) • Exp(.839) = 2.3 • Males drink 2.3 times as many drinks as females (when sensation seeking = 0). POISSON REGRESSION: MODEL FIT • -2LL = -2 * (-.941.4) = 1828.2 • BIC = 1900.77 • Model including gender is a substantially better fit than sensation model alone • (1900 vs 2087) Pseudo-R2 =1 − πππ£πππππ π πππ ππ‘πππ&ππππππ πππ£πππππ πππ‘ππππππ‘ =1− 959.5 1186 = .19 MODEL ADEQUACY • Save deviance residuals and predicted values, and plot the residuals against predicted values. OVERDISPERSION • A Poisson distribution has only one parameter, m, where m is the mean and variance of the distribution. • Often the variance of a set of data is greater than the mean • The data are overdispersed. OVERDISPERSED POISSON REGRESSION MODELS • A second parameter, f, is estimated to scale the variance. • The parameters from the overdispersed model are the same as with the simple model, but standard errors are larger. • Use information criteria to compare models NEGATIVE BINOMIAL MODELS • Negative binomial models use a Poisson distribution, but allow for individuals to vary in the distribution fitted. HOMEWORK • Use PGSI Data.sav (Leigh’s Honours data) • DV = PGSI (Score on Problem Gambling Severity Scale) • Predictors = GABS, FreqCoded • Run a Poisson regression to predict PGSI from GABS • Does GABS significantly predict PGSI score? • Look at the likelihood ratio (G2) • Interpret the coefficients for the intercept and GABS • Run a second regression including FreqCode (as a continuous variable) in the model. • Does this second predictor improve the model fit? • (hint – look at the BIC for the two models)