Stats 330: Lecture 32 © Department of Statistics 2012 STATS 330 Lecture 32: Slide 1 The Exam! • 25 multiple choice questions, similar to term test (60% for 330, 50% for 762) • 3 “long answer” questions, similar to past exams (STATS 330) You have to do all the multiple choice questions, and 2 out of 3 “long answer” questions • (STATS 762) You have to do a compulsory extra question • Held on am of Wed 31th October 2012 © Department of Statistics 2012 STATS 330 Lecture 32: Slide 2 Help Schedule • I will be available in Rm 265 from 10:30 am to 12:00 on Monday, Tuesday and Wednesday in the week before the exam, my schedule permitting – Tutors: as per David Smiths’s email © Department of Statistics 2012 STATS 330 Lecture 32: Slide 3 STATS 330: Course Summary The course was about • Graphics for data analysis • Regression models for data analysis © Department of Statistics 2012 STATS 330 Lecture 32: Slide 4 Graphics Important ideas: • Visualizing multivariate data – Pairs plots – 3d plots – Coplots – Trellis plots • Same scales • Plots in rows and columns • Diagnostic plots for model criticism © Department of Statistics 2012 STATS 330 Lecture 32: Slide 5 Regression models We studied 3 types of regression: – “Ordinary” (normal, least squares) regression for continuous responses – Logistic regression for binomial responses – Poisson regression for count responses (log-linear models) © Department of Statistics 2012 STATS 330 Lecture 32: Slide 6 Normal regression • Response is assumed to be N(m,s2) • Mean is a linear function of the covariates m = b 0 + b1 x 1 + . . . + b k x k • Covariates can be either continuous or categorical • Observations independent, same variance © Department of Statistics 2012 STATS 330 Lecture 32: Slide 7 Logistic regression • Response ( s “successes” out of n) is assumed to be Binomial Bin(n,p) • Logit of Probability log(p/(1-p))is a linear function of the covariates log(p/(1-p)) = b0 + b1x1 + . . . + bkxk • Covariates can be either continuous or categorical • Observations independent © Department of Statistics 2012 STATS 330 Lecture 32: Slide 8 Poisson regression • Response is assumed to be Poisson(m) • Log of mean log(m) is a linear function of the covariates (log-linear models) log(m) = b0 + b1x1 + . . . + bkxk (Or, equivalently m = exp(b0 + b1x1 + . . . + bkxk) • Covariates can be either continuous or categorical • Observations independent © Department of Statistics 2012 STATS 330 Lecture 32: Slide 9 Interpretation of b coefficients • For continuous covariates: – In normal regression, b is the increase in mean response associated with a unit increase in x – In logistic regression, b is the increase in log odds associated with a unit increase in x – In Poisson regression, b is the increase in log mean associated with a unit increase in x – In logistic regression, if x is increased by 1, the odds are increased by a factor of exp(b) – In Poisson regression, if x is increased by 1, the mean is increased by a factor of exp(b) © Department of Statistics 2012 STATS 330 Lecture 32: Slide 10 Interpretation of b coefficients • For categorical covariates (main effects only): – In normal regression, b is the increase in mean response relative to the baseline – In logistic regression, b is the increase in log odds relative to the baseline – In logistic regression, if we change from baseline to some level, the odds are increased by a factor of exp(parameter for that level) relative to the baseline – In Poisson regression, if we change from baseline to some level, the mean is increased by a factor of exp(parameter for that level) relative to the baseline © Department of Statistics 2012 STATS 330 Lecture 32: Slide 11 Measures of Fit • R2 (for normal regression) • Residual Deviance (for Logistic and Poisson regression) – But not for ungrouped data in logistic, or Poisson with very small means (cell counts) © Department of Statistics 2012 STATS 330 Lecture 32: Slide 12 Prediction • For normal regression, – Predict response at covariates x1, . . . ,xk – Estimate mean response at covariates x1, . . . ,xk • For logistic regression, – estimate log-odds at covariates x1, . . . ,xk – Estimate probability of “success” at covariates x1, . . . ,xk • For Poisson regression, – Estimate mean at covariates x1, . . . ,xk © Department of Statistics 2012 STATS 330 Lecture 32: Slide 13 Inference • Summary table – Estimates of regression coefs – Standard errors – Test stats for coef = 0 – R2 etc (normal regression) – F-test for null model – Null and residual deviances (logistic/Poisson) © Department of Statistics 2012 STATS 330 Lecture 32: Slide 14 Testing model vs sub-model • Use and interpretation of both forms of anova – Comparing model with a sub-model – Adding successive terms to a model © Department of Statistics 2012 STATS 330 Lecture 32: Slide 15 Topics specific to normal regression • Collinearity – VIF’s – Correlation – Added variable plots • Model selection – Stepwise procedures: FS, BE, stepwise – All possible regressions approach • AIC, BIC, CP, adjusted R2, CV © Department of Statistics 2012 STATS 330 Lecture 32: Slide 16 Factors (categorical explanatory variables) • Factors – Baselines – Levels – Factor level combinations – Interactions – Dummy variables – Know how to express interactions in terms of means, means in terms of interactions – Know how to interpret zero interactions © Department of Statistics 2012 STATS 330 Lecture 32: Slide 17 Fitting and Choosing models • Fit a separate plane (mean if no continuous covariates) to each combination of factor levels • Search for a simpler submodel (with some interactions zero) using stepwise and anova © Department of Statistics 2012 STATS 330 Lecture 32: Slide 18 Diagnostics • For non-planar data – Plot res/fitted, res/x’s, partial residual plots, gam plots, box-cox plot – Transform either x’s or response, fit polynomial terms • For unequal variance – Plot res/ fitted, look for funnel effect – Weighted least squares – Transform response © Department of Statistics 2012 STATS 330 Lecture 32: Slide 19 Diagnostics (2) • For outliers and high-leverage points – Hat matrix diagonals – Standardised residuals, – Leave-one-out diagnostics • Independent observations – Acf plots – Residual/previous residual – Time series plot of residuals – Durbin-Watson test © Department of Statistics 2012 STATS 330 Lecture 32: Slide 20 Diagnostics (3) • Normality – Normal plot – Weisberg-Bingham test – Box Cox (select power) © Department of Statistics 2012 STATS 330 Lecture 32: Slide 21 Specifics for Logistic Regression Log-likelihood is n l ( b 0 ,...,b k ) = si log(p i ) + (ni - si ) log(1 - p i ) i =1 where exp(b 0 + b1 xi1 + ... + b k xik ) pi = 1 + exp(b 0 + b1 xi1 + ... + b k xik ) or, equivalently n l ( b 0 ,...,b k ) = si ( b 0 + b1 xi1 + ... + b k xik ) i =1 - ni log(1 + exp(b 0 + b1 xi1 + ... + b k xik )) © Department of Statistics 2012 STATS 330 Lecture 32: Slide 22 Deviance Deviance = 2(log LMAX - log LMOD) – log LMAX: replace p’s with frequencies si/ni – log LMOD: replace p’s with estimated p’s from logistic model i.e. exp(bˆ0 + bˆ1 xi1 + ... + bˆk xik ) pˆi = (1 + exp(bˆ0 + bˆ1 xi1 + ... + bˆk xik )) Can’t use as goodness of fit measure in ungrouped case © Department of Statistics 2012 STATS 330 Lecture 32: Slide 23 Odds and log-odds • Probabilities p: Pr(Y=1) • Odds p /(1- p) exp(b0 + b1x1 + ...+ b k xk ) • Log-odds: log p/(1- p) © Department of Statistics 2012 exp(b 0 + b1 x1 + ... + b k xk ) (1 + exp(b 0 + b1 x1 + ... + b k xk )) b0 + b1x1 + ...+ b k xk STATS 330 Lecture 32: Slide 24 Residuals • Pearson • Deviance (ri - nip i ) nip i (1 - p i ) d i = sign (ri - nip i ) | r (bˆ i ˆ x + ... + bˆ x ) - n log(1 + exp( bˆ + bˆ x + ... + bˆ x )) | 1 / 2 + b 0 1 i1 k ik i 0 1 i1 k ik n Deviance = d i2 i =1 © Department of Statistics 2012 STATS 330 Lecture 32: Slide 25 Topics specific to Poisson regression • Offsets • Interpretation of regression coefficients – (same as for odds in logistic regression) • Correspondence between Poisson regression (Log-linear models) and the multinomial model for contingency tables – The “Poisson trick” © Department of Statistics 2012 STATS 330 Lecture 32: Slide 26 Contingency tables • Cells 1,2,…, m • Cell probabilities p1, . . . , pm • Counts y1, . . . , ym • Log-likelihood is m y logp i =1 © Department of Statistics 2012 i i STATS 330 Lecture 32: Slide 27 Contingency tables (2) • A “model” for the table is anything that specifies the form of the probabilities, possibly up to k unknown parameters • Test if the model is OK by – Calculate Deviance = 2(log LMAX - log LMOD) log LMAX: replace p’s with table frequencies log LMOD: replace p’s with estimated p’s from the model – Model OK if deviance is small,(p-value > 0.05) – Degrees of freedom m - 1 - k – k = number of parameters in the model © Department of Statistics 2012 STATS 330 Lecture 32: Slide 28 Independence models • Correspond to interactions being zero • Fit a “saturated” model using Poisson regression • Use anova, stepwise to see which interactions are zero • Identify the appropriate model • Models can be represented by graphs © Department of Statistics 2012 STATS 330 Lecture 32: Slide 29 Odds Ratios • • • • Definition and interpretation Connection to independence Connection with interactions Relationship between conditional OR’s and interactions • Homogeneous association model © Department of Statistics 2012 STATS 330 Lecture 32: Slide 30 Association graphs • Each node is a factor • Factors joined by lines if an interaction between them • Interpretation in terms of conditional independence • Interpretation in terms of collapsibility © Department of Statistics 2012 STATS 330 Lecture 32: Slide 31 Contingency tables: final topics • Association reversal – Simpson’s paradox – When can you collapse • Product multinomial – comparing populations – populations the same if certain interactions are zero • Goodness of fit to a distribution – Special case of 1-dimensional table © Department of Statistics 2012 STATS 330 Lecture 32: Slide 32