330.Lect32 - Department of Statistics

advertisement
Stats 330: Lecture 32
© Department of Statistics 2012
STATS 330 Lecture 32: Slide 1
The Exam!
• 25 multiple choice questions, similar to
term test (60% for 330, 50% for 762)
• 3 “long answer” questions, similar to past
exams (STATS 330) You have to do all
the multiple choice questions, and 2 out of
3 “long answer” questions
• (STATS 762) You have to do a
compulsory extra question
• Held on am of Wed 31th October 2012
© Department of Statistics 2012
STATS 330 Lecture 32: Slide 2
Help Schedule
• I will be available in Rm 265 from 10:30
am to 12:00 on Monday, Tuesday and
Wednesday in the week before the exam,
my schedule permitting
– Tutors: as per David Smiths’s email
© Department of Statistics 2012
STATS 330 Lecture 32: Slide 3
STATS 330: Course
Summary
The course was about
• Graphics for data analysis
• Regression models for data analysis
© Department of Statistics 2012
STATS 330 Lecture 32: Slide 4
Graphics
Important ideas:
• Visualizing multivariate data
– Pairs plots
– 3d plots
– Coplots
– Trellis plots
• Same scales
• Plots in rows and columns
• Diagnostic plots for model criticism
© Department of Statistics 2012
STATS 330 Lecture 32: Slide 5
Regression models
We studied 3 types of regression:
– “Ordinary” (normal, least squares) regression
for continuous responses
– Logistic regression for binomial responses
– Poisson regression for count responses
(log-linear models)
© Department of Statistics 2012
STATS 330 Lecture 32: Slide 6
Normal regression
• Response is assumed to be N(m,s2)
• Mean is a linear function of the covariates
m = b 0 + b1 x 1 + . . . + b k x k
• Covariates can be either continuous or
categorical
• Observations independent, same variance
© Department of Statistics 2012
STATS 330 Lecture 32: Slide 7
Logistic regression
• Response ( s “successes” out of n) is
assumed to be Binomial Bin(n,p)
• Logit of Probability log(p/(1-p))is a linear
function of the covariates
log(p/(1-p)) = b0 + b1x1 + . . . + bkxk
• Covariates can be either continuous or
categorical
• Observations independent
© Department of Statistics 2012
STATS 330 Lecture 32: Slide 8
Poisson regression
• Response is assumed to be Poisson(m)
• Log of mean log(m) is a linear function of
the covariates (log-linear models)
log(m) = b0 + b1x1 + . . . + bkxk
(Or, equivalently
m = exp(b0 + b1x1 + . . . + bkxk)
• Covariates can be either continuous or
categorical
• Observations independent
© Department of Statistics 2012
STATS 330 Lecture 32: Slide 9
Interpretation of b coefficients
• For continuous covariates:
– In normal regression, b is the increase in mean
response associated with a unit increase in x
– In logistic regression, b is the increase in log odds
associated with a unit increase in x
– In Poisson regression, b is the increase in log mean
associated with a unit increase in x
– In logistic regression, if x is increased by 1, the odds
are increased by a factor of exp(b)
– In Poisson regression, if x is increased by 1, the
mean is increased by a factor of exp(b)
© Department of Statistics 2012
STATS 330 Lecture 32: Slide 10
Interpretation of b coefficients
• For categorical covariates (main effects only):
– In normal regression, b is the increase in mean
response relative to the baseline
– In logistic regression, b is the increase in log odds
relative to the baseline
– In logistic regression, if we change from baseline to
some level, the odds are increased by a factor of
exp(parameter for that level) relative to the baseline
– In Poisson regression, if we change from baseline to
some level, the mean is increased by a factor of
exp(parameter for that level) relative to the baseline
© Department of Statistics 2012
STATS 330 Lecture 32: Slide 11
Measures of Fit
• R2 (for normal regression)
• Residual Deviance (for Logistic and
Poisson regression)
– But not for ungrouped data in logistic, or
Poisson with very small means (cell counts)
© Department of Statistics 2012
STATS 330 Lecture 32: Slide 12
Prediction
• For normal regression,
– Predict response at covariates x1, . . . ,xk
– Estimate mean response at covariates x1, . . . ,xk
• For logistic regression,
– estimate log-odds at covariates x1, . . . ,xk
– Estimate probability of “success” at covariates x1, . .
. ,xk
• For Poisson regression,
– Estimate mean at covariates x1, . . . ,xk
© Department of Statistics 2012
STATS 330 Lecture 32: Slide 13
Inference
• Summary table
– Estimates of regression coefs
– Standard errors
– Test stats for coef = 0
– R2 etc (normal regression)
– F-test for null model
– Null and residual deviances (logistic/Poisson)
© Department of Statistics 2012
STATS 330 Lecture 32: Slide 14
Testing model vs sub-model
• Use and interpretation of both forms of
anova
– Comparing model with a sub-model
– Adding successive terms to a model
© Department of Statistics 2012
STATS 330 Lecture 32: Slide 15
Topics specific to normal
regression
• Collinearity
– VIF’s
– Correlation
– Added variable plots
• Model selection
– Stepwise procedures: FS, BE, stepwise
– All possible regressions approach
• AIC, BIC, CP, adjusted R2, CV
© Department of Statistics 2012
STATS 330 Lecture 32: Slide 16
Factors (categorical
explanatory variables)
• Factors
– Baselines
– Levels
– Factor level combinations
– Interactions
– Dummy variables
– Know how to express interactions in terms of
means, means in terms of interactions
– Know how to interpret zero interactions
© Department of Statistics 2012
STATS 330 Lecture 32: Slide 17
Fitting and Choosing models
• Fit a separate plane (mean if no
continuous covariates) to each
combination of factor levels
• Search for a simpler submodel (with some
interactions zero) using stepwise and
anova
© Department of Statistics 2012
STATS 330 Lecture 32: Slide 18
Diagnostics
• For non-planar data
– Plot res/fitted, res/x’s, partial residual plots,
gam plots, box-cox plot
– Transform either x’s or response, fit
polynomial terms
• For unequal variance
– Plot res/ fitted, look for funnel effect
– Weighted least squares
– Transform response
© Department of Statistics 2012
STATS 330 Lecture 32: Slide 19
Diagnostics (2)
• For outliers and high-leverage points
– Hat matrix diagonals
– Standardised residuals,
– Leave-one-out diagnostics
• Independent observations
– Acf plots
– Residual/previous residual
– Time series plot of residuals
– Durbin-Watson test
© Department of Statistics 2012
STATS 330 Lecture 32: Slide 20
Diagnostics (3)
• Normality
– Normal plot
– Weisberg-Bingham test
– Box Cox (select power)
© Department of Statistics 2012
STATS 330 Lecture 32: Slide 21
Specifics for Logistic
Regression
Log-likelihood is
n
l ( b 0 ,...,b k ) =  si log(p i ) + (ni - si ) log(1 - p i )
i =1
where
exp(b 0 + b1 xi1 + ... + b k xik )
pi =
1 + exp(b 0 + b1 xi1 + ... + b k xik )
or, equivalently
n
l ( b 0 ,...,b k ) =   si ( b 0 + b1 xi1 + ... + b k xik )
i =1
- ni log(1 + exp(b 0 + b1 xi1 + ... + b k xik ))
© Department of Statistics 2012

STATS 330 Lecture 32: Slide 22
Deviance
Deviance = 2(log LMAX - log LMOD)
– log LMAX: replace p’s with frequencies si/ni
– log LMOD: replace p’s with estimated p’s from
logistic model i.e.
exp(bˆ0 + bˆ1 xi1 + ... + bˆk xik )
pˆi =
(1 + exp(bˆ0 + bˆ1 xi1 + ... + bˆk xik ))
Can’t use as goodness of fit measure in
ungrouped case
© Department of Statistics 2012
STATS 330 Lecture 32: Slide 23
Odds and log-odds
• Probabilities p: Pr(Y=1)
• Odds p /(1- p)
exp(b0 + b1x1 + ...+ b k xk )
• Log-odds: log p/(1- p)
© Department of Statistics 2012
exp(b 0 + b1 x1 + ... + b k xk )
(1 + exp(b 0 + b1 x1 + ... + b k xk ))
b0 + b1x1 + ...+ b k xk
STATS 330 Lecture 32: Slide 24
Residuals
• Pearson
• Deviance
(ri - nip i )
nip i (1 - p i )
d i = sign (ri - nip i ) 
| r (bˆ
i

ˆ x + ... + bˆ x ) - n log(1 + exp( bˆ + bˆ x + ... + bˆ x )) | 1 / 2
+
b
0
1 i1
k ik
i
0
1 i1
k ik
n
Deviance =  d i2
i =1
© Department of Statistics 2012
STATS 330 Lecture 32: Slide 25
Topics specific to Poisson
regression
• Offsets
• Interpretation of regression coefficients
– (same as for odds in logistic regression)
• Correspondence between Poisson
regression (Log-linear models) and the
multinomial model for contingency tables
– The “Poisson trick”
© Department of Statistics 2012
STATS 330 Lecture 32: Slide 26
Contingency tables
• Cells 1,2,…, m
• Cell probabilities p1, . . . , pm
• Counts y1, . . . , ym
• Log-likelihood is
m
 y logp
i =1
© Department of Statistics 2012
i
i
STATS 330 Lecture 32: Slide 27
Contingency tables (2)
• A “model” for the table is anything that specifies the form
of the probabilities, possibly up to k unknown parameters
• Test if the model is OK by
– Calculate Deviance = 2(log LMAX - log LMOD)
log LMAX: replace p’s with table frequencies
log LMOD: replace p’s with estimated p’s from
the model
– Model OK if deviance is small,(p-value > 0.05)
– Degrees of freedom m - 1 - k
– k = number of parameters in the model
© Department of Statistics 2012
STATS 330 Lecture 32: Slide 28
Independence models
• Correspond to interactions being zero
• Fit a “saturated” model using Poisson
regression
• Use anova, stepwise to see which
interactions are zero
• Identify the appropriate model
• Models can be represented by graphs
© Department of Statistics 2012
STATS 330 Lecture 32: Slide 29
Odds Ratios
•
•
•
•
Definition and interpretation
Connection to independence
Connection with interactions
Relationship between conditional OR’s
and interactions
• Homogeneous association model
© Department of Statistics 2012
STATS 330 Lecture 32: Slide 30
Association graphs
• Each node is a factor
• Factors joined by lines if an interaction
between them
• Interpretation in terms of conditional
independence
• Interpretation in terms of collapsibility
© Department of Statistics 2012
STATS 330 Lecture 32: Slide 31
Contingency tables: final topics
• Association reversal
– Simpson’s paradox
– When can you collapse
• Product multinomial
– comparing populations
– populations the same if certain interactions are zero
• Goodness of fit to a distribution
– Special case of 1-dimensional table
© Department of Statistics 2012
STATS 330 Lecture 32: Slide 32
Download