Generalized Linear Models - LISA (Laboratory for Interdisciplinary

advertisement
Generalized Linear Models
GLMs
What is GLM?
• Wikipedia defines the generalized linear
model (GLM) as “a flexible generalization of
ordinary linear regression that allows for
response variables that have other than a
normal distribution.”*
*http://en.wikipedia.org/wiki/Generalized_linear_model
Ordinary Linear Regression
• Ordinary Linear Regression is a statistical technique
for investigating and modeling the linear relationship
between explanatory variables and response
variables that are continuous.
• The simplest regression is Simple Linear Regression
that models the linear relationship between a single
explanatory variable and a single response.
• Simple Linear Regression Model:
Dependent
Variable
𝑦 = 𝛽0 + 𝛽1 π‘₯ + πœ–
Intercept
Slope
Independent
Variable
Random
Error
Assumptions in Ordinary Linear
Regression
• The errors πœ– are assumed to have mean zero
and unknown variance 𝜎 2 .
• Therefore, we have the following
assumptions:
– The true relationship between X and Y must be
linear
– Errors are uncorrelated
– Errors are normally distributed
– Constant Variance
When assumptions are violated…
• The following are alternative approaches
when the assumptions of a normally
distributed response variable with constant
variance are violated:
– Data transformations
– Weighted least squares
– Generalized linear models (GLM)
GLM Model
• Generalized linear models (GLM) extend ordinary regression to nonnormal response distributions.
• Generalized Linear Model
𝑔 πœ‡ = 𝛽0 + 𝛽1 π‘₯
– 𝑔 function is called the link function because it connects the
mean πœ‡ and the linear predictor π‘₯.
• Response distribution must come from the Exponential Family of
Distributions
– Includes Normal, Bernoulli, Binomial, Poisson, Gamma, etc.
• 3 Components
– Random – Identifies response Y and its probability distribution
– Systematic – Explanatory variables in a linear predictor function
– Link function – Invertible function (g(.)) that links the mean of
the response to the systematic component.
Random Component
Normal: continuous, symmetric, mean μ and var σ2
Bernoulli: 0 or 1, mean p and var p(1-p)
special case of Binomial
Poisson: non-negative integer, 0, 1, 2, …, mean λ var λ
# of events in a fixed time interval
7
Types of GLMs for Statistical Analysis
Distribution of
Response (π’š)
Link Function
Explanatory
Variables
Model
Normal
Identity
Continuous
Regression
Normal
Identity
Categorical
Analysis of Variance
Normal
Identity
Mixed
Analysis of Covariance
Binomial
Logit
Mixed
Logistic Regression
Poisson
Log
Mixed
Loglinear (Poisson Regression)
Recall: The link function relates the response variable to the linear model.
GLM and Ordinary Regression
• Ordinary linear regression is a special case of the
GLM
• Response is normally distributed with variance 𝜎 2
• The ordinary regression model can be
represented as a GLM where the link function is
the identity function.
• Identity Link: 𝑔 πœ‡ = πœ‡
GLM for ordinary regression
πœ‡ = 𝛽0 + 𝛽1 π‘₯
Model Evaluation: Deviance
• Deviance: Measure of Goodness of Fit for the
GLM.
– Definition: Deviance=-2 times the difference in loglikelihood between the current model and the
saturated model.
• Likelihood is the product of the probability distribution
function of the observations
• -2 Log likelihood is used due to its distributional
properties – Chi-square
• Saturated model is a model that fits the data perfectly
• Deviance in GLM is similar to residual variance in
ANOVA.
Inference in GLM
• Lack of Fit test
• Likelihood Ratio Statistic for testing the null hypothesis that
the model is a good alternative to the saturated model
• Has an asymptotic chi-squared distribution.
• Likelihood Ratio Test
– Also allows for the comparison of one model to another model
by looking at the difference in deviance of the two models.
• Null Hypothesis: The predictor variables in Model 1 that are not found in
Model 2 are not significant to the model fit.
• Alternate Hypothesis: The predictor variables in Model 1 that are not found in
Model 2 are significant to the model fit.
– LRT is distributed as a Chi-Square distribution
– Later, the Likelihood Ratio Test will be used to test the
significance of variables in Logistic and Poisson regression
models.
Model Comparison
– Two additional Measure for determining model fit are:
• Akaike Information Criterion (AIC)
– Penalizes model for having many parameters
– AIC = -2 Log L +2*p where p is the number of parameters in
model, small is better
• Bayesian Information Criterion (BIC)
– BIC = -2 Log L + ln(n)*p where p is the number of parameters
in model and n is the number of observations
– Usually stronger penalization for additional parameter than
AIC
Summary
– Setup of the Generalized Linear Model
– Deviance and Likelihood Ratio Test
• Test lack of fit of the model
• Test the significance of a predictor variable or set of predictor
variables in the model.
– Model Comparison
Logistic Regression with Binary
Responses
• A regression technique used for predicting the
outcome for a response, where the response
variable (𝑦) takes on only two possibilities.
– Example: 𝑦 = 1-Success
𝑦 = 0-Failure
• The response variable 𝑦 is a Bernoulli random
variable.
Binary Response Variables
• A common situation that occurs is when the
measured response has two possibilities.
– Examples:
•
•
•
•
success/failure
pass/fail
employed/unemployed
Male/female
• In many situations the binary response is
recorded as a “0” or “1”.
– Example:
• Success=“1”
• Failure=“0”
Distribution of the Response
• Probability of success: 𝑝
• Probability of failure: (1 − 𝑝)
• The probability of obtaining of obtaining y=1
(a success) or y=0 (a failure) is given by the
Bernoulli Distribution:
𝑃 𝑦 = 𝑝 𝑦 1 − 𝑝 1−𝑦
Logistic Regression Model
• The probabilities of the outcomes are modeled using
the logistic function.
• The GLM for Logistic Regression uses the logit link
function.
• In some cases it is preferable to use the log-log link.
• Logistic Regression Model with Logit Link:
𝑝
= 𝛽0 + 𝛽1 π‘₯
ln
1−𝑝
• This model ensures that values for 𝑝 are between 0
and 1!
Interpretation of Parameters in Logistic
Regression
• Logistic Regression Model with Logit Link:
𝑝
ln
= 𝛽̂0 + 𝛽̂1 π‘₯
1−𝑝
• When interpreting 𝛽̂1 it is usually easiest to
take the odds ratio approach.
– Odds Ratio in logistic regression is the estimated
increase in probability of success for a one-unit
change in π‘₯.
– 𝑂𝑂𝑂𝑂 𝑅𝑅𝑅𝑅𝑅 =
οΏ½1
𝛽
𝑒
Logistic Regression with a Binary
Response in R
1. Create a single vector of 0’s and 1’s for the
response variable.
2. Use the function: glm family=binomial to fit
the model.
3. Test for goodness of fit and significance.
4. Interpret coefficients.
Poisson Regression with Count
Responses
• A regression technique used for predicting the
outcome for a response, where the response
variable (𝑦) takes on counts.
– Consider a count response variable.
• Response variable is the number of occurrences in a
given time frame.
• Outcomes equal to 0, 1, 2, ….
• Examples:
– Number of penalties during a football game.
– Number of customers shop at a store on a given day.
– Number of car accidents at an intersection.
Poisson Regression Model
• Consider the GLM
𝑔(πœ‡) = 𝛽0 + 𝛽1 π‘₯
• Problems if ordinary regression was used and
𝑔 πœ‡ = πœ‡:
– Predicted values for the response might be negative.
– The variance of the response variable is likely to
increase with the mean not remain constant.
– Zero is a likely value and in ordinary regression 0 is
hard to handle in transformations.
Poisson Regression Model
• Poisson Model
lπ‘œπ‘œ πœ‡ = 𝛽0 + 𝛽1 π‘₯
– The log link function ensures that all the
predicted/fitted values are positive.
– The Poisson errors takes in to account that the data
are counts and the variances are equal to their means.
• In R this handled by family=poisson.
– Sets errors=Poisson and link=log.
– Can be easily coded as glm(y~x, poisson).
Poisson Regression in R
1. Input data where y is a column of counts.
2. Use the function: glm family=poisson to fit
the model.
3. Test for goodness of fit and significance.
Download