Generalized Linear Models GLMs What is GLM? • Wikipedia defines the generalized linear model (GLM) as “a flexible generalization of ordinary linear regression that allows for response variables that have other than a normal distribution.”* *http://en.wikipedia.org/wiki/Generalized_linear_model Ordinary Linear Regression • Ordinary Linear Regression is a statistical technique for investigating and modeling the linear relationship between explanatory variables and response variables that are continuous. • The simplest regression is Simple Linear Regression that models the linear relationship between a single explanatory variable and a single response. • Simple Linear Regression Model: Dependent Variable π¦ = π½0 + π½1 π₯ + π Intercept Slope Independent Variable Random Error Assumptions in Ordinary Linear Regression • The errors π are assumed to have mean zero and unknown variance π 2 . • Therefore, we have the following assumptions: – The true relationship between X and Y must be linear – Errors are uncorrelated – Errors are normally distributed – Constant Variance When assumptions are violated… • The following are alternative approaches when the assumptions of a normally distributed response variable with constant variance are violated: – Data transformations – Weighted least squares – Generalized linear models (GLM) GLM Model • Generalized linear models (GLM) extend ordinary regression to nonnormal response distributions. • Generalized Linear Model π π = π½0 + π½1 π₯ – π function is called the link function because it connects the mean π and the linear predictor π₯. • Response distribution must come from the Exponential Family of Distributions – Includes Normal, Bernoulli, Binomial, Poisson, Gamma, etc. • 3 Components – Random – Identifies response Y and its probability distribution – Systematic – Explanatory variables in a linear predictor function – Link function – Invertible function (g(.)) that links the mean of the response to the systematic component. Random Component Normal: continuous, symmetric, mean μ and var σ2 Bernoulli: 0 or 1, mean p and var p(1-p) special case of Binomial Poisson: non-negative integer, 0, 1, 2, …, mean λ var λ # of events in a fixed time interval 7 Types of GLMs for Statistical Analysis Distribution of Response (π) Link Function Explanatory Variables Model Normal Identity Continuous Regression Normal Identity Categorical Analysis of Variance Normal Identity Mixed Analysis of Covariance Binomial Logit Mixed Logistic Regression Poisson Log Mixed Loglinear (Poisson Regression) Recall: The link function relates the response variable to the linear model. GLM and Ordinary Regression • Ordinary linear regression is a special case of the GLM • Response is normally distributed with variance π 2 • The ordinary regression model can be represented as a GLM where the link function is the identity function. • Identity Link: π π = π GLM for ordinary regression π = π½0 + π½1 π₯ Model Evaluation: Deviance • Deviance: Measure of Goodness of Fit for the GLM. – Definition: Deviance=-2 times the difference in loglikelihood between the current model and the saturated model. • Likelihood is the product of the probability distribution function of the observations • -2 Log likelihood is used due to its distributional properties – Chi-square • Saturated model is a model that fits the data perfectly • Deviance in GLM is similar to residual variance in ANOVA. Inference in GLM • Lack of Fit test • Likelihood Ratio Statistic for testing the null hypothesis that the model is a good alternative to the saturated model • Has an asymptotic chi-squared distribution. • Likelihood Ratio Test – Also allows for the comparison of one model to another model by looking at the difference in deviance of the two models. • Null Hypothesis: The predictor variables in Model 1 that are not found in Model 2 are not significant to the model fit. • Alternate Hypothesis: The predictor variables in Model 1 that are not found in Model 2 are significant to the model fit. – LRT is distributed as a Chi-Square distribution – Later, the Likelihood Ratio Test will be used to test the significance of variables in Logistic and Poisson regression models. Model Comparison – Two additional Measure for determining model fit are: • Akaike Information Criterion (AIC) – Penalizes model for having many parameters – AIC = -2 Log L +2*p where p is the number of parameters in model, small is better • Bayesian Information Criterion (BIC) – BIC = -2 Log L + ln(n)*p where p is the number of parameters in model and n is the number of observations – Usually stronger penalization for additional parameter than AIC Summary – Setup of the Generalized Linear Model – Deviance and Likelihood Ratio Test • Test lack of fit of the model • Test the significance of a predictor variable or set of predictor variables in the model. – Model Comparison Logistic Regression with Binary Responses • A regression technique used for predicting the outcome for a response, where the response variable (π¦) takes on only two possibilities. – Example: π¦ = 1-Success π¦ = 0-Failure • The response variable π¦ is a Bernoulli random variable. Binary Response Variables • A common situation that occurs is when the measured response has two possibilities. – Examples: • • • • success/failure pass/fail employed/unemployed Male/female • In many situations the binary response is recorded as a “0” or “1”. – Example: • Success=“1” • Failure=“0” Distribution of the Response • Probability of success: π • Probability of failure: (1 − π) • The probability of obtaining of obtaining y=1 (a success) or y=0 (a failure) is given by the Bernoulli Distribution: π π¦ = π π¦ 1 − π 1−π¦ Logistic Regression Model • The probabilities of the outcomes are modeled using the logistic function. • The GLM for Logistic Regression uses the logit link function. • In some cases it is preferable to use the log-log link. • Logistic Regression Model with Logit Link: π = π½0 + π½1 π₯ ln 1−π • This model ensures that values for π are between 0 and 1! Interpretation of Parameters in Logistic Regression • Logistic Regression Model with Logit Link: π ln = π½Μ0 + π½Μ1 π₯ 1−π • When interpreting π½Μ1 it is usually easiest to take the odds ratio approach. – Odds Ratio in logistic regression is the estimated increase in probability of success for a one-unit change in π₯. – ππππ π π π π π = οΏ½1 π½ π Logistic Regression with a Binary Response in R 1. Create a single vector of 0’s and 1’s for the response variable. 2. Use the function: glm family=binomial to fit the model. 3. Test for goodness of fit and significance. 4. Interpret coefficients. Poisson Regression with Count Responses • A regression technique used for predicting the outcome for a response, where the response variable (π¦) takes on counts. – Consider a count response variable. • Response variable is the number of occurrences in a given time frame. • Outcomes equal to 0, 1, 2, …. • Examples: – Number of penalties during a football game. – Number of customers shop at a store on a given day. – Number of car accidents at an intersection. Poisson Regression Model • Consider the GLM π(π) = π½0 + π½1 π₯ • Problems if ordinary regression was used and π π = π: – Predicted values for the response might be negative. – The variance of the response variable is likely to increase with the mean not remain constant. – Zero is a likely value and in ordinary regression 0 is hard to handle in transformations. Poisson Regression Model • Poisson Model lππ π = π½0 + π½1 π₯ – The log link function ensures that all the predicted/fitted values are positive. – The Poisson errors takes in to account that the data are counts and the variances are equal to their means. • In R this handled by family=poisson. – Sets errors=Poisson and link=log. – Can be easily coded as glm(y~x, poisson). Poisson Regression in R 1. Input data where y is a column of counts. 2. Use the function: glm family=poisson to fit the model. 3. Test for goodness of fit and significance.