Count Data Regression Poisson and Negative-binomial Regression models using R Asma Alfadhel Timothy Monday Outline Theoretical Background ◦ Count Data ◦ Count Regression Models ◦ Over Dispersion Implementation in R ◦ Simulated Data ◦ Real Data Part I THEORETICAL BACKGROUND Count Data The observations can take only a nonnegative integer values 0, 1, 2, 3, .. These integers arise from counting rather than ranking. It is different from: – Binary data y ϵ (0,1) – Ordinal data ranking is important. (logistic/ probit regression) http://en.wikipedia.org/wiki/Ordinal_data Count Data Examples: – The number of DVD purchases in a store, observed for several months. – The number of Dr. visits, observed for several days. Count Regression Models General Linear Models −Poisson regression model (equal dispersion) −Negative binomial regression model (NB) (over dispersion) Two-Part Models −Hurdle Poisson model −Hurdle negative binomial model −Zero-inflated Poisson model −Zero-inflated negative binomial model (excess zeroes) Poisson Distribution The Poisson distribution has one parameter μ E(Y) = μ = Var(Y) e = 2.71828… y! = 1 x 2 x 3 x … x y ≈ Normal Distribution Small mean Small count numbers Many zeroes Poisson Regression Large mean Large count numbers Few/none zeroes OLS Regression Poisson Regression Model It allows you to model the relationship between a Poisson distributed response variable and one or more explanatory variables. The explanatory variables can be either numeric or categorical. http://www.instantr.com/category/statistical-models/ Poisson Regression Model Poisson Regression Coefficient Interpretation Example 1: Example 2: yi ~ Poisson (exp(2.5 + 0.18Xi)) yi ~ Poisson (exp(2.5 - 0.18Xi)) (e0.18 )= 1.19 (e-0.18 )= 0.83 1.19 – 1 = 0.19 1- 0.83 = 0.17 A one unit increase in X, A one unit increase in X, will will increase the average decrease the average number of y by 19% number of y by 17% Over Dispersion Observed variance > Theoretical variance The variation in the data is beyond Poisson model prediction Var(Y)= μ+ α ∗ f(μ), (α: dispersion parameter) α = 0, indicates standard dispersion (Poisson Model) α > 0, indicates over-dispersion (Reality, Neg-Binomial) α < 0, indicates under-dispersion (Not common) Negative-Binomial vs. Poisson Distribution Many zeroes Small mean Small count numbers Poisson Regression Many zeroes Small mean more variability in count numbers NB Regression Negative-Binomial vs. Poisson Distribution Many zeroes Large mean NB Regression Few\none zeroes Large mean OLS Regression Negative-Binomial Distribution One formulation of the negative binomial distribution can be used to model count data with over-dispersion http://www.ats.ucla.edu/stat/stata/seminars/count_presentation/count.htm Negative-Binomial Regression Model Estimation Method Parameters Estimation For Poisson Regression β0, β1, …, βn For Negative-Binomial Regression β0, β1, …, βn, and a Goodness of Fit LLNB > LLPoisson AIC = 2k - 2ln(L) ◦ k: # of parameters ◦ L: Maximum Likelihood http://en.wikipedia.org/wiki/Akaike_information_criterion Two-Part Models Used to handle the excess zeroes issue Instead of assuming that count outcome comes from a single data generating process, two-part Zeroes:generated by two models, considerExcess count outcome systematically different statistical processes When there are more zeroes in the data than a Logit model P or orNegative-Binomial NB model predicts Poisson Model Hurdle Regression (Zero-truncated) Zero-inflated Regression http://www2.sas.com/proceedings/forum2008/371-2008.pdf Part II IMPLEMENTATION IN R Models in R Poisson Model glm(Y ~ X, family = poisson) Negative Binomial Model glm.nb(Y ~ X) Hurdle-Poisson Model hurdle(Y ~ X| X1, link = “logit”, dist = “poisson”) hurdle(Y ~ X| X1, link = “logit”, dist = “negbin”) Zero-Inflated Model zip(Y ~ X| X1, link = “logit”, dist = “poisson”) zinb(Y ~ X| X1, link = “logit”, dist = “negbin”) Thank you For Listening ANY QUESTIONS? References Count Data Models in SAS, WenSui Liu, Jimmy Cela http://www2.sas.com/proceedings/forum2008/371-2008.pdf Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/hierarchical models. Cambridge University Press. http://www.ats.ucla.edu/stat/stata/seminars/count_presentation/count.htm http://en.wikipedia.org/wiki/Ordinal_data http://en.wikipedia.org/wiki/Poisson_distribution http://www.instantr.com/category/statistical-models/ http://www.ats.ucla.edu/stat/stata/seminars/count_presentation/count.htm http://en.wikipedia.org/wiki/Akaike_information_criterion Generalized Linear Models A generalized linear model involves: 1.Data vector y = ( y1, y2, …, yn) 2.Predictors X and coefficients β , forming a linear predictor Xβ 3.A link function g, yielding a vector of transformed data ŷ = g(Xβ) 4.A data distribution, p(y| ŷ) 5.Possible other parameters (variances, overdispersions, & cutpoints) involved in the predictors, link functions, and data disribution Gelman, A., & Hill, J. (2007)