Count Data Regression Poisson, Negative

advertisement
Count Data Regression
Poisson and Negative-binomial
Regression models using R
Asma Alfadhel
Timothy Monday
Outline

Theoretical Background
◦ Count Data
◦ Count Regression Models
◦ Over Dispersion

Implementation in R
◦ Simulated Data
◦ Real Data
Part I
THEORETICAL
BACKGROUND
Count Data

The observations can take only a nonnegative integer values 0, 1, 2, 3, ..

These integers arise from counting rather
than ranking.

It is different from:
– Binary data
y ϵ (0,1)
– Ordinal data
ranking is important.
(logistic/ probit regression)
http://en.wikipedia.org/wiki/Ordinal_data
Count Data

Examples:
– The number of DVD purchases in a store,
observed for several months.
– The number of Dr. visits, observed for
several days.
Count Regression Models

General Linear Models
−Poisson regression model
(equal dispersion)
−Negative binomial regression model (NB)

(over dispersion)
Two-Part Models
−Hurdle Poisson model
−Hurdle negative binomial model
−Zero-inflated Poisson model
−Zero-inflated negative binomial model
(excess zeroes)
Poisson Distribution

The Poisson distribution has one parameter μ

E(Y) = μ = Var(Y)

e = 2.71828…

y! = 1 x 2 x 3 x … x y
≈ Normal
Distribution
Small mean
Small count numbers
Many zeroes
Poisson Regression
Large mean
Large count numbers
Few/none zeroes
OLS Regression
Poisson Regression Model

It allows you to model the relationship
between a Poisson distributed response
variable and one or more explanatory
variables.

The explanatory variables can be either
numeric or categorical.
http://www.instantr.com/category/statistical-models/
Poisson Regression Model

Poisson Regression Coefficient Interpretation
Example 1:
Example 2:
yi ~ Poisson (exp(2.5 + 0.18Xi))
yi ~ Poisson (exp(2.5 - 0.18Xi))
(e0.18 )= 1.19
(e-0.18 )= 0.83
1.19 – 1 = 0.19
1- 0.83 = 0.17
A one unit increase in X,
A one unit increase in X, will
will increase the average
decrease the average
number of y by 19%
number of y by 17%
Over Dispersion

Observed variance > Theoretical variance

The variation in the data is beyond Poisson model prediction
Var(Y)= μ+ α ∗ f(μ),
(α: dispersion parameter)

α = 0, indicates standard dispersion (Poisson Model)

α > 0, indicates over-dispersion
(Reality, Neg-Binomial)

α < 0, indicates under-dispersion
(Not common)
Negative-Binomial vs. Poisson Distribution
Many zeroes
Small mean
Small count numbers
Poisson Regression
Many zeroes
Small mean
more variability in count numbers
NB Regression
Negative-Binomial vs. Poisson Distribution
Many zeroes
Large mean
NB Regression
Few\none zeroes
Large mean
OLS Regression
Negative-Binomial Distribution
One formulation of the negative binomial distribution can
be used to model count data with over-dispersion
http://www.ats.ucla.edu/stat/stata/seminars/count_presentation/count.htm
Negative-Binomial Regression Model

Estimation Method

Parameters Estimation

For Poisson Regression
β0, β1, …, βn

For Negative-Binomial Regression
β0, β1, …, βn, and a
Goodness of Fit

LLNB > LLPoisson

AIC = 2k - 2ln(L)
◦ k: # of parameters
◦ L: Maximum Likelihood
http://en.wikipedia.org/wiki/Akaike_information_criterion
Two-Part Models

Used to handle the excess zeroes issue

Instead of assuming that count outcome comes
from a single data generating process, two-part
Zeroes:generated by two
models, considerExcess
count outcome
systematically different statistical processes
When there are more zeroes in the data than a
Logit model
P or
orNegative-Binomial
NB model predicts
Poisson
Model

Hurdle Regression (Zero-truncated)

Zero-inflated Regression
http://www2.sas.com/proceedings/forum2008/371-2008.pdf
Part II
IMPLEMENTATION IN R
Models in R
Poisson Model
glm(Y ~ X, family = poisson)
Negative Binomial Model
glm.nb(Y ~ X)
Hurdle-Poisson Model
hurdle(Y ~ X| X1, link = “logit”, dist = “poisson”)
hurdle(Y ~ X| X1, link = “logit”, dist = “negbin”)
Zero-Inflated Model
zip(Y ~ X| X1, link = “logit”, dist = “poisson”)
zinb(Y ~ X| X1, link = “logit”, dist = “negbin”)
Thank you For Listening
ANY QUESTIONS?
References
Count Data Models in SAS, WenSui Liu, Jimmy Cela
http://www2.sas.com/proceedings/forum2008/371-2008.pdf
Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/hierarchical
models. Cambridge University Press.
http://www.ats.ucla.edu/stat/stata/seminars/count_presentation/count.htm
http://en.wikipedia.org/wiki/Ordinal_data
http://en.wikipedia.org/wiki/Poisson_distribution
http://www.instantr.com/category/statistical-models/
http://www.ats.ucla.edu/stat/stata/seminars/count_presentation/count.htm
http://en.wikipedia.org/wiki/Akaike_information_criterion
Generalized Linear Models
A generalized linear model involves:
1.Data vector y = ( y1, y2, …, yn)
2.Predictors X and coefficients β , forming a linear
predictor Xβ
3.A link function g, yielding a vector of transformed
data ŷ = g(Xβ)
4.A data distribution, p(y| ŷ)
5.Possible other parameters (variances, overdispersions,
& cutpoints) involved in the predictors, link functions,
and data disribution
Gelman, A., & Hill, J. (2007)
Download