Modelling Count Data

advertisement
Handout #16: Modeling Count Data
Section 16.1: Modeling Count Data using Linear Regression Models
Example 16.1 For this example, we will consider the relationship between concentration of a
chemical compound and the count of Ceriodaphnia dubia, a particular species of a water flea.
Ceriodaphnia dubia is commonly used in testing the toxicity of water leaving a wastewater
treatment facility.
Wiki Entry for Ceriodaphnia dubia: http://en.wikipedia.org/wiki/Ceriodaphnia_dubia
Picture of Ceriodaphnia dubia
Snip-it of data
The following is a scatterplot of the relationship between the Concentration and Count. We can
see that a model that assumes a constant rate of decrease in counts as a function of
concentration will not be sufficient.
1
Simple Linear Regression Model Setup – assuming a constant rate of decrease
ο‚·
ο‚·
ο‚·
Response Variable: Count
Predictor Variable: Concentration
The standard mean and variance function
o
o
𝐸(πΆπ‘œπ‘’π‘›π‘‘ | πΆπ‘œπ‘›π‘π‘’π‘›π‘‘π‘Ÿπ‘Žπ‘‘π‘–π‘œπ‘›) = 𝛽0 + 𝛽1 ∗ πΆπ‘œπ‘›π‘π‘’π‘›π‘‘π‘Ÿπ‘Žπ‘‘π‘–π‘œπ‘›
π‘‰π‘Žπ‘Ÿ(πΆπ‘œπ‘’π‘›π‘‘|πΆπ‘œπ‘›π‘π‘’π‘›π‘‘π‘Ÿπ‘Žπ‘‘π‘–π‘œπ‘›) = 𝜎 2
Output from the model
Residual Plot
Issues:
1. A mean function that assumes a
constant rate of decrease is inadequate
2. The constant variance function appears
to be inadequate as well
Plot: Residuals vs. Concentration
Problem: Mean function inadequate
Plot: |Residual| vs. Concentration
Problem: Constant variance inadequate
2
Updated Model Setup
ο‚·
ο‚·
ο‚·
ο‚·
Response Variable: Count
Predictor Variable: Concentration
Terms: Intercept, Concentration, Concentration2
Consider the following mean and variance function
o
o
𝐸(πΆπ‘œπ‘’π‘›π‘‘ | πΆπ‘œπ‘›π‘π‘’π‘›π‘‘π‘Ÿπ‘Žπ‘‘π‘–π‘œπ‘›) = 𝛽0 + 𝛽1 ∗ πΆπ‘œπ‘›π‘π‘’π‘›π‘‘π‘Ÿπ‘Žπ‘‘π‘–π‘œπ‘› + 𝛽2 ∗ πΆπ‘œπ‘›π‘π‘’π‘›π‘‘π‘Ÿπ‘Žπ‘–π‘œπ‘› 2
π‘‰π‘Žπ‘Ÿ(πΆπ‘œπ‘’π‘›π‘‘|πΆπ‘œπ‘›π‘π‘’π‘›π‘‘π‘Ÿπ‘Žπ‘‘π‘–π‘œπ‘›) = 𝜎 2
Regression Output
Fitting a quadratic
mean function
Using a quadratic form of mean function appears fit the data fairly well. However, the
use of a constant variance function remains inadequate.
One simple alternative to fixing non-constant variance is to transform the response.
This change implies a multiplicative form exists in the response instead of the usual
additive form.
3
Multiplicative versus Additive Structure
An additive form for the response has been considered up to this point in this course.
π‘…π‘’π‘ π‘π‘œπ‘›π‘ π‘’|π‘₯ = 𝛽0 + 𝛽1 ∗ π‘₯ + πœ€
𝐸(π‘…π‘’π‘ π‘π‘œπ‘›π‘ π‘’|π‘₯) = 𝛽0 + 𝛽1 ∗ π‘₯
For an additive mean function, the rate of increase per unit
change in π‘₯ is given by 𝛽1 . Thus, if 𝛽0 = 10 and 𝛽1 = 3, then
successive differences in the expected response are constant.
A multiplicative form of the response has the following form.
π‘…π‘’π‘ π‘π‘œπ‘›π‘ π‘’|π‘₯ = 10𝛽0 ∗ 10𝛽1 ∗π‘₯ ∗ 10πœ€
= 10(𝛽0 +𝛽1 ∗π‘₯) ∗ 10πœ€
= 10(𝛽0 +𝛽1∗π‘₯+πœ€)
Note: The mean and variability in this conditional distribution are not additive, but
multiplicative. Thus, a large (or small) value of x may exhibit more (or less) variability in
the response.
Consider the following tables in an effort to understand a multiplicative form for the
response. Suppose 𝛽0 = 0.01, 𝛽1 = 0.5, and πœ€ = 0; then, π‘…π‘’π‘ π‘π‘œπ‘›π‘ π‘’|π‘₯ when π‘₯ = 1would be
calculated as follows 10(0.01+0.5∗1+0) = 3.24.
Successive differences for a multiplicative
function are not constant
Successive ratios are constant
A simple transformation of the response, can change a multiplicative form of the
response to an additive form. For example, when a log base 10 transformation of the
response is taken, the successive differences are indeed constant and given by 𝛽1 = 0.5.
4
In an effort to fix the non-constant variance issue discovered above, a log transformation of the
response will be considered.
Regression Setup – Using a log10() transformation of the response
ο‚·
ο‚·
ο‚·
Response Variable: log10(Count)
Predictor Variable: Concentration
Multiplicative form of the response
πΆπ‘œπ‘’π‘›π‘‘|πΆπ‘œπ‘›π‘π‘’π‘›π‘‘π‘Ÿπ‘Žπ‘‘π‘–π‘œπ‘› = 10(𝛽0 +𝛽1 ∗πΆπ‘œπ‘›π‘π‘’π‘›π‘‘π‘Ÿπ‘Žπ‘‘π‘–π‘œπ‘› + πœ€)
ο‚·
Using a log10 transformation to convert to an additive form
π‘™π‘œπ‘”10 (πΆπ‘œπ‘’π‘›π‘‘|πΆπ‘œπ‘›π‘π‘’π‘›π‘‘π‘Ÿπ‘Žπ‘‘π‘–π‘œπ‘›) = 𝛽0 + 𝛽1 ∗ πΆπ‘œπ‘›π‘π‘’π‘›π‘‘π‘Ÿπ‘Žπ‘‘π‘–π‘œπ‘› + πœ€
ο‚·
Usual mean and variance functions on the transformed scale
o 𝐸(π‘™π‘œπ‘”10 (πΆπ‘œπ‘’π‘›π‘‘) | πΆπ‘œπ‘›π‘π‘’π‘›π‘‘π‘Ÿπ‘Žπ‘‘π‘–π‘œπ‘›) = 𝛽0 + 𝛽1 ∗ πΆπ‘œπ‘›π‘π‘’π‘›π‘‘π‘Ÿπ‘Žπ‘‘π‘–π‘œπ‘›
o π‘‰π‘Žπ‘Ÿ(π‘™π‘œπ‘”10 (πΆπ‘œπ‘’π‘›π‘‘)|πΆπ‘œπ‘›π‘π‘’π‘›π‘‘π‘Ÿπ‘Žπ‘‘π‘–π‘œπ‘›) = 𝜎 2
Comments:
1. A linear mean function appears
appropriate when the response is
transformed.
2. Transforming the response does
not appear to fix the constant
variance problem. Interestingly,
the inflated variance in the
response is now occurring when
counts are small (not large as in the
original scale).
ο‚·
The log10 transformation of the response did not appropriately address the constant
variance issue. A log2 transformation does not appear to address this problem neither.
5
Using a log2 transformation of the response
ο‚·
Using a log10 transformation of the response
Consider the following regarding the interpretation of 𝛽̂1 .
π‘™π‘œπ‘”10 (πΆπ‘œπ‘’π‘›π‘‘|πΆπ‘œπ‘›π‘π‘’π‘›π‘‘π‘Ÿπ‘Žπ‘‘π‘–π‘œπ‘›)
= 𝛽̂0 + 𝛽̂1 ∗ πΆπ‘œπ‘›π‘π‘’π‘›π‘‘π‘Ÿπ‘Žπ‘‘π‘–π‘œπ‘›
= 1.89 + −0.71 ∗ 2
= 0.47
= 𝛽̂0 + 𝛽̂1 ∗ πΆπ‘œπ‘›π‘π‘’π‘›π‘‘π‘Ÿπ‘Žπ‘‘π‘–π‘œπ‘›
= 1.89 + −0.71 ∗ 1
= 1.18
= 𝛽̂0 + 𝛽̂1 ∗ πΆπ‘œπ‘›π‘π‘’π‘›π‘‘π‘Ÿπ‘Žπ‘‘π‘–π‘œπ‘›
= 1.89 + −0.71 ∗ 0
= 1.89
Concentration=2
Concentration=1
Concentration=0
πΆπ‘œπ‘’π‘›π‘‘|πΆπ‘œπ‘›π‘π‘’π‘›π‘‘π‘Ÿπ‘Žπ‘‘π‘–π‘œπ‘›
= 100.47
= 2.95
Rate of Increase
= 101.18
= 15.14
15.14
= 5.13
2.95
101.89
77.62
77.62
= 5.13
15.14
=
=
Thus, we’d expect to see about a 5-fold increase in Count for a one-unit decrease in
Concentration. This can be seen directly by considering the following ratio. In this ratio
(π‘₯ − 1) is being used instead of (π‘₯ + 1) because a one-unit decrease in Concentration is
being used instead of a one-unit increase.
Μ‚
Μ‚
10(𝛽0 +𝛽1 ∗(π‘₯−1))
10(𝛽̂0 +𝛽̂1 ∗π‘₯)
101.89 ∗ 10−0.71∗(π‘₯−1)
10−0.71∗π‘₯ ∗ 10+0.71
=
=
= 10+0.71 = 5.13
101.89 ∗ 10−0.71∗π‘₯
10−0.71∗π‘₯
ο‚·
The R2 value from this model is somewhat smaller than the quadratic model fit above.
ο‚·
Even after the log10 transformation of the response, the non-constant variance problem
is still a problem.
Non-constant variance remains even after
the log10 transformation
Residuals lack normality as well
6
Section 16.2: Modeling Count Data using Generalized Linear Model
A more precise and succinct method of modeling counts can be achieved through the use of a
generalized linear model. The generalized linear modeling framework extends what we’ve
learned in this class to a wider class of modeling situations.
π‘ƒπ‘œπ‘–π‘ π‘ π‘œπ‘› ↔ π‘€π‘œπ‘‘π‘’π‘™π‘–π‘›π‘” πΆπ‘œπ‘’π‘›π‘‘π‘ 
The Poisson distribution is often used to model counts. One characteristic of the Poisson
distribution is that the mean and variance function are the same. Thus, as the mean increases
so does the variance. This is often the case when modeling counts. The wiki entry for the
Poisson distribution is provided here.
Source: http://en.wikipedia.org/wiki/Poisson_distribution
Assume the mean of a response variable, say πœ†, whose distribution is Poisson,
depends on a single predictor variable through the following form. This, by
default, is the form of the variance function as well.
πœ† = 𝑒 (𝛽0 +𝛽1 ∗π‘₯)
The link function is a function that relates the mean of the response to the
predictor variable in a linear way. A natural log link is appropriate for the mean
function specified above.
ln(πœ†) = 𝛽0 + 𝛽1 ∗ π‘₯
7
Fitting a Generalized Linear Model with Distribution = Poisson and Link = Log in JMP
ο‚·
ο‚·
ο‚·
ο‚·
ο‚·
Model Setup
Response: Count
Predictor: Concentration
Personality: Generalized Linear
Model
Distribution: Poisson
Link: Log
Fit Model window in JMP
JMP provides the following output for the generalized linear model.
Sketch of Estimated Mean Function
ο‚·
Modeling Output
Some of the usual output, e.g. R2 and RMSE, is missing in this generalized linear model
output. Such simple summaries of the model do not necessarily generalize to these
more complicated models.
8
ο‚·
Understanding model coefficients, e.g. 𝛽̂1 , can be achieved similar to what was done
above for the log10 transformation of the response.
Concentration=2
Concentration=1
Concentration=0
πΆπ‘œπ‘’π‘›π‘‘|πΆπ‘œπ‘›π‘π‘’π‘›π‘‘π‘Ÿπ‘Žπ‘‘π‘–π‘œπ‘›
= 𝑒 (4.33+(−1.54∗2))
= 3.49
= 𝑒 (4.33+(−1.54∗1))
= 16.28
= 𝑒 (4.33+(−1.54∗0))
= 75.94
Rate of Increase
16.28
= 4.66
3.49
75.94
= 4.66
16.28
Thus, we’d expect to see about a 4.66-fold increase in Counts for a one-unit decrease in
Concentration. This can be seen directly by considering the following ratio. Again, in
this ratio (π‘₯ − 1) is being used instead of (π‘₯ + 1) because a one-unit decrease in
Concentration is being used instead of a one-unit increase.
Μ‚
Μ‚
𝑒 (𝛽0 +𝛽1 ∗(π‘₯−1))
𝑒 (𝛽̂0 +𝛽̂1 ∗π‘₯)
ο‚·
=
𝑒 4.33 ∗ 𝑒 −1.54∗(π‘₯−1)
𝑒 −1.54∗π‘₯ ∗ 𝑒 +1.54
=
= 𝑒 +1.54 = 4.66
𝑒 4.33 ∗ 𝑒 −1.54∗π‘₯
𝑒 −1.54∗π‘₯
A plot of the residuals (or the Studentized Deviance Residual) is provided here.
These modified residuals should be centered at 0 – to ensure the correct form of the
mean function. The predicted counts near the right-side of this graph *should*
exhibit more variability because for the Poisson distribution as the mean increases,
the variance should also increase.
Poisson Distribution
If π‘€π‘’π‘Žπ‘› ↑, then
π‘‰π‘Žπ‘Ÿπ‘–π‘Žπ‘›π‘π‘’ ↑
9
ο‚·
In JMP, you are able to save the predicted counts. When this is done, you can
create the usual Predicted Counts vs. Actual Counts plot to understand the quality
of your model.
ο‚·
The following plot displays the predictions from each of the models considered thus
far: i) Poisson Model, ii) Log10(Count) model, and iii) Quadratic model. These three
models are very similar.
Model
Poisson
Quadratic
Response = log10(Count)
R2
88.8%
88.1%
88.6%
AIC
446.7
504.5
Different scale
Note: The R2 quantity for the Poisson model was calculated using the brute force
approach of “hand-calculating” the sum of squared residuals.
10
Download