Handout #16: Modeling Count Data Section 16.1: Modeling Count Data using Linear Regression Models Example 16.1 For this example, we will consider the relationship between concentration of a chemical compound and the count of Ceriodaphnia dubia, a particular species of a water flea. Ceriodaphnia dubia is commonly used in testing the toxicity of water leaving a wastewater treatment facility. Wiki Entry for Ceriodaphnia dubia: http://en.wikipedia.org/wiki/Ceriodaphnia_dubia Picture of Ceriodaphnia dubia Snip-it of data The following is a scatterplot of the relationship between the Concentration and Count. We can see that a model that assumes a constant rate of decrease in counts as a function of concentration will not be sufficient. 1 Simple Linear Regression Model Setup – assuming a constant rate of decrease ο· ο· ο· Response Variable: Count Predictor Variable: Concentration The standard mean and variance function o o πΈ(πΆππ’ππ‘ | πΆππππππ‘πππ‘πππ) = π½0 + π½1 ∗ πΆππππππ‘πππ‘πππ πππ(πΆππ’ππ‘|πΆππππππ‘πππ‘πππ) = π 2 Output from the model Residual Plot Issues: 1. A mean function that assumes a constant rate of decrease is inadequate 2. The constant variance function appears to be inadequate as well Plot: Residuals vs. Concentration Problem: Mean function inadequate Plot: |Residual| vs. Concentration Problem: Constant variance inadequate 2 Updated Model Setup ο· ο· ο· ο· Response Variable: Count Predictor Variable: Concentration Terms: Intercept, Concentration, Concentration2 Consider the following mean and variance function o o πΈ(πΆππ’ππ‘ | πΆππππππ‘πππ‘πππ) = π½0 + π½1 ∗ πΆππππππ‘πππ‘πππ + π½2 ∗ πΆππππππ‘πππππ 2 πππ(πΆππ’ππ‘|πΆππππππ‘πππ‘πππ) = π 2 Regression Output Fitting a quadratic mean function Using a quadratic form of mean function appears fit the data fairly well. However, the use of a constant variance function remains inadequate. One simple alternative to fixing non-constant variance is to transform the response. This change implies a multiplicative form exists in the response instead of the usual additive form. 3 Multiplicative versus Additive Structure An additive form for the response has been considered up to this point in this course. π ππ ππππ π|π₯ = π½0 + π½1 ∗ π₯ + π πΈ(π ππ ππππ π|π₯) = π½0 + π½1 ∗ π₯ For an additive mean function, the rate of increase per unit change in π₯ is given by π½1 . Thus, if π½0 = 10 and π½1 = 3, then successive differences in the expected response are constant. A multiplicative form of the response has the following form. π ππ ππππ π|π₯ = 10π½0 ∗ 10π½1 ∗π₯ ∗ 10π = 10(π½0 +π½1 ∗π₯) ∗ 10π = 10(π½0 +π½1∗π₯+π) Note: The mean and variability in this conditional distribution are not additive, but multiplicative. Thus, a large (or small) value of x may exhibit more (or less) variability in the response. Consider the following tables in an effort to understand a multiplicative form for the response. Suppose π½0 = 0.01, π½1 = 0.5, and π = 0; then, π ππ ππππ π|π₯ when π₯ = 1would be calculated as follows 10(0.01+0.5∗1+0) = 3.24. Successive differences for a multiplicative function are not constant Successive ratios are constant A simple transformation of the response, can change a multiplicative form of the response to an additive form. For example, when a log base 10 transformation of the response is taken, the successive differences are indeed constant and given by π½1 = 0.5. 4 In an effort to fix the non-constant variance issue discovered above, a log transformation of the response will be considered. Regression Setup – Using a log10() transformation of the response ο· ο· ο· Response Variable: log10(Count) Predictor Variable: Concentration Multiplicative form of the response πΆππ’ππ‘|πΆππππππ‘πππ‘πππ = 10(π½0 +π½1 ∗πΆππππππ‘πππ‘πππ + π) ο· Using a log10 transformation to convert to an additive form πππ10 (πΆππ’ππ‘|πΆππππππ‘πππ‘πππ) = π½0 + π½1 ∗ πΆππππππ‘πππ‘πππ + π ο· Usual mean and variance functions on the transformed scale o πΈ(πππ10 (πΆππ’ππ‘) | πΆππππππ‘πππ‘πππ) = π½0 + π½1 ∗ πΆππππππ‘πππ‘πππ o πππ(πππ10 (πΆππ’ππ‘)|πΆππππππ‘πππ‘πππ) = π 2 Comments: 1. A linear mean function appears appropriate when the response is transformed. 2. Transforming the response does not appear to fix the constant variance problem. Interestingly, the inflated variance in the response is now occurring when counts are small (not large as in the original scale). ο· The log10 transformation of the response did not appropriately address the constant variance issue. A log2 transformation does not appear to address this problem neither. 5 Using a log2 transformation of the response ο· Using a log10 transformation of the response Consider the following regarding the interpretation of π½Μ1 . πππ10 (πΆππ’ππ‘|πΆππππππ‘πππ‘πππ) = π½Μ0 + π½Μ1 ∗ πΆππππππ‘πππ‘πππ = 1.89 + −0.71 ∗ 2 = 0.47 = π½Μ0 + π½Μ1 ∗ πΆππππππ‘πππ‘πππ = 1.89 + −0.71 ∗ 1 = 1.18 = π½Μ0 + π½Μ1 ∗ πΆππππππ‘πππ‘πππ = 1.89 + −0.71 ∗ 0 = 1.89 Concentration=2 Concentration=1 Concentration=0 πΆππ’ππ‘|πΆππππππ‘πππ‘πππ = 100.47 = 2.95 Rate of Increase = 101.18 = 15.14 15.14 = 5.13 2.95 101.89 77.62 77.62 = 5.13 15.14 = = Thus, we’d expect to see about a 5-fold increase in Count for a one-unit decrease in Concentration. This can be seen directly by considering the following ratio. In this ratio (π₯ − 1) is being used instead of (π₯ + 1) because a one-unit decrease in Concentration is being used instead of a one-unit increase. Μ Μ 10(π½0 +π½1 ∗(π₯−1)) 10(π½Μ0 +π½Μ1 ∗π₯) 101.89 ∗ 10−0.71∗(π₯−1) 10−0.71∗π₯ ∗ 10+0.71 = = = 10+0.71 = 5.13 101.89 ∗ 10−0.71∗π₯ 10−0.71∗π₯ ο· The R2 value from this model is somewhat smaller than the quadratic model fit above. ο· Even after the log10 transformation of the response, the non-constant variance problem is still a problem. Non-constant variance remains even after the log10 transformation Residuals lack normality as well 6 Section 16.2: Modeling Count Data using Generalized Linear Model A more precise and succinct method of modeling counts can be achieved through the use of a generalized linear model. The generalized linear modeling framework extends what we’ve learned in this class to a wider class of modeling situations. ππππ π ππ ↔ ππππππππ πΆππ’ππ‘π The Poisson distribution is often used to model counts. One characteristic of the Poisson distribution is that the mean and variance function are the same. Thus, as the mean increases so does the variance. This is often the case when modeling counts. The wiki entry for the Poisson distribution is provided here. Source: http://en.wikipedia.org/wiki/Poisson_distribution Assume the mean of a response variable, say π, whose distribution is Poisson, depends on a single predictor variable through the following form. This, by default, is the form of the variance function as well. π = π (π½0 +π½1 ∗π₯) The link function is a function that relates the mean of the response to the predictor variable in a linear way. A natural log link is appropriate for the mean function specified above. ln(π) = π½0 + π½1 ∗ π₯ 7 Fitting a Generalized Linear Model with Distribution = Poisson and Link = Log in JMP ο· ο· ο· ο· ο· Model Setup Response: Count Predictor: Concentration Personality: Generalized Linear Model Distribution: Poisson Link: Log Fit Model window in JMP JMP provides the following output for the generalized linear model. Sketch of Estimated Mean Function ο· Modeling Output Some of the usual output, e.g. R2 and RMSE, is missing in this generalized linear model output. Such simple summaries of the model do not necessarily generalize to these more complicated models. 8 ο· Understanding model coefficients, e.g. π½Μ1 , can be achieved similar to what was done above for the log10 transformation of the response. Concentration=2 Concentration=1 Concentration=0 πΆππ’ππ‘|πΆππππππ‘πππ‘πππ = π (4.33+(−1.54∗2)) = 3.49 = π (4.33+(−1.54∗1)) = 16.28 = π (4.33+(−1.54∗0)) = 75.94 Rate of Increase 16.28 = 4.66 3.49 75.94 = 4.66 16.28 Thus, we’d expect to see about a 4.66-fold increase in Counts for a one-unit decrease in Concentration. This can be seen directly by considering the following ratio. Again, in this ratio (π₯ − 1) is being used instead of (π₯ + 1) because a one-unit decrease in Concentration is being used instead of a one-unit increase. Μ Μ π (π½0 +π½1 ∗(π₯−1)) π (π½Μ0 +π½Μ1 ∗π₯) ο· = π 4.33 ∗ π −1.54∗(π₯−1) π −1.54∗π₯ ∗ π +1.54 = = π +1.54 = 4.66 π 4.33 ∗ π −1.54∗π₯ π −1.54∗π₯ A plot of the residuals (or the Studentized Deviance Residual) is provided here. These modified residuals should be centered at 0 – to ensure the correct form of the mean function. The predicted counts near the right-side of this graph *should* exhibit more variability because for the Poisson distribution as the mean increases, the variance should also increase. Poisson Distribution If ππππ ↑, then ππππππππ ↑ 9 ο· In JMP, you are able to save the predicted counts. When this is done, you can create the usual Predicted Counts vs. Actual Counts plot to understand the quality of your model. ο· The following plot displays the predictions from each of the models considered thus far: i) Poisson Model, ii) Log10(Count) model, and iii) Quadratic model. These three models are very similar. Model Poisson Quadratic Response = log10(Count) R2 88.8% 88.1% 88.6% AIC 446.7 504.5 Different scale Note: The R2 quantity for the Poisson model was calculated using the brute force approach of “hand-calculating” the sum of squared residuals. 10