Modelling Count Data

Handout #16: Modeling Count Data Section 16.1: Modeling Count Data using Linear Regression Models Example 16.1 For this example, we will consider the relationship between concentration of a chemical compound and the count of Ceriodaphnia dubia, a particular species of a water flea. Ceriodaphnia dubia is commonly used in testing the toxicity of water leaving a wastewater treatment facility. Wiki Entry for Ceriodaphnia dubia: http://en.wikipedia.org/wiki/Ceriodaphnia_dubia Picture of Ceriodaphnia dubia Snip-it of data The following is a scatterplot of the relationship between the Concentration and Count. We can see that a model that assumes a constant rate of decrease in counts as a function of concentration will not be sufficient. 1 Simple Linear Regression Model Setup – assuming a constant rate of decrease    Response Variable: Count Predictor Variable: Concentration The standard mean and variance function o o 𝐸(𝐶𝑜𝑢𝑛𝑡 | 𝐶𝑜𝑛𝑐𝑒𝑛𝑡𝑟𝑎𝑡𝑖𝑜𝑛) = 𝛽0 + 𝛽1 ∗ 𝐶𝑜𝑛𝑐𝑒𝑛𝑡𝑟𝑎𝑡𝑖𝑜𝑛 𝑉𝑎𝑟(𝐶𝑜𝑢𝑛𝑡|𝐶𝑜𝑛𝑐𝑒𝑛𝑡𝑟𝑎𝑡𝑖𝑜𝑛) = 𝜎 2 Output from the model Residual Plot Issues: 1. A mean function that assumes a constant rate of decrease is inadequate 2. The constant variance function appears to be inadequate as well Plot: Residuals vs. Concentration Problem: Mean function inadequate Plot: |Residual| vs. Concentration Problem: Constant variance inadequate 2 Updated Model Setup     Response Variable: Count Predictor Variable: Concentration Terms: Intercept, Concentration, Concentration2 Consider the following mean and variance function o o 𝐸(𝐶𝑜𝑢𝑛𝑡 | 𝐶𝑜𝑛𝑐𝑒𝑛𝑡𝑟𝑎𝑡𝑖𝑜𝑛) = 𝛽0 + 𝛽1 ∗ 𝐶𝑜𝑛𝑐𝑒𝑛𝑡𝑟𝑎𝑡𝑖𝑜𝑛 + 𝛽2 ∗ 𝐶𝑜𝑛𝑐𝑒𝑛𝑡𝑟𝑎𝑖𝑜𝑛 2 𝑉𝑎𝑟(𝐶𝑜𝑢𝑛𝑡|𝐶𝑜𝑛𝑐𝑒𝑛𝑡𝑟𝑎𝑡𝑖𝑜𝑛) = 𝜎 2 Regression Output Fitting a quadratic mean function Using a quadratic form of mean function appears fit the data fairly well. However, the use of a constant variance function remains inadequate. One simple alternative to fixing non-constant variance is to transform the response. This change implies a multiplicative form exists in the response instead of the usual additive form. 3 Multiplicative versus Additive Structure An additive form for the response has been considered up to this point in this course. 𝑅𝑒𝑠𝑝𝑜𝑛𝑠𝑒|𝑥 = 𝛽0 + 𝛽1 ∗ 𝑥 + 𝜀 𝐸(𝑅𝑒𝑠𝑝𝑜𝑛𝑠𝑒|𝑥) = 𝛽0 + 𝛽1 ∗ 𝑥 For an additive mean function, the rate of increase per unit change in 𝑥 is given by 𝛽1 . Thus, if 𝛽0 = 10 and 𝛽1 = 3, then successive differences in the expected response are constant. A multiplicative form of the response has the following form. 𝑅𝑒𝑠𝑝𝑜𝑛𝑠𝑒|𝑥 = 10𝛽0 ∗ 10𝛽1 ∗𝑥 ∗ 10𝜀 = 10(𝛽0 +𝛽1 ∗𝑥) ∗ 10𝜀 = 10(𝛽0 +𝛽1∗𝑥+𝜀) Note: The mean and variability in this conditional distribution are not additive, but multiplicative. Thus, a large (or small) value of x may exhibit more (or less) variability in the response. Consider the following tables in an effort to understand a multiplicative form for the response. Suppose 𝛽0 = 0.01, 𝛽1 = 0.5, and 𝜀 = 0; then, 𝑅𝑒𝑠𝑝𝑜𝑛𝑠𝑒|𝑥 when 𝑥 = 1would be calculated as follows 10(0.01+0.5∗1+0) = 3.24. Successive differences for a multiplicative function are not constant Successive ratios are constant A simple transformation of the response, can change a multiplicative form of the response to an additive form. For example, when a log base 10 transformation of the response is taken, the successive differences are indeed constant and given by 𝛽1 = 0.5. 4 In an effort to fix the non-constant variance issue discovered above, a log transformation of the response will be considered. Regression Setup – Using a log10() transformation of the response    Response Variable: log10(Count) Predictor Variable: Concentration Multiplicative form of the response 𝐶𝑜𝑢𝑛𝑡|𝐶𝑜𝑛𝑐𝑒𝑛𝑡𝑟𝑎𝑡𝑖𝑜𝑛 = 10(𝛽0 +𝛽1 ∗𝐶𝑜𝑛𝑐𝑒𝑛𝑡𝑟𝑎𝑡𝑖𝑜𝑛 + 𝜀)  Using a log10 transformation to convert to an additive form 𝑙𝑜𝑔10 (𝐶𝑜𝑢𝑛𝑡|𝐶𝑜𝑛𝑐𝑒𝑛𝑡𝑟𝑎𝑡𝑖𝑜𝑛) = 𝛽0 + 𝛽1 ∗ 𝐶𝑜𝑛𝑐𝑒𝑛𝑡𝑟𝑎𝑡𝑖𝑜𝑛 + 𝜀  Usual mean and variance functions on the transformed scale o 𝐸(𝑙𝑜𝑔10 (𝐶𝑜𝑢𝑛𝑡) | 𝐶𝑜𝑛𝑐𝑒𝑛𝑡𝑟𝑎𝑡𝑖𝑜𝑛) = 𝛽0 + 𝛽1 ∗ 𝐶𝑜𝑛𝑐𝑒𝑛𝑡𝑟𝑎𝑡𝑖𝑜𝑛 o 𝑉𝑎𝑟(𝑙𝑜𝑔10 (𝐶𝑜𝑢𝑛𝑡)|𝐶𝑜𝑛𝑐𝑒𝑛𝑡𝑟𝑎𝑡𝑖𝑜𝑛) = 𝜎 2 Comments: 1. A linear mean function appears appropriate when the response is transformed. 2. Transforming the response does not appear to fix the constant variance problem. Interestingly, the inflated variance in the response is now occurring when counts are small (not large as in the original scale).  The log10 transformation of the response did not appropriately address the constant variance issue. A log2 transformation does not appear to address this problem neither. 5 Using a log2 transformation of the response  Using a log10 transformation of the response Consider the following regarding the interpretation of 𝛽̂1 . 𝑙𝑜𝑔10 (𝐶𝑜𝑢𝑛𝑡|𝐶𝑜𝑛𝑐𝑒𝑛𝑡𝑟𝑎𝑡𝑖𝑜𝑛) = 𝛽̂0 + 𝛽̂1 ∗ 𝐶𝑜𝑛𝑐𝑒𝑛𝑡𝑟𝑎𝑡𝑖𝑜𝑛 = 1.89 + −0.71 ∗ 2 = 0.47 = 𝛽̂0 + 𝛽̂1 ∗ 𝐶𝑜𝑛𝑐𝑒𝑛𝑡𝑟𝑎𝑡𝑖𝑜𝑛 = 1.89 + −0.71 ∗ 1 = 1.18 = 𝛽̂0 + 𝛽̂1 ∗ 𝐶𝑜𝑛𝑐𝑒𝑛𝑡𝑟𝑎𝑡𝑖𝑜𝑛 = 1.89 + −0.71 ∗ 0 = 1.89 Concentration=2 Concentration=1 Concentration=0 𝐶𝑜𝑢𝑛𝑡|𝐶𝑜𝑛𝑐𝑒𝑛𝑡𝑟𝑎𝑡𝑖𝑜𝑛 = 100.47 = 2.95 Rate of Increase = 101.18 = 15.14 15.14 = 5.13 2.95 101.89 77.62 77.62 = 5.13 15.14 = = Thus, we’d expect to see about a 5-fold increase in Count for a one-unit decrease in Concentration. This can be seen directly by considering the following ratio. In this ratio (𝑥 − 1) is being used instead of (𝑥 + 1) because a one-unit decrease in Concentration is being used instead of a one-unit increase. ̂ ̂ 10(𝛽0 +𝛽1 ∗(𝑥−1)) 10(𝛽̂0 +𝛽̂1 ∗𝑥) 101.89 ∗ 10−0.71∗(𝑥−1) 10−0.71∗𝑥 ∗ 10+0.71 = = = 10+0.71 = 5.13 101.89 ∗ 10−0.71∗𝑥 10−0.71∗𝑥  The R2 value from this model is somewhat smaller than the quadratic model fit above.  Even after the log10 transformation of the response, the non-constant variance problem is still a problem. Non-constant variance remains even after the log10 transformation Residuals lack normality as well 6 Section 16.2: Modeling Count Data using Generalized Linear Model A more precise and succinct method of modeling counts can be achieved through the use of a generalized linear model. The generalized linear modeling framework extends what we’ve learned in this class to a wider class of modeling situations. 𝑃𝑜𝑖𝑠𝑠𝑜𝑛 ↔ 𝑀𝑜𝑑𝑒𝑙𝑖𝑛𝑔 𝐶𝑜𝑢𝑛𝑡𝑠 The Poisson distribution is often used to model counts. One characteristic of the Poisson distribution is that the mean and variance function are the same. Thus, as the mean increases so does the variance. This is often the case when modeling counts. The wiki entry for the Poisson distribution is provided here. Source: http://en.wikipedia.org/wiki/Poisson_distribution Assume the mean of a response variable, say 𝜆, whose distribution is Poisson, depends on a single predictor variable through the following form. This, by default, is the form of the variance function as well. 𝜆 = 𝑒 (𝛽0 +𝛽1 ∗𝑥) The link function is a function that relates the mean of the response to the predictor variable in a linear way. A natural log link is appropriate for the mean function specified above. ln(𝜆) = 𝛽0 + 𝛽1 ∗ 𝑥 7 Fitting a Generalized Linear Model with Distribution = Poisson and Link = Log in JMP      Model Setup Response: Count Predictor: Concentration Personality: Generalized Linear Model Distribution: Poisson Link: Log Fit Model window in JMP JMP provides the following output for the generalized linear model. Sketch of Estimated Mean Function  Modeling Output Some of the usual output, e.g. R2 and RMSE, is missing in this generalized linear model output. Such simple summaries of the model do not necessarily generalize to these more complicated models. 8  Understanding model coefficients, e.g. 𝛽̂1 , can be achieved similar to what was done above for the log10 transformation of the response. Concentration=2 Concentration=1 Concentration=0 𝐶𝑜𝑢𝑛𝑡|𝐶𝑜𝑛𝑐𝑒𝑛𝑡𝑟𝑎𝑡𝑖𝑜𝑛 = 𝑒 (4.33+(−1.54∗2)) = 3.49 = 𝑒 (4.33+(−1.54∗1)) = 16.28 = 𝑒 (4.33+(−1.54∗0)) = 75.94 Rate of Increase 16.28 = 4.66 3.49 75.94 = 4.66 16.28 Thus, we’d expect to see about a 4.66-fold increase in Counts for a one-unit decrease in Concentration. This can be seen directly by considering the following ratio. Again, in this ratio (𝑥 − 1) is being used instead of (𝑥 + 1) because a one-unit decrease in Concentration is being used instead of a one-unit increase. ̂ ̂ 𝑒 (𝛽0 +𝛽1 ∗(𝑥−1)) 𝑒 (𝛽̂0 +𝛽̂1 ∗𝑥)  = 𝑒 4.33 ∗ 𝑒 −1.54∗(𝑥−1) 𝑒 −1.54∗𝑥 ∗ 𝑒 +1.54 = = 𝑒 +1.54 = 4.66 𝑒 4.33 ∗ 𝑒 −1.54∗𝑥 𝑒 −1.54∗𝑥 A plot of the residuals (or the Studentized Deviance Residual) is provided here. These modified residuals should be centered at 0 – to ensure the correct form of the mean function. The predicted counts near the right-side of this graph *should* exhibit more variability because for the Poisson distribution as the mean increases, the variance should also increase. Poisson Distribution If 𝑀𝑒𝑎𝑛 ↑, then 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 ↑ 9  In JMP, you are able to save the predicted counts. When this is done, you can create the usual Predicted Counts vs. Actual Counts plot to understand the quality of your model.  The following plot displays the predictions from each of the models considered thus far: i) Poisson Model, ii) Log10(Count) model, and iii) Quadratic model. These three models are very similar. Model Poisson Quadratic Response = log10(Count) R2 88.8% 88.1% 88.6% AIC 446.7 504.5 Different scale Note: The R2 quantity for the Poisson model was calculated using the brute force approach of “hand-calculating” the sum of squared residuals. 10

Modelling Count Data

Related documents

Products

Support

Modelling Count Data

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib