Workshop on Binary Regression Logistic Regression + Classification Trees + Regression Trees + Graphics + Multinomial Regression Hyderabad, December 26-29, 2012 MB RAO Module 1: An Introduction to Logistic Regression + Fitting the model with R + Goodness-of-fit test BINARY RESPONSE VARIABLE AND LOGISTIC REGRESSION A binary variable is a variable with only two possible values. There are many, many examples of binary variables in statistical work. Example. A patient is admitted with abdominal sepsis (blood poisoning). The case is severe enough to warrant surgery. The patient is wheeled to the operation theatre. Let us speculate what will happen after the surgery. Let Y = 1 if death follows surgery, = 0 if the patient survives. The outcome Y is random. Since it is random, we want to know its distribution. Pr(Y = 1) = π, say, and Pr(Y = 0) = 1 – π. In simple terms, we want to know the chances (π) that a patient dies after surgery. Equivalently, what are the chances (1 – π) of survival after surgery? Are there any prognostic variables or factors which could influence the outcome Y. Surgeons list the following variables which could have some bearing on Y. 1. X1: Shock. Is the patient in a state of shock before the surgery? X1 = 1 if yes, = 0 if no. 2. X2: Malnutrition. Is the patient undernourished? X2 = 1 if yes, = 0 if no. 1 3. X3: Alcoholism. Is the patient alcoholic? X3 = 1 if yes, = 0 if no. 4. X4: Age 5. X5: Bowel infarction. Has the patient bowel infarction? X5 = 1 if yes, = 0 if no. The variables X1, X2, X3, and X5 are categorical covariates and X4 is a continuous covariate. The categorical variables are binary with only two possible values. It is felt that the outcome Y depends on these covariates. Response variable Y Covariates, predictors, or independent variables X1, X2, X3, X4, X5 More precisely, the probability π depends on X1 through X5. In order to indicate the dependence, we write π = π(X1, X2, X3, X4, X5). We want to model π as a function of the covariates. Why one wants to build a model? If a model is in place, one could use the model to assess the chances of survival after surgery for a patient before he is wheeled into the operation theatre. How? The surgeon could get information on X1, X2, X3, X4, X5 for the patient, calculate π = π(X1, X2, X3, X4, X5) from the postulated model, and then the chances of survival (1 – π) after surgery. A possible model? π(X1, X2, X3, X4, X5) = β0 + β1X1 + β2X2 + β3X3 + β4X4 + β5X5 This is like a multiple regression model. This model is not acceptable. The left hand side of the model π is a probability. The right hand side of the model could be any real number. Why not model Y directly as a function of the covariates? For example, Y = β0 + β1X1 + β2X2 + β3X3 + β4X4 + β5X5? 2 This is not acceptable. The left hand side Y takes only two values 0 and 1, but the right hand side could be any real number. Logistic regression model π(X1, X2, X3, X4, X5) = 0 1 X 1 2 X 2 3 X 3 4 X 4 5 X 5 e 1 e 0 1 X 1 2 X 2 3 X 3 4 X 4 5 X 5 . This model looks very formidable. The left hand side of the model is a probability and hence its value should always between zero and one. The right hand is always a number between zero and one. Why? This model has 6 unknown parameters β0, β1, β2, β3, β4, and β5. We need to know the values of these parameters before it can be used. We can estimate the parameters of the model if we have data on a sample of patients. We do have data. Structure of the data. Patient 1 2 3 etc. Y 1 1 0 X1 1 1 0 X2 1 0 1 X3 1 0 0 X4 47 53 32 X5 1 1 0 We have data on 106 patients. Present the data. Discuss the data. A digression: The approach I presented is purely statistical. Identify the variables of interest, designate the response variable or dependent variable, designate the covariates or independent variables, postulate a model, fit the model, and check it’s goodness-of-fit. Engineers, physicists, and computer scientists will look at the problem from a different angle. They work directly with the dependent or response variable. I will talk about their approach later. Problems (Back to our problem) 3 1. How does one estimate the parameters of the model using the data? There are two standard methods available. 1. Method of maximum likelihood. Write the likelihood of the data. Maximize the likelihood with respect to the parameters. 2. Method of weighted least squares. The least squares principle is used to minimize certain sum of squares. This method is much simpler than the method of maximum likelihood. Asymptotically, both methods are equivalent. If the sample is large, the estimates will be more or less the same. 2. Once the model is estimated, we need to check whether or not the model adequately summarizes the data. We need to assess how well the model fits the data. We may have to use some goodness-of-fit tests to make the assessment. 3. If the model fits the data well, we need to examine the impact of each and every covariate on the response variable. It is tantamount to identifying risk factors. We need to test the significance of each and every covariate in the model. If a covariate is not significant, we could remove the covariate from the model and then fit a leaner and tighter model to the data. 4. If a particular model does not fit the data well, explore other models, which can do a better job. 5. If an adequate model is fitted, explain how the model can be used in practice. Spend time on interpreting the model. Before we pursue all these objectives, let us look at the model from another angle. π(X1, X2, X3, X4, X5) = Probability of death after surgery for a patient with covariate values X1, X2, X3, X4, X5 e 0 1 X1 2 X 2 3 X 3 4 X 4 5 X 5 = 0 1 X 1 2 X 2 3 X 3 4 X 4 5 X 5 1 e 1 - π(X1, X2, X3, X4, X5) = Probability of survival after surgery for a patient with covariate values X1, X2, X3, X4, X5 = 1 1 1 e 0 1 X1 2 X 2 3 X 3 4 X 4 5 X 5 = Odds of Death versus Survival after surgery 4 ln( = exp{β0 + β1X1 + β2X2 + β3X3 + β4X4 + β5X5} ) = log odds = logit = 1 β0 + β1X1 + β2X2 + β3X3 + β4X4 + β5X5 This is like a multiple regression model. The log odds are a linear function of the covariates! This form of the model is very useful for interpretation. The parameter β0 is called the intercept of the model. The parameter β1 is called the regression coefficient associated with the variable ‘Shock.’ The parameter β2 is called the regression coefficient associated with the variable ‘Malnutrition,’ etc. These regression coefficients indicate how much impact the corresponding covariates have on the response variable. The logistic regression model can be spelled out either in the form π(X1, X2, X3, X4, X5) = 0 1 X 1 2 X 2 3 X 3 4 X 4 5 X 5 e 1 e or in the form ln( 1 Both are equivalent. 0 1 X 1 2 X 2 3 X 3 4 X 4 5 X 5 , ) = β0 + β1X1 + β2X2 + β3X3 + β4X4 + β5X5. Using R and the data, I fitted the model. The following are the estimates of the parameters along with their standard errors and p-values. Variable Regression Coefficient Intercept -9.754 Shock 3.674 Malnutrition 1.217 Alcoholism 3.355 Age 0.09215 Infarction 2.798 Standard Error 2.534 1.162 0.7274 0.9797 0.03025 1.161 z-value p-value 3.16 1.67 3.43 3.04 2.41 0.0016 0.095 0.0006 0.0023 0.016 Let us proceed in a systematic way with an analysis of the model. 1. Adequacy of the model. The model fits the data very well. We can say that the model is a good summarization of the data. I will talk about this aspect when I present and discuss the relevant program. 5 2. Estimated model. ln( ) = -9.754 + 3.674*X1 + 1.217*X2 1 + 3.355*X3 + 0.09215*X4 + 2.798*X5 3. Impact of the covariates on the response variable. Let us look at the covariate X1 (Shock). We want to test the null hypothesis H0: β1 = 0 (The covariate has no impact on the response variable or the covariate X1 is not significant) against the alternative H1: β1 ≠ 0 (The covariate has some impact on Y or the covariate is significant). An estimate of β1 is 3.674. Is this value significant? We look at the corresponding zvalue (Estimate/(Standard Error)). If the z-value exceeds 1.96 in absolute value, we reject the null hypothesis at 5% level of significance. In our case, it does indeed exceed 1.96. The variable X1 is significant. There is another way to check significance of a variable. Look at the corresponding p-value. a. If p ≤ 0.001, the covariate is very, very significant. b. If 0.001 < p ≤ 0.01, the covariate is very significant. c. If 0.01 < p ≤ 0.05, the covariate is significant. d. If p > 0.05, the covariate is not significant. In our case, p = 0.0016. The variable X1 is very significant. Further, the estimate 3.674 is positive. This means that X1 has a positive significance over the response variable. This means that if the value of X1 goes up, the probability π goes up. In our example, the variable X1 takes only two values 1 and 0. The probability π will be higher for a person with X1 = 1 than for a person with X1 = 0, other factors remaining the same. I will talk about ‘how much higher’ later. 4. Let us look at the other covariates. Malnutrition: Not significant Alcoholism: Very, very significant (Positive impact) Age: Very significant (The higher the age is, the higher the probability of death is.) Infarction: significant (Positive impact) 5. Let us make the model a little tighter. Chuck out ‘Malnutrition’ from the model. Refitting gives the following estimates. Variable Regression Standard z-value Coefficient Error 6 Intercept -8.895 Shock 3.701 1.103 3.355 Alcoholism 3.186 0.9163 3.477 Age 0.08983 0.02918 3.078 Infarction 2.386 1.071 2.228 6. The fit is good. Every covariate is significant. The estimated model is ln( 1 ) = -8.895 + 3.701*X1 + 3.186*X3 + 0.08983*X4 + 2.386*X5 Data on ‘abdominal sepsis’ I have data in Excel format. ID Y X1 X2 X3 1 0 0 0 0 2 0 0 0 0 3 0 0 0 0 4 0 0 0 0 5 0 0 0 0 6 1 0 1 0 7 0 0 0 0 8 1 0 0 1 9 0 0 0 0 10 0 0 1 0 11 1 0 0 1 12 0 0 0 0 13 0 0 0 0 14 0 0 0 0 15 0 0 1 0 16 0 0 1 0 17 0 0 0 0 19 0 0 0 0 20 1 1 1 0 22 0 0 0 0 102 0 0 0 0 103 0 0 0 0 104 0 1 0 0 105 1 1 0 0 106 0 0 0 0 107 0 0 0 0 X4 56 80 61 26 53 87 21 69 57 76 66 48 18 46 22 33 38 27 60 31 59 29 60 63 80 23 X5 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 7 108 110 111 112 113 114 115 116 117 118 119 120 122 123 202 203 204 205 206 207 208 209 210 211 214 215 217 301 302 303 304 305 306 307 308 309 401 402 501 502 0 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 1 0 0 0 1 0 1 1 1 0 0 0 0 0 1 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 71 87 70 22 17 49 50 51 37 76 60 78 60 57 28 94 43 70 70 26 19 80 66 55 36 28 59 50 20 74 54 68 25 27 77 54 43 27 66 47 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 1 0 8 503 504 505 506 507 508 510 511 512 513 514 515 516 518 519 520 521 523 524 525 526 527 529 531 532 534 535 536 537 538 539 540 541 542 543 544 545 546 549 550 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 1 0 0 0 0 0 1 0 1 0 1 1 1 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 1 0 1 1 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0 0 1 0 0 0 1 1 1 0 0 1 1 37 36 76 33 40 90 45 75 70 36 57 22 33 75 22 80 85 90 71 51 67 77 20 52 60 29 30 20 36 54 65 47 22 69 68 49 25 44 56 42 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 9 I am working on a number of projects. In about fifty percent of these projects the response variable is binary. A specific example A child gets cancer. It could be any one of the following. Bone Cancer; Kidney (Wilms); Hodgkin’s; Leukemia; Neuroblastoma; NonHodgkin’s; Soft tissue sarcoma; CNS Treatment begins. The child recovers and survives for five years. The child is on Pediatric Cancer Registry. The child is followed lifelong. Periodically, the child is examined. A number of measurements are recorded. Some children get BCC (Basal Cell Carcinoma). Others don’t. How can one explain this? What are the risk factors? Data are collected. 320 children got BCC at least once during the follow up years. 723 children never gotten BCC. Response variable: Occurrence of BCC (Yes or No) Covariates: Type of Cancer (Categorical with 8 levels) Age at diagnosis of cancer (numeric) Follow up Years (How many years the child was followed up after he/she enters the cancer registry?) Gender Race Radiation (Yes or No) SMN (Did the child get a different cancer during the follow up years?) Model Fitting Using R I want to illustrate how to use R to fit a logistic regression model for a given data. The data I use here is different from the one I presented earlier. It is easy to understand this data. A particular treatment is being evaluated to cure a particular medical condition. Introduce the response variable. Does the patient get relief when he/she has the treatment? 10 Y = 1 if yes = 0 if no The response variable is binary. There are two prognostic variables: Age and Gender. A sample of 20 male and 20 female patients are chosen to try the treatment. Input the data. > Age <- c(37, 39, 39, 42, 47, 48, 48, 52, 53, 55, 56, 57, 58, 58, 60, 64, 65, 68, 68, 70, 34, 38, 40, 40, 41, 43, 43, 43, 44, 46, 47, 48, 48, 50, 50, 52, 55, 60, 61, 61) Gender is a categorical variable. Enter data on gender as a factor. Gender 0 means Female and 1 means Male. > Gender <- factor(c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)) An alternative way of inputting the data on Gender. > Gender <- rep(0:1, c(20, 20)) Response is a categorical variable. Enter data on Response as a factor. Response 1 means Yes and 0 means No. > Response <- factor(c(0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1)) Create a data file containing data on all these variables. There is no need to do this. It is good to have everything in a single folder. > MB <- data.frame(Age, Gender, Response) Look at the data. > MB Age Gender Response 1 37 0 0 2 39 0 0 3 39 0 0 11 4 42 5 47 6 48 7 48 8 52 9 53 10 55 11 56 12 57 13 58 14 58 15 60 16 64 17 65 18 68 19 68 20 70 21 34 22 38 23 40 24 40 25 41 26 43 27 43 28 43 29 44 30 46 31 47 32 48 33 48 34 50 35 50 36 52 37 55 38 60 39 61 40 61 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 0 0 0 0 1 0 0 1 1 1 1 1 1 0 0 0 1 1 1 0 0 1 1 1 0 1 1 1 1 1 1 Reflect on the data. The first 20 patients are female and the last 20 female. The ages are reported in increasing order of magnitude. At lower ages of 12 females the treatment does not seem to be working. For males the treatment seems to be working at all ages with a good probability. We need to quantify our first impressions. Model building will help us. Let us fit the logistic regression model. Create a folder, which will store the output. The basic command is glm (generalized linear model). The command ‘glm’ is available in the ‘base.’ > MB1 <- glm(Response ~ Age + Gender, family = binomial, data = MB) Let us look at the output. > summary(MB1) Call: glm(formula = Response ~ Age + Gender, family = binomial, data = Logi) Deviance Residuals: Min 1Q Median 3Q Max -1.86671 -0.80814 0.03983 0.78066 2.17061 Coefficients: Estimate Std. Error z value (Intercept) -9.84294 3.67576 -2.678 Age 0.15806 0.06164 2.564 Gender1 3.48983 1.19917 2.910 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Pr(>|z|) 0.00741 ** 0.01034 * 0.00361 ** (Dispersion parameter for binomial family taken to be 1) Null deviance: 55.452 on 39 degrees of freedom Residual deviance: 38.917 on 37 degrees of freedom AIC: 44.917 Number of Fisher Scoring iterations: 5 The estimated model is: 13 ln Pr(Y 1) = -9.84294 + 0.15806*Age + 3.48983*Gender Pr(Y 0) Age is a significant covariate. How can you tell? Look at its p-value 0.01304. Gender is a very significant factor. How can you tell? Look at its p-value 0.00361. General conclusions. 1. Look at the regression coefficients. Both are positive. The higher the age is the higher the chances are of getting relief on the treatment. 2. Males and Females react to the treatment significantly differently. The treatment is more beneficial to males than to females. Justification. Two patients on the treatment Patient 1. Age = 50 Gender = Male Patient 2. Age = 55 Gender = Male Patient 1 Pr(Y = 1) = Probability of getting relief = 4.71 1 4.71 = 0.82 Patient 2. Pr(Y = 1) = Probability of getting relief = 10.38 1 10.38 = 0.91 Another example Two patients on the treatment Patient 1. Age = 50 Gender = Female Patient 2. Age = 55 Gender = Female Patient 1. 14 Pr(Y = 1) = Probability of getting relief = 0.14 1 0.14 = 0.12 Patient 2. Pr(Y = 1) = Probability of getting relief = 0.32 1 0.32 = 0.24 Compare the probabilities of getting relief for a male and a female with the same age 50: Male: Female: Probability of relief = 0.82 Probability of relief = 0.12 It is now time to focus on a goodness-of-fit test of the Logistic Regression Model. We will work with the data presented above, with response variable ‘Response,’ and covariates ‘Gender’ and ‘Age.’ We want to test the hypothesis that the Logisitc Regression Model is an adequate summary of the data. In other words, we want to test that the response probability Pr(Response = 1) follows the stipulated logistic regression model pattern. The null hypothesis is H0: Pr(Response = 1) = exp(0 1 * Age 2 * Gender) 1 exp(0 1 * Age 2 * Gender) for some parameters β0, β1, and β2. The alternative is H1: H0 not true. Hosmer and Lemeshow devised a test to test the validity of the null hypothesis. I will not go through the rationality behind the test. We will use the R package to conduct the test on our data. The test is available in a package called ‘rms.’ First, we need to download the package. Download the package. How? Put the package ‘rms’ into service. > library (rms) The basic R command is ‘lrm’ (logistic regression model). Create a new folder for the execution of ‘lrm.’ 15 The package ‘rms’ also fits the logistic regression model. The command is different. MB2 <- lrm(Response ~ Age + Gender, data = MB, x = TRUE, y = TRUE) You need to ask for the output. The output is given later. The following command gives the results of the Hosmer-Lemeshow test. residuals.lrm(MB2, type = 'gof') Sum of squared errors 6.4736338 Z 0.0552307 Expected value|H0 SD 6.4612280 0.2246174 P 0.9559547 You look at the p-value in the output. It is a large number. The chances of getting the type of data we have gotten when the null hypothesis is true are 0.96. Recall what the null hypothesis is here. One cannot reject the null hypothesis. The logistic regression model adequately summarizes the data. Let us ask for the output of the ‘lrm’ command. > MB2 Logistic Regression Model lrm(formula = Response ~ Age + Gender, data = Logi, x = TRUE, y = TRUE) Frequencies of Responses 0 1 20 20 Obs Max Deriv Model L.R. d.f. P C Dxy 40 7e-08 16.54 2 3e-04 0.849 0.698 Gamma Tau-a R2 Brier 0.703 0.358 0.451 0.162 16 Coef S.E. Wald Z P Intercept -9.8429 3.67577 -2.68 0.0074 Age 0.1581 0.06164 2.56 0.0103 Gender 3.4898 1.19917 2.91 0.0036 ̂ 0 = - 9.8429; ˆ1 = 0.1581; ˆ2 = 3.4898 Age is significant. Gender is very significant. Why does one build a model? If the model is an adequate summary of the data, the data can be thrown away. The model can be used to answer any questions that may be raised on the experiment and the outcomes. A model lets us assess whether or not a particular covariate is a significant risk factor. A model lets us evaluate the extent of its influence on the outcome variable. Module 2: Null Hypotheses + p-values + Standard Errors + Critical Values + LOGISTIC REGRESSION + INTERACTIONS + GRAPHS Null hypotheses and their ilk Let us go back to the abdominal sepsis problem. The response variable (Dead or Alive after surgery, Y) is binary. There are five covariates (X1 = Shock; X2 = Undernourishment; X3 = Alcoholism; X4 = Age; X5 = Infarction). We postulated the following regression model. π(X1, X2, X3, X4, X5) = Probability of Death After Surgery = Pr(Y = 1) = e 0 1 X1 2 X 2 3 X 3 4 X 4 5 X 5 . 0 1 X 1 2 X 2 3 X 3 4 X 4 5 X 5 1 e This is a population model. The population consists of all those who have had abdominal sepsis and for whom surgery is contemplated. We believe that the response probability has the pattern spelled out above. This belief can be tested. We want to test the impact of Shock on the Response. The null hypothesis is one of skepticism. The patient being in shock has no bearing on the outcome of the surgery. The null hypothesis is H0: β1 = 0. We have to have an alternative. H1: β1 ≠ 0. The null hypothesis can be interpreted that Shock has 17 no impact on the Response. Another interpretation is that ‘Shock’ has no significance. Yet, another interpretation is that ‘Shock’ is not a risk factor. Yet, another interpretation is that there is no association between Shock and the outcome of surgery. The alternative is interpreted as that ‘Shock’ has an impact on the Response. Equivalently, Shock and Outcome of Surgery are associated. In order to test the hypotheses we need data. It means that we want to check whether the data is consistent with the null hypothesis. Using the data, we estimate the unknown parameter β1. If the null hypothesis is true, we would ̂1 to be close to the null value β1 = 0. Any large value of expect the estimate 𝛽 ̂1 would make us doubt the validity of the null hypothesis. In the estimate 𝛽 ̂1 = practice, we got 𝛽 3.674. Is this large enough to cast doubt on the validity of H0? If β1 = 0, how much it is plausible to get an estimate of β1 to be as high as 3.674? Mathematical statisticians are able to calculate the probability of getting an estimate at least as large as 3.674 if the null hypothesis is true. Formally, the p-value is defined by ̂1 ⌋ ≥ 3.674 | 𝐻0 𝑖𝑠 𝑇𝑟𝑢𝑒) = 0.0016. p = Pr(⌊𝛽 If the null hypothesis is true, the chances of observing a value at least as large as 3.674 for β1 in a sample are very, very small. I do not think getting the estimate 3.674 is feasible. However, we did indeed get such an estimate. How did one calculate the probability? The probability is calculated under the assumption that the null hypothesis is true. May be, the assumption is not valid. Reject the null hypothesis! In short, if the p-value is small, reject the null hypothesis. Typically, the pvalue is compared with the industry standard 0.05. Any event with the probability of occurrence 0.05 or less is unlikely to occur. When a p-value is deemed small? 1. One-in-twenty principle (5%): If an event has the probability of occurrence ≤ 5%, the event is not expected to occur. (Level of significance is 5%) 18 2. One-in-hundred principle (1%): If an event has the probability of occurrence ≤ 1%, the event is not expected to occur. (Level of significance is 1%) 3. One-in-ten principle (10%): If an event has the probability of occurrence ≤ 10%, the event is not expected to occur. (Level of significance is 10%) 4. Non-judgmental: Just report the p-value. Let the reader make up his/her mind. Some misconceptions! 1. Can I say that the chances that the null hypothesis is true are 0.0016? No. Remember that the p-value is a conditional probability. 2. Is the null hypothesis true? I don’t know. 3. Is the null hypothesis false? I don’t know. 4. Is the data consistent with the null hypothesis? No, in this example. Some theory behind the calculation of p-value We have a model. The model is believed to be true. It has some parameters in the model. One of the parameters is β1. We take a random sample of subjects and collect data on the variables. Using the data, we estimate β 1. Let ̂1 . The value of the estimate would vary from us denote the estimate by 𝛽 sample to sample. If the null hypothesis is true, it has been shown that ̂1 𝛽 ~ N(0, 1). Here SE is the standard error of the estimate. Using the standard normal distribution, we need to calculate 𝑆𝐸 ̂ 𝛽 3.674 ̂ 𝛽 Pr(⌊ 1 ⌋ ≥ ) = Pr(⌊ 1 ⌋ ≥ 3.1618) = 2*pnorm(3.1618, lower.tail = F) = 𝑆𝐸 1.162 𝑆𝐸 0.0016 I used R to calculate the p-value using the ‘pnorm’ command. Talk about this more. What is standard error? A medical doctor collected data on 106 patients. Using the data, we estimated β1. The estimate is 3.674. If another researcher collects data on the same theme, the estimate may not come out to be 3.674. There is bound to 19 be variation from one estimate to another. Mathematical statisticians are able to estimate the variation as measured by the standard deviation of the estimate. This is the standard error of the estimate. This can also be called margin of error. In this example, standard error is 1.162. One can use the standard error to provide a 95% confidence interval for the unknown parameter β1 of the population model. It is 3.674 ± 1.96*1.162. This interval misses β1 = 0. Check! From this, one can conclude that the null hypothesis can be rejected. Logistic Regression and interactions Let us go back to the example presented in the last lecture. A treatment is being tested out on patients suffering from a medical condition whether or not they get relief. Response variable: For any randomly chosen patient on treatment, let Y = 1 if the patient gets relief = 0 if not. There are two prognostic variables: Age and Gender. Main effects model: exp(0 1 * Age 2 * Gender) Pr(Y = 1) = 1 exp(0 1 * Age 2 * Gender) and 1 Pr(Y = 0) = 1 exp(0 1 * Age 2 * Gender) for some unknown parameters β0, β1, and β2. Interaction model: exp(0 1 * Age 2 * Gender 3 * Age * Gender) Pr(Y = 1) = 1 exp(0 1 * Age 2 * Gender 3 * Age * Gender) and 1 Pr(Y = 0) = 1 exp(0 1 * Age 2 * Gender 3 * Age * Gender) for some unknown parameters β0, β1, β2, and β3. 20 Typically, one should entertain an interaction model before gravitating towards the main effects model. What does an interaction model mean? In a multiple regression set-up, an interaction model is easy to explain. In the context of a logistic regression model, one can provide a good explanation in terms of log-odds model. Pr(Y 1) = β0 + β1*Age + β2*Gender + β3*Age*Gender Pr(Y 0) The log-odds is a linear function of the covariates! ln Logistic Regression Model for Males Pr(Y 1) ln = (β0 + β2) + (β1 + β3)*Age Pr(Y 0) The log-odds is a linear function of age with intercept β0 + β2 and slope β1 + β3. Logistic Regression Model for Females Pr(Y 1) ln = β0 + β1*Age Pr(Y 0) The log-odds is a linear function of age with intercept β0 and slope β1. If interaction is present, i.e., β3 ≠ 0, the slopes are different. It is of paramount importance to test the significance of interaction to begin with, i.e., test the null hypothesis H0: β3 = 0. This is what we will do. Load R with data. > Age <- c(37, 39, 39, 42, 47, 48, 48, 52, 53, 55, 56, 57, 58, 58, 60, 64, 65, 68, 68, 70, 34, 38, 40, 40, 41, 43, 43, 43, 44, 46, 47, 48, 48, 50, 50, 52, 55, 60, 61, 61) > length(Age) [1] 40 > Gender <- rep(c("female", "male"), c(20, 20)) > Response <- factor(c(0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1)) > length(Response) [1] 40 > MB <- data.frame(Age, Gender, Response) 21 > MB Age Gender Response 1 37 female 0 2 39 female 0 3 39 female 0 4 42 female 0 5 47 female 0 6 48 female 0 7 48 female 1 8 52 female 0 9 53 female 0 10 55 female 0 11 56 female 0 12 57 female 0 13 58 female 0 14 58 female 1 15 60 female 0 16 64 female 0 17 65 female 1 18 68 female 1 19 68 female 1 20 70 female 1 21 34 male 1 22 38 male 1 23 40 male 0 24 40 male 0 25 41 male 0 26 43 male 1 27 43 male 1 28 43 male 1 29 44 male 0 30 46 male 0 31 47 male 1 32 48 male 1 33 48 male 1 34 50 male 0 35 50 male 1 36 52 male 1 37 55 male 1 38 60 male 1 22 39 61 male 40 61 male 1 1 Attaching package: 'rms' Fit the main effects model. > MB1 <- lrm(Response ~ Age + Gender, data = MB, x = T, y = T) > MB1 Logistic Regression Model lrm(formula = Response ~ Age + Gender, data = MB, x = T, y = T) Frequencies of Responses 0 1 20 20 Obs Max Deriv Model L.R. d.f. 40 7e-08 16.54 2 Gamma Tau-a R2 Brier 0.703 0.358 0.451 0.162 Intercept Age Gender=male Coef -9.8429 0.1581 3.4898 P 3e-04 C Dxy 0.849 0.698 S.E. Wald Z 3.67577 -2.68 0.06164 2.56 1.19917 2.91 P 0.0074 0.0103 0.0036 Let us do a goodness-of-fit test. > residuals.lrm(MB1, type = 'gof') Sum of squared errors 6.4736338 Z 0.0552307 Expected value|H0 6.4612280 P 0.9559547 SD 0.2246174 The fit is excellent. 23 Let us fit the interaction model. > MB2 <- lrm(Response ~ Age + Gender + Age*Gender, data = MB, x = T, y = T) > MB2 Logistic Regression Model lrm(formula = Response ~ Age + Gender + Age * Gender, data = MB, x = T, y = T) Frequencies of Responses 0 1 20 20 Obs Max Deriv Model L.R. d.f. 40 2e-05 16.97 3 Gamma Tau-a R2 Brier 0.733 0.373 0.461 0.158 Coef S.E. Intercept -12.1462 5.5816 Age 0.1970 0.0935 Gender=male 7.7047 6.7598 Age * Gender=male -0.0819 0.1259 P 7e-04 C 0.864 Dxy 0.728 Wald Z P -2.18 0.0295 2.11 0.0351 1.14 0.2544 -0.65 0.5153 The interaction is not significant. Let us do a goodness-of-fit test. > residuals.lrm(MB2, type = 'gof') Sum of squared errors 6.3100002 Z -0.3503031 The fit is excellent. Expected value|H0 6.3857317 P 0.7261113 SD 0.2161884 We better stick to the main effects model. What is its interpretation? It is easy to explain in terms of log-odds. 24 Logistic Regression model for Males ln Pr(Y 1) = β0 + β2 + β1*Age Pr(Y 0) Logistic Regression model for Females ln Pr(Y 1) = β0 + β1*Age Pr(Y 0) The lines are parallel. The only difference is in the intercepts. Let us do some plotting. Plot the logistic regression model for males and females separately. > curve(exp(-9.8429 + 3.4898 + 0.1581*x)/(1 + exp(-9.8429 + 3.4898 + 0.1581*x)), + from = 30, to = 75, xlab = "Age", ylab = "Probability", col = "red", sub = + "Logistic Regression Model", main = "Probability of Relief From Treatment") > curve(exp(-9.8429 + 0.1581*x)/(1 + exp(-9.8429 + 0.1581*x)), col = "blue", + add = T) > text(40, 0.6, "Males", col = "red") > text(60, 0.6, "Females", col = "blue") The output is at the end. What else we can do? Some prediction. Prediction is useful in pattern recognition problems. What does the output folder MB1 contain? > names(MB1) [1] "freq" "sumwty" "stats" [4] "fail" "coefficients" "var" [7] "u" "deviance" "est" [10] "non.slopes" "linear.predictors" "penalty.matrix" [13] "info.matrix" "weights" "x" [16] "y" "call" "Design" 25 [19] "scale.pred" [22] "na.action" [25] "fitFunction" "terms" "fail" "assign" "nstrata" > MB1$linear.predictors 1 2 3 4 5 6 -3.99478499 -3.67866837 -3.67866837 -3.20449343 -2.41420186 -2.25614355 7 8 9 10 11 12 -2.25614355 -1.62391030 -1.46585199 -1.14973536 -0.99167705 -0.83361874 13 14 15 16 17 18 -0.67556042 -0.67556042 -0.35944380 0.27278945 0.43084777 0.90502270 19 20 21 22 23 24 0.90502270 1.22113933 -0.97913195 -0.34689870 -0.03078207 -0.03078207 25 26 27 28 29 30 0.12727624 0.44339287 0.44339287 0.44339287 0.60145118 0.91756780 31 32 33 34 35 36 1.07562612 1.23368443 1.23368443 1.54980106 1.54980106 1.86591768 37 38 39 40 2.34009262 3.13038418 3.28844250 3.28844250 Linear predictors are the numbers calculated in the linear form of the model for every individual in the sample. We can calculate predicted probability of relief as per the model for every one in the sample using the subject’s covariate values. > Predict <- round(exp(MB1$linear.predictor)/(1 + exp(MB1$linear.predictors)), 3) > Predict 1 2 3 4 5 6 7 8 9 10 11 12 13 0.018 0.025 0.025 0.039 0.082 0.095 0.095 0.165 0.188 0.241 0.271 0.303 0.337 14 15 16 17 18 19 20 21 22 23 24 25 26 0.337 0.411 0.568 0.606 0.712 0.712 0.772 0.273 0.414 0.492 0.492 0.532 0.609 27 28 29 30 31 32 33 34 35 36 37 38 39 0.609 0.609 0.646 0.715 0.746 0.774 0.774 0.825 0.825 0.866 0.912 0.958 0.964 40 0.964 Put everything together. > MB3 <- data.frame(MB, Predict) > MB3 Age Gender Response Predict 1 37 female 0 0.018 2 39 female 0 0.025 3 39 female 0 0.025 26 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 42 47 48 48 52 53 55 56 57 58 58 60 64 65 68 68 70 34 38 40 40 41 43 43 43 44 46 47 48 48 50 50 52 55 60 61 61 female female female female female female female female female female female female female female female female female male male male male male male male male male male male male male male male male male male male male 0 0 0 1 0 0 0 0 0 0 1 0 0 1 1 1 1 1 1 0 0 0 1 1 1 0 0 1 1 1 0 1 1 1 1 1 1 0.039 0.082 0.095 0.095 0.165 0.188 0.241 0.271 0.303 0.337 0.337 0.411 0.568 0.606 0.712 0.712 0.772 0.273 0.414 0.492 0.492 0.532 0.609 0.609 0.609 0.646 0.715 0.746 0.774 0.774 0.825 0.825 0.866 0.912 0.958 0.964 0.964 Are there any graphics available to check on interaction? 27 0.6 Males Females 0.2 0.4 Probability 0.8 1.0 Probability of Relief From Treatment 30 40 50 60 70 Age Logistic Regression Model Module 3: Odds, Odds ratio, and their ilk Odds and odds ratio A Bernoulli Trial is a random experiment, which when performed results in one and only one of two possible outcomes. For example, tossing a coin is a Bernoulli trial. There are only two possible outcomes: Heads or Tails. A medical researcher devised a new medicine for curing a specific malady. If the medicine is administered to a patient suffering from the malady, only one of two possible outcomes results: the patient is cured or 28 the patient is not cured. It is customary to denote the outcomes as Success and Failure. Consider a Bernoulli trial with probability of Success being p. Consequently, the probability of Failure is 1-p. We define p . 1 p The odds of Success provide a way to assess how likely the Success to occur in comparison with Failure. In the following table, for a given probability of Success, we calculate the odds of Success versus Failure. Odds of Success versus Failure = Odds of Success = p: Odds: 0.1 1/9 0.2 ¼ 0.3 3/7 0.4 2/3 0.5 1 0.6 1.5 0.7 7/3 0.8 4 0.9 9 Suppose the odds of Success are 1. It means that Success and Failure are equally likely. Suppose the odds of Success are 2. This means that Success is twice as likely to occur as Failure. Solve the equation p = 2 for p. 1 p Solution: p = 2/3 and 1 – p = 1/3. Suppose the odds of Success are 8. This means that Success is eight times as likely to occur as Failure. Suppose the odds of Success are ¼.This means that Failure is four times as likely to occur as Success. In Medical Research, the scientists usually talk about odds of a treatment being successful. If we know the probability of Success, we can work out the odds of Success. Conversely, if we know the odds of Success (versus Failure), we can work out the probability of Success. Suppose the odds of Success are 4.1. Set p = 4.1 and solve for p. 1 p 29 As a matter of fact, p = 4.1 = 0.8039. 1 4.1 Now we come to the concept of Odds Ratio. As the name indicates, it is indeed the ratio of two sets of odds. Another name for the Odds Ratio is Cross Ratio. Odds ratio can be defined for independent Bernoulli experiments. In prospective studies, we generally compare the performance two treatments. Prospective Studies Suppose we have two Bernoulli trials. In one Bernoulli trial the probability of Success is p1 and in the other it is p2. In Bernoulli Trial 1, the odds of p1 Success versus Failure are . In Bernoulli Trial 2, the odds of Success 1 p1 p2 versus Failure are . The Odds Ratio is defined by 1 p2 OR = . Interpretation. Suppose OR = 2. This means Odds of Success versus Failure in Trial 1 are two times the Odds of Success versus Failure in Trial 2. It also implies that the probability of Success in Trial 1 is greater than the probability of Success in Trial 2. How much larger? It depends on what the probability of Success is in Trial 2. 1. Suppose the Odds of Success in Trial 2 are one. This means that in Trial 2, Success and Failure occur with equal probability ½. The odds of Success in Trial 1 are two. This implies that the probability of success in Trial 1 is 2/3. The probability of Failure is 1/3. Successes are two times as likely as Failures. 2. Suppose the Odds of Success in Trial 2 are two. In Trial 2, the probability of Success is 2/3. Thus in Trial 2, Successes are two times as likely as Failures. In Trial 1, the Odds of Success are 4. The probability of Success in Trial 1, therefore, is 4/5. The probability of Failure is 1/5. Thus in Trial 1, Successes are four times as likely as Failures. Estimation of Odds Ratio 30 The population odds ratio OR is unknown. We collect data in order to estimate and build confidence intervals for OR. Typically, the data are collected by adopting a prospective design. In medical research, Trial 1 corresponds to an experimental drug and Trial 2 corresponds to a standard drug (control). The drugs are designed to cure a specific malady. Sampling. Select m many patients randomly and put them all on the experimental drug. Observe them for a certain length of time. Let m1 be the number of patients for whom the drug is successful. Let m2 be the number of patients for whom the drug is not successful. Select n many patients randomly and put them all on the standard drug. Observe them for the same length of time. Let n1 be the number of patients for whom the drug is successful. Let n2 be the number of patients for whom the drug is not successful. The sampling protocol described above is called a prospective design. The data can be put in the form of a 2x2 contingency table. Drug Experimental Standard Successful Yes No m1 m2 n1 n2 Sample size m n The null hypothesis is that the population OR = 1 (Hypothesis of skepticism). The alternative hypothesis is that OR ≠ 1. H0: H1: OR = 1 OR ≠ 1 Null hypothesis is equivalent to the statement that the experimental and standard drugs are equally effective. Let p1 be the probability of success on the experimental drug and p2 the probability of success on the standard drug. Population odds ratio is defined by 31 p1 OR = p2 (1 p1) (1 p2 ) p (1 p2 ) . 1 (1 p1) p2 Claim: OR = 1 if and only if p1 = p2. Prove this. It is easier to handle OR than handling with p1 and p2. mn Estimate of OR = ORˆ 1 2 (Derive this estimate.) m2n1 Standard Error of the ln ORˆ 1 1 1 1 m1 m2 n1 n2 ln ORˆ ln 1 SE(ln( ORˆ )) Test: Reject the null hypothesis at 5% level of significance if |Z| > 1.96. Test statistic = Z= Note: Testing OR = 1 is equivalent to testing p1 = p2. For testing p1 = p2, one could use the two-sample proportion test if the alternative is directional or a chi-squared test if the alternative is non-directional. The test based on OR has better sampling properties than the one based on the proportions. Retrospective Studies Odds Ratio can also be defined in the context of a retrospective study. Let us consider the problem of examining association between smoking status of mothers and perinatal mortality. We select at random a maternity record of a mother from a group of hospitals. We observe the following two categorical variables: Smoking status of the mother and perinatal mortality of the baby It is natural to expect that these response variables are correlated. Their joint distribution can be summarized in the following table. Mother Smoked (X) Yes (0) Perinatal Mortality (Y) Yes (0) No (1) a b Marginals a+b 32 No (1) Marginals c a+c d b+d c+d 1 Odds of Death versus Life if the mother smoked during pregnancy = [Pr(Baby Died / Mother Smoked)] [Pr(Baby Alive / Mother Smoked)] a ( a b) = . b ( a b) Odds of Death versus Life if the mother did not smoke during pregnancy = Pr(Baby Died/Mother did not smoke)Pr(Baby Alive/Mother did not smoke) c (c d ) = . d (c d ) The odds ratio is the ratio of these two sets of odds. ad OR = bc Equivalently, Odds of Death versus Life if mother smoked = OR Odds of Death versus Life if mother did not smoke. OR = 1 means it does not matter whether mother smokes or not. The smoking status of the mother has no impact on mortality. This also means that Smoking Status of Mother and Perinatal Mortality are statistically independent. OR > 1 implies that death probability if mother smokes is greater than the death probability if mother does not smoke. So far what we discussed is about population’s odds ratio. We do not know the population odds ratio. We need to estimate the population odds ratio and test hypothesis about the odds ratio. Inference for Odds Ratio The data come in the form of a 2x2 contingency table. X /Y 0 1 33 0 1 n00 n10 n01 n11 n n A point estimate of OR: ORˆ 00 11 . n10 n01 In order to build a confidence interval for OR, we need its large sample standard error. What is standard error? It is easy to obtain a large sample standard error of ln( ORˆ ) = ln(n00) + ln(n11) – ln(n01) – ln(n10) using asymptotic theory. As a matter of fact, estimated standard error is given by ORˆ )) = 1 1 1 1 . n00 n01 n10 n11 A large sample 95% confidence interval for ln(OR) is given by ln( ORˆ ) 1.96SE(ln( ORˆ )). In order to get a large sample 95% confidence interval for ln(OR), take antilogarithms. It is given by Exp{ ln( ORˆ ) - 1.96SE(ln( ORˆ ))} OR Exp{ln( ORˆ ) + 1.96SE(ln( ORˆ ))} SE(ln( Whether the study is prospective or retrospective, the concept of odds ratio is the same. The underlying distributions are characteristically different. Example. Back to perinatal mortality problem and smoking mothers. A retrospective study of 48,378 mothers yielded the following data. Mother Smoked Yes No Marginals ORˆ = Perinatal Mortality Yes No 619 20,443 634 26,682 1253 47,125 Marginals 21,062 27,316 48,378 619 26682 = 1.27 634 20443 ln( ORˆ ) = 0.2390 34 SE[ln( ORˆ )] = 1 1 1 1 619 20443 634 26682 = 0.057 A large sample 99% confidence interval for ln(OR) is given by Ln( ORˆ ) 2.576SE[ln( ORˆ )] 0.2390 0.1475 0.0915 ln(OR) 0.3865 A large sample 99% confidence interval for OR is given by exp{0.0915} OR exp{0.3865}. 1.10 OR 1.47 Conclusion. The number 1 is not in the interval. Smoking status of a mother does indeed influence the perinatal mortality of the baby. The odds ratio is at least 1.10 and at most it is 1.47. The odds of death versus life if mother smokes is at best 1.10 times the odds of death versus life if mother does not smoke and at worst it is 1.47 times the odds of death versus life if mother does not smoke. R can do all these calculations. Download the package ‘vcd.’ Activate the package. Enter the data into a matrix. > MB <- matrix(c(619, 20443, 634, 26682), nrow = 2, byrow = T) > MB [,1] [,2] [1,] 619 20443 [2,] 634 26682 Name the rows and columns. > rownames(MB) <- c("Smoked", "No") > colnames(MB) <- c("Died", "Alive") > MB Died Alive Smoked 619 20443 No 634 26682 35 The command for oddsratio is ‘oddsratio.’ > oddsratio(MB) [1] 0.2424050 This is ln(oddsratio), i.e., log of odds ratio. We can get confidence interval of ln(OR). The default level is 95%. > confint(MB) lwr upr [1,] 0.1302128 0.3545972 > confint(MB, level = 0.95) lwr upr [1,] 0.1302128 0.3545972 > confint(MB2, level = 0.90) lwr upr [1,] 0.1482503 0.3365596 > confint(MB, level = 0.99) lwr upr [1,] 0.09495945 0.3898505 How to get confidence intervals for OR? > oddsratio(MB, log = F) [1] 1.274310 > MB3 <- oddsratio(MB, log = F) > confint(MB3) lwr upr [1,] 1.139071 1.425606 Odds ratios in the context of Logistic Regression In a multiple regression model, it is easy to examine the impact of a covariate on the response variable. Suppose we have one response variable y and three covariates X1, X2, and X3. Suppose we have the following estimated multiple regression equation: ŷ = 3 + 2X1 + 3X2 – 4X3. 36 What is the impact of X1 on the response variable? Suppose we increase the value of X1 by one unit and keep the values of X2 and X3 the same. What will happen to the value of y? Scenario 1 X1 = 1, X2 = 3, X3 = 1 ŷ = 10 Scenario 2 X1 = 2, X2 = 3, X3 = 1 ŷ = 12 What is the difference between Scenarios 1 and 2? The value of X1 has gone up by one unit and the values of X2 and X3 have remained the same. The value of y has gone up by two units. The number 2 is precisely the coefficient of X1 in the multiple regression estimated equation. Thus the impact of X1 on the response variable is positive and is measured by the coefficient of X1 in the equation. What is the impact of X3 on the response variable y? If the value of X3 goes up by one unit and the values of X1 and X2 remain the same, then the value of y goes down by four units on average. Scenario 1 X1 = 1, X2 = 3, X3 = 1 ŷ = 10 Scenario 2 X1 = 1, X2 = 3, X3 = 2 ŷ = 6 What is the difference between Scenarios 1 and 2? We would like to initiate a similar study in the environment of logistic regression models. Suppose we have two covariates X1 and X2 in a logistic regression model. Suppose the model is given by Pr(Y = 1 / X1, X2) = exp{0 1 X1 2 X 2} 1 exp{0 1 X1 2 X 2} and 1 Pr(Y = 0 / X1, X2) = 1 exp{0 1 X1 2 X 2} 37 What is the impact of X1 on the response variable? A multiple regression type of interpretation is not possible here. We work with the odds ratios. Let Y = 1 stand for success and Y = 0 stand for failure. Odds of Success versus Failure = Pr(Y 1 | X1, X 2 ) exp( 0 1 X1 2 X 2 ) . Pr(Y 0 | X1, X 2 ) Look at the following scenario: X1 = 1 and X2 = 2 Odds of Success versus Failure = exp{0 + 1 + 22} (Check this.) Let us increase the value of X1 by one unit and keep the value of X2 the same, i.e., X1 = 2 and X2 = 2 Odds of Success versus Failure = exp{0 + 21 + 22} Odds Ratio = [Odds of Success versus Failure when X1 = 2 and X2 = 2] [Odds of Success versus Failure when X1 = 1 and X2 = 2] = exp{1}. The numbers given to X1 and X2 are not special. Give any values to X1 and X2 so that the value of X1 goes up by one unit and the value of X2 remains the same. The odds ratio will remain the same. Equivalently, [Odds of Success versus Failure when X1 = 2 and X2 = 2] = (Odds Ratio)[ Odds of Success versus Failure when X1 = 1 and X2 = 2] Value of 1 Odds Ratio Impact of X1 on Y 0 1 No impact >0 >1 If X1 goes up, so are the odds. <0 <1 If X1 goes up, the odds go down. Example. Framingham Study: Homework 38 Module 4: Odds ratio from the logistic regression model vis-à-vis Odds ratio from the contingency table + Biplots + How to download EXCEL onto R Odds ratio Let us look at the data on Response to a particular treatment with prognostic variable Age and Gender. > Age <- c(37, 39, 39, 42, 47, 48, 48, 52, 53, 55, 56, 57, 58, 58, 60, 64, 65, 68, 68, 70, 34, 38, 40, 40, 41, 43, 43, 43, 44, 46, 47, 48, 48, 50, 50, 52, 55, 60, 61, 61) > Gender <- factor(rep(c("F", "M"), c(20, 20))) > Response <- factor(c(0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1)) > MB <- data.frame(Age, Gender, Response) > MB Age Gender Response 1 37 F 0 2 39 F 0 3 39 F 0 4 42 F 0 5 47 F 0 6 48 F 0 7 48 F 1 8 52 F 0 9 53 F 0 10 55 F 0 11 56 F 0 12 57 F 0 13 58 F 0 14 58 F 1 15 60 F 0 16 64 F 0 17 65 F 1 18 68 F 1 19 68 F 1 20 70 F 1 21 34 M 1 22 38 M 1 23 40 M 0 24 40 M 0 25 41 M 0 39 26 43 M 1 27 43 M 1 28 43 M 1 29 44 M 0 30 46 M 0 31 47 M 1 32 48 M 1 33 48 M 1 34 50 M 0 35 50 M 1 36 52 M 1 37 55 M 1 38 60 M 1 39 61 M 1 40 61 M 1 Response and Gender are binary variables. We can cross-tabulate these variables to get a 2x2 contingency table. We can calculate the odds ratio for the table to measure the degree of association between Response and Gender. We can also build a 95% confidence interval for the population odds ratio. Let us activate the package ‘vcd.’ > MB1 <- table(Response, Gender) > MB1 Gender Response F M 0 14 6 1 6 14 > MB2 <- oddsratio(MB1, log = F) > MB2 [1] 5.444444 Interpretation: Odds of Cure versus No Cure if the patient is Male = 5.44*Odds of Cure versus No Cure if the patient is Female. Odds of Cure are much better for males than for females. Suppose the odds of Cure versus No Cure for females are 1/5, i.e., No Cure is 5 times more likely than Cure. Then the odds of Cure versus No Cure for males is 5.44*(1/5) = 1.088, better than evens. More precisely, Pr(Cure | Male) = 1.088/(1 + 1.088) = 0.52 In this analysis, age is not factored into. > confint(MB2) lwr upr 40 [1,] 1.471410 20.14528 More precisely, a 95% confidence interval for the population odds ratio is given by 1.47 ≤ OR ≤ 20.15. Why this confidence interval is so wide? The sample is small. Recall the ̂ ) is √ standard error of ln(𝑂𝑅 1 14 + 1 6 + 1 14 1 + . The length of the confidence 6 intervals depends on the standard error. The smaller the standard error is the small the length of the confidence interval is. The numbers in the four cells of the contingency table are small. The bigger these numbers are the smaller the standard is. Recall how the confidence interval is built. We want to test the null hypothesis that the population odds ratio is equal to one, i.e., there is no association between Response and Gender. H0: OR = 1 H1: OR ≠ 1 The observed 95% confidence interval is: 1.47 ≤ OR ≤ 20.15. The interval does not contain OR = 1. We reject the null hypothesis at 5% level of significance. We can also calculate the p-value. Under the null hypothesis, ln(OR) = 0. Under the null hypothesis, theoretically, Z= ̂ )− 0 ln(𝑂𝑅 ̂ )) 𝑆𝐸(ln(𝑂𝑅 has a standard normal distribution. Observed value of the z-statistic can be computed using R. > SE <- sqrt(1/14 + 1/14 + 1/6 + 1/6) > SE [1] 0.6900656 > Z <- log(MB2)/SE >Z [1] 2.455703 > pvalue <- 2*pnorm(2.455703, lower.tail = F) > pvalue [1] 0.01406093 Based on this p-value, we can reject H0: ln(OR) = 0 or H0: OR = 1. We can fit a logistic regression model to the data. exp(𝛽0 + 𝛽1 ∗𝐴𝑔𝑒+ 𝛽2 ∗𝐺𝑒𝑛𝑑𝑒𝑟) Pr(Response = 1 | Age, Gender) = 1+ exp(𝛽0 + 𝛽1 ∗𝐴𝑔𝑒+ 𝛽2 ∗𝐺𝑒𝑛𝑑𝑒𝑟) Let us fit this model. > MB3 <- glm(Response ~ Age + Gender, data = MB, family = binomial) > summary(MB3) 41 Call: glm(formula = Response ~ Age + Gender, family = binomial, data = MB) Deviance Residuals: Min 1Q Median -1.86671 -0.80814 0.03983 Coefficients: 3Q 0.78066 Max 2.17061 Estimate Std. Error z value Pr(>|z|) (Intercept) -9.84294 3.67576 -2.678 0.00741 ** Age 0.15806 0.06164 2.564 0.01034 * GenderM 3.48983 1.19917 2.910 0.00361 ** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 55.452 on 39 degrees of freedom Residual deviance: 38.917 on 37 degrees of freedom AIC: 44.917 Number of Fisher Scoring iterations: 5 > OddsRatioGender <- exp(3.4898) > OddsRatioGender [1] 32.77939 If the Age is fixed, no matter what it is, the odds of Cure versus No Cure for a male = 32.78*Odds of Cure versus No Cure for a female, both with the same age. Suppose the odds of Cure versus No Cure for females are 1/5, i.e., No Cure is 5 times more likely than Cure females. Then the odds ratio of Cure versus No Cure for males is 32.78*(1/5) = 6.556. This means Pr(Cure | Male) = 6.556/(1 + 6.556) = 0.87. This odds ratio takes into account age. One can say that this odds ratio is the odds ratio of Response and Gender adjusted for Age. This is indeed a true summary of the relationship between Response to the treatment and Gender. Another great advantage of the logistic regression model, if it is a good fit, is that we can measure association between the response variable (binary) and a numeric covariate after adjusting for the presence of other covariates. In our example, the continuous variable is Age. The odds ratio of Response to Treatment and Age is exp(0.15806) = 1.17. If the gender is fixed, Odds of Cure versus No cure if (Age = x+1) = 1.17*Odds of Cure versus No Cure if (Age = x), where x is any number. 42 This odds ratio is the odds ratio of Response to the Treatment and Age adjusted for Gender. If the gender is fixed, Odds of Cure versus No cure if (Age = x+2) = (1.17)2*Odds of Cure versus No Cure if (Age = x) Derive this result. There is no way we can measure association between Response to Treatment (binary) and Age (continuous) using contingency table approach. Build a 95% confidence interval for the odds ratio of Gender The coefficient of Gender in the logistic regression model is β2. Population odds ratio is exp(β2). A 95% confidence interval for β2 is 𝛽̂ 2 ± 1.96*SE. 3.49 ± 1.96*1.20 1.14 ≤ β2 ≤ 5.84 A 95% confidence interval for the odds ratio exp(β2) is obtained by exponentiation of the above interval. 3.13 ≤ OR ≤ 343.78 Why this interval is so wide? Biplots Goal: I have 4 quantitative variables: X1, X2, X3, and X4. Make a graphical presentation of the data on these four variables in a single frame. Solution: Get a scatter plot of X1 and X2. Get the scatter plot of X3 and X4 on the same graph. How? Let us look at an example. The data ‘iris’ is available in R. Data were collected on Petal Length, Petal Width, Sepal Length, and Sepal Width on three different species of iris flowers (setosa, versicolor, viriginica). Download the data. > data(iris) > dim(iris) [1] 150 5 > head(iris) 1 2 3 4 5 6 Sepal.Length Sepal.Width Petal.Length Petal.Width Species 5.1 3.5 1.4 0.2 setosa 4.9 3.0 1.4 0.2 setosa 4.7 3.2 1.3 0.2 setosa 4.6 3.1 1.5 0.2 setosa 5.0 3.6 1.4 0.2 setosa 5.4 3.9 1.7 0.4 setosa How many flowers in each species? > table(iris$Species) setosa versicolor virginica 43 50 50 50 Focus on setosa flowers only. The four measurements are: Petal.Length; Petal.Width; Sepal.Length; and Sepal.Width. Get the scatter plot of Petal Length and Sepal Length. Superimpose this graph with the scatter plot of Petal Width and Sepal Width. > setosa <- subset(iris, iris$Species == "setosa") Using ‘par’ command, create four lines of space at the bottom, four lines on the left, seven lines at the top, and seven lines on the right. I need space at the top and on the right for legend. (mar = margin on the sides) > par(mar = c(4, 4, 7, 7)) > plot(setosa$Petal.Length, setosa$Sepal.Length, pch = 16, col = "red", + xlab = "Petal Length", ylab = "Sepal Length") I have been harping that a plot command will not accept another plot command in any superimposition. We can overcome that. > par(new = T) > plot(setosa$Petal.Width, setosa$Sepal.Width, pch = 17, col = "blue", ann = F, + axes = F) > range(setosa$Sepal.Width) [1] 2.3 4.4 > range(setosa$Petal.Width) [1] 0.1 0.6 > axis(side = 3, at = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6)) > axis(side = 4, at = c(2, 3, 4, 5)) > mtext("Petal Width", side = 3, line = 2) > mtext("Sepal Width", side = 4, line = 3) > mtext("Setosa Flowers", side = 3, line = 5) mtext = text on the margins Here is the biplot. This method of plotting can be used to plot X versus Y and X versus U, where X, Y, and U are three quantitative variables. In this case, side = 3 is not needed. 44 Setosa Flowers 0.2 Petal Width 0.3 0.4 0.5 0.6 Sepal Width 4 4.5 3 5.0 Sepal Length 5.5 0.1 1.0 1.2 1.4 1.6 1.8 Petal Length How to download EXCEL data into R for MAC users? Courtesy: Gail Pyne-Geithman, Associate Professor, Neurosurgery 1. Save the data as a comma-separated.csv file. 2. Find the precise address of this file. If you can find it, that is good. If you cannot, R can find it for you. For example, suppose the data file is “sepsis.csv”. Then in R console, type < rawdata <file.choose(“sepsis.csv”). Type < rawdata. This will give the address in double quotes. 3. Type < Name <- read.csv(“Address/sepsis.csv”, header = T) 4. The folder Name contains the data. 45 Module 5: NON-PARAMETRIC REGRESSION; BINARY RESPONSE VARIABLE; CLASSIFICATION TREES We have been working on how to model a binary response variable in terms of covariates or independent variables. Our approach was probabilistic in nature. We proposed a logistic regression model. However, there are a number of other approaches. One approach popular with engineers and physicists is to treat the problem as a pattern recognition or classification problem. Let us go back to the abdominal sepsis problem. Response variable Y = 1 if the patient dies after surgery = 0 if the patient survives after surgery Independent variables X1: Is the patient in a state of shock? X2: Is the patient suffering from malnutrition? X3: Is the patient alcoholic? X4: Age X5: Has the patient bowel infarction? In logistic regression, the probability distribution of Y is modeled in terms of the covariates. If we view this problem as a pattern recognition problem, we need to identify what the patterns are. The situation Y = 1 is regarded as one pattern and Y = 0 as the other. Once we have information on the independent variables for a patient, we need to classify him/her into one of the two patterns. We have to come up with a protocol, which will classify the patient as falling into one of the patterns. In other words, we have to say whether he will die or survive after surgery. We will not make a probability statement. Any classification protocol one comes up can not be expected to be free of errors. A classification protocol is judged based on its misclassification error rate. We will make precise this concept later. Core idea: Look at the space of predictors. We want to break up the predictor space into boxes (5-dimensional parallelepipeds) so that each box is identified with one pattern. For example, Shock = 1, Malnourishment = 0, Alcoholism = 1, Age > 45, Infarction = 1 is one such box. Can we say that 46 most of the patients that fall into this box die? We want to divide the predictor space into mutually exclusive and exhaustive boxes so that the patients falling into each box have predominantly one pattern. The creation of such boxes is the main objective of this lecture. One popular method in classification or pattern recognition is the so called the ‘classification tree methodology,’ which is a data mining method. The methodology was first proposed by Breiman, Friedman, Olshen, and Stone in their monograph published in 1984. This goes by the acronym CART (Classification and Regression Trees). A commercial program called CART can be purchased from Salford Systems. Other more standard statistical software such as SPLUS, SPSS, and R also provide tree construction procedures with user-friendly graphical interface. The packages ‘rpart’ and ‘tree’ do classification trees. Some of the material I am presenting in this lecture is culled from the following two books. L Breiman, J H Friedman, R A Olshen, and C J Stone – Classification and Regression Trees, Wadsworth International Group, 1984. Heping Zhang and Burton Singer – Recursive Partitioning in the Health Sciences, Second Edition, Springer, 2008. Various computer programs related to this methodology can be downloaded freely from Heping Zhang’s web site: http://peace.med.yale.edu/pub Let me illustrate the basic ideas of tree construction in the context of a specific example of binary classification. In the construction of a tree, for evaluation purpose, we need the concept of ENTROPY of a probability distribution and/or Gini’s measure of uncertainty. Suppose we have a random variable X taking finitely many values with some probability distribution. X: Pr.: 1 p1 2 p2 … … m pm We want to measure the degree of uncertainty in the distribution (p1, p2, … , pm). For example, suppose m = 2. Look at the distributions (1/2, 1/2) and (0.99, 0.01). There is more uncertainty in the first distribution than in the second. Suppose some one is about to crank out X. I am more comfortable in 47 betting on the outcome of X if the underlying distribution is (0.99, 0.01) than when the distribution is (1/2,1/2). We want to assign a numerical quantity to measure the degree of uncertainty. Entropy of a distribution is introduced as a measure of uncertainty. Entropy (p1, p2, … , pm) = m pi ln pi = Entropy impurity = Measure of i 1 Chaos, with the convention that 0 ln 0 = 0. Properties 1. 0 ≤ Entropy ≤ ln m. 2. The minimum 0 is attained for each of the distributions (1, 0, 0, … , 0), (0, 1, 0, … , 0), … , (0, 0, … , 0, 1). For each of these distributions, there is no uncertainty. The entropy is zero. 3. The maximum ln m is attained at the distribution (1/m, 1/m, … , 1/m). The uniform distribution is the most chaotic. Under this uniform distribution, uncertainty is maximum. There are other measures of uncertainty available in the literature. Gini’s measure of uncertainty for the distribution (p1, p2, … , pm) = pi p j . i j Properties 1. 0 ≤ Gini’s measure ≤ (m-1)/m. 2. The minimum 0 is attained for each of the distributions (1, 0, 0, … , 0), (0, 1, 0, … , 0), … , (0, 0, … , 0, 1). For each of these distributions, there is no uncertainty. The Gini’s measure is zero. 3. The maximum (m-1)/m is attained at the most chaotic distribution (1/m, 1/m, … , 1/m). Under this uniform distribution, the uncertainty is maximum. Another measure of uncertainty is defined by min {p1, p2, … , pm}. Basic ideas in the development of a classification tree Let me work with an artificial example. ID Y X1 X2 48 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 1 1 0 0 1 1 0 1 0 1 0 1 0 0 1 0 1 1 0 1 6 5 10 5 4 10 4 8 9 3 8 9 3 7 2 6 7 1 2 2 5 7 9 5 8 2 3 4 7 9 8 2 1 7 10 10 5 6 4 Goal: I know one with given X1 and X2 values. I need to classify him as having the pattern Y = 0 or Y = 1. We have the training data given above to develop a classification protocol. (I could have done a logistic regression here.) Another view point: What ranges of X1 and X2 values identify the pattern {Y = 0} and what for the pattern {Y = 1}? Step 1: Put all the subjects into the root node. There are 10 subjects with the pattern Y = 0 and ten with Y = 1. How impure the root (mother) node is. Calculate the entropy of the distribution: {Y = 0} {Y = 1} 0.5 0.5 Impurity of the mother = - 0.5 ln(0.5) – 0.5 ln(0.5) = ln 2 = 0.69. Step 2: Let us split the mother node into two daughter nodes. We need to choose one of the covariates. Let us choose X1. We need to choose one of 49 the numbers taken by X1. The possible values of X1 are 1, 2, … , 10. Let us choose 5. All those subjects with X1 ≤ 5 go into the left daughter node. All those subjects with X1 > 5 go into the right daughter node. Members of the left daughter node: ID 1, 3, 5, 6, 8, 11, 14, 16, 19, 20. Five of these subjects have the pattern {Y = 0} and the rest {Y = 1}. Impurity of this daughter = - 0.5 ln(0.5) – 0.5 ln(0.5) = 0.69. Members of the right daughter node: ID 2, 4, 7, 9, 10, 12, 13, 15, 17, 18. Five of these subjects have the pattern {Y = 0} and the rest {Y = 1}. Impurity of this daughter = - 0.5 ln(0.5) – 0.5 ln(0.5) = 0.69. This is a disappointment. We expected the daughters to be less chaotic. May be the choice of the cut-point X1 = 5 is not helpful. We need to compare the mother node with the daughter nodes. We need to calculate impurity of the daughters combined. Right and left daughters have the same number of subjects. The weights of these nodes are: 50:50 or 0.5:0.5. The weights come from: what proportions of subjects from the root node are in the daughter nodes. Overall impurity of the daughters = weighted sum of individual impurities = 0.5*0.69 + 0.5*0.69 = 0.69 Our goal is to seek daughters purer than their mothers. Improvement in purity achieved by the split = Goodness of the split = Impurity of the mother – Overall impurity of the daughters = 0.69 – 0.69 = 0. There is no improvement by splitting the mother node this way. We could have chosen another number such as 4 instead of 5 for X1. Our goal is to maximize the goodness of the split. Let us persist with this split. Step 3. Let us split the left daughter node. Choose one of the covariates. Let us choose now X2. Let us choose one of the numbers taken by X2. Let us 50 choose 5. Shepherd all those subjects with X2 ≤ 5 into the left grand daughter node and those with X2 > 5 into the right grand daughter node. Composition of the subjects in the left grand daughter node: ID 1, 5, 8, 14, 20. All these subjects have the pattern {Y = 0}. Its impurity is zero. This granddaughter is the purest. Further split is useless. This is a terminal node. Declare this node as {Y = 0} node. Composition of the subjects in the right grand daughter node: ID 3, 6, 11, 16, 19. All these subjects have the pattern {Y = 1}. Its impurity is zero. This granddaughter is the purest. No further split is possible. This is a terminal node. Declare this node as {Y = 1} node. Step 4. Let us split the right daughter node. Choose one of the covariates. Let us choose X2. Let us choose one of the numbers taken by X2. Let us choose 5. Shepherd all those subjects with X2 ≤ 5 into the left grand daughter node and those with X2 > 5 into the right grand daughter node. Composition of the subjects in the left grand daughter node: ID 2, 7, 9, 13, 18. All these subjects have the pattern {Y = 1}. Its impurity is zero. Why? This granddaughter is the purest. Further split is pointless. This is a terminal node. Declare this node as {Y = 1} node. Composition of the subjects in the right grand daughter node: ID 4, 10, 12, 15, 17. All these subjects have the pattern {Y = 0}. Its impurity is zero. This granddaughter is the purest. Further split is not worthwhile. This is a terminal node. Declare this node as {Y = 0} node. The task of building a tree is complete. Look at the tree that results. Let us now calculate the misclassification error rate. Let us pour all the subjects into the mother node. We know the pattern each subject has. Check which terminal node they fall into. Check whether it’s true pattern matches with the pattern of the terminal node. The percentage of mismatches is the misclassification rate. Misclassification rate = 0%. How does one use this classification protocol in practice? Take a subject whose pattern is unknown. We have its covariate values. Pour this subject 51 into the mother node. See where he lands. Note the identity of the terminal node. That is the pattern he is classified into. Let me give a realistic example. The illustrative example comes from the Yale Pregnancy Outcome Study, a project funded by the National Institute of Health. The basic question they want to address is what factors influence pre-term deliveries. The study subjects were women who made a first prenatal visit to a private obstetrics, midwife practice, health maintenance organization, or hospital clinic in the Greater New Haven, Connecticut, area between March 12, 1980 and March 12, 1982. From these women, select those women whose pregnancies ended in a singleton live birth. The outcome or response variable: Woman whose pregnancy ended by a singleton live birth: Is this a preterm delivery or a normal time delivery? At the time of prenatal visit, measurements on 15 variables (predictors or independent variables) were collected. Variable name Maternal age Marital status Label Type Range/levels X1 Continuous 13-46 X2 Nominal Currently married, divorced, separated, widowed, never married Race X3 Nominal White, Black, Hispanic, Asian, Other Marijuana use X4 Nominal Yes, No Times of using X5 Ordinal ≥ 5, 3-4, 2, 1 (daily) Marijuana 4-6, 1-3 (weekly) 2-3, 1, < 1 (monthly) Years of education X6 Continuous 4-27 Employment X7 Nominal Yes, no Smoker X8 Nominal Yes, No Cigarettes smoked X9 Continuous 0-66 Passive smoking X10 Nominal Yes, No Gravidity X11 Ordinal 1-10 Hormones/DES X12 Nominal None, hormones, Used by mother DES, both, Uncertain 52 Alcohol (oz/day) X13 Caffeine (mg) X14 Parity X15 Ordinal 0-3 Continuous 12.6-1273 Ordinal 0-7 The training sample consisted of 3,861 pregnant women. The objective of the study is to predict whether of not the delivery will be preterm based on the measurements collected at the time of prenatal visit. This is viewed as a binary classification problem. We could solve this problem by following the logistic regression methodology. How? Classification tree methodology was pursued for this problem. What are the key steps? A tree will consist of a root node, internal (circle) nodes, and terminal (box) nodes. Identify each woman in the sample who had a preterm delivery with 0 and who had a normal term delivery with 1. Step 1. Stuff the root node with all these women. Step 2. We will create two daughter nodes (Left Daughter Node and Right Daughter Node) out of the root node. Every woman in the root node has to go either to the Left Daughter Node or Right Daughter Node. In other words, we will split the women in the root node into two groups. The splitting is done using one of the predictors. Suppose we start with X1. In the sample, we have women representing every age from 13 to 43. We may decide to split the root node according to the following criterion. Put a woman in the Left Daughter Node if her age X1 ≤ 13 years. The number 13 is the cut-point chosen. Otherwise, put the woman in the Right Daughter Node. According to this criterion, some women in the root node go into the Left Daughter Node and the rest go into the Right Daughter Node. We could split the root node using a different criterion. For example, put a woman in the Left Daughter Node if her age X1 ≤ 35 years. Otherwise, put the woman in the Right Daughter Node. Here the chosen cut-point is 35. Important idea: In order to split the root node, we need to select one of the covariates and a cut-point. 53 There are 31 different ways of splitting the root node! We want to choose a good split. The objective is to channel as many women with Label 1 (normal delivery) as possible into one node and channel as many women with Label 0 (pre-term delivery) into the other node. Let us assess the situation when the split was done based on the age 35 years. The composition of the daughter nodes can be summarized by the following 2x2 contingency table. Left Node Right Node Total Term 3521 135 3656 Preterm 198 7 205 Total 3719 142 3861 The left node has a proportion of 3521/3791 1’s and 198/3719 0’s. The entropy impurity of the distribution (3521/3791, 198/3719) can be calculated – (3521/3791) ln (3521/3719) - (198/3791) ln (198/3719) = 0.2079 The impurity 0 is the ideal value we are seeking. Similarly, the entropy impurity of the right node is 0.1964. The goodness of the split is defined by Impurity of the mother node – P(Left node)*impurity of the left node – P(Right node)*impurity of the right node, where P(left node) and P(right node) are the probabilities that a subject falls into the left node and right node, respectively. (You can inject some Bayesian philosophy into these probabilities.) For the time being, we can take these probabilities to be 3719/3861 and 142/3861, respectively. Therefore, the goodness of the split = 0.20753 – (3719/3861)*0.2079 – (142/3861)*0.1964 = 0.00001. More intuitively, we are computing Impurity of the mother – Impurity of the daughters to judge how good the chosen cut-point is . 54 The larger the difference is, the better the daughters are! Talk about this more! If each daughter is pure, this is the best split. We are shooting for a high value for the goodness of split. Thus for every possible choice of age, we can measure goodness of split. Choose that age for which the goodness of split is maximum. The goodness of allowable Age splits Split Impurity Value Left node Right node 13 0.00000 0.20757 14 0.00000 0.20793 15 0.31969 0.20615 . … … 24 0.25747 0.18195 . … … 43 0.20757 0.00000 1000*Goodness of the split 0.01 0.14 0.17 … 1.50 … 0.01 At the age 24 years, we have an optimal split. Here, we started with age to split the root node. Why age? It could have been any other predictor. Strategy. Find the best split and the corresponding impurity reduction for every predictor. Choose that predictor for which impurity reduction is the largest. This is the variable we start with to split the root node. After splitting the root node, look at the left and right daughter nodes. We now split the Left Daughter Node into two nodes: Left Grand Daughter Node and Right Grand Daughter Node. We choose one of the predictors including the one already used for the split. How do we split a node based on a nominal or categorical variable? Suppose we choose Race for the split. Note that the Race is a nominal variable with five possible values. There are 24 – 1 ways we can split. The possibilities are listed below. Left Grand Daughter Node White Right Grand Daughter Node Black, Hispanic, Asian, Others 55 Black Hispanic Asian Others White, Black White, Hispanic White, Asian White, Others Black, Hispanic Black, Asian Black, Others Hispanic, Asian Hispanic, Others Asian, Others White, Hispanic, Asian, Others White, Black, Asian, Others White, Black, Asian, Others White, Black, Hispanic, Asian Hispanic, Asian, Others Black, Asian, Others Black, Hispanic, Others Black, Hispanic, Asian White, Asian, Others White, Hispanic, Others White, Hispanic, Asian White, Black, Others White, Black, Asian White, Black, Hispanic Let us look at the split based on White on one hand and Black, Hispanic, Asian, Others on the other hand. Channel all women in the Left Daughter Node into Left Grand Daughter Node if she is white. Otherwise, she goes into the Right Grand Daughter Node. We can assess how good the split is just the same way as we did earlier. Thus the splitting goes on using all the predictors one by one. When do we create a terminal node? We stop splitting when a node is smaller than the prescribed minimum size. This is called pruning. The choice of the minimum size depends on the sample size. If the size of a node is less than 1% of the total size, one could stop splitting. Or if a node contains less than 5 subjects, stop splitting. There are a number of packages available to build a classification tree. We will look at two of them: tree; rpart. Let us download these packages and look at some examples. 56 Module 6: Classification Trees + rpart package + Creation of Polygonal Plots Creation of Polygonal Plots Recall the artificial data presented in Lecture 9. The data had one binary outcome variable Y (0 or 1) and two predictors X1 and X2. Each of the predictors takes integer values from 1 through 10. I built a tree with my bare hands. The tree is equivalent to the following classification protocol. If X1 ≤ 5 and X2 ≤ 5, classify the subject to have the pattern {Y = 0}. If X1 ≤ 5 and X2 ≥ 6, classify the subject to have the pattern {Y = 1}. If X1 ≥ 6 and X2 ≤ 5, classify the subject to have the pattern {Y = 1}. If X1 ≥ 5 and X2 ≥ 6, classify the subject to have the pattern {Y = 0}. There is another way to present the classification protocol graphically. The statement X1 ≤ 5 and X2 ≤ 5 is equivalent to, graphically, the rectangle with vertices (1, 1), (1, 5), (5, 5), (5, 1) in the X1 – X2 plane. The command ‘polygon’ draws the rectangle. First, we need to create a blank plot setting the X1- and X2-axes. The input type = “n” exhorts the plot that there should be no points imprinted on the graph. > plot(c(1,10), c(1, 10), type = "n", xlab = "X1", ylab = "X2", main = "Classification Protocol") The ‘polygon’ command has essentially two major inputs. The x-input should have all the x coordinates of the points. The y-input should have all the corresponding y-coordinates of the points. The polygon thus created latches onto the existing plot. > polygon(c(1, 1, 5, 5), c(1, 5, 5, 1), col = "gray", border = "blue", lwd = 2) The statement X1 ≤ 5 and X2 ≥ 6 is equivalent to, graphically, the rectangle with vertices (1, 6), (1,10), (5, 10), (5, 6) in the X1 – X2 plane. > polygon(c(1, 1, 5, 5), c(6, 10, 10,6), col = "yellow", border = "blue", lwd = 2) The other polygons are created in the same way. > polygon(c(6, 6, 10, 10), c(6, 10, 10,6), col = "mistyrose", border = "blue", lwd = 2) > polygon(c(6, 6, 10, 10), c(1, 5, 5, 1), col = "cyan", border = "blue", lwd = 2) We need to identify each rectangle with a pattern. > text(3, 3, "{Y = 0}", col = "red") > text(3, 8, "{Y = 1}", col = "blue") > text(8, 8, "{Y = 0}", col = "red") > text(8, 3, "{Y = 1}", col = "blue") 57 {Y = 1} {Y = 0} {Y = 0} {Y = 1} 2 4 X2 6 8 10 Classification Protocol 2 4 6 8 10 X1 rpart package ‘rpart’ is an acronym for recursive partitioning. Terry Therneau and Elizabeth Atkinson (Mayo Foundation) have developed ‘rpart’ (recursive partitioning) package to implement classification trees and regression trees in all their glory. The method depends what kind of response variable we have. Categorical → method = “class” Continuous → method = “anova” Count → method = “poisson” Survival → method = “exp” They have two monographs on their package available on the internet. 58 An introduction to Recursive Partitioning using the RPART routines, February, 2000 Same title, September, 1997 Both are very informative. Let me illustrate ‘rpart’ command in the context of a binary classification problem. Four data sets are available in the package. Download ‘rpart.’ data(package = “rpart”) Data sets in package ‘rpart’: car.test.frame Automobile Data from 'Consumer Reports' 1990 cu.summary Automobile Data from ‘Consumer Reports' 1990 kyphosis Data on Children who have had Corrective Spinal Surgery solder Soldering of Components on Printed-Circuit Boards Let us look at ‘kyphosis’ data. > data(kyphosis) > dim(kyphosis) [1] 81 4 > head(kyphosis) Kyphosis Age Number Start 1 absent 71 3 5 2 absent 158 3 14 3 present 128 4 5 4 absent 2 5 1 5 absent 1 4 15 6 absent 1 2 16 Understand the data. Look at the documentation on the data. Look at the documentation on ‘raprt.’ If we let the partition continue without any break, we will end up with a saturated tree. Every terminal node is pure. It is quite possible some terminal nodes contain only one data point. One has to declare each terminal node as one of the two types: present or absent. Majority rule. Discuss We need to arrest the growth of the tree. One possibility is to demand that if a node contains 20 observations or less no more splitting is done at this node. This is the default setting in ‘rpart.’ Let us check. MB <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis) To get a tree, follow the commands. > plot(MB, uniform = T, margin = 0.1) > text(MB, use.n = T, all = T) 59 > title(main = "Classification Tree for Kyphosis Data") Classification Tree for Kyphosis Data Start>=8.5 | absent 64/17 Start>=14.5 absent 56/6 present 8/11 Age< 55 absent 27/6 absent 29/0 absent 12/0 Age>=111 absent 15/6 absent 12/2 present 3/4 Comments and interpretation 1. The root node has 81 subjects, for 64 of them kyphosis is absent and 17 present. The root node identifies the majority. As a matter of fact, each node identifies the majority pattern. 2. All those subjects with Start ≥ 8.5 go into the left node. Total number of subjects in the left node is 62, 56 of them have kyphosis absent. The majority have kyphosis absent, which is duly recorded inside the node. 3. All those subjects with Start < 8.5 go into the right node. Total number of subjects in the right node is 19, 8 of them have kyphosis absent. The majority have kyphosis present, which is duly recoded inside the node. 60 4. This node is a terminal node. No further split is envisaged because the total number of observations is 19 ≤ 20. The command stops splitting a node if the size of the node is 20 or less (default). This is a pruning strategy. This terminal node is declared ‘present’ as per the ‘majority rule’ paradigm. 5. The node on the left is split again. The best covariate as per the entropy purity calculations is ‘Start’ again. All those subjects with Start ≥ 14.5 go into the left node. This node is pure. No split is possible. This node has 29 subjects for all of whom kyphosis is absent. Obviously, we declare this terminal node as ‘absent.’ All those subjects with Start < 14.5 go into the right node, which has 33 patients. And so on. 6. Other terminal nodes are self-explanatory. The classification protocol as per this tree is given by: 1. If a child has Start < 8.5, predict that kyphosis will be present. 2. If a child has 14.5 ≤ Start, predict that kyphosis will be absent. 3. If a child has 8.5 ≤ Start < 14.5 and Age < 55 months, predict that kyphosis will be absent. 4. If a child has 8.5 ≤ Start < 14.5 and Age ≥ 111 months, predict that kyphosis will be absent. 5. If a child has 8.5 ≤ Start < 14.5 and 55 ≤ Age < 111 months, predict that kyphosis will be present. 6. The covariate ‘Number’ has no role in the classification. Since only two predictors are playing a role here, we can build rectanglesgraph using the polygon command. Here are the commands. > plot(c(1, 18), c(1,206), type = "n", xlab = "Start", ylab = "Age", ann = F, + main = "Classification Protocol - Kyphosis Data", axes = F) ann = F means annotation false. The title and labels are not imprinted anymore. We can add these later. Have the landmarks of ‘Start’ and ‘Age’ be recorded on the graph. > axis(side = 1, at = c(1, 8, 9, 14, 15, 18)) > axis(side = 2, at = c(1, 54, 55, 110, 111, 206)) Build 5 polygons as per the protocol. > polygon(c(1, 1, 8, 8), c(1, 206, 206, 1), col = "gray", border = "red", lwd = 1.5) > polygon(c(15, 15, 18, 18), c(1, 206, 206, 1), col = "mistyrose", border = "green", lwd = 1.5) 61 > polygon(c(9, 9, 14, 14), c(1, 54, 54, 1), col = "mistyrose", border = "green", lwd = 1.5) > polygon(c(9, 9, 14, 14), c(55, 110, 110, 55), col = "gray", border = "red", lwd = 1.5) > polygon(c(9, 9, 14, 14), c(111, 206, 206, 111), col = "mistyrose", border = "green", lwd = 1.5) Print the texts accordingly. > text(4, 100, "present", col = "red") > text(12, 25, "absent", col = "green") > text(12, 75, "present", col = "red") > text(12, 150, "absent", col = "green") > text(16.5, 100, "absent", col = "green") > title(main = "Classification Protocol - Kyphosis Data", sub = "81 Children", + xlab = "Start", ylab = "Age") 62 206 Classification Protocol - Kyphosis Data 110 Age absent present absent 54 present 1 absent 1 8 9 14 18 Start 81 Children How reliable the judgment of this tree is? We have 81 children in our study. We know for each child whether kyphosis is present or absent. Pour the data on the covariates of a child into the root node. See which terminal node the child settles in. Classify the child accordingly. We know the true status of the child. Note down whether or not a mismatch occurred. Find the total number of mismatches. Misclassification rate = re-substitution error = 100*(8 + 0 + 2 + 3)/81 = 16%. Add up all the minority numbers of the terminal nodes. We have other choices when graphing a tree. Let us try some of these. > plot(MB, branch = 0, margin = 0.1, col = "red") > text(MB, use.n = T, all = T, col = "red") 63 > title(MB, main = "Classification Tree for Kyphosis data") Another choice: > plot(MB, branch = 0.4, margin = 0.1, col = "red") > text(MB, use.n = T, all = T, col = "red") > title(MB, main = "Classification Tree for Kyphosis data") Classification Tree for Kyphosis data Start>=8.5 absent 64/17 absent 29/0 Start>=14.5 absent 56/6 absent 12/0 Age< 55 absent Age>=111 27/6 absent 15/6 absent 12/2 present 8/11 present 3/4 052785923755, 1, 1372549, , 1, 3, list(minsplit 1,1, 9,1,1, 5,2,1, 3,2), 0.01, 3,complexity 1, =1, 3,20, 0, 7, 1, 0.684863523573199, 1, 3, minbucket 4, 5, 1, 1, 3, = 0.823529411764706, 9, 2, c(0.176470588235294, 8,= 2, list(prior 9, 1, 7,2, 9, cp1, 5,=1, 9, =0.01, 0.297533206831119, c(0.790123456790123, 8, 1, 1, 3,maxcompete 3,list(summary 1, 3, 1, 0.764705882352941, rpart(formula 7, 1, 0.0196078431372549, 7, 1, 1, 3, Kyphosis 7, 1, =c(FALSE, 4, 3, 1, =0.64516129032258, maxsurrogate function 5, 1, = 0.209876543209877), 9, 2, Kyphosis ~ Age 5,FALSE, 1, class 8, 2, (yval, 1,+2, 9,1.17647058823529, Number 0.01, 5, 1, ~dev, = .,1, 9, FALSE) 5, data 9, 1, 0.0196078431372549, usesurrogate wt, 3, 1, +0.596774193548387, =ylevel, Start 7, 2, kyphosis) 3, 1, loss 7,digits) 1, 9, 2, = c(0, 1.17647058823529, 7, 1, = 2, 8,1, 1, surrogatestyle 3,1, 1, 9,0), 2, 0.01, 3,1.24675324675324, 1, split 3, 1, 1, 3, 0.0196078431372549, =1, 5,1) =2, 9,0, 0.215587222254518, 5,maxdepth 1, 8, 2, 1, 9, 2, 9, 1, 9, 0.28877005 =1, 3,30, 3,0.01, 1, 5, 1, xval 3, 1, 64 Classification Tree for Kyphosis data Start>=8.5 | absent 64/17 Age< 11.5 present 8/11 Start>=14.5 absent Age< 55 absent 56/6 absent Age>=98 29/0 absent27/6absent 12/0 15/6 absent 2/0 absentpresent 14/2 1/4 Start< 5.5 Age>=130.5present absent present 6/11 6/6 0/5 absent 2/0 Age< 93 present 4/6 Number< 4.5 absent present 4/2 0/4 absentpresent 3/0 1/2 052785923755, 1, 1372549, , 1, 3, list(minsplit 1,1, 9,1,1, 5,2,1, 3,2), 0.01, 3,complexity 1, =1, 3,20, 0, 7, 1, 0.684863523573199, 1, 3, minbucket 4, 5, 1, 1, 3, = 0.823529411764706, 9, 2, c(0.176470588235294, 8,= 2, list(prior 9, 1, 7,2, 9, cp1, 5,=1, 9, =0.01, 0.297533206831119, c(0.790123456790123, 8, 1, 1, 3,maxcompete 3,list(summary 1, 3, 1, 0.764705882352941, rpart(formula 7, 1, 0.0196078431372549, 7, 1, 1, 3, Kyphosis 7, 1, =c(FALSE, 4, 3, 1, =0.64516129032258, maxsurrogate function 5, 1, = 0.209876543209877), 9, 2, Kyphosis ~ Age 5,FALSE, 1, class 8, 2, (yval, 1,+2, 9,1.17647058823529, Number 0.01, 5, 1, ~dev, = .,1, 9, FALSE) 5, data 9, 1, 0.0196078431372549, usesurrogate wt, 3, 1, +0.596774193548387, =ylevel, Start 7, 2, kyphosis) 3, 1, loss 7,digits) 1, 9, 2, = c(0, 1.17647058823529, 7, 1, = 2, 8,1, 1, surrogatestyle 3,1, 1, 9,0), 2, 0.01, 3,1.24675324675324, 1, split 3, 1, 1, 3, 0.0196078431372549, =1, 5,1) =2, 9,0, 0.215587222254518, 5,maxdepth 1, 8, 2, 1, 9, 2, 9, 1, 9, 0.28877005 =1, 3,30, 3,0.01, 1, 5, 1, xval 3, 1, We can increase the size of the tree by reducing the threshold number 20. Let us do it. If the size of a node ≤ 5, don’t split. The following is the R command. > MB1 <- rpart(Kyphosis ~ ., data = kyphosis, control = rpart.control(minsplit = 5)) > plot(MB1, branch = 0.4, margin = 0.1, col = "red") > text(MB1, use.n = T, all = T, col = "red") > title(MB, main = "Classification Tree for Kyphosis data") 65 Classification Tree for Kyphosis data Start>=8.5 | absent 64/17 absent 29/0 Start>=14.5 absent 56/6 absent 12/0 Age< 55 absent Age>=111 27/6 absent 15/6 absent 12/2 present 8/11 present 3/4 052785923755, 1, 1372549, , 1, 3, list(minsplit 1,1, 9,1,1, 5,2,1, 3,2), 0.01, 3,complexity 1, =1, 3,20, 0, 7, 1, 0.684863523573199, 1, 3, minbucket 4, 5, 1, 1, 3, = 0.823529411764706, 9, 2, c(0.176470588235294, 8,= 2, list(prior 9, 1, 7,2, 9, cp1, 5,=1, 9, =0.01, 0.297533206831119, c(0.790123456790123, 8, 1, 1, 3,maxcompete 3,list(summary 1, 3, 1, 0.764705882352941, rpart(formula 7, 1, 0.0196078431372549, 7, 1, 1, 3, Kyphosis 7, 1, =c(FALSE, 4, 3, 1, =0.64516129032258, maxsurrogate function 5, 1, = 0.209876543209877), 9, 2, Kyphosis ~ Age 5,FALSE, 1, class 8, 2, (yval, 1,+2, 9,1.17647058823529, Number 0.01, 5, 1, ~dev, = .,1, 9, FALSE) 5, data 9, 1, 0.0196078431372549, usesurrogate wt, 3, 1, +0.596774193548387, =ylevel, Start 7, 2, kyphosis) 3, 1, loss 7,digits) 1, 9, 2, = c(0, 1.17647058823529, 7, 1, = 2, 8,1, 1, surrogatestyle 3,1, 1, 9,0), 2, 0.01, 3,1.24675324675324, 1, split 3, 1, 1, 3, 0.0196078431372549, =1, 5,1) =2, 9,0, 0.215587222254518, 5,maxdepth 1, 8, 2, 1, 9, 2, 9, 1, 9, 0.28877005 =1, 3,30, 3,0.01, 1, 5, 1, xval 3, 1, 66 Module 7: Logistic Regression for Grouped Data A medical researcher wants to explore the connection between hypertension and the predictors smoking, obesity, and snoring. ‘Hypertension’ is taken to be a response variable. He collected data on a sample of 433 subjects. For each subject, he assessed whether or not the subject suffers from hypertension, whether or not the subject smokes, whether or not the subject is obese, and whether or not the subject snores. A typical record looks like: Hypertension Yes Smoking No Obesity Yes Snoring Yes He has 433 such records. Note that all variables are binary. We can entertain a logistic regression model for the response variable. Pr(Hypertension = Yes)/Pr(Hypertension = No) = exp{β0 + β1*Smoking + β2*Obesity + β3*Snoring} We need to score each binary variable as 1 or 0. The R program will do it for you. The data consists of 433 pieces of information. Since each predictor is binary, we can summarize the entire data into 8 pieces of information as follows. Smoking No Yes No Yes No Yes No Yes Obesity No No Yes Yes No No Yes Yes Snoring No No No No Yes Yes Yes Yes Total # hypertension 60 5 17 2 8 1 2 0 187 35 85 13 51 15 23 8 Digest the data. Each covariate column is structured. 67 We need to enter the data in R format. It can be done a couple of different ways. The way the data are arraigned makes it possible to proceed the following way. First, create a folder storing the words “No” and “Yes.” > no.yes <- c("No", "Yes") Create a folder for each of the predictors. > smoking <- gl(2, 1, 8, no.yes) gl = generate levels Look at the documentation of ‘gl.’ ?gl no.yes: the levels come from the folder no.yes 8: The folder should consist of 8 entries. 2: Each entry should be one of the levels coming from no.yes. 1: The entries should alternate between “No” and “Yes.” > obesity <- gl(2, 2, 8, no.yes) 2: The entries should consist of 2 “No” s followed by 2 “Yes” s. > snoring <- gl(2, 4, 8, no.yes) 4: The entries should consist of 4 “No” s followed by 4 “Yes” s. The data under smoking can be entered in another way. Smoking1 <- c(“No”, “Yes”, “No”, “Yes”, “No”, “Yes”, “No”, “Yes”) The folders n.tot and n.hyp. are self-explanatory. > n.tot <- c(60, 17, 8, 2, 187, 85, 51, 23) > n.hyp <- c(5, 2, 1, 0, 35, 13, 15, 8) Let us all put all the data folders into a single frame. > hyp <- data.frame(smoking, obesity, snoring, n.tot, n.hyp) > hyp smoking obesity snoring n.tot n.hyp 1 No No No 60 5 68 2 3 4 5 6 7 8 Yes No Yes No Yes No Yes No Yes Yes No No Yes Yes No No No Yes Yes Yes Yes 17 8 2 187 85 51 23 2 1 0 35 13 15 8 We need to fit a logistic regression model using these grouped data. It can be done in two different ways. In one of the ways, we need to create a matrix consisting of two columns. The first column should consist of number of people suffering from hypertension and the second column those who don’t. The matrix operation is characterized by the command ‘cbind,’ where ‘c’ stands, as usual for ‘column.’ > hyp1 <- cbind(n.hyp, n.tot - n.hyp) > hyp1 [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] n.hyp 5 55 2 15 1 7 0 2 35 152 13 72 15 36 8 15 We are ready for fitting a logistic regression model. > hyp2 <- glm(hyp1 ~ smoking + obesity + snoring, family = binomial) > hyp2 Call: glm(formula = hyp1 ~ smoking + obesity + snoring, family = binomial) Coefficients: (Intercept) smokingYes obesityYes snoringYes -2.37766 -0.06777 0.69531 0.87194 Degrees of Freedom: 7 Total (i.e. Null); 4 Residual Null Deviance: 14.13 69 Residual Deviance: 1.618 > summary (hyp2) AIC: 34.54 Call: glm(formula = hyp1 ~ smoking + obesity + snoring, family = binomial) Deviance Residuals: 1 -0.04344 2 0.54145 3 -0.25476 4 -0.80051 5 0.19759 6 -0.46602 Coefficients: Estimate Std. Error z value (Intercept) -2.37766 0.38018 -6.254 smokingYes -0.06777 0.27812 -0.244 obesityYes 0.69531 0.28509 2.439 snoringYes 0.87194 0.39757 2.193 7 -0.21262 8 0.56231 Pr(>|z|) 4e-10 *** 0.8075 0.0147 * 0.0283 * --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 14.1259 on 7 degrees of freedom Residual deviance: 1.6184 on 4 degrees of freedom AIC: 34.537 Residual deviance is the sum of squares of the deviance residuals. Number of Fisher Scoring iterations: 4 There is another way to go about fitting the model. We need the proportions of those who suffer from hypertension to the total for each configuration of predictor variables. > hyp3 <- n.hyp/n.tot > hyp3 [1] 0.08333333 0.11764706 0.12500000 0.00000000 0.18716578 0.15294118 0.29411765 [8] 0.34782609 These proportions are the response variable in the model. Talk a little bit about this. 70 > hyp4 <- glm(hyp3 ~ smoking + obesity + snoring, binomial, weights = n.tot) R would not know what the total number of subjects is from which the proportion is calculated. It needs this information in writing the likelihood of the data. > hyp4 Call: glm(formula = hyp3 ~ smoking + obesity + snoring, family = binomial, weights = n.tot) Coefficients: (Intercept) -2.37766 smokingYes -0.06777 obesityYes 0.69531 snoringYes 0.87194 Degrees of Freedom: 7 Total (i.e. Null); 4 Residual Null Deviance: 14.13 Residual Deviance: 1.618 AIC: 34.54 > summary (hyp4) Call: glm(formula = hyp3 ~ smoking + obesity + snoring, family = binomial, weights = n.tot) Deviance Residuals: 1 -0.04344 2 0.54145 3 -0.25476 4 -0.80051 5 0.19759 6 -0.46602 7 -0.21262 8 0.56231 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.37766 smokingYes -0.06777 obesityYes 0.69531 snoringYes 0.87194 0.38018 0.27812 0.28509 0.39757 -6.254 -0.244 2.439 2.193 4e-10 *** 0.8075 0.0147 * 0.0283 * --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 14.1259 on 7 degrees of freedom Residual deviance: 1.6184 on 4 degrees of freedom 71 AIC: 34.537 Number of Fisher Scoring iterations: 4 It is interesting that ‘Smoking’ is not a significant factor. The other two predictors are significant. The fitted model is: ln(Odds) = -2.37736 – 0.0677*smoking + 0.69531*obesity + 0.87194*snoring. The R program coded each “Yes” as 1 and “No” as 0. Why? It is following the alpha-numeric principle. Using the model, we can predict ln(Odds) for each scenario of predictors. > predict(hyp4) 1 2 3 4 -2.3776615 -2.4454364 -1.6823519 -1.7501268 5 6 7 -1.5057221 -1.5734970 -0.8104126 8 -0.8781874 We can predict the response probabilities using the model for each scenario of predictors. > predict(hyp4, type = "response") 1 0.08489206 2 0.07977292 3 0.15678429 4 0.14803121 5 6 0.18157364 0.17171843 7 0.30780259 8 0.29355353 There is another R-command we can use to get the predicted probabilities. > fitted(hyp4) 1 0.08489206 2 3 0.07977292 `0.15678429 4 0.14803121 5 0.18157364 6 0.17171843 8 0.29355353 7 0.30780259 72 We can compare the observed proportions with the predicted probabilities for each scenario of predictors to see how close they are. Note that the observed proportions are in the folder ‘hyp3.’ > hyp3 [1] 0.08333333 0.11764706 0.12500000 0.00000000 0.18716578 0.15294118 0.29411765 [8] 0.34782609 It is difficult to compare the proportions. We can compare observed frequencies with expected frequencies. The expected frequency, as per the model, is the product of the predicted probability and total. We can calculate the expected frequency for each scenario of predictors. > fitted(hyp4)*n.tot 1 5.0935236 2 1.3561397 3 1.2542744 5 6 7 33.9542700 14.5960668 15.6979321 4 0.2960624 8 6.7517311 We can place the observed and expected frequencies side by side. > data.frame(fit = fitted(hyp4)*n.tot, n.hyp, n.tot) fit hyp3 n.tot 1 5.0935236 5 60 2 1.3561397 2 17 3 1.2542744 1 8 4 0.2960624 0 2 5 33.9542700 35 187 6 14.5960668 13 85 7 15.6979321 15 51 8 6.7517311 8 23 Can we do some testing about model adequacy? One could use HosmerLemeshow test. Another can be built on residual variance or null deviance. R is reluctant to associate p-value with the deviance. Just as well, because no exact p-value can be found, only an approximation that is valid for large expected counts. In the present example, some expected counts are below 5. If you insist in doing a goodness of fit, the asymptotic result is that each stated deviance has a chi-squared distribution with the stated degrees of freedom under the null hypothesis that the response probability follows a logistic model pattern. 73 Module 8: Prediction in Classification Tress + Regression Trees Make sure the response variable is ‘factor’ when wanting to build a classification tree We build classification trees when the response variable is binary. If you use rpart package, make sure your response variable is a ‘factor.’ If the response variable is descriptive such as absence and presence, the response variable is indeed a ‘factor.’ If the response variable is coded as 0 and 1, make sure the codes are factors. If they are not, one can always convert them into factors using the command ‘as.factor.’ Suppose the file is called MB. Then type < MB <- as.factor(MB). Prediction in classification trees Let us work with the kyphosis data. Activate the ‘rpart’ package. > data(kyphosis) Build a classification tree. > MB <- rpart(Kyphosis ~ ., data = kyphosis) The ‘predict’ command predicts the status of each kid in the data as per the classification tree. > MB1 <- predict(MB, newdata = kyphosis) > head(MB1) absent present 1 0.4210526 0.5789474 2 0.8571429 0.1428571 3 0.4210526 0.5789474 4 0.4210526 0.5789474 5 1.0000000 0.0000000 6 1.0000000 0.0000000 What is going on? Look at the data. > head(kyphosis) Kyphosis Age Number Start 1 absent 71 3 5 2 absent 158 3 14 3 present 128 4 5 4 absent 2 5 1 5 absent 1 4 15 6 absent 1 2 16 Look at the first kid. Feed his data into the tree. He falls into the last terminal node. The prediction is ‘Kyphosis present.’ Look at the data in the last terminal node. Nineteen of our kids will fall into this node. Eight of them have Kyphosis absent and eleven of them have Kyphosis present. As 74 per the classification protocol (majority rule), every one of these kids will be classified Kyphosis present. Using the data in the terminal node, R calculates the probability of Kyphosis present and also of Kyphosis absent. These are the probabilities that are reported in the output of ‘predict’ command. > 11/19 [1] 0.5789474 Let us codify the probabilities into present and absent using the threshold probability 0.50. > MB2 <- ifelse(MB1$present >= 0.50, "present", "absent") Error in MB1$present : $ operator is invalid for atomic vectors > class(MB1) [1] "matrix" The ‘ifelse’ command does not work on matrices. Convert the folder into data.frame. > MB2 <- as.data.frame(MB1) > MB3 <- ifelse(MB2$present >= 0.50, "present", "absent") > head(MB3) [1] "present" "absent" "present" "present" "absent" "absent" Let us add MB3 to the mother folder ‘kyphosis.’ > kyphosis$Prediction <- MB3 > head(kyphosis) Kyphosis Age Number Start Prediction 1 absent 71 3 5 present 2 absent 158 3 14 absent 3 present 128 4 5 present 4 absent 2 5 1 present 5 absent 1 4 15 absent 6 absent 1 2 16 absent We want to identify the kids for whom the actual status of Kyphosis and Prediction disagree. > kyphosis$Disagree <- ifelse(kyphosis$Kyphosis == "absent" & kyphosis$Prediction == "present", 1, ifelse(kyphosis$Kyphosis == "present" & kyphosis$Prediction == "absent", 1, 0)) > head(kyphosis) Kyphosis Age Number Start Prediction Disagree 1 absent 71 3 5 present 1 2 absent 158 3 14 absent 0 3 present 128 4 5 present 0 4 absent 2 5 1 present 1 5 absent 1 4 15 absent 0 75 6 absent 1 2 16 absent 0 How many kids are misclassified? > sum(kyphosis$Disagree) [1] 13 What is the misclassification rate? > (13/81)*100 [1] 16.04938 We have two new kids with the following information. Kid 1 Age = 12; Number = 4; Start = 7 Kid 2 Age = 121; Number = 5, Start = 9 How does the tree classify these kids? > MB4 <- data.frame(Age = c(12, 121), Number = c(4, 5), Start = c(7, 9)) > MB4 Age Number Start 1 12 4 7 2 121 5 9 > MB5 <- predict(MB, newdata = MB4) > MB5 absent present [1,] 0.4210526 0.5789474 [2,] 0.8571429 0.1428571 The first kid will be classified as Kyphosis present and the second Kyphosis absent. Regression trees We now focus on developing a regression tree when the response variable is quantitative. Let me work out the build-up of a tree using an example. The data set ‘bodyfat’ is available in the package ‘mboost.’ Download the package and the data. The data has 71 observations on 10 variables. Body fat was measured on 71 healthy German women using Dual Energy X-ray Absorptiometry (DXA). This reference method is very accurate in measuring body fat. However, the setting-up of this instrument requires a lot of effort and is of high cost. Researchers are looking ways to estimate body fat using some anthropometric measurements such as waist circumference, hip circumference, elbow breadth, and knee breadth. The data gives these anthropometric measurements on the women in addition to their age. Here is the data. > data(bodyfat) > dim(bodyfat) [1] 71 10 > head(bodyfat) 76 age DEXfat waistcirc hipcirc elbowbreadth kneebreadth anthro3a anthro3b 47 57 41.68 100.0 112.0 7.1 9.4 4.42 4.95 48 65 43.29 99.5 116.5 6.5 8.9 4.63 5.01 49 59 35.41 96.0 108.5 6.2 8.9 4.12 4.74 50 58 22.79 72.0 96.5 6.1 9.2 4.03 4.48 51 60 36.42 89.5 100.5 7.1 10.0 4.24 4.68 52 61 24.13 83.5 97.0 6.5 8.8 3.55 4.06 anthro3c anthro4 47 4.50 6.13 48 4.48 6.37 49 4.60 5.82 50 3.91 5.66 51 4.15 5.91 52 3.64 5.14 Ignore the last four measurements. Each one is a sum of logarithms of three of the four anthropometric measurements. We now want to create a regression tree. All data points go into the root node to begin with. We need to select one of the covariates and a cut-point to split the root node. Let us start with the covariate ‘waistcirc’ and the cutpoint 88.4, say. All women with waistcirc < 88.4 are shepherded into the left node and the rest into the right node. We need to judge how good the split is. We use variance as the criterion. Calculate the variance of ‘DEXfat’ of all women in the root node. It is 121.9426. Calculate the variance of ‘DEXfat’ of all women in the left node. It is 33.72712. Calculate the variance of ‘DEXfat’ of all women in the right node. It is 52.07025. Goodness of the split = Mother’s variance – Weighted sum of daughters’ variances = 121.9426 – [(40/71)*33.72712 + (31/71)*52.07025] = 80.2065. The goal is to find that cut-point for which the goodness of the split is maximum. For each covariate, find the best cut-point. Select the best covariate to start the tree. Follow the same principle at every stage. > var(bodyfat$DEXfat) [1] 121.9426 > MB1 <- subset(bodyfat, bodyfat$waistcirc < 88.4) > var(MB1$DEXfat) [1] 33.72712 > mean(bodyfat$DEXfat) [1] 30.78282 77 > MB2 <- subset(bodyfat, bodyfat$waistcirc >= 88.4) > dim(MB1) [1] 40 10 > dim(MB2) [1] 31 10 > var(MB2$DEXfat) [1] 52.07205 Let us use rpart to build a regression tree. I need to prune the tree. If the size of a node is 10 or less, don’t split the node. > MB <- rpart(DEXfat ~ waistcirc + hipcirc + elbowbreadth + kneebreadth, data = + bodyfat, control = rpart.control(minsplit = 10)) > plot(MB, uniform = T, margin = 0.1) > text(MB, use.n = T, all = T) waistcirc< 88.4 | 30.78 n=71 hipcirc< 96.25 22.92 n=40 kneebreadth< 11.15 40.92 n=31 waistcirc< 70.35 waistcirc< 80.75 18.21 26.41 n=17 n=23 15.11 n=7 20.38 n=10 24.13 n=13 29.37 n=10 hipcirc< 109.9 39.26 n=28 35.28 n=13 56.45 n=3 42.71 n=15 Interpretation of the tree 1. At each node, the mean of DEXfat is reported. 2. At each node the size of the node is reported. 3. The tree has 7 terminal nodes. 4. The variable elbowbreadth has no role in the tree. 5. How does one carry out prediction here? Take any woman with anthropometric measurements measured. Pour the measurements into the root node. The data will settle in one of the terminal nodes. The 78 mean of the DEXfat reported in the terminal node is the predicted DEXfat for the woman. Let us pour the data on the covariates of all individuals in our sample. The body fat is predicted by the tree. Let us record the predicted body fat and observed body fat side by side. > MB3 <- predict(MB, newdata = bodyfat) > MB4 <- data.frame(bodyfat$DEXfat, PredictedValues = MB3) > MB4 bodyfat.DEXfat PredictedValues 47 41.68 42.71133 48 43.29 42.71133 49 35.41 35.27846 50 22.79 24.13077 51 36.42 35.27846 52 24.13 29.37200 53 29.83 29.37200 54 35.96 35.27846 55 23.69 24.13077 56 22.71 20.37700 57 23.42 24.13077 58 23.24 20.37700 59 26.25 20.37700 60 21.94 15.10857 61 30.13 24.13077 62 36.31 35.27846 63 27.72 24.13077 64 46.99 42.71133 65 42.01 42.71133 66 18.63 20.37700 67 38.65 35.27846 68 21.20 20.37700 69 35.40 35.27846 70 29.63 35.27846 71 25.16 24.13077 72 31.75 29.37200 73 40.58 42.71133 74 21.69 24.13077 75 46.60 56.44667 76 27.62 29.37200 77 41.30 42.71133 78 42.76 42.71133 79 28.84 29.37200 80 36.88 29.37200 79 81 25.09 24.13077 82 29.73 29.37200 83 28.92 29.37200 84 43.80 42.71133 85 26.74 24.13077 86 33.79 35.27846 87 62.02 56.44667 88 40.01 42.71133 89 42.72 35.27846 90 32.49 35.27846 91 45.92 42.71133 92 42.23 42.71133 93 47.48 42.71133 94 60.72 56.44667 95 32.74 35.27846 96 27.04 29.37200 97 21.07 24.13077 98 37.49 35.27846 99 38.08 42.71133 100 40.83 42.71133 101 18.51 20.37700 102 26.36 24.13077 103 20.08 20.37700 104 43.71 42.71133 105 31.61 35.27846 106 28.98 29.37200 107 18.62 20.37700 108 18.64 15.10857 109 13.70 15.10857 110 14.88 15.10857 111 16.46 20.37700 112 11.21 15.10857 113 11.21 15.10857 114 14.18 15.10857 115 20.84 24.13077 116 19.00 24.13077 117 18.07 20.37700 Here is the graph of the observed and predicted values. 80 40 20 30 Predicted Fat 50 Regression Tree Output on the bodyfat data 10 20 30 40 50 60 Observed Fat Here is the R code. > plot(bodyfat$DEXfat, MB3, pch = 16, col = "red", xlab = "Observed Fat", + ylab = "Predicted Fat", main = "Regression Tree Output on the bodyfat data") > abline(a=0, b=1, col = "blue") 81 Module 9: Polygonal Plots for Three Predictors in Classification Tree Let us go back to the data ‘infert’ in the package ‘datasets.’ In the classification tree, three variables (spontaneous, age, and parity) made an impact. We want to present an illuminating polygonal plot. Spontaneous abortions have range 0 to 2 and the other variables have wider range. We will create three polygonal plots one for each value of spontaneous with xaxis age and y-axis parity. We will create a single frame accommodating all three plots using the ‘par’ command. Here is the plot. 5 3 1 Parity Spontaneous Abortions = 0 25 30 35 40 Age 0 2 4 6 Parity Spontaneous Abortions = 1 F F I 21 F 28 I 30 44 Age 5 3 F I 1 Parity Spontaneous Abortions = 2 21 F 28 I 30 44 Age Here are the R commands. > par(mfrow = c(3, 1), oma = c(3, 2, 4, 1)) 82 > plot(infert$age, infert$parity, xlab = "Age", ylab = "Parity", main = + "Spontaneous Abortions = 0", type = "n") > polygon(c(21, 21, 44, 44), c(1, 6, 6, 1), col = "mistyrose") > plot(infert$age, infert$parity, xlab = "Age", ylab = "Parity", main = + "Spontaneous Abortions = 1", type = "n", ylim = c(0, 6), axes = F) > axis(side = 1, at = c(21, 28, 29, 30, 31, 44)) > axis(side = 2, at = c(0:6)) > polygon(c(21, 21, 44, 44), c(4, 6, 6, 4), col = "mistyrose") > text(33, 5, "F", col = "green") > polygon(c(21, 21, 28, 28), c(2, 3, 3, 2), col = "mistyrose") > text(25, 2.5, "F", col = "green") > polygon(c(21, 21, 28, 28), c(0.5, 1.5, 1.5, 0.5), col = "lightcyan") > text(25, 1, "I", col = "red") > polygon(c(29, 29, 30, 30), c(1, 3, 3, 1), col = "mistyrose") > text(29.5, 2, "F", col = "green") > polygon(c(31, 31, 44, 44), c(1, 3, 3, 1), col = "lightcyan") > text(38, 2, "I", col = "red") > plot(infert$age, infert$parity, xlab = "Age", ylab = "Parity", main = + "Spontaneous Abortions = 2", type = "n", axes = F) > axis(side = 1, at = c(21, 28, 29, 30, 31, 44)) > axis(side = 2, at = c(1:6)) > polygon(c(21, 21, 28, 28), c(1, 3, 3, 1), col = "lightcyan") > text(25, 2, "I", col = "red") > polygon(c(21, 21, 44, 44), c(4, 6, 6, 4), col = "mistyrose") > text(33, 5, "F", col = "green") > polygon(c(29, 29, 30, 30), c(1, 3, 3, 1), col = "mistyrose") > text(29.5, 2, "F", col = "green") > polygon(c(31, 31, 44, 44), c(1, 3, 3, 1), col = "lightcyan") > text(38, 2, "I", col = "red") Writing the classification tree verbally makes it easy to build the polygonal plots. The tree has seven terminal nodes. Trace each terminal node to the root node. Make a tentative polygonal plot by hand using the verbal description. Let us begin the whole exercise with the tree. 83 Classification Tree for the Infertility Data spontaneous< 0.5 0| 165/83 parity>=3.5 0 1 113/28 52/55 age< 30.5 1 36/48 0 16/7 0 8/2 age>=28.5 0 27/21 spontaneous< 1.5 0 19/19 parity>=1.5 0 1 16/12 3/7 0 10/3 1 9/27 1 6/9 0 = Not infertile; 1 = Infertile Here is the code for the tree. > data(infert) > head(infert) education age parity induced case spontaneous stratum pooled.stratum 1 0-5yrs 26 6 1 1 2 1 3 2 0-5yrs 42 1 1 1 0 2 1 3 0-5yrs 39 6 2 1 0 3 4 4 0-5yrs 34 4 2 1 0 4 2 5 6-11yrs 35 3 1 1 1 5 32 6 6-11yrs 36 4 2 1 1 6 36 > infert$education1 <- ifelse(infert$education == "0-5yrs", 0, + ifelse(infert$education == "6-11yrs", 1, 2)) 84 > head(infert) education age parity induced case spontaneous stratum pooled.stratum 1 0-5yrs 26 6 1 1 2 1 3 2 0-5yrs 42 1 1 1 0 2 1 3 0-5yrs 39 6 2 1 0 3 4 4 0-5yrs 34 4 2 1 0 4 2 5 6-11yrs 35 3 1 1 1 5 32 6 6-11yrs 36 4 2 1 1 6 36 education1 1 0 2 0 3 0 4 0 5 1 6 1 > infert$case <- as.factor(infert$case) > MB <- rpart(case ~ age + parity + induced + spontaneous + education1, + data = infert) > plot(MB, uniform = T, margin = 0.1) > text(MB, use.n = T, all = T) > title(main = "Classification Tree for the Infertility Data", sub = + "0 = Not infertile; 1 = Infertile") Verbal description Read the terminal nodes from left to right. Terminal node No. 1 If spontaneous = 0, then fertile. Terminal node No. 2 If spontaneous = 1 0r 2 and parity = 4, 5, or 6, then fertile. Terminal node No. 3 If spontaneous = 1 or 2, parity = 1, 2, or 3, age ≤ 30, and age ≥ 29, then fertile. Clean this up. If spontaneous = 1 or 2, parity = 1, 2, or 3, and age = 29 or 30, then fertile. 85 Terminal node No. 4 If spontaneous = 1 or 2, parity = 1, 2, or 3, age ≤ 30, age ≤ 28, spontaneous = 0 or 1, and parity = 2, 3, 4, 5, or 6, then fertile. Clean this up. If spontaneous = 1, parity = 2 or 3, and age ≤ 28, then fertile. Terminal node No. 5 If spontaneous = 1 or 2, parity = 1, 2, or 3, age ≤ 30, age ≤ 28, spontaneous = 0 or 1, and parity = 1, then infertile. Clean this up. If spontaneous = 1, parity = 1, and age ≤ 28, then infertile. Terminal node No. 6 If spontaneous = 1 or 2, parity = 1, 2, or 3, age ≤ 30, age ≤ 28, spontaneous = 0 or 1, and spontaneous = 2, then infertile. Clean this up. If spontaneous = 2, parity = 1, 2, or 3, and age ≤ 28, then infertile. Terminal node No. 7 If spontaneous = 1 or 2, parity = 1, 2, or 3,and age ≥ 31, then infertile. Draw polygonal plots by hand. Every inch of the space in each plot should be covered. Module 10: MULTINOMIAL LOGISTIC REGRESSION In traditional Logistic Regression, the response variable is binary. We now move on to response variables, which are polytomous, i.e., have more than two possible response categories. The models we develop here encompass the traditional binary response variables. Polytomous variables are of two kinds: ordinal or nominal. Suppose a subject is given a certain medication for arthritis pain. The response variable is how much improvement the subject perceives. The subject is asked to check one of the following items: Y: Marked improvement, Some improvement, or None at all. 86 The response variable takes three possible values. The responses have some sense of ordering. We will then say that the response variable is ordinal. Suppose an opinion pollster is interested in conducting a survey in a borough in order to ascertain political leanings of its denizens. Each subject in the borough is classified into one of the categories: Y: Republican, Democrat, or No particular affiliation. The response variable is again polytomous. No sense of ordering is perceivable in the responses. We say that the response variable is nominal. We now embark on building models for polytomous response variables. We will develop special models if the polytomous response variable is ordinal. Example. A medical researcher is entrusted with the job of evaluating an active treatment in alleviating a certain type of arthritis pain. As a control, he takes a Placebo treatment. Each subject is a female or male. For each subject selected at random, let X1 = 1 if the subject is female = 0 if the subject is male; X2 = 1 if the treatment given is Active = 0 if the treatment given is Placebo Y = 1 (Marked improvement) = 2 (Some improvement) = 3 (None at all) Data will be collected on the variables Y, X1, and X2 for a random sample of arthritis patients. In this problem we have two covariates X1 and X2. The response variable is Y, which is polytomous. The objective is to explore how the response variable Y depends on the covariates X1 and X2. Research questions are: 1. Are there significant gender differences in the responses? 2. Are there significant differences between the active treatment and placebo in the responses? 87 We will answer these questions via a model building endeavor. We want to build a model connecting Y with X1 and X2. After building the model, we want to ascertain the impact of each covariate on the response variable. More appropriately, for given values of X1 and X2, we want to model Pr(Y = 1 / X1, X2), Pr(Y = 2 / X1, X2), Pr(Y = 3 / X1, X2) as a function of the covariates X1 and X2. What is the interpretation? Suppose we have a subject whose covariate values X1 and X2 are known. What are the chances that the subject exhibits marked improvement in his/her condition? What are the chances that the subject exhibits some improvement in his/her condition? What are the chances that the subject exhibits no improvement at all in his/her condition. The responses are mutually exclusive and exhaustive. Therefore, the sum of these three probabilities is one. The Multinomial Logistic regression model is given by, exp(1 1 X1 2 X 2 ) D exp( 2 3 X1 4 X 2 ) Pr(Y = 2 / X1, X2) = D 1 Pr(Y = 3 / X1, X2) = , D where D = 1 + exp(α1 + β1X1 + β2X2) + exp(α2 + β3X1 + β4X2). Pr(Y = 1 / X1, X2) = On the left side, each is a probability. Each probability is a number between 0 and 1. Each of the right sides is also a number between 0 and 1. The probabilities on the left side add up to unity. So do the expressions on the right side. There is no incongruity in the model. This model has 6 parameters. Is there any special reason that the model for the response {Y = 3} is the one it is given? You could swap the expressions for {Y = 2} and {Y = 3}. The 88 ultimate conclusions will remain the same. You have to take one of the specific responses as the baseline. We can rewrite the model in a different way. Compare the responses {Y = 1} and {Y = 3}: Pr(Y 1 | X1, X 2 ) ln = α1 + β1X1 + β2X2. Pr(Y 3 | X1, X 2 ) Compare the responses {Y = 2} and {Y = 3}: Pr(Y 2 | X1, X 2 ) ln = α2 + β3X1 + β4X2. Pr(Y 3 | X1, X 2 ) If the response variable is ordinal, there is another model commonly entertained. This is called the Proportional Odds Model. In our example, the response variable is indeed ordinal. PROPORTIONAL ODDS MODEL Pr(Y = 1 | X1, X2) = exp{1 1 X1 2 X 2} 1 exp{1 1 X1 2 X 2} and Pr(Y = 1 | X1, X2) + Pr(Y = 2 / X1, X2) = exp{ 2 1 X1 2 X 2} , 1 exp{ 2 1 X1 2 X 2} with α1 < α2. Under this model, each individual probability can be ascertained. exp{ 2 1 X1 2 X 2 } Pr(Y = 2 | X1, X2) = 1 exp{ 2 1 X1 2 X 2 } exp{1 1 X1 2 X 2} 1 exp{1 1 X1 2 X 2} Pr(Y = 3 | X1,X2) = 1 - exp{ 2 1 X1 2 X 2 } 1 exp{ 2 1 X1 2 X 2 } In order to make sure that these are all non-negative numbers, we ought to have α2 ≥ α1. Discuss why? This model has 4 parameters. 89 What is the difference between the multinomial logistic regression model and the proportional odds model? 1. The proportional odds model has a fewer parameters. 2. The proportional odds model is applicable only when we have an ordinal response variable. 3. If both models fit well the data, we choose the one with fewer parameters. Simplicity is a desirable trait. This model can be rewritten in the following way. Compare the responses {Y = 1} and {Y = 2 or 3}: Compare the best response versus not so best: (We could do this because the response variable is ordinal.) Pr(Y 1 | X1, X 2 ) ln = α1 + β1X1 + β2X2. Pr(Y 2 | X1, X 2 ) Pr(Y 3 | X1, X 2 ) Compare the responses {Y = 1 or 2} and {Y = 3}: Compare the best two responses versus the worst: Pr(Y 1 | X1, X 2 ) Pr(Y 2 | X1, X 2 ) ln = α2 + β1X1 + β2X2. Pr(Y 3 | X1, X 2 ) Some writers write the proportional odds model, in the above case, in the following way. Pr(Y = 3 | X1, X2) = exp{1 1 X1 2 X 2} 1 exp{1 1 X1 2 X 2} and Pr(Y = 2 | X1, X2) + Pr(Y = 3 / X1, X2) = exp{ 2 1 X1 2 X 2} , 1 exp{ 2 1 X1 2 X 2} with α1 < α2. The package ‘Design’ (nee, ‘rms’) uses this form of Proportional Odds model. In whatever way we decide to write the model, the final conclusions about significance of covariates and predicted probabilities remain the same. Why this is called a proportional odds model? Let us look at the general situation. Suppose we have a categorical response variable Y which is 90 ordinal. Let the possible values of Y be denoted by 1, 2, … , J. (1 means the best and J means the worst.) Suppose we have only one covariate X. The proportional odds model is given by, exp(1 X ) 1 exp(1 X ) exp( 2 X ) Pr(Y ≤ 2|X) = 1 exp( 2 X ) exp( 3 X ) Pr(Y ≤ 3|X) = 1 exp( 3 X ) … … exp( J 1 X ) Pr(Y ≤ J-1/X) = , 1 exp( J 1 X ) with the provision that α1 ≤ α2 ≤ … ≤ αJ-1. Pr(Y = 1|X) = Equivalently, we can formulate the model as ln Pr(Y j / X ) = αj + βX, j = 1, 2, … , J-1. Pr(Y j / X ) We are comparing the odds of the best j responses with the worst J-j responses. What is special about the model? The same β is present in every equation. This is what makes the model to have proportional odds. Suppose we have an individual with X = x1 and another individual with X = x2. Compare the odds of the best j responses versus the worst J-j responses for the individual with X = x1 and the odds of the best j responses versus the worst J-j responses for the individual with X = x2. Pr(Y j / x1) Pr(Y j / x1) exp( ( x1 x2 )). Pr(Y j / x2 ) Pr(Y j / x2 ) The odds are proportional. They depend only on their conditions X = x1 and X = x2. They are free of j. Actual data on the arthritis experiement 91 Sex Treatment Female Active Female Placebo Male Active Male Placebo Improvement Marked Some None 16 5 6 6 7 19 5 2 7 1 0 10 Total 27 32 14 11 Analyze the data. The analysis of the data is tantamount to comparing the distribution of the response variable Y among four distinct populations: 1. Those who are females and are on active treatment; 2. Those who are females and are on placebo; 3. Those who are males and are on active treatment; 4. Those who are males and are on placebo. The empirical distributions are: Female Active Placebo Pr(Y = 1) 0.59 0.19 Pr(Y = 2) 0.19 0.22 Pr(Y = 3) 0.22 0.59 Males Active 0.36 0.14 0.50 Placebo 0.09 0.00 0.91 Empirical analysis 1. Compare the responses of Males and Females who are on active medication. Is the observed difference in the responses significantly different? 2. Compare the responses between active treatment and placebo for females. Is the observed difference in the responses significantly different? OPERATION R A number of packages is available to fit logistic models. Base: Binary Logistic Regression Model nnet: Multinomial Logistic Regression model for raw data VGAM: Multinomial and Proportional Odds models 92 Design (rms): Proportional Odds Model Raw data or ungrouped data The following data came from the Florida Game and Fresh Water Fish Commission. They wanted to investigate factors influencing the primary food choice of alligators. For 59 alligators sampled in Lake George, Florida, the numbers pertain to the alligator’s length (in meters) and primary food type found in the alligator’s stomach. Primary food type has three categories: Fish, Invertebrate, and Other. The invertebrates are primarily apple snails, aquatic insects, and crayfish. The ‘Other’ category includes reptiles (primarily turtles, though one stomach contained tags of 23 baby alligators that had been released in the lake during the previous year.) Size: 1.24 1.45 1.63 1.78 1.98 2.36 2.79 3.68 1.30 Food: I I I I I F F O I Size: 1.45 1.65 1.78 2.03 2.39 2.84 3.71 1.30 1.47 Food: O O I F F F F I I Size: 1.65 1.78 2.03 2.41 3.25 3.89 1.32 1.47 1.65 Food: I O F F O F F F F Size: 1.80 2.16 2.44 3.28 1.32 1.50 1.65 1.80 2.26 Food: I F F O F I F F F Size: 2.46 3.33 1.40 1.52 1.68 1.85 2.31 2.56 3.56 Food: F FI F I F F F O F Size: 1.42 1.55 1.70 1.88 2.31 2.67 3.58 1.42 1.60 Food: I I I I F F F F I Size: 1.73 1.93 2.36 2.72 3.66 Food: O I F I F Here, the response variable is ‘Food Type,’ which is a nominal categorical variable. The independent variable is ‘Size,’ which is quantitative. We want to investigate how the food choice is dependent on size. Let us entertain a multinomial logistic regression model. 93 exp{1 1 Size} D exp{ 2 2 Size} Pr(Food = I) = D 1 Pr(Food = O) = D D = 1 + exp{α1 + β1*Size} + exp{α2 + β2*Size} Pr(Food = F) = The multinomial logistic model fitting is not available in the ‘base’ system of R. It is available in several packages. The package I will use is ‘nnet.’ It is also available in ‘Design.’ Get into the R console. Download the package ‘nnet’ from Ohio. We have two columns of data. Input the data. > size <- c(1.24, 1.45, 1.63, 1.78, 1.98, 2.36, 2.79, 3.68, 1.30, 1.45, 1.65, 1.78, 2.03, 2.39, 2.84, 3.71, 1.30, 1.47, 1.65, 1.78, 2.03, 2.41, 3.25, 3.89, 1.32, 1.47, 1.65, 1.80, 2.16, 2.44, 3.28, 1.32, 1.50, 1.65, 1.80, 2.26, 2.46, 3.33, 1.40, 1.52, 1.68, 1.85, 2.31, 2.56, 3.56, 1.42, 1.55, 1.70, 1.88, 2.31, 2.67, 3.58, 1.42, 1.60, 1.73, 1.93, 2.36, 2.72, 3.66) ‘Food type’ is categorical. One can input the 59 entries in toto. Or, one can exploit the repetitions as follows. > food <- factor(c(rep("I", 5), rep("F", 2), "0", "I", rep("0", 2), "I", rep("F", 4), rep("I", 3), "0", rep("F", 2), "0", rep("F", 4), "I", rep("F", 2), "0", "F", "I", rep("F", 6), "I", rep("F", 3), "0", "F", rep("I", 4), rep("F", 4), "I", "0", "I", "F", "I", "F")) Put both data sets into a single frame. > allig <- data.frame(food, size) > allig food 1 I 2 I 3 I 4 I 5 I size 1.24 1.45 1.63 1.78 1.98 94 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 F F 0 I 0 0 I F F F F I I I 0 F F 0 F F F F I F F 0 F I F F F F F F I F F F 0 F I I I I F F 2.36 2.79 3.68 1.30 1.45 1.65 1.78 2.03 2.39 2.84 3.71 1.30 1.47 1.65 1.78 2.03 2.41 3.25 3.89 1.32 1.47 1.65 1.80 2.16 2.44 3.28 1.32 1.50 1.65 1.80 2.26 2.46 3.33 1.40 1.52 1.68 1.85 2.31 2.56 3.56 1.42 1.55 1.70 1.88 2.31 2.67 95 52 53 54 55 56 57 58 59 F F I 0 I F I F 3.58 1.42 1.60 1.73 1.93 2.36 2.72 3.66 Put the package ‘nnet’ into service. > library(nnet) The R command is ‘multinom.’ allig1 <- multinom(food ~ size, data = allig) # weights: 9 (4 variable) initial value 64.818125 iter 10 value 49.170785 final value 49.170622 converged > summary(allig1) Call: multinom(formula = food ~ size, data = allig) Coefficients: (Intercept) size F 1.617952 -0.1101836 I 5.697543 -2.4654695 Std. Errors: (Intercept) size F 1.307291 0.5170838 I 1.793820 0.8996485 Residual Deviance: 98.34124 AIC: 106.3412 Correlation of Coefficients: 96 F:(Intercept) F:size I:(Intercept) F:size -0.9528240 I:(Intercept) 0.5905923 -0.5608696 I:size -0.4442392 0.4637660 -0.9591611 It is not providing p-values. Let us write down the estimated model. exp{1.62 0.11 Size} D exp{5.70 2.47 Size} Pr(Food = I) = D 1 Pr(Food = O) = D D = 1 + exp{1.62 -0.11*Size} + exp{5.70 – 2.47*Size}. Pr(Food = F) = Let us calculate z-values associated with every parameter. α1 -β1 -α2 -β2 -- z = Estimate/S.E. = 1.62/1.31 = 1.24 z = Estimate/S.E. = -0.11/0.52 = - 0.21 z = Estimate/S.E. = 5.70/1.79 = 3.18 z = Estimate/S.E. = -2.47/0.90 = - 2.74 Interpretation. Only two estimates are significant. Let us plot all the three logistic curves as a function of size. We use the basic command ‘curve.’ > curve(exp(1.62 - 0.11*x)/(1 + exp(1.62 - 0.11*x) + exp(5.70 - 2.47*x)), + xlim = c(1, 4), xlab = "Size", ylab = "Probability") > curve(exp(5.70 - 2.47*x)/(1 + exp(1.62 - 0.11*x) + exp(5.70 - 2.47*x)), add = TRUE, col = "blue") > curve(1/(1 + exp(1.62 - 0.11*x) + exp(5.70 - 2.47*x)), add = TRUE, col = "red") > title(main = "Probability of Choice of Food as a function of Size") > text(3, 0.7, "FISH") > text(1.5, 0.7, "Invertebrate") 97 > text(3, 0.2, "Other") 0.7 Probability of Choice of Food as a function of Size FISH 0.5 0.4 0.2 0.3 Probability 0.6 Invertebrate Other 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Size Comment on the graph. As the alligator grows bigger and bigger, it prefers fish over vertebrates. Comment. The nnet package orders the categories of the response variable alpha-numerically. The last one is taken the baseline category (i.e., 1/D). Multinomial logistic regression model and grouped data Recall the arthritis data. There were 84 subjects in the sample. These 84 pieces of information (raw data) have been summarized into 12 pieces of information. 98 Actual data on the arthritis experiement Sex Female Female Male Male Treatment Active Placebo Active Placebo Improvement Marked Some None 16 5 6 6 7 19 5 2 7 1 0 10 Total 27 32 14 11 Write the theoretical model. exp(1 1 * Gender 2 * Treatment) D exp( 2 3 * Gender 4 * Treatment) Pr(Y = Some) = D 1 Pr(Y = None) = D D = Sum of the numerators. Pr(Y = Marked) = Estimate the unknown parameters using the data. We will use the package VGAM. We need to enter the data into R in the form presented above. Each covariate is structured. Exploit this. Use the command ‘gl’ (generate levels). > Gender <- gl(2, 2, 4, labels = c("Female", "Male")) > Treatment <- gl(2, 1, 4, labels = c("Active", "Placebo")) > Marked <- c(16, 6, 5, 1) > Some <- c(5, 7, 2, 0) > None <- c(6, 19, 7, 10) > Arthritis <- data.frame(Gender, Treatment, Marked, Some, None) > Arthritis Gender Treatment Marked Some None 1 Female Active 16 5 6 2 Female Placebo 6 7 19 3 Male Active 5 2 7 4 Male Placebo 1 0 10 99 Download the package VGAM and make it active. > Multi <- vglm(cbind(Marked, Some, None) ~ Gender + Treatment, family = + multinomial, data = Arthritis) > summary(Multi) Call: vglm(formula = cbind(Marked, Some, None) ~ Gender + Treatment, family = multinomial, data = Arthritis) Pearson Residuals: log(mu[,1]/mu[,3]) log(mu[,2]/mu[,3]) 1 -0.0019173 -0.30625 2 -0.0672899 0.26014 3 -0.0701808 0.53312 4 0.2363841 -0.79075 Coefficients: Value (Intercept):1 1.0324454 (Intercept):2 -0.0029107 GenderMale:1 -1.3784722 GenderMale:2 -1.6615071 TreatmentPlacebo:1 -2.1686535 TreatmentPlacebo:2 -1.1055147 Std. Error t value 0.45755 2.2564433 0.55587 -0.0052363 0.63848 -2.1589737 0.86029 -1.9313430 0.59430 -3.6490942 0.67377 -1.6407881 Number of linear predictors: 2 Names of linear predictors: log(mu[,1]/mu[,3]), log(mu[,2]/mu[,3]) Dispersion Parameter for multinomial family: 1 Residual Deviance: 1.70347 on 2 degrees of freedom Log-likelihood: -74.51039 on 2 degrees of freedom Number of Iterations: 4 100 Check Model Adequacy: Look at the residual deviance and the corresponding degrees of freedom. H0: The response probabilities follow the pattern of a multinomial logistic regression model. Technically, if the sample is large, Residual Deviance has a chi-squared distribution if the null hypothesis is true. Let us calculate the p-value. p-value = Probability of getting the residual deviance as large as what was observed if the null hypothesis is true. > pchisq(1.70347, 2, lower.tail = F) [1] 0.426674 The null hypothesis cannot be rejected. The Multinomial model is a good fit. Let us write down the estimated model. Pr(Y = Marked) = exp(1.03 1.38 * Male 2.17 * Placebo) D Pr(Y = Some) = exp(0.00 1.66 * Male 1.11 * Placebo) D 1 Pr(Y = None) = D How do we know that the category {Y = None} is the base line? Look at the vglm command. I had ‘None’ put in the end of the ‘cbind’ command. You can have any category to be the baseline. Look at the response probabilities as per the model. > Multi1 <- fitted(Multi) > Multi1 Marked Some None 101 1 2 3 4 0.58437331 0.19443502 0.37299433 0.07073449 0.20751090 0.19991267 0.09980040 0.05479949 0.2081158 0.6056523 0.5272053 0.8744660 How do we get these? Look at the expected frequencies as per the fitted model. How does one calculate these? > total <- c(27, 32, 14, 11) > Multi2 <- fitted(Multi)*total > Multi2 1 2 3 4 Marked 15.7780793 6.2219207 5.2219207 0.7780793 Some None 5.6027944 5.619126 6.3972056 19.380874 1.3972056 7.380874 0.6027944 9.619126 Too many decimals? Round the numbers up to 3 decimal places. > Multi3 <- round(fitted(Multi), 3) > Multi3 Marked Some None 1 0.584 0.208 0.208 2 0.194 0.200 0.606 3 0.373 0.100 0.527 4 0.071 0.055 0.874 I want a barplot of these numbers. I want to make the above output is a matrix. > Multi4 <- as.matrix(Multi3) > Multi4 Marked 1 0.584 2 0.194 3 0.373 4 0.071 Some 0.208 0.200 0.100 0.055 None 0.208 0.606 0.527 0.874 There is no problem. Identify the rows more descriptively. > rownames(Multi4) <- c("FemaleActive", "FemalePlacebo", "MaleActive", 102 + "MalePlacebo") Look at what happens. > Multi4 Marked FemaleActive 0.584 FemalePlacebo 0.194 MaleActive 0.373 MalePlacebo 0.071 Some 0.208 0.200 0.100 0.055 None 0.208 0.606 0.527 0.874 Discuss what type of barplot we want. > barplot(Multi4, beside = T, legend.text = colnames(t(Multi4)), col = + c("red", "blue", "green", “magenta”)) Comments. The barplot command of a matrix always gives barplots column by column. Look at the output. 103 0 5 10 15 FemaleActive FemalePlacebo MaleActive MalePlacebo Marked Some None Let us look at another type of barplot. > barplot(t(Multi4), beside = T, legend.text = colnames(Multi4), col = + c("red", "blue", "green")) (t = transpose of a matrix) Add more description to the barplot. > title(main = "Multinomial Logistic Regression of Improvement Response on + Gender and Treatment", xlab = "Gender Treatment Combination", ylab = + "Probability") 104 Multinomial Logistic Regression of Improvement Response on Gender and Treatment 0.4 0.0 0.2 Probability 0.6 0.8 Marked Some None FemaleActive FemalePlacebo MaleActive MalePlacebo Gender Treatment Combination Comments. 1. For females on active treatment, the predominant response is marked improvement with chances hovering around 60%. 2. For females on placebo, the predominant response is no improvement with chances around 60%. 3. For males on active treatment, the predominant response is no improvement with chances around 50%. 4. For males on placebo, the predominant response is no improvement with chances more than 90%. The fitted model is: 105 exp(1.03 1.38 * gender 2.17 * treatment) D exp(1.03 0.28 * gender 1.06 * treatment) Pr(Improvement = Some) = D 1 Pr(Improvement = Marked) = D D = 1 + exp(-1.03 + 1.38*gender + 2.17*treatment) + exp(-1.03 – 0.28*gender + 1.06*treatment) Pr(Improvement = None) = with the understanding that gender = 1 if male = 0 if female treatment = 1 if placebo = 0 if active What is the rationality behind this kind of scoring? Let us compare the empirical and model probabilities for each configuration of the factors. improvement gender treatment 1 2 3 4 5 6 7 8 9 10 11 12 marked some none marked some none marked some none marked some none female female female male male male female female female male male male active active active active active active placebo placebo placebo placebo placebo placebo Emp. Model Prob. Prob. 0.59 0.58 0.19 0.21 0.22 0.21 0.36 0.37 0.14 0.10 0.50 0.53 0.19 0.19 0.22 0.20 0.59 0.61 0.09 0.07 0.00 0.05 0.91 0.88 Let us compare the observed and expected frequencies. 106 improvement gender treatment 1 2 3 4 5 6 7 8 9 10 11 12 marked some none marked some none marked some none marked some none female female female male male male female female female male male male active active active active active active placebo placebo placebo placebo placebo placebo Emp. Model Prob. Prob. 16 15.66 5 5.67 6 5.67 5 5.18 2 1.40 7 7.42 6 6.40 7 6.40 19 19.52 1 0.77 0 0.55 10 9.68 An eye inspection of the frequencies tells us that the model is a good summary of the data. Table of z-values (Estimate/(Standard Error) Response Intercept gender treatment None -2.256 2.159 3.649 Some -2.168 -0.311 1.522 Let us have a different baseline. Let us have {Y = Marked} as the baseline. > Multi5 <- vglm(cbind(Some, None, Marked) ~ Gender + Treatment, family = + multinomial, data = Arthritis) > summary(Multi5) Call: vglm(formula = cbind(Some, None, Marked) ~ Gender + Treatment, family = multinomial, data = Arthritis) Pearson Residuals: log(mu[,1]/mu[,3]) log(mu[,2]/mu[,3]) 1 -0.26907 0.146256 2 0.26106 -0.063604 3 0.52039 -0.135416 4 -0.81536 0.127844 Coefficients: 107 Value Std. Error (Intercept):1 -1.03536 0.47750 (Intercept):2 -1.03245 0.45755 GenderMale:1 -0.28303 0.90900 GenderMale:2 1.37847 0.63848 TreatmentPlacebo:1 1.06314 0.69859 TreatmentPlacebo:2 2.16865 0.59430 t value -2.16828 -2.25644 -0.31137 2.15897 1.52184 3.64909 Number of linear predictors: 2 Names of linear predictors: log(mu[,1]/mu[,3]), log(mu[,2]/mu[,3]) Dispersion Parameter for multinomial family: 1 Residual Deviance: 1.70347 on 2 degrees of freedom Log-likelihood: -11.29448 on 2 degrees of freedom Number of Iterations: 4 > Multi6 <- fitted(Multi5) > Multi6 Some None 1 0.20751090 0.2081158 2 0.19991267 0.6056523 3 0.09980040 0.5272053 4 0.05479949 0.8744660 1. 2. 3. 4. Marked 0.58437331 0.19443502 0.37299433 0.07073449 Residual deviance is exactly the same. Predicted probabilities are exactly the same. Expected frequencies are exactly the same. The choice of the baseline is yours. 108 Module 11: PROPORTIONAL ODDS MODEL ON R + VGAM package The following is a data set coming from Lancaster coalmines in England. The publication of these data has revolutionized the way coalface workers are compensated for years of exposure in the coal mines. The data has information on a sample (371) of coalface workers, how many years each is exposed and severity of pneumoconiosis categorized as 1 = normal, 2 = mild pneumoconiosis, 3 = severe pneumoconiosis. The response variable is severity of pneumoconiosis, which is clearly ordinal. The covariate is the number of years X of exposure. For simplicity of presentation and analysis, the covariate values are divided into 8 intervals: (0, 12]; (12, 18], (18, 24]; (24, 30]; (30, 36]; (36, 42]; (42, 48]; (48, 54] years of exposure. The median value of the years of exposure in each interval is taken as representative of the interval. Take any interval of exposure. Identify all workers whose years of exposure fall into the interval. For the group of workers, find out how many of them are normal, have mild pneumoconiosis, and severe pneumoconiosis. Data 1 2 3 4 5 6 7 8 exposure normal mild severe 5.8 98 0 0 15.0 51 2 1 21.5 34 6 3 27.5 35 5 8 33.5 32 10 9 39.5 23 7 8 46.0 12 6 10 51.5 4 2 5 Interpretation of the data: Out of 98 coalface workers whose median exposure time is 5.8 years, all of them are normal; Out of 54 coalface workers whose median exposure time is 15 years, 52 are normal, two have mild pneumoconiosis, and one has severe pneumoconiosis; etc. This is grouped data with a ternary response variable and one quantitative covariate. 109 Earlier, I have presented two data sets with a multi-level response variable. Let us discuss the structure of these data sets. 1. Alligator data a. The response variable is ternary. b. The response variable is nominal. c. The proportional odds model cannot be entertained. d. There is only one covariate, which is quantitative. e. The data are presented in raw form. f. We have fitted the multinomial logistic regression model using the packages nnet and VGAM. 2. Arthritis data a. The response variable is ternary. b. The response variable is ordinal. c. Both the multinomial logistic regression and proportional odds models can be entertained. d. There are two covariates both categorical. e. The data are presented in grouped form. f. We have fitted the multinomial logistic regression model using the VGAM package. g. What about the proportional odds model? 3. Coal mine data a. The response variable is ternary. b. The response variable is ordinal. c. Both the multinomial logistic regression and proportional odds models can be entertained. d. There is only one covariate which is quantitative. e. The data are force presented in grouped form. f. We are going to fit both the multinomial logistic regression model and the proportional odds model using the VGAM package. Download the package VGAM. The coal mine data set are available in VGAM under the name ‘pneumo.’ Make the package active. library(VGAM) Download the data and see what it contains. 110 > data(pneumo) > pneumo exposure.time normal mild severe 1 5.8 98 0 0 2 15.0 51 2 1 3 21.5 34 6 3 4 27.5 35 5 8 5 33.5 32 10 9 6 39.5 23 7 8 7 46.0 12 6 10 8 51.5 4 2 5 The covariate is continuous. The data on the number of years of exposure is skewed to the left. Take logarithms (natural logarithm) to make the numbers look more like a normal distribution. Keep the same name for the data set. Transform only the exposure time. Keep all other entries intact. > pneumo = transform(pneumo, logexpo = log(exposure.time)) Look what we have in the new data folder. > pneumo exposure.time 1 5.8 2 15.0 3 21.5 4 27.5 5 33.5 6 39.5 7 46.0 8 51.5 normal mild severe 98 0 0 51 2 1 34 6 3 35 5 8 32 10 9 3 7 8 12 6 10 4 2 5 logexpo 1.757858 2.708050 3.068053 3.314186 3.511545 3.676301 3.828641 3.941582 Two models can be entertained. Multinomial Logistic Regression Model Pr(Severity = normal) = Pr(Severity = mild) exp(1 1 logexpo) D exp( 2 2 logexpo) = D 111 1 D D = 1 + exp(α1 + β1*logexpo) + exp(α2 + β2*logexpo) Pr(Severity = severe) = The baseline category is taken to be ‘severe.’ The R command as it is laid out below will respect our choice. This model has four parameters. Proportional Odds Model (Cumulative Model) exp(1 logexpo) 1 exp(1 logexpo) exp( 2 logexpo) Pr(Severity = normal) + Pr(Severity = mild) = 1 exp( 2 logexpo) This model has three parameters. Pr(Severity = normal) = We will now fit the multinomial logistic regression model. ‘glm’ is the acronym for the generalized linear model. > fit.multi <- vglm(cbind(normal, mild, severe) ~ logexpo, family = multinomial, pneumo) > summary(fit.multi) Call: vglm(formula = cbind(normal, mild, severe) ~ logexpo, family = multinomial, data = pneumo) Pearson Residuals: Min 1Q Median 3Q Max log(mu[,1]/mu[,3]) -0.75473 -0.20032 -0.043255 0.43644 0.92283 log(mu[,2]/mu[,3]) -0.81385 -0.43818 -0.121281 0.15716 1.03229 Coefficients: Value (Intercept):1 11.9751 (Intercept):2 3.0391 logexpo:1 -3.0675 logexpo:2 -0.9021 Std. Error t value 2.00044 5.9862 2.37607 1.2790 0.56521 -5.4272 0.66898 -1.3485 112 Number of linear predictors: 2 Names of linear predictors: log(mu[,1]/mu[,3]), log(mu[,2]/mu[,3]) Dispersion Parameter for multinomial family: 1 Residual Deviance: 5.34738 on 12 degrees of freedom Let us now check on goodness-of-fit. The null hypothesis is: H0: The response probabilities follow the multinomial logistic regression model pattern. Let us calculate the p-value of the observed residual deviance on 12 degrees of freedom. pchisq(5.34738, 12, lower.tail = F) [1] 0.945359 p-value = Probability of getting the residual deviance at least as large as 5.34738 when the null hypothesis is true. How does one calculate the degrees of freedom? The model is a good fit. We do not reject the null hypothesis. The fitted model is: exp(11.975 1 - 3.0675 logexpo) D exp(3.0391 - 0.9021 logexpo) P̂r(Severity mild) D 1 P̂r(Severity severe) D D = 1 + exp(11.9751 – 3.0675*logexpo) + exp(3.0391 – 0.9021*logexpo) P̂r(Severity normal) Let us calculate the model probabilities. The following command gives the probabilities as per the model for each age group. 113 fitted(fit.multi) 1 2 3 4 5 6 7 8 normal 0.9927503 0.9329702 0.8488899 0.7485338 0.6393787 0.5334715 0.4313692 0.3581471 mild 0.005875947 0.043219077 0.085745054 0.128835331 0.168725388 0.201127232 0.226188995 0.239824757 severe 0.001373768 0.023810688 0.065365011 0.122630879 0.191895881 0.265401245 0.342441766 0.402028109 The empirical probabilities are: Age group 1 2 3 4 5 6 7 8 normal 1.00 0.94 0.79 0.73 0.63 0.61 0.43 0.36 mild 0.00 0.04 0.14 0.10 0.20 0.18 0.21 0.18 severe 0.00 0.02 0.07 0.17 0.17 0.21 0.36 0.45 It is more instructive to compare observed and expected frequencies. Let us go to R. Calculate the total number of coalminers in each exposure group. Open a data vector consisting of these numbers. total <- c(98, 54, 43, 48, 51, 38, 28, 11) > fitted(fit.multi)*total 1 2 3 4 5 6 7 8 normal 97.289528 50.380393 36.502267 35.929622 32.608315 20.271918 12.078339 3.939618 mild severe 0.5758428 0.1346293 2.3338302 1.2857771 3.6870373 2.8106955 6.1840959 5.8862822 8.6049948 9.7866899 7.6428348 10.0852473 6.3332919 9.5883694 2.6380723 4.4223092 114 Compare these numbers with the observed frequencies. It is done in the last page. We now fit the proportional odds model. ln Pr(Severit y normal) = α1 + β*logexpo Pr(Severit y mild) Pr(Severit y severe) ln Pr(Severit y normal) Pr(Severit y mild) = α2 + β*logexpo Pr(Severit y severe) The log odds of the best condition versus the worst two are a linear function of the lone covariate. The log odds of the best two conditions versus the worst one are a linear function of the lone covariate. Further, these two linear functions are parallel (same slope). There are other ways to write the proportional odds model as we shall see later. All are equivalent. > fit.prop <- vglm(cbind(normal, mild, severe) ~ logexpo, propodds, data = pneumo, trace = T) > summary(fit.prop) Call: vglm(formula = cbind(normal, mild, severe) ~ logexpo, family = propodds, data = pneumo, trace = T) Pearson Residuals: Min 1Q Median 3Q Max logit(P[Y>=2]) -0.77144 -0.30860 -0.14410 0.071638 1.2479 logit(P[Y>=3]) -0.50476 -0.33528 -0.30926 0.184145 1.0444 Coefficients: Value Std. Error t value (Intercept):1 -9.6761 1.3241 -7.3078 (Intercept):2 -10.5817 1.3454 -7.8649 115 logexpo 2.5968 0.3811 6.8139 Number of linear predictors: 2 Names of linear predictors: logit(P[Y>=2]), logit(P[Y>=3]) Dispersion Parameter for cumulative family: 1 Residual Deviance: 5.02683 on 13 degrees of freedom Log-likelihood: -25.09026 on 13 degrees of freedom Number of Iterations: 4 We have to be very careful in writing the estimated model. Look at the output carefully. In the model statement, we see ‘cbind(normal, mild, severe).’ These responses are labeled in the output as 1, 2, and 3, respectively. The output says it is providing estimates of the parameter in logit Pr(Y ≥ 2) and logit Pr(Y ≥ 3). What does it mean? Understand what logit means. Estimate of logit Pr(Y ≥ 2) = ln [(Pr(Y = mild) + Pr(Y = severe))]/Pr(Y = normal)] = - 9.6761 + 2.5968*logexpo From this, we get Pr(Y = mild) + Pr(Y = severe) = exp(-9.6761 + 2.5968*logexpo)/[1 + exp(-9.6761 + 2.5968*logexpo)] In addition, estimate of logit Pr(Y ≥ 3) = ln [Pr(Y = severe)]/(Pr(Y = normal) + Pr(Y = mild))] = - 10.5817 + 2.5968*logexpo From this, we get Pr(Y = severe) = exp(-10.5817 + 2.5968*logexpo)/[1 + exp(-10.5817 + 2.5968*logexpo)] 116 These expressions did not come out as we planned originally. Never mind! We can look at the predicted probabilities as per this model. > fitted(fit.prop) normal mild severe 1 0.9940077 0.003560995 0.002431267 2 0.9336285 0.038433830 0.027937700 3 0.8467004 0.085093969 0.068205607 4 0.7445575 0.133635126 0.121807334 5 0.6358250 0.176154091 0.188020946 6 0.5323177 0.205582603 0.262099709 7 0.4338530 0.220783923 0.345363119 8 0.3636788 0.222017003 0.414304232 We can look at the expected frequencies as per this model. > total <- c(98, 54, 43, 48, 51, 38, 28, 11) > fitted(fit.prop)*total normal mild 1 97.412758 0.3489776 2 50.415937 2.0754268 3 36.408118 3.6590407 4 35.738762 6.4144861 5 32.427073 8.9838586 6 20.228072 7.8121389 7 12.147883 6.1819498 8 4.000466 2.4421870 severe 0.2382642 1.5086358 2.9328411 5.8467520 9.5890683 9.9597889 9.6701673 4.5573466 See how close these expected frequencies are to the observed frequencies. Common sense tells me that the fit must be excellent. We can reverse the responses in the ‘cbind’ input. We will have the same model in a different guise. > fit.prop <- vglm(cbind(severe, mild, normal) ~ logexpo, propodds, data = + pneumo, trace = T) > summary(fit.prop) Call: vglm(formula = cbind(severe, mild, normal) ~ logexpo, family = propodds, 117 data = pneumo, trace = T) Pearson Residuals: Min 1Q Median 3Q Max logit(P[Y>=2]) -1.0444 -0.184145 0.30926 0.33528 0.50476 logit(P[Y>=3]) -1.2479 -0.071638 0.14410 0.30860 0.77144 Coefficients: Value Std. Error t value (Intercept):1 10.5817 1.3454 7.8649 (Intercept):2 9.6761 1.3241 7.3078 logexpo -2.5968 0.3811 -6.8139 Number of linear predictors: 2 Names of linear predictors: logit(P[Y>=2]), logit(P[Y>=3]) Dispersion Parameter for cumulative family: 1 Residual Deviance: 5.02683 on 13 degrees of freedom Log-likelihood: -25.09026 on 13 degrees of freedom Number of Iterations: 4 Interpretation of the output and estimated model: In the model statement, we see ‘cbind(severe, mild, normal).’ These responses are labeled in the output as 1, 2, and 3, respectively. The output says it is providing estimates of the parameter in logit Pr(Y ≥ 2) and logit Pr(Y ≥ 3). What does it mean? Estimate of logit Pr(Y ≥ 2) = ln [(Pr(Y = mild) + Pr(Y = normal))]/Pr(Y = severe)] = 9.6761 - 2.5968*logexpo From this, we get Pr(Y = mild) + Pr(Y = normal) = exp(9.6761 - 2.5968*logexpo)/[1 + exp(9.6761 - 2.5968*logexpo)] 118 In addition, estimate of logit Pr(Y ≥ 3) = ln [Pr(Y = normal)]/(Pr(Y = mild) + Pr(Y = severe))] = 10.5817 - 2.5968*logexpo From this, we get Pr(Y = normal) = exp(10.5817 - 2.5968*logexpo)/[1 + exp(10.5817 - 2.5968*logexpo)] The predicted probabilities and other entities are exactly the same. Let us now check on goodness-of-fit. The null hypothesis is: H0: The response probabilities follow the multinomial logistic regression model pattern. Let us calculate the p-value. pchisq(5.02638, 13, lower.tail = F) [1] 0.9746078 The model is a good fit. We do not reject the null hypothesis. The fitted model using the second run of the command is: P̂r(Severity normal) exp(9.6761 - 2.5968 logexpo) 1 exp(9.6761 - 2.5968 * logexpo) P̂r(Severity normal) P̂r(Severity mild) exp(10.581 7 - 2.5968 logexpo) = 1 exp(10.581 7 - 2.5968 * logexpo) A commentary on the models: We have two models which are good fits. We prefer the tighter model, the one with fewer parameters. Our choice is the Proportional Odds model. There is another reason for choosing the model. Every parameter in the model is significant. Why? In the case of the multinomial logistic regression model, some parameters are not significant. 119 Additional comments. Multinomial Logistic Model has 4 parameters, where as the Proportional Odds Model (Cumulative Model with parallelism) has 3. We now tabulate the observed and expected frequencies for each of the fitted models. MB <- data.frame(pnuemo, round(fitted.multi*total, 2), round(fitted.prop*total, 2)) Age Frequencies Group Observed Expected Multinomial norm. mild sev. norm. mild sev. 1 98 0 0 97.29 0.58 0.13 51 2 1 50.38 2.33 1.29 34 6 3 36.50 3.69 2.81 35 5 8 35.93 6.18 5.89 32 10 9 32.61 8.60 9.79 23 7 8 20.27 7.64 10.09 12 6 10 12.08 6.33 9.59 4 2 5 3.94 2.64 4.42 Expected Prop.odds norm. mild sev. 97.41 0.35 0.24 50.42 2.08 1.51 36.41 3.66 2.93 35.73 6.41 5.85 32.43 8.98 9.59 20.23 7.81 9.96 12.15 6.18 9.67 4.00 2.44 4.56 Final word. The Multinomial model is carrying the entire data on 4 four numbers where as the Proportional Odds model on three numbers! Final, final word! Does the covariate have a significant impact on the response variable? Which model is in a good position to answer this question? Let us mull in this. 120