LOGISTIC REGRESSION (Chapter 20) Example - High Dieldrin Levels in Western Australian Breast Feeding Mothers Data File: Pestmilk.JMP These data come from a study of breast feeding mothers in Western Australia in 1979-80. Earlier research discovered surprisingly high levels of pesticide levels in human breast milk. The research conducted in 1979-80 hoped to show that the levels had decreased as a result of stricter government regulations on the use of pesticides on food crops. They did find decreases for several types of pesticides. Levels of the pesticide Dieldrin, however had substantially increased. These data were collected to hopefully explain why. For 45 breast milk donors, we have information on the mother's age in years, whether they lived in a new suburb (0 = no, 1 = yes), whether their house had been treated for termites within the past three years (0 = no, 1 = yes), and whether their breast milk contained above average (> .009 ppm) levels of the pesticide Dieldrin. By law new houses are treated for termites in Australia. The variables in the Pestmilk.JMP data file are: age - age of mother (yrs.) ns - new suburb indicator (1 = yes, 0 = no) ht - house treated for termites in the last 3 years (1 = yes, 0 = no) hd - high Dieldrin level (1 = yes, 0 = no) New Sub (New or Old) Treated (HT = house treated or NT= not treated) High Dieldrin (High or Low) Important JMP Note: For interpretation purposes it is best to code the outcome so that the adverse outcome is alphabetically first. The same is true for risk factors, code them so the level that would be associated with increased risk is alphabetically first. One way to examine the relationship between the response (High Dieldrin) and the predictors (age, New Sub & Treated) we could construct 2 X 2 contingency tables and compute conditional probabilities, relative risks, and odds ratios. The tables and plots below were obtained in JMP by using Fit Y by X and placing each of the predictors (New Sub & Treated) in the X box and the response (High Dieldrin) in the Y box. The results are shown on the following page. 1 The plots and the contingency tables with the conditional probabilities added suggest that both living in a new suburb (New Sub) and living in home treated for termites (HT) lead to increased risk of having high dieldrin levels in breast milk. Contingency Analysis of High Dieldrin By Treated 1.00 High Dieldrin 0.75 Low Count Row % High Low HT 13 54.17 3 15.79 16 11 45.83 16 84.21 27 0.50 NT 0.25 High 24 19 43 0.00 HT NT Treated OR = (13*16)/(3*11) = 6.30 Mothers living in a home treated for termites have 6.30 times higher odds for having high dieldrin levels in their breast milk when compared to mothers living in homes not treated for termites. Contingency Analysis of High Dieldrin By New Sub 1.00 High Dieldrin 0.75 Low Count Row % High Low New 7 58.33 9 29.03 16 5 41.67 22 70.97 27 0.50 Old 0.25 High 12 31 43 0.00 New Old New Sub OR = (7*22)/(9*5) = 3.42 Mothers living in a new suburb have 3.42 times the odds of having high dieldrin levels in their breast milk when compared to mothers in living in an older suburb. 2 Logistic Regression Model In logistic regression we model the log of odds for success as a function of the predictors using a linear model. For example, consider the logistic regression model for the risk factor New Suburb. p o 1 NewSuburb ln( odds for high dieldrin ) ln 1 p where, 1 if mother lives in new suburb NewSuburb 1 if mother lives in old suburb The log odds a breast feeding mother living in a new suburb is given by p o 1 ln( odds for High for mothers living in a new suburb ) ln 1 p and for a mother living in an old suburb is given by p o 1 ln( odds for High for mothers living in an old suburb ) ln 1 p The difference in the log odds is equivalent to the log of the odds ratio (OR) because of the following property of logarithms. x ln( x) ln( y ) ln y Applying this property here we have ln( odds for High for mothers in a new suburb ) - ln(odds for High for mothers in an old suburb) odds for High for mothers in a new suburb ( o 1 ) ( o 1 ) 21 ln odds for High for mothers in an old suburb This says that the OR associated with living in a new suburb is given by OR e 2 1 3 Fitting the New Suburb Logistic Regression Model in JMP Select Fit Model and place High Dieldrin in the Y box and New Suburb in the Model Effects box. Resulting output… The estimated OR associated with living in a new suburb is then We can use JMP to compute the OR’s by selecting Nominal Logistic… > Odds Ratio 4 Similarly for House Treated we have the following logistic regression model. Finding Predicted Probabilities The logistic regression model can be used to estimate the probability of “success” given a set of predictor values as follows: ˆ p P( success | X ) ˆ e o 1 X ˆ ˆ 1 e o 1 X for situations where we have a single predictor and is given by p P( success | X ) e ˆo ˆ1 X 1 ˆ p X p 1 e ˆo ˆ1 X 1 ˆ p X p for situations where we have p predictors. For the example above we can estimate the probability of high dieldrin levels for women living in a home treated for termites as follows: e .7535.9205 .5417 P(High|House Treated) = 1 e .7535.9205 P(High|House Not Treated) = e .7535.9205 .1579 1 e .7535.9205 5 How do these estimate probabilities compare to those we obtain by using a 2 X 2 contingency table? We now consider the age effect. Again select Fit Y by X from the Analyze menu and place High Dieldrin in the Y box and age in the X box. The resulting output is given below. Logistic Fit of High Dieldrin By age 1.00 High Dieldrin 0.75 Low 0.50 0.25 High 0.00 20 25 30 35 age Whole Model Test Model Difference Full Reduced -LogLikelihood 0.924219 27.458371 28.382590 RSquare (U) Observations (or Sum Wgts) DF 1 ChiSquare 1.848438 Prob>ChiSq 0.1740 ChiSquare 2.25 1.76 Prob>ChiSq 0.1334 0.1849 0.0326 43 Converged by Gradient Parameter Estimates Term Estimate Intercept -4.0886156 age 0.12223765 For log odds of High/Low Std Error 2.7245511 0.0922011 The logistic model using age a predictor is given by p = Age -4.0886156 + .1222*Age ln 1 p Note: The response in logistic regression is the natural log of the odds for “success”. The blue curve added to the plot gives the P(High|Age) = p. For example, for mothers 25 years of age the predicted probability of finding a high dieldrin level in her breast milk is .25. For mothers 35 years of age this probability increases to around .50. The distance 6 from the top of the plot to the curve represents the P(Low|Age). To attach an odds ratio to mother’s age we need to pick an incremental increase of interest, e.g. suppose we wanted to find the odds ratio associated with a 5-year increase in age. The associated odds ratio is found as follows: OR for 5-year increase in age = e5*.122 = 1.84 Thus for a 5-year increase in age a mothers odds for having high dieldrin are 1.84 times higher or alternatively there is an 84% increase in their odds for having high dieldrin levels in their breast milk. Predicted Probabilities for Logistic Model Using Age We can use the logistic regression model to obtain predicted probabilities of high dieldrin levels as a function of age by using. e 4.089.1222 Age P(High|Age) = 1 e 4.089.1222 Age For example, P(High|Age=25) = e 4.089.122225 .2623 1 e 4.089.122225 P(High|Age=35) = e 4.089.122235 .5469 1 e 4.089.122235 Multiple Logistic Regression Model Now we consider a logistic regression model. p o 1 NewSuburb 2Treated 3 Age ln 1 p where, 1 if mother lives in new suburb NewSuburb 1 if mother lives in old suburb 1 if mother lives in a home treated for termites Treated 1 if mother lives in a home not treated for termites Age = mother’s age in years 7 Select Fit Model from the Analyze menu and put the high dieldrin indicator in the Y box and Age, HT, and New Sub in the Effects in Model box as shown at the top of the following page. The resulting output is shown below. The Whole Model Test is testing H o : The logistic model is NOT useful H a : The logistic model is useful. The p-value = .0013 so here we evidence to suggest that the model is useful for explaining presence of high dieldrin levels in a mothers breast milk. The Lack of Fit test is testing H o : The model is adequate. H a : The model is inadequate, i.e. there is lack of fit The p-value = .2220, so there is no evidence of lack of fit. The Parameter Estimates and Effect Wald Tests both contain the results of tests that are used to test the significance of the predictors in the logistic model. Here we see that both the new suburb and house treated indicators are statistically significant at the .05 level, while mother’s age is significant at the .10 level. 8 Finding OR’s associated with the predictors For a dichotomous (two-level) categorical predictor, e.g. new suburb and house treated, in order to find the associated OR we do the following: OR associated with risk factor i exp( 2ˆi ) , i.e. e 2 ̂i . Examples: For New Suburb we have: For House Treated we have: To find a crude 95% CI associated with the OR associated with risk factor i we compute exp( 2 * (ˆi (normal or t - table value) * SE(ˆi )) which will give an lower and upper confidence limits for the true OR associated with risk factor. Examples: For New Suburb we have: exp( 2 * (1.0703 1.96 .4678)) (1.359 , 53.22) For House Treated we have: exp( 2 * (1.2984 1.96 .4873)) (1.986 , 90.65) These intervals are very wide because the sample size (n = 45) is not very big. Typically these types of studies require a larger sample size to get precise CI’s for OR’s. We can obtain both the OR’s and their confidence intervals using JMP as follows. Select both the options Odds Ratios – calculates the odds ratios for all predictors in the model. Confidence Intervals – provides CI’s for the Odds Ratio, calculated using a method slightly differently than approach above. ROC Curve – draws an ROC curve which is shown and discussed later in the handout. (Professional JMP only!) The resulting output is shown on the following page. 9 Multiple Logistic Regression Model The OR’s associated with living in home treated for termites and living in a new suburb are considerably larger than those found examining there effect independently. The differences between those obtained above are due to the fact that the factors themselves are potentially related and as result their estimated effects when placed in a model jointly differ. The odds ratio reported for age is found by using Max(Age) – Min(Age) as the incremental increase. For these data Max(Age) = 37 and Min(Age) = 21, thus a mother who is 37 has 28.055 times higher odds for having high dieldrin levels in her breast milk when compared to a mother who is 21 years of age. It is better to use an increment like 5 years instead, i.e. OR associated with a 5 year increase in age is calculated as follows: OR exp(. 2083 * 5) exp(1.042) 2.833 . As stated previously, the confidence intervals for all of the OR’s are quite broad in this study because the sample size is small (n = 45). Predicted Probabilities Using All Available Predictors The predicted probabilities of high dieldrin can be found as follows. P(High Dieldrin|House Treated, New Suburb, Age) = e 6.6041.070NewSuburb1.298HouseTreated .2084Age 1 e 6.6041.070NewSuburb1.298HouseTreated .2084Age For example the probability that a 30 year old mother living in a home treated for termites in an old suburb is estimated to be: P(High|Old Suburb, House Treated, Age = 30) = e 6.6041.0701.298.208430 = .4690 1 e 6.6041.0701.298.208430 For a 25 year old mother living in a home treated for termites located in a new suburb the probability of high dieldrin is estimated to be: P(High|New Suburb, House Treated, Age = 25) = e 6.6041.0701.298.208425 = .7259 1 e 6.6041.0701.298.208425 10 Estimates of the P(High Dieldrin|New Suburb, House Treated, Age) Using Professional Version of JMP (FYI) Selecting Save Probability Formula from the Nominal Logistic Fit pull down menu places the predicted probabilities of high and low dieldrin levels in the spreadsheet along with the predicted status. The predicted status is determined by whichever probability is larger, low dieldrin level or high dieldrin level, given their demographics. Here is a portion of this output which will appear back in the original data spreadsheet. P(Low|X) P(High|X) We can compare the predicted dieldrin status to the actual via a contingency table. Select Fit Y by X from the Analyze menu a place Most Likely High Dieldrin in the X box and High Dieldrin in the Y box. The table and mosaic plot are shown below. Contingency Analysis of High Dieldrin By MostLikely High Dieldrin 1.00 Actual Status High Dieldrin 0.75 Low 0.50 Predicted Status High Low High Low 11 73.33 5 17.86 16 4 26.67 23 82.14 27 15 28 43 0.25 High 0.00 High Low MostLikely High Dieldrin From the table we see that 26.7% of mothers classified as having high dieldrin levels actually had low dieldrin levels, similarly 17.9% of those classified as having low dieldrin levels actually had high dieldrin levels. In total 9 out of 43 mothers were misclassified for an estimated overall error rate of 20.9%. 11 Receiver Operating Characteristic (ROC) Curve The Receiver Operating Characteristic plots the true positive probability vs. the false positive probability. As the sensitivity increases the false positive rate increases as expected. A good classification rule based on upon a logistic model should have area beneath the ROC curve of .90 or higher. Here we do not quite meet that standard. Receiver Operating Characteristic 1.00 0.90 True Positive Sensitivity 0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 .00 .10 .20 .30 .40 .50 .60 .70 .80 .90 1.00 1-Specificity False Positive Area Under ROC Curve = 0.83449 12 Example 2: Risk Factors for Low Birth Weight These data come from a case-control study where risk factors for having a infant with low birth weight (< 2500g) were studied. The following information was recorded for each mother in the study: (Data File: LowBirth) Low Birth Weight – indicator of birth weight status (Low or Normal) Prev? – previous history of premature labor (History or None) Hyper – hypertension during pregnancy (HT or Normal) Smoke – mother smoked during pregnancy (Cig or No Cig) Uterine – uterine irritability during pregnancy (Irritation or None) Minority – minority status of mother (Nonwhite or White) Age – age of mother Lwt – mothers weight at last menstrual cycle Important JMP Note: For interpretation purposes it is best to code the outcome so that the adverse outcome is alphabetically first. The same is true for risk factors, code them so the level that would be associated with increased risk is alphabetically first. To fit the multiple logistic regression model select Analyze > Fit Model and set up the dialog box as shown below. After using backward elimination to remove non-significant predictors, uterine irritability and mothers age here, we have the following. 13 The only predictor which represents something a mother could control or change is smoking during pregnancy. This is the primary factor of interest in this study and the other factors, while interesting, are there for control purposes only. In summarizing the effect smoking we would see the phrase: “adjusting for age, pre-pregnancy weight, race, hypertension, uterine irritability, and previous history of premature labor we find the OR associated with smoking is OR = 2.66. This says that, after adjusting for these factors, the odds for having a low birth weight infant are 2.66 times larger for mothers who smoked during pregnancy. 14