Handout 15 – Introduction to Logistic Regression This handout covers material found in Section 13.8 of your text. You may also want to review regression techniques in Chapter 11. These data are taken from the text “Applied Logistic Regression” by Hosmer and Lemeshow. Researchers are interested in the relationship between age and presence or absence of evidence of coronary heart disease (CHD). The smooth is an estimate of: E(CHD|Age) = P(CHD=1|Age) Why? Expectation of a Bernoulli Random Variable Let θ(Agei) denote the probability of having CHD for a given Age i. Note: that CHD i | Agei is a Bernoulli random variable with the following probability distribution: CHDi|Agei 0 1 P(CHDi|Agei) 1 - θ(Agei) θ(Agei) We can find the expected value of the Bernoulli random variable as follows: 1 How do we develop a parametric model for a dichotomous response like CHD using Age of the person as the covariate? We might try a linear regression model with Age as a predictor and CHD as the response. Before we do this in SAS, consider our linear regression model: CHD i | Agei η0 η1 Agei e i , 0 in the absenceof CHD, 1 in the presenceof CHD. where CHD i | Agei Note that the mean function is given by E(CHD i | Agei ) η0 η1 Agei . As we saw above, we can find the expected value of the Bernoulli random variable as follows: E(CHD i | Agei ) 0 1 θ(Age i ) 1 θ(Age i ) θ(Age i ) . Why is this important? This shows that… (see previous page) E(CHD i | Agei ) η0 η1 Agei θ(Age i ) P(CHD i | Agei ) That is, the regression line gives an estimate of the probability of having CHD for a given Age. In SAS… (using file CHD.sas) proc reg; model CHD = age; plot CHD*age; run; 2 Some Problems that Arise Using this Model 1. Non-normality of the error terms. Only two different error terms are possible for each Agei: - θ(Agei) if the response is 0, and 1 - θ(Agei) if the response is 1. 2. Non-constant variance of the error terms. Since CHD i | Agei is a Bernouilli random variable, we know that Var( CHD i | Agei ) = θ(Agei) × [1- θ(Agei)]. This then implies that, Var( CHD i | Agei ) = [ η0 η1 Agei ]×[1- η0 η1 Agei ] That is, the variance function varies with Age and is NOT constant. 3. Constraints on the response function. A linear representation permits estimates or predictions outside the range 0 to 1, which is not correct when modeling probabilities. For example, what is our estimate of θ(Age=20) if we use a linear regression model? Comment: The constraint that the mean function fall between 0 and 1 frequently rules out a linear response function. For our CHD example, the use of a linear response function might require us to assume a probability of 0 for the mean response for all individuals beneath a certain age and a probability of 1 for all individuals over a certain age (see below). Such a model is often considered unreasonable, however. Ideally, we’d like to find a model where the probabilities 0 and 1 are reached asymptotically. One such model is the logistic regression model. Recall: 3 The Simple Logistic Mean Function We parameterize this model as follows: E(y i | x i ) θ(x i ) exp(η0 η1 x i ) . 1 exp(η0 η1 x i ) Some examples of simple logistic mean functions are shown below: With η0 = 0 With η1 = -1 Comments: 1. The logistic mean function is always between 0 and 1. 2. As η1 increases, the function becomes more S-shaped; therefore, the function changes more rapidly in the center. 3. When η1 is positive, the function is monotone increasing; when η1 is negative, the function is monotone decreasing. 4. Changing η0 shifts the function horizontally. 5. The logistic function possesses the property of symmetry. If the response variable is recoded by changing 1s to 0s and 0s to 1s, the signs of all coefficients will be reversed. 4 To fit the logistic regression model in SAS, you can use the following programming statements: ods html; ods graphics on; proc logistic descending; model CHD = age / link=logit; graphics estprob; run; This curve is a plot of: ˆ (CHD | Age ) θˆ (Age i ) P i ods graphics off; ods html close; ex p(η 0 η 1 Age i ) 1 ex p(η 0 η 1 Age i ) Questions: 1. Based on the plot, find θˆ (40) Pˆ (CHD | Age 40). 2. Based on the plot, find θˆ (60) Pˆ (CHD | Age 60). 5 Interpreting the Model Parameters Mean function: E(CHD | Agei ) θ(Age i ) exp(η0 η1 Agei ) . 1 exp(η0 η1 Agei ) Fitted Model Equation (or Fitted Probabilities): ˆ (CHD | Age ) θˆ (Age ) E i i ˆ0 η ˆ 1 Agei ) exp(η ˆ0 η ˆ 1 Agei ) 1 exp(η Note that in the mean function, the probabilities θ(Agei) are nonlinear functions of η0 and η1. However, a simple transformation results in a linear model. That is, we can show the following: θ(Age i ) ln η 0 η1 Agei 1 θ(Age i ) Proof of the previous claim: 6 Fitting the Model in JMP Select Analyze > Fit Y by X and place CHD (y/n) in the Y box and age in the X box. The resulting output is shown below. Because the response is a dichotomous categorical variable logistic regression is performed. The curve is a plot of: exp(ˆo ˆ1 Age) Pˆ (CHD | Age) 1 exp(ˆo ˆ1 Age) Example: Pˆ (CHD | Age 40) Pˆ (CHD | Age 60) 7 Interpretation of Model Parameters eo 1 Age P(CHD=1|Age) ( x) ~ 1 eo 1 Age Odds for Success ( x) ~ 1 ( x) ~ Thus ( x) ~ Age ln o 1 1 ( x) ~ Suppose we contrast individuals who are Age = x to those who are Age = x + c. What can we say about the increased risk associated with a c year increase in age? The logistic model gives us a means to do this through the odds ratio (OR). ( Age x c) 1 ( Age x c) ln( OR associated with a c year increase in age ) ln ( Age x) 1 ( Age x) ( Age x c) ( Age x) ln o 1 ( Age c) ( o 1 Age) c1 ln 1 ( Age x c) 1 ( Age x) Exponentiating both sides gives Thus the multiplicative increase (or decrease if 1 0 ) in odds associated with a c year increase in age is e c1 . 8 Example: Interpreting a c year increase in age. Question: Is it reasonable to assume that a c unit increase in a continuous predictor is constant regardless of starting point? For example, does the risk associated with a 5 year increase in age remain constant throughout one’s life? 9 Statistical Inference for the Logistic Regression Model Given estimates for the model parameters and their estimated standard errors what types of statistical inferences can be made? One approach is to use the normal-theory based methods outlined below. Hypothesis Testing For testing: H o :i 0 H a :i 0 Large sample test for significance of “slope” parameter ( i ) ˆi z N (0,1) SE (ˆi ) Confidence Intervals for Parameters and Corresponding OR’s For dichotomous categorical predictors (i.e. 0/1 predictors) 100(1 )% CI for i ˆi z1 / 2 SE (ˆi ) 100(1 )% CI for OR Associated with i exp(ˆi z1 / 2 SE (ˆi )) If i corresponds to a continuous predictor and we wish to examine the OR associated with a c unit increase the CI for the OR becomes exp( cˆi z1 / 2 cSE (ˆi )) Often times categorical predictors have more than two levels and we will see to handle that case later in the notes. Example: What is the OR for CHD associated with a 10 year increase in age? Give a 95% confidence interval based on this estimate. 10 Some Mathematical Details: Estimation of the Model Parameters In a linear regression analysis, the regression coefficients are estimated based on the least squares method. That is, the estimates are obtained by minimizing the sum of the squared residuals. In a logistic regression analysis, the model parameters are estimated through a process called the maximum likelihood method. The basic principle of maximum likelihood is to choose as estimates those parameter values which, if true, would maximize the probability of observing what we have actually observed. This involves: 1. Finding an expression (i.e., the likelihood function) for the probability of the data as a function of the unknown parameters. For the logistic model, the binary response variable is assumed to follow a binomial distribution with a single trial (n=1) and probability of “success” equal to θ(xi). Therefore, for the ith observed pair (x i , y i ) , the contribution to the likelihood is yi θ(x i ) (1 θ(x i )) eo 1x i 1 where θ(x i ) and y i o 1x i 1 e 0 1 y i Then, since we assume independence across observations, the likelihood function is given by L L o ,1 θ(x i ) yi (1 θ(x i ))1 yi ~ i 1 n 2. Finding the values of the unknown parameters which make the value of this expression as large as possible. For computational purposes it is usually easier to maximize the logarithm of the likelihood function rather than the likelihood function itself. This works because the logarithm is a monotonic increasing function; therefore, the maximizing parameters are the same for the likelihood and log-likelihood functions. The loglikelihood function is given by n lnL( o ,1 ) y i ln θ(x i ) (1 y i )ln 1 θ(x i ) i 1 To find the parameter estimates, we solve simultaneously the equations given by setting the partial derivatives with respect to each parameter equal to 0: o lnL( o ,1 ) 0 lnL( o ,1 ) 0 1 Several different nonlinear optimization routines are used to find solutions to such systems. This process gets increasingly computationally intensive as the number of terms in the model increases. 11 Example: Estimating Model Parameters with a Single Dichotomous Predictor CHD and Indicator of Age Over 55 Computed using standard approach Logistic Model There are two different ways to code dichotomous variables (0,1) coding or (-1,+1), i.e. contrast) coding. JMP uses contrast coding by default whereas other packages will generally use the (0,1) coding as default. The two coding types are shown below. 1 Age 55+ = 0 1 Age 55+ = 1 Age > 55 Age < 55 Age > 55 Age < 55 For the purposes of discussion we will consider the (0,1) coding. Recall (x ) P(CHD 1 | x) eo 1x where x = Age 55+ indicator we have the following. 1 eo 1x Age ≥ 55 (x = 1) CHD = 1 CHD = 0 exp(η0 η1 ) θ(x i 1) 1 exp(η0 η1 ) 1- θ(x i 1) 1 1 exp(η0 η1 ) Age < 55 (x = 0) θ(x i 0) exp(η0 ) 1 exp(η0 ) 1- θ(x i 0) 1 1 exp(η0 ) 12 Estimating the model parameters “by hand” ( x 1) /(1 ( x 1) OR = ( x 0) /(1 ( x 0) 13 EXAMPLE: Using the data in the file CHD.sas we will create a dummy variable indicating whether the subject is over age 55 or not. Then instead of examining the relationship between the CONTINUOUS variable age and the presence or absence of evidence of coronary heart disease (CHD), we could consider the dichotomous predictor: 0 if Age 55 Ov er55 1 if Age 55 data CHD; input ageGrp$ age CHD; if (age ge 55) then Over55=1; else Over55=0; datalines; 1 20 0 . . . . . . proc sort data=CHD; by descending CHD descending Over55; run; proc freq order=data; tables CHD*Over55 / all; run; Standard OR output from SAS: Using PROC LOGISTIC to fit the model: proc logistic data=CHD descending; model CHD = Over55 / link=logit; output out=probs predicted=predicted_probabilities; run; 14 Questions: 1. Use the model parameters to predict the probability of having CHD for a person who is 55 or over and for a person who is younger than 55. 2. Given only the estimates of the model parameters, could you find the odds ratio for having CHD associated with being 55 or over? Verify these values from the SAS output: proc print data=probs; run; 15 Analysis in JMP (CHD55.JMP) To fit a logistic regression model is best to use the Analyze > Fit Model option. We place CHD (1 = Yes, 2 = No) in the Y box and Age > 55 (1 = Yes, 2 = No) in the model effects box. The key is to have “Yes” for risk and disease alpha-numerically before “No”, thus the use of 1 for “Yes” and 2 for “No”. The summary of the fitted logistic model is shown below. Notice that the parameter estimates are the not the same as those obtained from SAS. This is because JMP uses contrast coding for the Age > 55 predictor (+1 = Age > 55 and -1 = Age < 55). 16 OR’s and Fitted Probabilities Using JMP to Compute OR’s, CI’s, and Fitted Probabilities Because we have the disease and the risk factor are alpha-numerically ordered the OR’s are correct as given. 17 By selecting Save Probability Formula we can save the fitted probabilities to the spreadsheet. Example: CHD and Age Over 55 in R > Over55 [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [53] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Levels: 0 1 > chd [1] 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 1 0 0 1 1 [53] 0 1 0 1 0 0 1 0 1 1 0 0 1 0 1 0 0 1 1 1 1 0 1 1 1 1 1 0 0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 0 1 1 1 Levels: 0 1 > table(chd,Over55) Over55 chd 0 1 0 51 6 1 22 21 > chd55 = glm(chd~Over55,family=”binomial”) > summary(chd55) Call: glm(formula = chd ~ Over55, family = "binomial") Deviance Residuals: Min 1Q Median -1.734 -0.847 -0.847 3Q 0.709 Max 1.549 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.8408 0.2551 -3.296 0.00098 *** Over55 2.0935 0.5285 3.961 7.46e-05 *** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 136.66 on 99 degrees of freedom Residual deviance: 117.96 on 98 degrees of freedom AIC= 121.96 Number of Fisher Scoring iterations: 4 18 How do we measure discrepancy between observed and fitted values? In OLS regression with a continuous response we used n n n i 1 i 1 i 1 RSS ( y i yˆ i ) 2 ( y i (ˆ T u i )) 2 = ( y i (ˆo ˆ1u1i ˆ k u ki )) 2 In logistic regression modeling we can use the deviance (typically denoted D or G2) which is defined as likelihood of saturated model D =2 ln likelihood of fitted model n y 1 yi D 2 y i ln i (1 y i ) ln ˆ ˆ ( x ) 1 ( x ) i 1 i i Because the likelihood function of the saturated model is equal to 1 when the response (𝑦𝑖 ) is 0 or 1, the deviance reduces to: D = -2 ln(likelihood of the fitted model) = -2 ∑𝑛𝑖=1[𝑦𝑖 ln (𝜃̂(𝑥𝑖 )) + (1 − 𝑦𝑖 )ln(1 − 𝜃̂ (𝑥𝑖 ))] The deviance can be used to compare two potential models where one model is nested within the other by using the “General Chi-Square Test” for comparing rival logistic regression models. We will see more applications of this in more detail when we discuss multiple logistic regression and model development, however we will demonstrate this process below when considering a single predictor x. The general nested model concept: General Chi-Square Test Consider the comparing two rival models where the alternative hypothesis model ( x) T H o : log( ) 1 x1 (reduced model OK) 1 ( x) ( x) T T (full model needed) H 1 : log( ) 1 x1 2 x 2 1 ( x) 19 General Chi-Square Statistic 2 = (residual deviance of reduced model) – (residual deviance of full model) = D( for model without the terms in x2 ) D( for model with the terms in x2 ) ~ df 2 2 If the full model is needed 2 is BIG and the associated p-value = P( df 2 ) is small. Example: CHD and Age ~ a single numeric predictor Ho : H1 : From JMP From R > summary(chd.glm) Call: glm(formula = chd ~ Age, family = "binomial") Deviance Residuals: Min 1Q Median -1.9718 -0.8456 -0.4576 3Q 0.8253 Max 2.2859 Coefficients: Estimate Std. Error z value (Intercept) -5.30945 1.13365 -4.683 Age 0.11092 0.02406 4.610 --Signif. codes: 0 '***' 0.001 '**' 0.01 Null deviance: 136.66 Residual deviance: 107.35 on 99 on 98 Pr(>|z|) 2.82e-06 *** 4.02e-06 *** '*' 0.05 '.' 0.1 ' ' 1 degrees of freedom degrees of freedom 20 Statistical Inference for the Logistic Regression Model (In SAS) First, consider the following output from PROC LOGISTIC: proc logistic descending; model CHD = age / link=logit; output out=get_values predicted=predicted_probabilities; run; All of these statistics are testing the same null hypothesis: Ho: all explanatory variables in the model have coefficients of zero. Ha: at least one explanatory variable in the model has a coefficient different from zero. The Likelihood Ratio test compares the log-likelihood for the fitted model with the likelihood for a model with NO explanatory variables. PROC LOGISTIC reports -2×log-likelihood for each of these models, and the chi-square test statistic is the difference of these two numbers. Note that the df = 1 corresponds to the one independent variable in the model. The Score statistic is a function of the first and second derivatives of the loglikelihood function under the null hypothesis. There is some evidence that this test does not perform as well as the likelihood ratio test for small samples. The Wald statistic is an approximation that is more accurate with larger sample sizes. 21 Hypothesis Testing For Individual Coefficients H o :i 0 H a :i 0 When the sample size is large, the test for significance of the “slope” parameter (i ) can be calculated as follows: z ˆi = SE(ˆi ) χ2= z2 = Confidence Intervals for Coefficients and Corresponding Odds Ratios A 100(1 α)% confidence interval for i can be calculated as follows: ˆi z1α/2SE(ˆi ) A 100(1 α)% confidence interval for the odds ratio associated with i is calculated as follows: exp(ˆi z1α/2SE(ˆi )) These intervals can be calculated in SAS PROC LOGISTIC as follows: proc logistic descending; model CHD = age / link=logit clparm=wald; run; 22 If β i corresponds to a continuous predictor and we wish to examine the odds ratio associated with a c unit increase, the confidence interval for the odds ratio becomes exp(c ˆi z1α/2 c SE(ˆi )) Example: Find the odds ratio for CHD associated with a 10 year increase in age, and give a 95% confidence interval based on this estimate. proc logistic descending; model CHD = age / link=logit clparm=pl clodds=pl; units age=10; run; The preceding intervals are all known as Wald intervals (based on normal-theory methods). These may not be appropriate for small samples; therefore, you may want to consider another method called the Profile Likelihood method. This involves an iterative evaluation of the likelihood function and produces intervals that may not be symmetric around the estimate. proc logistic descending; model CHD = age / link=logit clparm=pl; run; Questions: 1. How do these compare to the Wald confidence intervals? 2. How would you find the profile likelihood confidence interval for the odds ratio? 23 Statistics Measuring Predictive Power Once again, consider the model using the continuous variable age to predict CHD: proc logistic descending; model CHD = age / link=logit; output out=get_values predicted=predicted_probabilities; run; Recall that the p-values shown above are used to test the usefulness of the logistic regression model. We can also consider a few other statistics to investigate the model’s predictive power: Generalized R2 Likelihood Ratio Chi - Square = n This is calculated as follows: 1 exp You can also request this quantity from SAS: proc logistic descending; model CHD = age / link=logit rsq; run; Note that the upper-bound of the generalized R2 is less than 1. Therefore, PROC LOGISTIC also reports a quantity labeled the “Max-rescaled R-Square,” which divides the original generalized R2 by its upper bound. Ordinal Measures of Association SAS PROC LOGISTIC also reports the following statistics: 24 The idea behind these statistics is as follows. For the 100 observations in the data set, there exist 100×(99)/2 = 4,950 different ways to pair them up (without pairing an observation with itself). Of these pairs, 2,499 have either both 1s or both 0s for an observed response. These are ignored, leaving 2,451 pairs in which one case has a 0 and the other case has a 1. For these pairs, SAS determines whether the observation with a 1 has a higher predicted value (based on the model) than does the observation with a 0. If this is the case, the pair is called concordant. If not, the pair is discordant. Let C = the number of concordant pairs = D = the number of discordant pairs = T = the number of ties = N = the total number of pairs (before eliminating any) = The four measures of association are given as 1. Somer’s D = 2. Gamma = 3. Tau-a = CD CDT CD CD CD N 4. C = .5×(1 + Somer’s D) All four measures vary between 0 and 1, with large values corresponding to stronger associations between the predicted and observed values. Finally, note that the measure known as C has another familiar interpretation. Consider the following programming statements. ods html; ods graphics on; proc logistic data=CHD descending; model CHD = age / link=logit outroc=roc_data; run; ods graphics off; ods html close; proc print data=roc_data; run; 25 These statements request the following output. . . The ROC curve is obtained by changing the classification rule based on the estimated probability. Note that the area under the ROC curve is the same as C. 26 More Analysis in JMP – Logistic Regression with a Single Numeric Predictor OPTIONS FOR LOGISTIC REGRESSION Likelihood Ratio Tests – same as in SAS Wald Tests – normal-theory based Confidence Intervals – gives CI’s for population parameters in the model. Odds Ratios –Gives odds ratio associated with a unit increase in x, i.e. c = 1 and the odds ratio associated with being at the maximum of x vs. the minimum of x. ROC Curve – if we use ˆ( x) = P̂ (CHD| x ) to construct a rule for classifying a ~ ~ patient as having CHD vs. No CHD this option gives the ROC curve coming from all possible cutpoints based on this estimated probability. Estimated Odds Ratios ROC Curve and Table 27 By changing the classification rule based on estimated probability we can obtain an ROC curve. 28 Analysis in R – Logistic Regression with Single Numeric Predictor > CHD <- read.table(file.choose(),header=T) > CHD agegrp age chd 1 1 20 0 2 1 23 0 3 1 24 0 4 1 25 0 5 1 25 1 . . . . . . . . . . . . 96 8 63 1 97 8 64 0 98 8 64 1 99 8 65 1 100 8 69 1 > names(CHD) [1] "agegrp" "age" "chd" Make sure that you specify family=”binomial” or R will perform ordinary least squares > attach(CHD) > chd <- factor(chd) > chd.glm <- glm(chd~age,family="binomial") > summary(chd.glm) Call: glm(formula = chd ~ age, family = "binomial") Deviance Residuals: Min 1Q Median -1.9718 -0.8456 -0.4576 3Q 0.8253 Max 2.2859 Coefficients: Estimate Std. Error z value (Intercept) -5.30945 1.13263 -4.688 age 0.11092 0.02404 4.614 --Signif. codes: 0 `***' 0.001 `**' 0.01 Pr(>|z|) 2.76e-06 *** 3.95e-06 *** `*' 0.05 `.' 0.1 ` ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 136.66 Residual deviance: 107.35 AIC: 111.35 on 99 on 98 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 3 > probCHD <- exp(-5.30945 + .11092*age)/(1+exp(-5.30945 + .11092*age)) 29 > plot(age,probCHD,type="b",ylab="P(CHD|Age)",xlab="Age") P(CHD | Age) eo 1 Age 1 eo 1 Age ˆo 5.310 ˆ1 .11092 An easier way obtain the estimated probabilities is to extract them from the model object. > probCHD <- fitted(chd.glm) > plot(Age,probCHD,type=”b”,ylab=”P(CHD|Age)”) # This produces plot above We can obtain the estimated logit ( Lˆi ˆo ˆ1 Age ) by using the predicted command. > chd.logit = predict(chd.glm) > plot(Age,chd.logit,type="b",ylab="L = no + n1*Age") > title(main="Plot of Estimated Logit vs. Age") 30 Multiple Logistic Regression The multiple logistic mean function has the basic form, 𝑙𝑛 ( ̃) 𝜃(𝒙 ̃) 1−𝜃(𝒙 )= 𝜂0 + 𝜂1 𝑢1 + 𝜂2 𝑢2 + ⋯ + 𝜂𝑘−1 𝑢𝑘−1 where the 𝑢𝑖 = 𝑎𝑟𝑒 𝒕𝒆𝒓𝒎𝒔 𝑏𝑎𝑠𝑒𝑑 𝑜𝑛 𝑡ℎ𝑒 𝑥𝑗 ′𝑠 What are terms? 31 Terms (cont’d) 32 EXAMPLE 1: The data in the file OC_Use.sas are from a case-control study comparing the use of oral contraceptives and the occurrence of myocardial infarctions. Subjects were also classified into one of five age groups. data OCUse; input AgeGrp$ Status$ OCuse$ count; datalines; 1 Case Yes 4 1 Case No 2 1 Control Yes 62 1 Control No 224 2 Case Yes 9 2 Case No 12 2 Control Yes 33 2 Control No 390 3 Case Yes 4 3 Case No 33 3 Control Yes 26 3 Control No 330 4 Case Yes 6 4 Case No 65 4 Control Yes 9 4 Control No 362 5 Case Yes 6 5 Case No 93 5 Control Yes 5 5 Control No 301 ; In Handout 14, we discussed using the Cochran-Mantel-Haenszel test for controlling for a single categorical covariate while assessing the association between two other variables. We could apply this test to these data in order to “adjust” for age group when examining the relationship between oral contraceptive use and disease status: proc freq order=data; tables AgeGrp*status*OCuse / cmh all; weight count; run; Question: What do you conclude from this test? 33 The “/ all” option gives us the odds ratio for having myocardial infarction associated with oral contraceptive use for each age group: Age Grou p Odds Ratio 1 2 3 4 5 Recall that we can test for a difference in these odds ratios: 34 Finally, the CMH methods also provide us with a common odds ratio: You can think of this as an estimate of the odds ratio for having myocardial infarction associated with oral contraceptive use after controlling for age group. Fitting Multiple Logistic Regression Models in SAS Logistic regression methods also provide us with a method for controlling for confounding variables. Note that we can use a multiple logistic regression model to predict the probability of myocardial infarction based on oral contraceptive use. Moreover, we can add age group to the model in order to adjust for age. First, we must make our binary response variable numeric in order to use PROC LOGISTIC: data OCuse2; set OCuse; if Status="Case" then MI = 1; else MI = 0; run; We can leave the two predictor variables (OC use and Age group) in categorical format; however, we must place these variable names in the ‘class’ statement in PROC LOGISTIC: proc logistic descending; class OCUse agegrp; model MI = OCUse agegrp; weight count; run; 35 Questions: 1. Is the logistic regression model useful? Explain. 2. Why does SAS report four coefficients for Age Group? The Multiple Logistic Mean Function The multiple logistic regression model for this example is parameterized as follows: exp(η0 η1 OCuse η2 Age1 η3 Age2 η4 Age3 η5 Age4) . E(MI | ~ x OCuse, AgeGrp) θ(~ x) 1 exp(η0 η1 OCuse η2 Age1 η3 Age2 η4 Age3 η5 Age4) x) θ(~ η 0 η1 OCuse η 2 Age1 η 3 Age2 η 4 Age3 η 5 Age4 . ~ 1 θ( x) Also, recall that ln Note that since Age Group has five levels, its definition requires four dummy (or indicator) variables. SAS lists the value of these dummy variables in the PROC LOGISTIC output: 36 This method is known as “effects coding” (the reference group is identified by -1): 1 if Age group1 Age 1 - 1 if Age group 5 0 otherwise 1 if Age group2 Age 2 - 1 if Age group 5 0 otherwise 1 if Age group 3 Age 3 - 1 if Age group 5 0 otherwise 1 if Age group 4 Age 4 - 1 if Age group 5 0 otherwise 1 if non - user OCuse - 1 if user PROC LOGISTIC reports the following parameter estimates and odds ratios: Questions: 1. How does SAS calculate the odds ratio for MI associated with not using oral contraceptives for those in Age group 1? x) θ(~ ln ~ η 0 η1 OCuse η 2 Age1 η 3 Age2 η 4 Age3 η 5 Age4 1 - θ( x) 37 2. How does SAS calculate the odds ratio for MI associated with not using oral contraceptives for those in Age group 2? 3. How does SAS calculate the odds ratio for MI associated with being in Age group 1 versus Age group 5, adjusted for oral contraceptive use? Reordering the Factors To examine the effects of OC use and Age Group, we may want to “reorder” the levels of both variables. That is, we may want to use the non-OC users as the reference group. Also, we may want to use the youngest age group as our baseline. PROC LOGISTIC allows you to specify a reference group in the class statement: proc logistic descending; class OCUse(param=ref ref='No') agegrp(param=ref ref='1'); model MI = OCUse agegrp; weight count; run; 38 Note that the indicator variables are now defined using a method known as “dummy coding”: 1 if Age group 2 1 if Age group 3 1 if Age group 4 1 if Age group 5 Age 2 Age 3 Age 4 Age 5 0 otherwise 0 otherwise 0 otherwise 0 otherwise 0 if non - user OCuse 1 if user Questions: 1. Suppose we want to find the odds ratio associated with being in Age group 5 when compared to being in Age group 1 after adjusting for oral contraceptive use. x) θ(~ ln ~ η 0 η 1 OCuse η 2 Age2 η 3 Age3 η 4 Age4 η 5 Age5 1 - θ( x) 2. Find the odds ratio associated with oral contraceptive use ADJUSTED for age. How does this compare to the CMH estimate? 3. Find a 95% confidence interval for this odds ratio. 39 Note that PROC LOGISTIC returns these odds ratios and their confidence intervals: 4. Interpret the age effect in terms of odds ratios after adjusting for OC use. 5. How would you compare OC users in Age group 5 to non-OC users in Age group 1? 6. How would you compare OC users in Age group 4 to non-OC users in Age group 3? 40 Example 1 in JMP To fit a logistic regression model using OC use and Age group as covariates in JMP select Analyze > Fit Model and place both Age and OC use in the Construct Model Effects box and Case-Control status as the response as shown below: Then click Run to fit the model for these data. The resulting output is shown below. 41 We can see the odds for MI will be given which is the response category of interest here. Had it read No MI/MI we would want to recode the response so MI was the response value of interest. OC Use = No is the category of interest and OC Use = Yes is being used as the reference group, i.e. the denominator odds in the odds ratio. This is not what we want. We want to find the OR for having an MI associate with OC Use = Yes, i.e the risk associated with oral contraceptive use. We can achieve this by recoding OC Use so OC Use = Yes is the category of interest using the Value Ordering option in the Column Info… This process is shown below. Highlight Yes and click Move Up so Yes is at the top of the list. This will make OC Use = Yes the response value of interest in terms of computing the odds ratio for MI associated with OC Use. 42 Repeating the model fit above we obtain the results shown below. I have selected the Wald Tests and the Odds Ratio options from the Nominal Logistic Fit pull-down menu. Interpretation of the results from JMP: 43 EXAMPLE 2: Consider the data found in the file Lowbirth.JMP. These data are from a study to identify potential risk factors for low birth weight. A random sample of new mothers was taken and the following variables were recorded: Low = birth weight less than 2500 grams (Y or N) Prev = previous history of premature labor (History or None) Hyper = hypertension during pregnancy (HT or Normal) Smoke = smoked during pregnancy (Cig or No Cig) Uterine = uterine irritability during pregnancy (Irritation or None) Minority = minority status of mother (White or Nonwhite) Age = mother’s age in years (yrs.) Lwt = mother’s weight at last menstrual cycle (lbs.) Let’s begin by fitting a model with all predictors in JMP using Analyze > Fit Model. Questions: 1. Is the overall model useful? Explain. 2. Are all predictors significant in the model? Explain. 44 Comparing Models with the Likelihood Ratio Test We can fit the reduced model eliminating those terms that are not significant and then test whether the reduced model is adequate. θ(~ x) Ho: ln ~ η 0 η 2 Lwt η 3 Minority η 4Smoke η 5 Prev η 6 Hyper 1 - θ(x) Ha: θ(~ x) ln ~ η 0 η1 Age η 2 Lwt η 3 Minority η 4Smoke η 5 Prev η 6 Hyper η 7 Uterine 1 - θ(x) The test statistic is given by χ2 = (Residual Deviance of reduced model) - (Residual Deviance of full model) Note: Residual Deviance = -2×log-likelihood Under the null hypothesis, this test statistic follows the chi-square distribution with degrees of freedom equal to the change in degrees of freedom between the two competing models. Fitting the Null Hypothesis Model: (Age and Uterine dropped from the model) Fitting the Alternative Hypothesis Model: Carrying out the test: Residual Deviance for Null Hypothesis Model: Residual Deviance for Alternative Hypothesis Model: 45 Test Statistic, χ2 = df = To find the p-value use R: > 1 – pchisq(3.60666,df=2) [1] 0.1647543 Conclusion: Interpretation of Model Parameters for Reduced Model Questions: 1. Find and interpret the odds ratio for low birth weight associated with being a minority. 2. Find and interpret the odds ratio for low birth weight associated with being a smoker. 46 3. Find and interpret the odds ratio for low birth weight associated with having hypertension. 4. Find and interpret the odds ratio for low birth weight associated with having a history of preterm labor. 5. Find and interpret the odds ratio for low birth weight associated with a 10 pound increase in pre-pregnancy weight. 47 Logistic Regression Diagnostics: Residuals and Influence Statistics As in the case of ordinary least squares (OLS) regression, we need to be wary of cases that are poorly fit and those that may have excessive influence on our results. Residuals Pearson and Deviance residuals are useful in identifying observations that are not explained well by the model. Pearson residuals are components of the Pearson chisquare statistic and deviance residuals are components of the deviance. Pearson Residual: The Pearson residual for the ith observation is defined by eˆ χ i y i yˆ i n i θˆ (~ x i )(1 θˆ (~ x i )) Note that the Pearson’s chi-square statistic is the sum of the squared chiresiduals. Deviance Residual: The deviance residual for the ith observation is defined by y i 2(n y )ln n i y i D i sgn(y i θˆ (x i )) 2y i ln i i n θˆ (x ) n (1 θˆ (x )) i i i i 1 2 Note that the deviance is the sum of squares of the deviance residuals. Influence Statistics These measures can be used to identify cases that are highly influential on the logistic regression estimates. DFBETAS: For each parameter estimate, a DFBETAS diagnostic is calculated for each observation. This is the standardized difference in the parameter estimate due to deleting the observation, and it can be used to assess the effect of an individual observation on each estimated parameter of the fitted model. These measures are useful for detecting observations that are causing instability in the selected coefficients. C and CBAR: These diagnostics provide scalar measures of the influence of individual observations on the regression estimates. They are based on the same idea as the Cook distance in linear regression theory. DIFDEV and DIFCHISQ: These are diagnostics for detecting ill-fitted observations; in other words, observations that contribute heavily to the disagreement between the data and the predicted values of the fitted model. DIFDEV is the change in the deviance due to deleting an individual observation 48 while DIFCHISQ is the change in the Pearson chi-square statistic for the same deletion. In cases of both poor fit and high influence, it is good to look at the covariate values for these individuals to address the role they play in the analysis. In many cases there will be several individuals with the same covariate pattern, especially if most or all of the predictors are categorical in nature. To obtain these measures from SAS PROC LOGISTIC, use the following code: ods html; ods graphics on; *Reduced model; proc logistic data=LowBirthWeight descending; class MINORITY(param=ref ref='0') SMOKE(param=ref ref='0') PTL(param=ref ref='0') HTN(param=ref ref='0'); model LOW = WEIGHT MINORITY SMOKE PTL HTN / link=logit influence; run; ods graphics off; ods html close; SAS returns the following plots: 49 50 51 52 53 54 55 Logistic Regression in R In this section of the notes we examine logistic regression in R. There are several functions that I wrote for plotting diagnostics similar to what SAS does, although the inspiration for them came from work Prof. Malone and I did for OLS as part of his senior project. Example 1: Oral Contraceptive Use and Myocardial Infarctions Set up a text file with the data in columns with variable names at the top. The case and control counts are in separate columns. The risk factor OC use and stratification variable Age follow. > OCMI.data = read.table(file.choose(),header=T) # read in text file > OCMI.data MI NoMI Age OCuse 1 4 62 1 Yes 2 2 224 1 No 3 9 33 2 Yes 4 12 390 2 No 5 4 26 3 Yes 6 33 330 3 No 7 6 9 4 Yes 8 65 362 4 No 9 6 5 5 Yes 10 93 301 5 No > attach(OCMI.data) > OC.glm <- glm(cbind(MI,NoMI)~Age+OCuse,family=binomial) # fit model > summary(OC.glm) Call: glm(formula = cbind(MI, NoMI) ~ Age + OCuse, family = binomial) Deviance Residuals: [1] 0.456248 -0.520517 [9] -0.045061 0.008822 1.377693 -0.886710 -1.685521 Coefficients: Estimate Std. Error z value (Intercept) -4.3698 0.4347 -10.054 Age2 1.1384 0.4768 2.388 Age3 1.9344 0.4582 4.221 Age4 2.6481 0.4496 5.889 Age5 3.1943 0.4474 7.140 OCuseYes 1.3852 0.2505 5.530 --Signif. codes: 0 `***' 0.001 `**' 0.01 0.714695 Pr(>|z|) < 2e-16 0.0170 2.43e-05 3.88e-09 9.36e-13 3.19e-08 -0.130922 0.033643 *** * *** *** *** *** `*' 0.05 `.' 0.1 ` ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 158.0085 Residual deviance: 6.5355 AIC: 58.825 on 9 on 4 degrees of freedom degrees of freedom 56 Number of Fisher Scoring iterations: 3 Find OR associated with oral contraceptive use ADJUSTED for age. Recall: CMH procedure gave 3.97. > exp(1.3852) [1] 3.995625 Find a 95% CI for OR associated with OC use. > exp(1.3852-1.96*.2505) [1] 2.445428 > exp(1.3852+1.96*.2505) [1] 6.528518 Interpreting the age effect in terms of OR’s ADJUSTING for OC use. Note: The reference group is Age = 1 which was women 25 – 29 years of age. > OC.glm$coefficients (Intercept) Age2 -4.369850 1.138363 Age3 1.934401 Age4 2.648059 Age5 3.194292 OCuseYes 1.385176 > Age.coefs <- OC.glm$coefficients[2:5] > exp(Age.coefs) Age2 Age3 Age4 Age5 3.121653 6.919896 14.126585 24.392906 Find 95% CI for age = 5 group. > exp(3.1943-1.96*.4474) [1] 10.14921 > exp(3.1943+1.96*.4474) [1] 58.62751 Example 2: Coffee Drinking and Myocardial Infarctions CoffeeMI.data = read.table(file.choose(),header=T) > CoffeeMI.data Smoking Coffee MI NoMI 1 Never > 5 7 31 2 Never < 5 55 269 3 Former > 5 7 18 4 Former < 5 20 112 5 1-14 Cigs > 5 7 24 6 1-14 Cigs < 5 33 114 7 15-25 Cigs > 5 40 45 8 15-25 Cigs < 5 88 172 9 25-34 Cigs > 5 34 24 10 25-34 Cigs < 5 50 55 11 35-44 Cigs > 5 27 24 12 35-44 Cigs < 5 55 58 13 45+ Cigs > 5 30 17 14 45+ Cigs < 5 34 17 > attach(CoffeeMI.data) > Coffee.glm = glm(cbind(MI,NoMI)~Smoking+Coffee,family=binomial) 57 > summary(Coffee.glm) Call: glm(formula = cbind(MI, NoMI) ~ Smoking + Coffee, family = binomial) Deviance Residuals: Min 1Q Median -0.7650 -0.4510 -0.0232 3Q 0.2999 Max 0.7917 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.2981 0.1819 -7.136 9.60e-13 *** Smoking15-25 Cigs 0.6892 0.2119 3.253 0.00114 ** Smoking25-34 Cigs 1.2462 0.2398 5.197 2.02e-07 *** Smoking35-44 Cigs 1.1988 0.2389 5.017 5.24e-07 *** Smoking45+ Cigs 1.7811 0.2808 6.342 2.27e-10 *** SmokingFormer -0.3291 0.2778 -1.185 0.23616 SmokingNever -0.3153 0.2279 -1.384 0.16646 Coffee> 5 0.3200 0.1377 2.324 0.02012 * --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 173.7899 Residual deviance: 3.7622 AIC: 84.311 on 13 on 6 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 3 OR for drinking 5 or more cups of coffee per day. Note: CMH procedure gave OR = 1.375 > exp(.3200) [1] 1.377128 95% CI for OR associated with heavy coffee drinking > exp(.3200 - 1.96*.1377) [1] 1.051385 > exp(.3200 + 1.96*.1377) [1] 1.803794 Reordering a Factor To examine the effect of smoking we might want to “reorder” the levels of smoking status so that individuals who have never smoked are used as the reference group. To do this in R you must do the following: Smoking = factor(Smoking,levels=c("Never","Former","1-14 Cigs","15-25 Cigs","25-34 Cigs","35-44 Cigs","45+ Cigs")) The first level specified in the levels subcommand will be used as the reference group, “Never” in this case. Refitting the model with the reordered smoking status factor gives the following: 58 > Coffee.glm2 <-glm(cbind(MI,NoMI)~Smoking+Coffee,family=binomial) > summary(Coffee.glm2) Call: glm(formula = cbind(MI, NoMI) ~ Smoking + Coffee, family = binomial) Deviance Residuals: Min 1Q Median 3Q Max -0.7650 -0.4510 -0.0232 0.2999 0.7917 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.61344 0.14068 -11.469 < 2e-16 *** SmokingFormer -0.01376 0.25376 -0.054 0.9568 Smoking1-14 Cigs 0.31533 0.22789 1.384 0.1665 Smoking15-25 Cigs 1.00451 0.17976 5.588 2.30e-08 *** Smoking25-34 Cigs 1.56150 0.21254 7.347 2.03e-13 *** Smoking35-44 Cigs 1.51417 0.21132 7.165 7.77e-13 *** Smoking45+ Cigs 2.09646 0.25855 8.108 5.13e-16 *** Coffee> 5 0.31995 0.13766 2.324 0.0201 * --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 173.7899 Residual deviance: 3.7622 AIC: 84.311 on 13 on 6 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 3 Notice that “SmokingNever” is now absent from the output so we know it is being used as the reference group. The OR’s associated with the various levels of smoking are computed below. > Smoke.coefs = Coffee.glm$coefficients[2:7] > exp(Smoke.coefs) SmokingFormer Smoking1-14 Cigs Smoking15-25 Cigs Smoking25-34 Cigs 0.986338 1.370715 2.730561 4.765984 Smoking35-44 Cigs Smoking45+ Cigs 4.545632 8.137279 Confidence intervals for each could be computed in the standard way. 59 Some Details for Categorical Predictors with More Than Two Levels Consider the coffee drinking/MI study above. The stratification variable smoking has seven levels. Thus it requires six dummy variables to define it. The level that is not defined using a dichotomous dummy variable serves as the reference group. The table below shows how the value of the dummy variables: Level D2 D3 D4 D5 D6 D7 Never 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 (Reference Group) Former 1 – 14 Cigs 15 – 24 Cigs 25 – 34 Cigs 35 – 44 Cigs 45+ Cigs Example: Coffee Drinking and Myocardial Infarctions CoffeeMI.data = read.table(file.choose(),header=T) > CoffeeMI.data Smoking Coffee MI NoMI 1 Never > 5 7 31 2 Never < 5 55 269 3 Former > 5 7 18 4 Former < 5 20 112 5 1-14 Cigs > 5 7 24 6 1-14 Cigs < 5 33 114 7 15-25 Cigs > 5 40 45 8 15-25 Cigs < 5 88 172 9 25-34 Cigs > 5 34 24 10 25-34 Cigs < 5 50 55 11 35-44 Cigs > 5 27 24 12 35-44 Cigs < 5 55 58 13 45+ Cigs > 5 30 17 14 45+ Cigs < 5 34 17 The Logistic Model ( x) Coffee D D D D D D ~ ln o 1 2 2 3 3 4 4 5 5 6 6 7 7 1 ( x) ~ where Coffee is a dichotomous predictor equal to 1 if they drink 5 or more cups of coffee per day. Comparing the log-odds of a heavy coffee drinker who who smokes 15-25 cigarettes day to a heavy coffee drinker who has never smoked we have. 60 1 ( x) ~ ln o 1 4 1 1 ( x) ~ 2 ( x) ~ ln o 1 1 2 ( x) ~ Taking the difference gives, 1 ( x) ~ 1 1 ( x) ~ ln 4 2 ( x~ ) 1 2 ( x) ~ thus e 4 the odds ratio associated with smoking 15-24 cigarettes per day when compared to individuals who have never smoked amongst heavy coffee drinkers. Because 1 is not involved in the odds ratio the result is the same for non-heavy coffee drinkers as well! You can also consider combinations of factors, e.g. if we compared heavy coffee drinkers who smoked 15-24 cigarettes to a non-heavy coffee drinkers who have never smoked the associated OR would be given by e1 4 . Using our fitted model the OR’s ratios discussed above would be. > summary(Coffee.glm) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.61344 0.14068 -11.469 < 2e-16 *** SmokingFormer -0.01376 0.25376 -0.054 0.9568 Smoking1-14 Cigs 0.31533 0.22789 1.384 0.1665 Smoking15-25 Cigs 1.00451 0.17976 5.588 2.30e-08 *** Smoking25-34 Cigs 1.56150 0.21254 7.347 2.03e-13 *** Smoking35-44 Cigs 1.51417 0.21132 7.165 7.77e-13 *** Smoking45+ Cigs 2.09646 0.25855 8.108 5.13e-16 *** Coffee> 5 0.31995 0.13766 2.324 0.0201 * --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 OR for 15-24 cigarette smokers vs. never smokers (regardless of coffee drinking status) > exp(1.00451) [1] 2.730569 61 OR for 15-24 cigarette smokers who are also heavy coffee drinkers vs. non-smokers who are not heavy coffee drinkers > exp(.31995 + 1.00451) [1] 3.760154 Similar calculations could be done for other combinations of coffee and cigarette use. Example 3: Risk Factors for Low Birth Weight Response Low = low birth weight, i.e. birth weight < 2500 grams(1 = yes, 0 = no) Set of potential predictors Prev = previous history of premature labor (1 = yes, 0 = no) Hyper = hypertension during pregnancy (1 = yes, 0 = no) Smoke = smoker (1 = yes, 0 = no) Uterine = uterine irritability (1 = yes, 0 = no) Minority = minority (1 = yes, 0 = no) Age = mother’s age in years Lwt = mother’s weight at last menstrual cycle Analysis in R > Lowbirth = read.table(file.choose(),header=T) > Lowbirth[1:5,] # print first 5 rows of the data set Low Prev Hyper Smoke Uterine Minority Age Lwt race bwt 1 0 0 0 0 1 1 19 182 2 2523 2 0 0 0 0 0 1 33 155 3 2551 3 0 0 0 1 0 0 20 105 1 2557 4 0 0 0 1 1 0 21 108 1 2594 5 0 0 0 1 1 0 18 107 1 2600 Make sure categorical variables are interpreted as factors by using the factor command > > > > > > Low = factor(Low) Prev = factor(Prev) Hyper = factor(Hyper) Smoke = factor(Smoke) Uterine = factor(Uterine) Minority = factor(Minority) Note: This is not really necessary for dichotomous variables that are coded (0,1). Fit a preliminary model using all available covariates > low.glm = glm(Low~Prev+Hyper+Smoke+Uterine+Minority+Age+Lwt,family=binomial) > summary(low.glm) Call: glm(formula = Low ~ Prev + Hyper + Smoke + Uterine + Minority + Age + Lwt, family = binomial) Deviance Residuals: Min 1Q Median -1.6010 -0.8149 -0.5128 3Q 1.0188 Max 2.1977 62 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.378479 1.170627 0.323 0.74646 Prev1 1.196011 0.461534 2.591 0.00956 ** Hyper1 1.452236 0.652085 2.227 0.02594 * Smoke1 0.959406 0.405302 2.367 0.01793 * Uterine1 0.647498 0.466468 1.388 0.16511 Minority1 0.990929 0.404969 2.447 0.01441 * Age -0.043221 0.037493 -1.153 0.24900 Lwt -0.012047 0.006422 -1.876 0.06066 . --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Null deviance: 232.40 Residual deviance: 196.71 AIC: 212.71 on 185 on 178 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 3 It appears that both uterine irritability and mother’s age are not significant. We can fit the reduced model eliminating both terms and test whether the model is significantly degraded by using the general chi-square test (see the JMP example above). > low.reduced = glm(Low~Prev+Hyper+Smoke+Minority+Lwt,family=binomial) > summary(low.reduced) Call: glm(formula = Low ~ Prev + Hyper + Smoke + Minority + Lwt, family = binomial) Deviance Residuals: Min 1Q Median -1.7277 -0.8219 -0.5368 3Q 0.9867 Max 2.1517 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.261274 0.885803 -0.295 0.76803 Prev1 1.181940 0.444254 2.661 0.00780 ** Hyper1 1.397219 0.656271 2.129 0.03325 * Smoke1 0.981849 0.398300 2.465 0.01370 * Minority1 1.044804 0.394956 2.645 0.00816 ** Lwt -0.014127 0.006387 -2.212 0.02697 * --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 232.40 Residual deviance: 200.32 AIC: 212.32 on 185 on 180 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 3 63 Reduced Model: θ(~ x) Ho: ln ~ η 0 η 2 Lwt η 3 Minority η 4Smoke η 5 Prev η 6 Hyper 1 - θ(x) Full Model: θ(~ x) ln ~ η 0 η1 Age η 2 Lwt η 3 Minority η 4Smoke η 5 Prev η 6 Hyper η 7 Uterine 1 - θ(x) * Recall: ( x) P( Low 1 | X ) ~ ~ DH o 200.32 df = 180 Residual Deviance Alternative Hypothesis Model: DH1 196.71 df = 178 Residual Deviance Null Hypothesis Model: General Chi-Square Test 2 DH 0 DH1 200.32 196.71 3.607 p value P( 2 3.607) .1647 Fail to reject the null, the reduced model is adequate. 2 Interpretation of Model Parameters OR’s Associated with Categorical Predictors > low.reduced Call: glm(formula = Low ~ Prev + Hyper + Smoke + Minority + Lwt, family = binomial) Coefficients: (Intercept) Prev1 -0.26127 1.18194 Hyper1 1.39722 Smoke1 0.98185 Degrees of Freedom: 185 Total (i.e. Null); Null Deviance: 232.4 Residual Deviance: 200.3 AIC: 212.3 Minority1 1.04480 Lwt -0.01413 180 Residual Estimated OR’s > exp(low.reduced$coefficients[2:5]) Prev1 Hyper1 Smoke1 Minority1 3.260693 4.043938 2.669388 2.842841 95% CI for OR Associated with History of Premature Labor (Wald Intervals) > exp(1.182 - 1.96*.444) [1] 1.365827 > exp(1.182 + 1.96*.444) [1] 7.78532 Holding everything else constant we estimate that the odds of having an infant with low birth weight are between 1.366 and 7.785 times larger for mothers with a history of premature labor. 64 95% CI for OR Associated with Hypertension > exp(1.397 - 1.96*.6563) [1] 1.117006 > exp(1.397 + 1.96*.6563) [1] 14.63401 Holding everything else constant we estimate that the odds of having an infant with low birth weight are between 1.117 and 14.63 times larger for mothers with hypertension during pregnancy. 95% CI for OR Associated with Smoking > exp(.981849 - 1.96*.3983) [1] 1.222846 > exp(.981849 + 1.96*.3983) [1] 5.827086 Holding everything else constant we estimate that the odds of having an infant with low birth weight are between 1.223 and 5.827 times larger for mothers who smoked during pregnancy. 95% CI for OR Associated with Minority Status > exp(1.0448 - 1.96*.3950) [1] 1.310751 > exp(1.0448 + 1.96*.3950) [1] 6.16569 Holding everything else constant we estimate that the odds of having an infant with low birth weight are between 1.311 and 6.166 times larger for non-white mothers. OR Associated with Mother’s Weight at Last Menstrual Cycle Because this is a continuous predictor with values over 100 we should use an increment larger than one when considering the effect of mother’s weight on birth weight. Here we will use an increment of c = 10 lbs., although certainly there are other possibilities. > exp(-10*.014127) [1] 0.8682549 i.e. 13.2% decrease in the OR for each additional 10 lbs. in premenstrual weight. A 95% CI for this OR is: > exp(10*(-.014127) - 1.96*10*.006387) [1] 0.7660903 > exp(10*(-.014127) + 1.96*10*.006387) [1] 0.9840439 Create a sequence of weights from smallest observed weight to the largest observed weight by ½ pound increments. > x = seq(min(Lwt),max(Lwt),.5) 65 Here I have set the other covariates as follows: previous history (1 = yes), hypertension (0 = no), smoking status (1 = yes), and minority (0 = no). > fit = predict(low.reduced,data.frame(Prev=factor(rep(1,length(x))), Hyper=factor(rep(0,length(x))),Smoke=factor(rep(1,length(x))),Minority= factor(rep(0,length(x))),Lwt=x),type="response") plot(x,fit,xlab=”Mother’s Weight”,ylab=”P(Low|Prev=1,Smoke=1,Lwt)”) This is a plot of the effect of premenstrual weight for smoking mothers with a history of premature labor. Using the predict command above similar plots could be constructed by examining other combinations of the categorical predictors. 66 Case Diagnostics (Delta Deviance and Cook’s Distance) As in the case of ordinary least squares (OLS) regression we need to be wary of cases that may have unduly high influence on our results and those that are poorly fit. The most common influence measure is Cook’s Distance and a good measure of poorly fit cases is the Delta Deviance. Essentially Cook’s Distance ( ˆ( i ) or 𝐷𝑖 ) measures the changes in the estimated parameters when the ith observation is deleted. This change is measured for each of the observations and can be plotted versus ˆ( x) or observation number to aid in the ~ identification of high influence cases. Several cut-offs have been proposed for Cook’s Distance, the most common being to classify an observation as having large influence if ˆ( i ) 1 or, in case of large sample size n, ˆ( i ) 4 / n . Cook’s Distance ( i ) 2 1 eˆ i k 1 hi where eˆ χ i hi 1 hi y i yˆ i is the Pearson’s residual defined above. n i θˆ (~ x i )(1 θˆ (~ x i )) Delta deviance measures the change in the deviance (D) when the ith case is deleted. Values around 4 or larger are considered to cases that are poorly fit. These cases correspond to cases to individuals where yi 1 but ˆ( x) is small, or cases ~ where yi 0 but ˆ( x) is large. ~ In cases of both high influence and poor fit it is good to look at the covariate values for these individuals and we can begin to address the role they play in the analysis. In many cases there will be several individuals with the same covariate pattern, especially if most or all of the predictors are categorical in nature. > Diagplot.glm(low.reduced) 67 > Diagplot.log(low.reduced) Cases 11 and 13 have the highest Cook’s distances although they are not that large. It should be noted also that they are also somewhat poorly fit. Cases 129, 144, 152, and 180 appear to be poorly fit. The information on all of these cases is shown below. > Lowbirth[c(11,13,129,144,152,180),] Low Prev Hyper Smoke Uterine Minority Age Lwt race bwt 11 0 0 1 0 0 1 19 95 3 2722 13 0 0 1 0 0 1 22 95 3 2750 129 1 0 0 0 1 0 29 130 1 1021 144 1 0 0 0 1 1 21 200 2 1928 152 1 0 0 0 0 0 24 138 1 2100 180 1 0 0 1 0 0 26 190 1 2466 68 Case 152 had a low birth weight infant even in the absence of the identified potential risk factors. The fitted values for all four of the poorly fit cases are quite small. > fitted(low.reduced)[c(11,13,129,144,152,180)] 11 13 129 144 152 180 0.69818500 0.69818500 0.10930602 0.11486743 0.09877858 0.12307383 Cases 11 and 13 have high predicted probabilities despite the fact that they had babies with normal birth weight. Their relatively high leverage might come from the fact that there were very few hypertensive minority women in the study. These two facts combined lead to the relatively large Cook’s Distances for these two cases. Plotting Estimated Conditional Probabilities ~ P( Low 1 | x~ ) A summary of the reduced model is given below: > low.reduced Call: glm(formula = Low ~ Prev + Hyper + Smoke + Minority + Lwt, family = binomial) Coefficients: (Intercept) Prev1 -0.26127 1.18194 Hyper1 1.39722 Smoke1 0.98185 Degrees of Freedom: 185 Total (i.e. Null); Null Deviance: 232.4 Residual Deviance: 200.3 AIC: 212.3 Minority1 1.04480 Lwt -0.01413 180 Residual To easily plot probabilities in R we can write a function that takes covariate values and compute the desired conditional probability. > x <- seq(min(Lwt),max(Lwt),.5) > + + + + > + > > > > PrLwt <- function(x,Prev,Hyper,Smoke,Minority) { L <- -.26127 + 1.18194*Prev + 1.39722*Hyper + .98185*Smoke + 1.0448*Minority - .01413*x exp(L)/(1 + exp(L)) } plot(x,PrLwt(x,1,1,1,1),xlab="Mother's Weight",ylab="P(Low=1|x)", ylim=c(0,1),type="l") title(main="Plot of P(Low=1|X) vs. Mother's Weight") lines(x,PrLwt(x,0,0,0,0),lty=2,col="red") lines(x,PrLwt(x,1,1,0,0),lty=3,col="blue") lines(x,PrLwt(x,0,0,1,1),lty=4,col="green") 69 R Function – Diagplot.log Plot Cook’s Distance and Delta Deviance for Logistic Regression Models Diagplot.log = function(glm1) { k <- length(glm1$coef) h <- lm.influence(glm1)$hat fv <- fitted(glm1) pr <- resid(glm1, type = "pearson") dr <- resid(glm1, type = "deviance") par(mfrow = c(2, 1)) n <- length(fv) index <- seq(1, n, 1) Ck <- (1/k)*((pr^2) * h)/((1 - h)^2) Cd <- dr^2/(1 - h) plot(index, Ck, type = "n", xlab = "Index", ylab = "Cook's Distance", cex = 0.7, main = "Plot of Cook's Distance vs. Index", col = 1) points(index, Ck, col = 2) identify(index, Ck) plot(index, Cd, type = "n", xlab = "Index", ylab = "Delta Deviance", cex = 0.7, main = "Plot of Delta Deviance vs. Index") points(index, Cd, col = 2) identify(index, Cd) par(mfrow = c(1, 1)) invisible() } 70 Diagplot.glm - displays case diagnositic plots for a logistic regression Diagplot.glm function (lm1, lms = summary(lm1), lmi = lm.influence(lm1)) { par(mfrow = c(2, 2)) h <- lmi$hat pr <- residuals(lm1, type = "pearson") dr <- residuals(lm1, type = "deviance") dB <- ((pr^2) * h)/((1 - h)^2) dD <- dr^2/(1 - h) fv <- lm1$fitted.values plot(fv, dB, main = "Plot of dB vs. Fitted Values", xlab = "Fitted Values", ylab = "dB") points(fv[dB > 1], dB[dB > 1], col = "blue") plot(fv, dD, main = "Plot of dD vs. Fitted Values", xlab = "Fitted Values", ylab = "dD") points(fv[dD > 4], dD[dD > 4], col = "blue") index <- seq(1:length(fv)) plot(dB, main = "Plot of dB vs. Index Number", xlab = "Index Number") points(index[dB > 1], dB[dB > 1], col = "blue") identify(index, dB, cex = 0.75) plot(dD, main = "Plot of dD vs. Index Number", xlab = "Index Number") points(index[dD > 4], dD[dD > 4], col = "blue") identify(index, dD, cex = 0.75) par(mfrow = c(1, 1)) invisible() } 71 Interactions and Higher Order Terms (Note ~ uses data frame: Lowbwt) Working with a slightly different version of the low birth weight data available which includes an additional predictor, ftv, which is a factor that indicates the number of first trimester doctor visits the woman (coded as: 0, 1, or 2+). We will examine how the model below was developed in the next section where we discuss model development. In the model below we have added an interaction between age and the number of first trimester visits. The logistic model is: ( x) ~ Age Lwt Smoke Pr ev HT UI log o 1 2 3 4 5 6 1 ( x) ~ 7 FTV 1 8 FTV 2 9 Age * FTV 1 10 Age * FTV 2 11Smoke * UI > summary(bigmodel) Call: glm(formula = low ~ age + lwt + smoke + ptd + ht + ui + ftv + age:ftv + smoke:ui, family = binomial) Deviance Residuals: Min 1Q Median -1.8945 -0.7128 -0.4817 3Q 0.7841 Max 2.3418 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.582389 1.420834 -0.410 0.681885 age 0.075538 0.053945 1.400 0.161428 lwt -0.020372 0.007488 -2.721 0.006513 ** smoke1 0.780047 0.420043 1.857 0.063302 . ptd1 1.560304 0.496626 3.142 0.001679 ** ht1 2.065680 0.748330 2.760 0.005773 ** ui1 1.818496 0.666670 2.728 0.006377 ** ftv1 2.921068 2.284093 1.279 0.200941 ftv2+ 9.244460 2.650495 3.488 0.000487 *** age:ftv1 -0.161823 0.096736 -1.673 0.094360 . age:ftv2+ -0.411011 0.118553 -3.467 0.000527 *** smoke1:ui1 -1.916644 0.972366 -1.971 0.048711 * --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 234.67 Residual deviance: 183.07 AIC: 207.07 on 188 on 177 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 4 > bigmodel$coefficients (Intercept) age lwt smoke1 prev1 ht1 -0.58238913 0.07553844 -0.02037234 0.78004747 1.56030401 2.06567991 ui1 ftv1 ftv2+ age:ftv1 age:ftv2+ smoke1:ui1 1.81849631 2.92106773 9.24445985 -0.16182328 -0.41101103 -1.91664380 72 Calculate P(Low|Age,FTV) for women of average pre-pregnancy weight with all other risk factors absent. Similar calculations could be done if we wanted to add in other factors as well. First we calculate the logits as function of age for three levels of FTV 0, 1, and 2+ respectively. > L <- -.5824 + .0755*agex - .02037*mean(lwt) > L1 <- -.5824 + .0755*agex - .02037*mean(lwt) + 2.9211 - .16182*agex > L2 <- -.5824 + .0755*agex - .02037*mean(lwt) + 9.2445 - .4110*agex Next we calculate the associated conditional probabilities. > P <- exp(L)/(1+exp(L)) > P1 <- exp(L1)/(1+exp(L1)) > P2 <- exp(L2)/(1+exp(L2)) Finally we plot the probability curves as function of age and FTV. > plot(agex,P,type="l",xlab="Age",ylab="P(Low|Age,FTV)",ylim=c(0,1)) > lines(agex,P1,lty=2,col="blue") > lines(agex,P2,lty=3,col="red") > title(main="Interaction Between Age and First Trimester Visits",cex=.6) The interaction between in age and FTV produces differences in direction and magnitude of the age effect. For women with no first trimester doctor visits their probability of low birth weight increases with age. However for women with at least one first trimester visit the probability of low birth weight decreases with age. The magnitude of that drop is largest for women with 2 or more first trimester visits. We also have an interaction between smoking and uterine irritability added to the model. This will affect how we interpret the two in terms of odds ratios. We need to consider the OR associated with smoking for women without uterine irritability, the OR associated with uterine irritability for nonsmokers, and finally the OR associated with smoking and having uterine irritability during pregnancy. 73 These estimated odds ratios are given below: OR for Smoking with No Uterine Irritability > exp(.7800) [1] 2.181472 OR for Uterine Irritability with No Smoking > exp(1.8185) [1] 6.162608 OR for Smoking and Uterine Irritability > exp(.7800+1.8185-1.91664) [1] 1.977553 This result is hard to explain physiologically and so this interaction term might be removed from the model. Model Selection Methods Stepwise methods used in logistic regression are the same as those used in ordinary least square regression however the measure is the AIC (Akaike Information Criteria) as opposed to Mallow’s Ck statistic. Like Mallow’s statistic, AIC balances residual deviance and the number of parameters in the model. AIC = D + 2k ˆ Where D = residual deviance, k = total number of estimated parameters, and ˆ is an estimate of the dispersion parameter which is taken to be 1 in models where overdispersion is not present. Overdispersion occurs when the data consists of the number of successes out of mi > 1 trials and the trials are not independent (e.g. male birth data from your last homework). Forward, backward, both forward and backward simultaneously, and all possible subsets regression methods can be employed to find models with small AIC values. By default R uses both forward and backward selection simultaneously. The command to do this in R has the basic form: > step(current model name) To have it select from models containing all potential two-way interactions use: > step(current model name, scope=~.^2) This sometimes will have problems with convergence due to overfitting (i.e. the estimated probabilities approach 0 and 1 as in the saturated model). If this occurs you can have R consider adding each of the potential interaction terms and then you can scan the list and decide which you might want to add to your existing model. You can then continue adding terms until the AIC criteria suggests additional terms do not improve current model. 74 These commands are illustrated for the low birth weight data with first trimester visits included in the output shown below. Base Model > low.glm <- glm(low~age+lwt+race+smoke+ht+ui+ptd+ftv,family=binomial) > summary(low.glm) Call: glm(formula = low ~ age + lwt + race + smoke + ht + ui + ptd + ftv, family = binomial) Deviance Residuals: Min 1Q Median -1.7038 -0.8068 -0.5009 3Q 0.8836 Max 2.2151 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.822706 1.240174 0.663 0.50709 age -0.037220 0.038530 -0.966 0.33404 lwt -0.015651 0.007048 -2.221 0.02637 * race2 1.192231 0.534428 2.231 0.02569 * race3 0.740513 0.459769 1.611 0.10726 smoke1 0.755374 0.423246 1.785 0.07431 . ht1 1.912974 0.718586 2.662 0.00776 ** ui1 0.680162 0.463464 1.468 0.14222 ptd1 1.343654 0.479409 2.803 0.00507 ** ftv1 -0.436331 0.477792 -0.913 0.36112 ftv2+ 0.178939 0.455227 0.393 0.69426 --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 234.67 Residual deviance: 195.48 AIC: 217.48 on 188 on 178 degrees of freedom degrees of freedom Find “best” model that includes all potential two-way interactions. > low.step <- step(low.glm,scope=~.^2) Start: AIC= 217.48 low ~ age + lwt + race + smoke + ht + ui + ptd + ftv + age:ftv - ftv - age <none> - ui + smoke:ui + lwt:smoke + ui:ptd + lwt:ui + ptd:ftv + ht:ptd Df Deviance AIC 2 183.00 209.00 2 196.83 214.83 1 196.42 216.42 195.48 217.48 1 197.59 217.59 1 193.76 217.76 1 194.04 218.04 1 194.24 218.24 1 194.28 218.28 2 192.38 218.38 1 194.55 218.55 75 + + + + + + + + + + + + + + + + + + + + age:ptd age:ht age:smoke race:ui smoke smoke:ht smoke:ptd race race:smoke lwt:ptd lwt:ht age:lwt age:ui ht:ftv lwt:ftv smoke:ftv age:race lwt:race race:ptd lwt race:ht ui:ftv ht ptd race:ftv 1 1 1 2 1 1 1 2 2 1 1 1 1 2 2 2 2 2 2 1 2 2 1 1 4 194.58 194.59 194.61 192.63 198.67 195.03 195.16 201.23 193.24 195.35 195.44 195.46 195.47 194.00 194.19 194.47 194.58 194.63 194.83 200.95 195.19 195.32 202.93 203.58 193.81 218.58 218.59 218.61 218.63 218.67 219.03 219.16 219.23 219.24 219.35 219.44 219.46 219.47 220.00 220.19 220.47 220.58 220.63 220.83 220.95 221.19 221.32 222.93 223.58 223.81 Step: AIC= 209 low ~ age + lwt + race + smoke + ht + ui + ptd + ftv + age:ftv + smoke:ui + lwt:smoke - race <none> + ui:ptd + lwt:ui + ht:ptd - smoke + age:smoke + race:ui + age:ptd - ui + smoke:ht + lwt:ptd + smoke:ptd + age:ht + age:ui + age:lwt + lwt:ht + race:smoke + lwt:ftv + ptd:ftv + age:race + smoke:ftv + ht:ftv + lwt:race + race:ht Df Deviance AIC 1 179.94 207.94 1 180.89 208.89 2 186.99 208.99 183.00 209.00 1 181.42 209.42 1 181.90 209.90 1 182.06 210.06 1 186.11 210.11 1 182.16 210.16 2 180.32 210.32 1 182.50 210.50 1 186.61 210.61 1 182.71 210.71 1 182.75 210.75 1 182.82 210.82 1 182.90 210.90 1 182.96 210.96 1 183.00 211.00 1 183.00 211.00 2 181.23 211.23 2 181.44 211.44 2 181.57 211.57 2 181.62 211.62 2 181.65 211.65 2 181.82 211.82 2 182.55 212.55 2 182.78 212.78 76 + + + - race:ptd lwt ui:ftv ht ptd race:ftv age:ftv 2 1 2 1 1 4 2 182.85 188.88 182.94 190.13 191.05 181.69 195.48 212.85 212.88 212.94 214.13 215.05 215.69 217.48 Step: AIC= 207.94 low ~ age + lwt + race + smoke + ht + ui + ptd + ftv + age:ftv + smoke:ui - race <none> + lwt:smoke + ht:ptd - smoke:ui + ui:ptd + age:ptd + age:smoke + smoke:ptd + lwt:ptd + lwt:ui + age:ht + smoke:ht + age:lwt + age:ui + lwt:ht + lwt:ftv + ptd:ftv + smoke:ftv + race:smoke + age:race + ht:ftv + race:ui + ui:ftv + race:ht + lwt:race + race:ptd - lwt - ht + race:ftv - ptd - age:ftv Df Deviance AIC 2 183.07 207.07 179.94 207.94 1 178.34 208.34 1 178.89 208.89 1 183.00 209.00 1 179.07 209.07 1 179.35 209.35 1 179.37 209.37 1 179.58 209.58 1 179.61 209.61 1 179.76 209.76 1 179.78 209.78 1 179.82 209.82 1 179.84 209.84 1 179.86 209.86 1 179.94 209.94 2 178.25 210.25 2 178.53 210.53 2 178.64 210.64 2 178.73 210.73 2 178.84 210.84 2 178.89 210.89 2 179.13 211.13 2 179.50 211.50 2 179.52 211.52 2 179.68 211.68 2 179.86 211.86 1 187.15 213.15 1 187.66 213.66 4 178.51 214.51 1 188.83 214.83 2 193.76 217.76 Step: AIC= 207.07 low ~ age + lwt + smoke + ht + ui + ptd + ftv + age:ftv + smoke:ui <none> + lwt:smoke + ui:ptd + ht:ptd + race + age:smoke + age:ht Df Deviance 183.07 1 181.40 1 181.88 1 181.93 2 179.94 1 181.97 1 182.64 AIC 207.07 207.40 207.88 207.93 207.94 207.97 208.64 77 + + + + + + + + + + + + + - age:ptd lwt:ptd lwt:ui smoke:ptd age:lwt smoke:ui age:ui smoke:ht lwt:ht smoke:ftv lwt:ftv ptd:ftv ui:ftv ht:ftv ht lwt ptd age:ftv 1 1 1 1 1 1 1 1 1 2 2 2 2 2 1 1 1 2 182.69 182.73 182.76 182.85 182.92 186.99 182.99 183.02 183.06 181.48 181.69 181.85 182.28 182.41 191.21 191.56 193.59 199.00 208.69 208.73 208.76 208.85 208.92 208.99 208.99 209.02 209.06 209.48 209.69 209.85 210.28 210.41 213.21 213.56 215.59 219.00 Summarize the model returned from the stepwise search > summary(low.step) Call: glm(formula = low ~ age + lwt + smoke + ht + ui + ptd + ftv + age:ftv + smoke:ui, family = binomial) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.582389 1.420834 -0.410 0.681885 age 0.075538 0.053945 1.400 0.161428 lwt -0.020372 0.007488 -2.721 0.006513 ** smoke1 0.780047 0.420043 1.857 0.063302 . ht1 2.065680 0.748330 2.760 0.005773 ** ui1 1.818496 0.666670 2.728 0.006377 ** ptd1 1.560304 0.496626 3.142 0.001679 ** ftv1 2.921068 2.284093 1.279 0.200941 ftv2+ 9.244460 2.650495 3.488 0.000487 *** age:ftv1 -0.161823 0.096736 -1.673 0.094360 . age:ftv2+ -0.411011 0.118553 -3.467 0.000527 *** smoke1:ui1 -1.916644 0.972366 -1.971 0.048711 * Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 234.67 on 188 degrees of freedom Residual deviance: 183.07 on 177 degrees of freedom AIC: 207.07 Number of Fisher Scoring iterations: 4 This is the model used to demonstrate model interpretation in the presence of interactions. 78 An alternative to the full blown search above is to consider adding a single interaction term to the “Base Model” from the set of all possible terms. > add1(low.glm,scope=~.^2) Single term additions Model: low ~ age + lwt + race Df Deviance <none> 195.48 age:lwt 1 195.46 age:race 2 194.58 age:smoke 1 194.61 age:ht 1 194.59 age:ui 1 195.47 age:ptd 1 194.58 age:ftv 2 183.00 lwt:race 2 194.63 lwt:smoke 1 194.04 lwt:ht 1 195.44 lwt:ui 1 194.28 lwt:ptd 1 195.35 lwt:ftv 2 194.19 race:smoke 2 193.24 race:ht 2 195.19 race:ui 2 192.63 race:ptd 2 194.83 race:ftv 4 193.81 smoke:ht 1 195.03 smoke:ui 1 193.76 smoke:ptd 1 195.16 smoke:ftv 2 194.47 ht:ui 0 195.48 ht:ptd 1 194.55 ht:ftv 2 194.00 ui:ptd 1 194.24 ui:ftv 2 195.32 ptd:ftv 2 192.38 + smoke + ht + ui + ptd + ftv AIC 217.48 219.46 220.58 218.61 218.59 219.47 218.58 209.00 * 220.63 218.04 219.44 218.28 219.35 220.19 219.24 221.19 218.63 220.83 223.81 219.03 217.76 219.16 220.47 217.48 218.55 220.00 218.24 221.32 218.38 We can than “manually” enter this term to our base model by using the update command in R. > low.glm2 <- update(low.glm,.~.+age:ftv) > summary(low.glm2) Call: glm(formula = low ~ age + lwt + race + smoke + ht + ui + ptd + ftv + age:ftv, family = binomial) Deviance Residuals: Min 1Q Median -2.0338 -0.7690 -0.4510 3Q 0.8354 Max 2.3383 Coefficients: Estimate Std. Error z value Pr(>|z|) 79 (Intercept) -1.636485 age 0.085461 lwt -0.017599 race2 0.994134 race3 0.700669 smoke1 0.792972 ht1 1.936204 ui1 0.938620 ptd1 1.373390 ftv1 2.877889 ftv2+ 8.264965 age:ftv1 -0.149619 age:ftv2+ -0.359454 --Signif. codes: 0 `***' 1.558677 0.055734 0.007653 0.550962 0.491400 0.452303 0.747576 0.492240 0.495738 2.253710 2.594444 0.096342 0.115429 -1.050 1.533 -2.300 1.804 1.426 1.753 2.590 1.907 2.770 1.277 3.186 -1.553 -3.114 0.29376 0.12519 0.02147 0.07118 0.15391 0.07957 0.00960 0.05654 0.00560 0.20162 0.00144 0.12043 0.00185 * . . ** . ** ** ** 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 234.67 on 188 degrees of freedom Residual deviance: 183.00 on 176 degrees of freedom AIC: 209 Number of Fisher Scoring iterations: 4 Next we could use add1 to consider the remaining interaction terms for addition to this model. > add1(low.glm2,scope=~.^2) Single term additions Model: low ~ age + lwt + race + smoke + ht + ui + ptd + ftv + age:ftv Df Deviance AIC <none> 183.00 209.00 age:lwt 1 183.00 211.00 age:race 2 181.62 211.62 age:smoke 1 182.16 210.16 age:ht 1 182.90 210.90 age:ui 1 182.96 210.96 age:ptd 1 182.50 210.50 lwt:race 2 182.55 212.55 lwt:smoke 1 180.89 208.89 * lwt:ht 1 183.00 211.00 lwt:ui 1 181.90 209.90 lwt:ptd 1 182.75 210.75 lwt:ftv 2 181.44 211.44 race:smoke 2 181.23 211.23 race:ht 2 182.78 212.78 race:ui 2 180.32 210.32 race:ptd 2 182.85 212.85 race:ftv 4 181.69 215.69 smoke:ht 1 182.71 210.71 smoke:ui 1 179.94 207.94 ** smoke:ptd 1 182.82 210.82 smoke:ftv 2 181.65 211.65 ht:ui 0 183.00 209.00 ht:ptd 1 182.06 210.06 ht:ftv 2 181.82 211.82 ui:ptd 1 181.42 209.42 ui:ftv 2 182.94 212.94 ptd:ftv 2 181.57 211.57 80