Chapter 2-11. Logistic Regression & Dummy Variables When the outcome is dichotomous (scored as 0 or 1), logistic regression is almost universally used. Let’s see what happens if we just use linear regression. We will use the fev dataset, and attempt to predict being a current smoker, rather than what is predictive of FEV. Reading in the data, File Open Find the directory where you copied the course CD: Change to the subdirectory datasets & do-files Single click on fev.dta Open use fev, clear Obtaining the means and percents, bysort male: sum smoker tab smoker male, col -> male = 0 Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------smoker | 318 .1226415 .3285422 0 1 -> male = 1 Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------smoker | 336 .077381 .2675934 0 1 | male Smoking Status | 0 1 | Total -------------------+----------------------+---------not current smoker | 279 310 | 589 | 87.74 92.26 | 90.06 -------------------+----------------------+---------current smoker | 39 26 | 65 | 12.26 7.74 | 9.94 -------------------+----------------------+---------Total | 318 336 | 654 | 100.00 100.00 | 100.00 We see that the means of the dichotomous variable smoker are identical to the proportions computed using the crosstabulation approach. _____________________ Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah School of Medicine, 2010. Chapter 2-11 (revision 16 May 2010) p. 1 This is because the mean and proportion are computed identically for a 0-1 scored variable: X X n 1 1 0 1 0 .... 1 #1' s proportion n n Since linear regression fits a straight line through the group means, it seems reasonable that it will fit a straight line through the proportions if we have a dichotous outcome variable. Fitting the regression line to predict smoker from male, Statistics Linear models and related Linear regression Model tab: Dependent variable: smoker Independent variables: male OK regress smoker male Source | SS df MS -------------+-----------------------------Model | .334678982 1 .334678982 Residual | 58.2050764 652 .08927159 -------------+-----------------------------Total | 58.5397554 653 .089647405 Number of obs F( 1, 652) Prob > F R-squared Adj R-squared Root MSE = = = = = = 654 3.75 0.0533 0.0057 0.0042 .29878 -----------------------------------------------------------------------------smoker | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------male | -.0452606 .0233756 -1.94 0.053 -.091161 .0006399 _cons | .1226415 .0167549 7.32 0.000 .0897413 .1555417 ------------------------------------------------------------------------------ Asking for predicted values, Statistics Postestimation Predictions, residuals, etc. Main tab: New variable name: pred_smoker Produce: fitted values OK predict pred_smoker Chapter 2-11 (revision 16 May 2010) p. 2 Creating a frequency table of the fitted values, tab pred_smoker Fitted | values | Freq. Percent Cum. ------------+----------------------------------.077381 | 336 51.38 51.38 .1226415 | 318 48.62 100.00 ------------+----------------------------------Total | 654 100.00 We see that for n=336 subjects, the males, the predicted value is 0.077, and for n=318 subjects, the females, the predicted value is 0.123. These are identically the means computed using the sum command on the previous page, and are identically the proportions from the crosstabulation table computed using the tab command on the previous page. -> male = 0 Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------smoker | 318 .1226415 .3285422 0 1 -> male = 1 Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------smoker | 336 .077381 .2675934 0 1 | male Smoking Status | 0 1 | Total -------------------+----------------------+---------not current smoker | 279 310 | 589 | 87.74 92.26 | 90.06 -------------------+----------------------+---------current smoker | 39 26 | 65 | 12.26 7.74 | 9.94 -------------------+----------------------+---------Total | 318 336 | 654 | 100.00 100.00 | 100.00 Chapter 2-11 (revision 16 May 2010) p. 3 From the above linear regression model output, Source | SS df MS -------------+-----------------------------Model | .334678982 1 .334678982 Residual | 58.2050764 652 .08927159 -------------+-----------------------------Total | 58.5397554 653 .089647405 Number of obs F( 1, 652) Prob > F R-squared Adj R-squared Root MSE = = = = = = 654 3.75 0.0533 0.0057 0.0042 .29878 -----------------------------------------------------------------------------smoker | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------male | -.0452606 .0233756 -1.94 0.053 -.091161 .0006399 _cons | .1226415 .0167549 7.32 0.000 .0897413 .1555417 ------------------------------------------------------------------------------ The prediction equation is shown to be smoker = a + bX = 0.1226415 – 0.0452606(male) is the straight line fitted to the data, which provides predicted proportions of smokers smoker = 0.1226415 – 0.0452606(1) = 0.0774 for males = 0.1226415 – 0.0452606(0) = 0.1226 for females Finally, notice that the p value from the chi-square test, p=0.053, is identical to the p value from the regression model, p=0.053. (It only comes out this perfectly close when the sample size is large.) From this example, we see that using linear regression to compare two groups on a dichotomous variable is just as good as using the chi-square test from a crosstabulation approach. Linear regression is not used for modeling a dichotomous outcome, however. The major criticism is that it sometimes produces predicted values outside of the 0-1 range, which are impossible values for proportions. Statisticians are driven crazy by such inconsistencies. An example dataset for which such inconsistencies arise, is vaso.dta. These data, originally published by Finney (1947), were obtained in a carefully controlled study of the effect of the RATE and VOLume of air inspired by human subjects on the occurrence (coded 1) or non-occurrence (coded 0) of a transient vasoconstriction RESPonse in the skin of the fingers. Chapter 2-11 (revision 16 May 2010) p. 4 Opening this data file, File Open Find the directory where you copied the course CD: Change to the subdirectory datasets & do-files Single click on vaso.dta Open use vaso, clear Fitting a multivariable linear regression, Statistics Linear models and related Linear regression Model tab: Dependent variable: resp Independent variables: vol rate OK regress resp vol rate Source | SS df MS -------------+-----------------------------Model | 4.37997786 2 2.18998893 Residual | 5.36361188 36 .148989219 -------------+-----------------------------Total | 9.74358974 38 .256410256 Number of obs F( 2, 36) Prob > F R-squared Adj R-squared Root MSE = = = = = = 39 14.70 0.0000 0.4495 0.4189 .38599 -----------------------------------------------------------------------------resp | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------vol | .4011113 .084634 4.74 0.000 .2294657 .5727569 rate | .343427 .0783695 4.38 0.000 .1844862 .5023678 _cons | -.612613 .2176896 -2.81 0.008 -1.054108 -.1711181 ------------------------------------------------------------------------------ The model output seems reasonable enough. However, when we look at the residuals, we discover the problem. predict pred_resp tab pred_resp Fitted | values | Freq. Percent Cum. ------------+-----------------------------------.1143759 | 1 2.56 2.56 -.0970707 | 1 2.56 5.13 -.0959705 | 1 2.56 7.69 .0059574 | 1 2.56 10.26 and so on ... .9760718 | 1 2.56 92.31 1.154826 | 1 2.56 94.87 1.165612 | 1 2.56 97.44 1.220426 | 1 2.56 100.00 ------------+----------------------------------Total | 39 Chapter 2-11 (revision 16 May 2010) 100.00 p. 5 We see that the predicted proportion of vasoconstriction was a <0 for three observations and >1 for three observations. These are undefined values, since a proportion by definition is between 0 and 1. [You can increase or decrease a proportion by greater than 1, or >100%, but the proportion itself must be a number between 0 and 1.] Statisticians demand that statistical approaches be “consistent” across all datasets. They expect logical results all the time. Primarily for this reason, linear regression lost credibility when the outcome is a dichotomy, even though it usually predicts between 0 and 1. Logistic regression was developed to fill the need for a regression model for a dichotomous outcome. Logistic regression is defined in a such way that it is impossible to predict a proportion outside of the 0-1 interval. (How it does this is explained in detail in the K30 regression models class.) Fitting a logistic regression to this same dataset, Statistics Binary outcomes Logistic regression (reporting odds ratios) Model tab: Dependent variable: resp Independent variables: vol rate OK logistic resp vol rate Logistic regression Log likelihood = -14.886152 Number of obs LR chi2(2) Prob > chi2 Pseudo R2 = = = = 39 24.27 0.0000 0.4491 -----------------------------------------------------------------------------resp | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------vol | 48.52846 69.32595 2.72 0.007 2.951221 797.9788 rate | 14.14156 12.92809 2.90 0.004 2.356874 84.8513 ------------------------------------------------------------------------------ and requesting the predicted values of resp predict pred_resp_logistic tab pred_resp_logistic Pr(resp) | Freq. Percent Cum. ------------+----------------------------------.0054134 | 1 2.56 2.56 .0072905 | 1 2.56 5.13 .0078175 | 1 2.56 7.69 and so on ... .9990379 | 1 2.56 94.87 .9991069 | 1 2.56 97.44 .9992014 | 1 2.56 100.00 ------------+----------------------------------Total | 39 100.00 Chapter 2-11 (revision 16 May 2010) p. 6 We see that logistic regression predicted proportions consistently inside the 0-1 range. Returning to the fev dataset, use fev, clear tab smoker male, col chi2 | male Smoking Status | 0 1 | Total -------------------+----------------------+---------not current smoker | 279 310 | 589 | 87.74 92.26 | 90.06 -------------------+----------------------+---------current smoker | 39 26 | 65 | 12.26 7.74 | 9.94 -------------------+----------------------+---------Total | 318 336 | 654 | 100.00 100.00 | 100.00 Pearson chi2(1) = 3.7390 Pr = 0.053 and computing the odds ratio, Statistics Observational/Epi. analysis Tables for epidemiologists Case-control odds ratio Main tab: Case variable: smoker Exposed variables: male OK cc smoker male Proportion | Exposed Unexposed | Total Exposed -----------------+------------------------+-----------------------Cases | 26 39 | 65 0.4000 Controls | 310 279 | 589 0.5263 -----------------+------------------------+-----------------------Total | 336 318 | 654 0.5138 | | | Point estimate | [95% Conf. Interval] |------------------------+-----------------------Odds ratio | .6 | .3414402 1.041289 (exact) Prev. frac. ex. | .4 | -.0412895 .6585598 (exact) Prev. frac. pop | .2105263 | +------------------------------------------------chi2(1) = 3.74 Pr>chi2 = 0.0532 We’ll define the odds ratio in a moment. Chapter 2-11 (revision 16 May 2010) p. 7 Now, analyzing the data using logistic regression, logistic smoker male Logistic regression Number of obs LR chi2(1) Prob > chi2 Pseudo R2 Log likelihood = -209.84678 = = = = 654 3.75 0.0527 0.0089 -----------------------------------------------------------------------------smoker | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------male | .6 .1597765 -1.92 0.055 .3560256 1.011163 ------------------------------------------------------------------------------ We see that logistic regression models the odds ratio, which agrees exactly with the “cc” command, where the odds ratio is created directly from the 2 2 table. The predicted values from logistic regression are proportions, predict pred_smoker tab pred_smoker Pr(smoker) | Freq. Percent Cum. ------------+----------------------------------.077381 | 336 51.38 51.38 .1226415 | 318 48.62 100.00 ------------+----------------------------------Total | 654 100.00 These predicted proportions agree with the proportions in the crosstabulation table computed on the previous page. | male Smoking Status | 0 1 | Total -------------------+----------------------+---------not current smoker | 279 310 | 589 | 87.74 92.26 | 90.06 -------------------+----------------------+---------current smoker | 39 26 | 65 | 12.26 7.74 | 9.94 -------------------+----------------------+---------Total | 318 336 | 654 | 100.00 100.00 | 100.00 Pearson chi2(1) = 3.7390 Pr = 0.053 Notice the chi-square p value from the crosstabulation table (p = 0.053) is nearly identical to the p value from the logistic regression ( p = 0.055). Logistic regression can be thought of as extending the chi-square analysis in the 2 × 2 table to allow for covariates. Chapter 2-11 (revision 16 May 2010) p. 8 Definition of Odds Ratio We define “odds” as the ratio of the probability of some event occurring to not occurring (such as the odds of heads vs tails on a coin flip), We then define the odds ratio (also called exposure odds ratio) as: (a / N1 ) a P(E=1|D=1)/P(E=0|D=1) (b / N1 ) b ad OR P(E=1|D=0)/P(E=0|D=0) (c / N 0 ) c bc (d / N 0 ) d where D = disease, and E = exposure. In our example, D = smoking status and E = male. It’s hard to apply this definition to the 2 2 table from the crosstabulation, because it is in ascending sort order. | male Smoking Status | 0 1 | Total -------------------+----------------------+---------not current smoker | 279 310 | 589 | 87.74 92.26 | 90.06 -------------------+----------------------+---------current smoker | 39 26 | 65 | 12.26 7.74 | 9.94 -------------------+----------------------+---------Total | 318 336 | 654 | 100.00 100.00 | 100.00 The 2 2 table from the cc command, however, is in the correct format for applying the odds ratio formula. | Exposed Unexposed | | (E=1) (E=0) | -----------------+------------------------+ Cases (D=1) | a b | Controls (D=0) | c d | -----------------+------------------------+ Total | 336 318 odds ratio (OR) = ab/bc | male Smoking Status | 0 1 | Total -------------------+----------------------+---------not current smoker | 279 310 | 589 | 87.74 92.26 | 90.06 -------------------+----------------------+---------current smoker | 39 26 | 65 | 12.26 7.74 | 9.94 -------------------+----------------------+---------Total | 318 336 | 654 | 100.00 100.00 | 100.00 display (279*26)/(310*39) .6 Chapter 2-11 (revision 16 May 2010) p. 9 Interpreting this, the odds of being a male if you are a smoker is 0.6 the odds of being a female if you are a smoker. This can also be flipped around. You would say, then, the odds of smoking if you are a male are 0.6, or (1-0.6) = 40% less than if you are a female. This 40% is real close to a calculation done directly on the percents (12.26-7.74)/12.26 = 0.37 or 37% Now let’s add a covariate. logistic smoker male age Logistic regression Log likelihood = -155.83995 Number of obs LR chi2(2) Prob > chi2 Pseudo R2 = = = = 654 111.77 0.0000 0.2639 -----------------------------------------------------------------------------smoker | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------male | .4525142 .1394614 -2.57 0.010 .2473423 .8278772 age | 1.650156 .0941799 8.78 0.000 1.475516 1.845465 ------------------------------------------------------------------------------ The way you interpret a logistic regression is for each line of the regression model, ...the odds fold-increase (or multiplicative increase) of the outcome variable for each one unit increase in the predictor variable, after controlling for all other predictor variables in the model. If the OR = 1, there is no effect. If the OR < 1, there is a protective effect (odds decrease). If the OR > 1, there is a deleterous effect (odds increase). We might report the male effect as: The odds of smoking for males is approximately one-half the odds of smoking for females, after controlling for age [adjusted OR, 0.45, 95% CI(0.25 – 0.83), p=0.010]. We might report the AGE effect as: Age was associated with increased odds for smoking, controlling for gender [adjusted OR, per 1 year increase in age, 1.65, 95% CI (1.48 – 1.85), p < 0.001]. or, The odds of smoking increased 1.65-fold for each one year increase in age, controlling for gender [adjusted OR, per 1 year increase in age, 1.65, 95% CI (1.48 – 1.85), p < 0.001]. Chapter 2-11 (revision 16 May 2010) p. 10 Exercise Look at logistic regression models in Table 3 of the Bergstrom et al. (2000) article. Notice how it follows the linear regression exercise we did in Chapter 6, looking at increasingly complete models. In their Table 3, they report odds ratio = OR = 7x734)/(240*4) = 5.35 In their Table 2, they report relative risk = RR = (7/247)/(4/738) = 5.23 When the disease outcome is rare, < 10%, the odds ratio is a good estimate of the risk ratio. Chapter 2-11 (revision 16 May 2010) p. 11 Assessing Linearity of Effect Linear regression assumed that as the predictor variable increased, such as increasing age, that the effect on the outcome increased by a constant amount for each one unit increase in the predictor variable. Logistic regression assumes something similar. It assume that the log odds increases linearly, by a constant amount, for each one unit increase in the predictor variable. This is the same thing as saying it assumes the odds increases exponentially for each one unit increase in the predictor variable. Letting y = odds = p / (1-p), log(y) = a + bX or i.e., exp(log(y)) = exp(a+bX) y = exp(a)exp(bX) = 1exp(bX) since a=1 in logistic regression models = exp(b)X = ORX since exp(b) = OR in logistic regression We can easily verify that exp(b) = OR in logistic regression by requesting the regression coefficient, b, be displayed instead of the OR. logistic smoker male age, coef Logistic regression Log likelihood = -155.83995 Number of obs LR chi2(2) Prob > chi2 Pseudo R2 = = = = 654 111.77 0.0000 0.2639 -----------------------------------------------------------------------------smoker | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------male | -.7929362 .3081923 -2.57 0.010 -1.396982 -.1888904 age | .5008697 .0570733 8.78 0.000 .389008 .6127314 _cons | -7.586072 .7205451 -10.53 0.000 -8.998315 -6.17383 ------------------------------------------------------------------------------ and then, display exp(-.7929362) .45251417 which matches the OR from the logistic smoker male age Chapter 2-11 (revision 16 May 2010) p. 12 Logistic regression Log likelihood = -155.83995 Number of obs LR chi2(2) Prob > chi2 Pseudo R2 = = = = 654 111.77 0.0000 0.2639 -----------------------------------------------------------------------------smoker | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------male | .4525142 .1394614 -2.57 0.010 .2473423 .8278772 age | 1.650156 .0941799 8.78 0.000 1.475516 1.845465 ------------------------------------------------------------------------------ To assess the linearity assumption in regression models, using quartiles, quintiles, or some other percentile is useful. It is always a good idea to examine the linearity assumption, no matter what type of model you are fitting. This is easy to do in Stata, xtile age5 = age, nq(5) which creates a variable of age categorized into 5 quantiles, or quintiles. The “nq( )” options is used to specific the number of quantiles (nq). tab age5 5 quantiles | of age | Freq. Percent Cum. ------------+----------------------------------1 | 215 32.87 32.87 2 | 94 14.37 47.25 3 | 171 26.15 73.39 4 | 57 8.72 82.11 5 | 117 17.89 100.00 ------------+----------------------------------Total | 654 100.00 We see the categories represent rough 20% each, but Stata could not do this any closer due to a lot of tied ages. Chapter 2-11 (revision 16 May 2010) p. 13 To discover what the categories include, we use bysort age5: sum age -> age5 = 1 Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------age | 215 6.8 1.257501 3 8 -> age5 = 2 Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------age | 94 9 0 9 9 -> age5 = 3 Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------age | 171 10.52632 .5007734 10 11 -> age5 = 4 Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------age | 57 12 0 12 12 -> age5 = 5 Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------age | 117 14.55556 1.663215 13 19 If we now use logistic smoker male age5 the age5 variable would be treated as if it were a continuous variable, which is no better than using age. Categorical variables, nominal or ordinal scales, have to be modeled using indicator, or dummy variables, which are a series of 0-1 coded variables. This can be done very quickly in Stata by the xi (generate indicator variables) command, where you precede the regression command name by “xi:” and then precede each categorical variable by “i.”. Chapter 2-11 (revision 16 May 2010) p. 14 xi: logistic smoker male i.age5 i.age5 _Iage5_1-5 (naturally coded; _Iage5_1 omitted) Logistic regression Log likelihood = -149.22261 Number of obs LR chi2(5) Prob > chi2 Pseudo R2 = = = = 654 125.00 0.0000 0.2952 -----------------------------------------------------------------------------smoker | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------male | .4899082 .1460738 -2.39 0.017 .2730962 .878848 _Iage5_2 | 1022168 . . . . . _Iage5_3 | 8588918 8969815 15.29 0.000 1109145 6.65e+07 _Iage5_4 | 1.31e+07 1.42e+07 15.11 0.000 1563884 1.10e+08 _Iage5_5 | 5.77e+07 5.92e+07 17.43 0.000 7735301 4.30e+08 -----------------------------------------------------------------------------note: 215 failures and 0 successes completely determined. First note the the first category _Iage5_1, where age5 = 1, is omitted from the model. One category must be left out, which becomes part of the intercept, to act as the referent group. All included categories are interpreted relative to the referent group. Then, notice the rest of the model looks like some kind of diaster. The reason becomes clear by looking at tab age5 smoker 5 | quantiles | Smoking Status of age | not curre current s | Total -----------+----------------------+---------1 | 215 0 | 215 2 | 93 1 | 94 3 | 157 14 | 171 4 | 50 7 | 57 5 | 74 43 | 117 -----------+----------------------+---------Total | 589 65 | 654 Chapter 2-11 (revision 16 May 2010) p. 15 We discover there are no smokers in the lowest category, which presents the ages 3 to 8. Therefore, all other age categories are infinitely deleterious. That is surely not the case in the population. We need to drop the first category from the analysis, and let the second category be the referent group. The “xi” facility always uses the first variable as the referent. If we create the indictor variables ourself, we can choose whichever category we want as the referent, simply by leaving out that indicator variable. Using the tabulate command, with the generate option, we specify the stub name of the indicator variables we want to create. tabulate age5 , gen(agecat) describe ------------------------------------------------------------------------------storage display value variable name type format label variable label ------------------------------------------------------------------------------id long %12.0g ID age byte %8.0g Age (years) fev float %9.0g Forced Expiratory Volume (liters) height float %9.0g Height (inches) male byte %8.0g smoker byte %18.0g smokerlab Smoking Status age5 byte %8.0g 5 quantiles of age agecat1 byte %8.0g age5== 1.0000 agecat2 byte %8.0g age5== 2.0000 agecat3 byte %8.0g age5== 3.0000 agecat4 byte %8.0g age5== 4.0000 agecat5 byte %8.0g age5== 5.0000 ------------------------------------------------------------------------------- Notice it created agecat1 through agecat5, where the suffix denotes what age category the variable is an indicator for. Chapter 2-11 (revision 16 May 2010) p. 16 Now, leaving out the second category indicator, making it the referent logistic smoker male agecat1 agecat3 agecat4 agecat5 note: agecat1 != 0 predicts failure perfectly agecat1 dropped and 215 obs not used Logistic regression Log likelihood = -149.22261 Number of obs LR chi2(4) Prob > chi2 Pseudo R2 = = = = 439 69.73 0.0000 0.1894 -----------------------------------------------------------------------------smoker | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------male | .4899082 .1460738 -2.39 0.017 .2730962 .878848 agecat3 | 8.402648 8.775284 2.04 0.042 1.085091 65.06784 agecat4 | 12.828 13.91722 2.35 0.019 1.529968 107.5563 agecat5 | 56.45461 57.88341 3.93 0.000 7.567545 421.1569 ------------------------------------------------------------------------------ This model looks a lot better. Notice Stata warns that it dropped agecat1 subjects from the analysis. When there is no variability for the variable to explain (since all category 1’s were nonsmokers), the variable has to be dropped or Stata cannot converge on a solution to the regression problem. We might try some more meaningful age groups, such as school attended. The social pressure to smoke is probably different depending where you on in school system. The following age categories roughly approximate the school system. 3-8 before grade 4 9-11 elementary school (grades 4th through 6th) 12-14 junior high (grades 7th through 9th) 15-19 high school (grades 10th through 12th) gen age3to8 = cond(age>=3 & age<=8,1,0) gen age9to11 = cond(age>=9 & age<=11,1,0) gen age12to14 = cond(age>=12& age<=14,1,0) gen age15to19 = cond(age>=15 & age<=19,1,0) replace age3to8=. if age==. replace age9to11=. if age==. replace age12to14=. if age==. replace age15to19=. if age==. tab age if age3to8==1 , missing tab age if age9to11==1 , missing tab age if age12to14==1 , missing tab age if age15to19==1 , missing Chapter 2-11 (revision 16 May 2010) p. 17 . tab age if age3to8==1 , missing Age (years) | Freq. Percent Cum. ------------+----------------------------------3 | 2 0.93 0.93 4 | 9 4.19 5.12 5 | 28 13.02 18.14 6 | 37 17.21 35.35 7 | 54 25.12 60.47 8 | 85 39.53 100.00 ------------+----------------------------------Total | 215 100.00 . tab age if age9to11==1 , missing Age (years) | Freq. Percent Cum. ------------+----------------------------------9 | 94 35.47 35.47 10 | 81 30.57 66.04 11 | 90 33.96 100.00 ------------+----------------------------------Total | 265 100.00 . tab age if age12to14==1 , missing Age (years) | Freq. Percent Cum. ------------+----------------------------------12 | 57 45.60 45.60 13 | 43 34.40 80.00 14 | 25 20.00 100.00 ------------+----------------------------------Total | 125 100.00 . tab age if age15to19==1 , missing Age (years) | Freq. Percent Cum. ------------+----------------------------------15 | 19 38.78 38.78 16 | 13 26.53 65.31 17 | 8 16.33 81.63 18 | 6 12.24 93.88 19 | 3 6.12 100.00 ------------+----------------------------------Total | 49 100.00 From the frequency tables, we see we created the new variables correctly. Chapter 2-11 (revision 16 May 2010) p. 18 For illustration of what will happen, let’s put all of the indicator variables in the model, logistic smoker male age3to8 age9to11 age12to14 age15to19 note: age3to8 != 0 predicts failure perfectly age3to8 dropped and 215 obs not used note: age15to19 dropped due to collinearity Logistic regression Log likelihood = -153.7973 Number of obs LR chi2(3) Prob > chi2 Pseudo R2 = = = = 439 60.58 0.0000 0.1645 -----------------------------------------------------------------------------smoker | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------male | .4899327 .1461839 -2.39 0.017 .2729975 .8792535 age9to11 | .0636365 .0252995 -6.93 0.000 .0291945 .1387114 age12to14 | .2913906 .1068836 -3.36 0.001 .1419875 .5979993 ------------------------------------------------------------------------------ Notice the statement “note: age15to19 dropped due to collinearity”. This occurred because we did not leave a category out. All the forms of regression models have this requirement. Collinearity is a term to denote that at a predictor variable is highly correlated with some linear combination of some of the other predictor variables. Regression models use a variable of all 1’s for the constant, or intercept term. After age3to8 were excluded, we had a dataset that looked like Constant age9to11 age12to14 age15to19 1 1 0 0 1 1 0 0 1 0 1 0 1 0 1 0 1 0 0 1 1 0 0 1 Then we had Constant = age9to11 + age12to14 + age15to19 so the linear combination of the dummy variables predicts the constant perfectly. This makes the model fitting routines go nuts (the “least squares” algorithm cannot compute the inverse of the data matrix in linear regression, and the “maximum likelihood” algorithm used in logistic regression cannot converge on a solution) so one dummy variable must be kicked out of the regression equation. Chapter 2-11 (revision 16 May 2010) p. 19 Let’s throw out, or use as the referent, the age9to11 category instead, by simply leaving it out the list of predictor variables. logistic smoker male age3to8 age12to14 age15to19 note: age3to8 != 0 predicts failure perfectly age3to8 dropped and 215 obs not used Logistic regression Log likelihood = -153.7973 Number of obs LR chi2(3) Prob > chi2 Pseudo R2 = = = = 439 60.58 0.0000 0.1645 -----------------------------------------------------------------------------smoker | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------male | .4899327 .1461839 -2.39 0.017 .2729975 .8792535 age12to14 | 4.578983 1.582391 4.40 0.000 2.32602 9.014148 age15to19 | 15.71425 6.247394 6.93 0.000 7.209212 34.25306 ------------------------------------------------------------------------------ This looks like a very believable model. References Bergstrom L, Yocum DE, Ampel NM, et al. (2004). Increased risk of coccidioidomycosis in patients treated with tumor necrosis factor α antagonists. Arthritis & Rheumatism 50(6):1959-1966. Finney DJ. (1947). The estimation from original records of the relationship between dose and quantal response. Biometrika 1947;34:320-334. Chapter 2-11 (revision 16 May 2010) p. 20