Chapter 3-12. Standardization << This chapter uses too many examples out of Rothman (2002), done that way to quickly prepare a lecture while teaching out of the Rothman text. It needs to be updated with more of my own examples. >> In Chapter 3-11, we used pooling to combine stratum-specific estimates of effect measures (such as risk ratio) into a single summary effect measure. The summary effect measure was basically a weighted average of the stratum-specific estimates. Another approach is standardization, which is a method of combining stratum-specific risks (cases/N) or rates (cases/PT) into a single summary value by taking a weighted average of them. It weights the stratum-specific rates using weights that come from a standard population, in contrast to the pooling approach which weighted by how much information is contained in each stratum. Suppose we choose to use the U.S. population in the year 2000 as our standard. We would then weight our age-specific rates with weights that reflect the age distribution of the U.S. population in the year 2000. Our summary rate would then be the rate that we would expect if our population had the same age distribution as the U.S. population in year 2000. _________________ Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah School of Medicine, 2010. Chapter 3-12 (revised 16 May 2010) p. 1 Example Sweden and Panama (Rothman, 2002, pp.1-2) It would seem residents of Sweden, where the standard of living is generally high, should have lower death rates than residents of Panama, where poverty and more limited health care take their toll. However, a greater proportion of Swedish residents than Panamanian residents die each year. The reason for this unexpected result is confounding due to differing age distributions of the populations of these two countries, with Panama having a younger population. Population Pyramid Panama: 2000 MALE FEMALE 60+ 30-59 0-29 150 100 50 0 0 50 100 150 Population (in thousands) Sweden: 2000 MALE FEMALE 60+ 30-59 0-29 300 200 100 0 0 100 200 300 Population (in thousands) In both countries, older people die at a greater rate than younger people. However, because Sweden has a population that is on the average older than that of Panama, a greater proportion of all Swedes die in a given year, despite the lower death rates within specific age categories. If we standardized the Panama rate and the Sweden rate to match a single population (such as the U.S. population), our rate estimates would then be directly comparable (not confounded by age), since the rates would be based on the same age distribution. Two advantages to this approach over pooling are: 1) It might be interesting to know the standardized rate, itself, for each country rather than just the rate ratio. For example, it might be interesting to see a graph of 10 different countries, all standardized to the same age distribution. Chapter 3-12 (revised 16 May 2010) p. 2 2) Pooling requires approximatley equal stratum-specific rate ratios (homogeneous effect), whereas standardization does not. For example, the relative effects might be very different in newborns, young adult, and old adult age groups for these two countries so that the pooled estimate may not be appropriate. We will return to this example below and standardize the rates using Stata. Simple Examples of Standardization Using Rothman’s example (2002, p. 158) suppose we have rates of: Males: 10/1000 person years Females: 5/1000 persons years We can standardize these sex-specific rates to any standard that we wish. For example, we might simply choose to weight males and females equally. We would then obtained a weighted average of the two rates that would equal their simple average standardized rate = weightmales Ratemales + weightfemales Ratefemales = 1 10/1000py + 1 5/1000py = (10+5)/1000py = 7.5/1000 person years Suppose the rates reflected the disease experience of nurses, 95% of whom are female. In that case, we might wish to use as our standard a weight of 5% for males and 95% for females: standardized rate = 0.05 10/1000py + 0.95 5/1000py = (0.5 + 4.75)/1000py = 5.25/1000 person years To compare rates for exposed and unexposed people, we standardize both to the same standard and then compare them. An advantage to standardization is that it uses a defined set of weights (which are independent of the data). Thus, other investigators can standardize using the same weights and then directly compare their stratified results to yours. Chapter 3-12 (revised 16 May 2010) p. 3 Example Mortality rate for current and past clozapine users by age category Age 10-54 years Age 55-94 years Current Past Current Past Deaths 196 111 167 157 Person-years 62,119 15,763 6,085 2,780 5 315.5 704.2 2744 5647 Rate ( 10 years) Rate difference -388.7 -2903 5 ( 10 years) Rate ratio 0.45 0.49 Using the Mantel-Haenszel pooling approach, the mortality rate difference is -720/100,000 person years and the mortality rate ratio is 0.47. Let’s now standardize the rates for age over the two age strata. We will standardize to the age distribution of current clozapine use in the study, since that is the age distribution of those who use the drug. current clozapine use Age 10-54 years: 62,119 py ( 91.1%) Age 55-94 years: 6,085 py ( 8.9%) Total: 68,204 py (100.0%) Standardizing, current clozapine use: standardized mortality rate = = past clozapine use: standardized mortality rate = = 0.911 315.5/100,000 py + 0.089 2744/100,000py 532.2/100,000py 0.911 704.2/100,000 py + 0.089 5647/100,000py 1144/100,000py Combining into standardize effect measures; standardized rate difference = (532.2 – 1144)/100,000py = -612/100,000py slightly smaller than the pooled estimate standardize rate ratio = 532.2/1144 = 0.47 identical to the pooled estimate to two decimal places. The stratum-specific rate ratios were very similar, so any weighting, whether pooled or standardized, would give a result close to this value. Chapter 3-12 (revised 16 May 2010) p. 4 Standardized Mortality Ratio (SMR) When the standardized rate ratio is calculated using the exposed group as the standard, the result is usually referred to as a standardized mortality ratio, or standardized morbidity ratio (Rothman, 2002, p.161). Thus, we computed an SMR in the preceding example. Direct Standardization Rothman and Greenland (1998, pp.45-46) give the following formula for direct standardization. Let T1, T1, … Tk be the person-years in k strata (e.g., age-sex categories) in some selected standard population. Thus, the T’s are called the standard distribution for which the standardize rate is based. Let I1, I1, … Ik be the stratum-specific incidence rates computed from your data. Then the standardized rate is given by k I T ... I k Tk standardized rate I s 1 1 T1 ... Tk IT i i i 1 k T i i 1 The numerator is the number of cases one would see in a population that had the person-time distribution T1, T1, … Tk and the stratum-specific rates I1, I1, … Ik. The denominator is the total person-time in such a population. Therefore, the standardized rate, Is , is the rate one would see in a population with person-time distribution T1, T1, … Tk and stratum-specific rates I1, I1, … Ik. The standardization process can be conducted with incidence proportions or prevalence proportions, as well. Let N1, N1, … Nk be the number of persons in k strata. Let R1, R1, … Rk be the stratum-specific incidence proportions (or prevalence proportions). Then the standardized risk, or standardized prevalence, is given by k R N ... Rk N k standardized risk I s 1 1 N1 ... N k R N i 1 k i N i 1 i i These are the formulas used by the Stata’s direct standardization command dstdize. Chapter 3-12 (revised 16 May 2010) p. 5 Exercise (direct standardization) Returning to the Sweden and Panama example, reading the data in, File Open Find the directory where you copied the course CD Change to the subdirectory datasets & do-files Single click on panswedmortality.dta Open use "C:\Documents and Settings\u0032770.SRVR\Desktop\ Biostats & Epi With Stata\datasets & do-files\ panswedmortality.dta", clear * which must be all on one line, or use: cd "C:\Documents and Settings\u0032770.SRVR\Desktop\” cd “Biostats & Epi With Stata\datasets & do-files" use panswedmortality.dta, clear Listing the data, Data Describe data List data Main tab: Variables: (leave empty for all variables): < leave empty > Override minimum abbreviation of variable names: 15 Options tab: Table options: Draw divider lines between columns Separators: When these variables change: nation OK list, abbreviate(15) divider sepby(nation) 1. 2. 3. 4. 5. 6. +---------------------------------------------+ | nation | age_category | population | deaths | |--------+--------------+------------+--------| | Sweden | 0 - 29 | 3145000 | 3,523 | | Sweden | 30 - 59 | 3057000 | 10,928 | | Sweden | 60+ | 1294000 | 59,104 | |--------+--------------+------------+--------| | Panama | 0 - 29 | 741,000 | 3,904 | | Panama | 30 - 59 | 275,000 | 1,421 | | Panama | 60+ | 59,000 | 2,456 | +---------------------------------------------+ We see that this file contains the variables for computing the age-specific incidence proportions, or mortality proportions. Chapter 3-12 (revised 16 May 2010) p. 6 We will use the following standard population: File Open Find the directory where you copied the course CD: Find the subdirectory datasets & do-files Single click on panswedstdpop.dta Open use "C:\Documents and Settings\u0032770.SRVR\Desktop\ Biostats & Epi With Stata\datasets & do-files\ panswedstdpop.dta", clear * which must be all on one line, or use: cd "C:\Documents and Settings\u0032770.SRVR\Desktop\” cd “Biostats & Epi With Stata\datasets & do-files" use panswedstdpop.dta, clear Double clicking on the last list command in the Review Window, and changing it to: list, abbreviate(15) divider +---------------------------+ | age_category | population | |--------------+------------| 1. | 0 - 29 | .35 | 2. | 30 - 59 | .35 | 3. | 60+ | .3 | +---------------------------+ we see that this is a file with the proportion of the population that will be used for each age stratum (the same for both countries). When you wish to use a reference population that is different from any group in your incidence or mortality data file, Stata requires: 1) the standard population to be saved in a separate Stata-formatted data file (.dta file extension), 2) for this file to have the identical strata as the risk data, and 3) for the morbidity or mortality data to be the current file in Stata memory. Chapter 3-12 (revised 16 May 2010) p. 7 Bringing the mortality data back in: File Open Find the directory where you copied the course CD Change to the subdirectory datasets & do-files Single click on panswedmortality.dta Open use "C:\Documents and Settings\u0032770.SRVR\Desktop\ Biostats & Epi With Stata\datasets & do-files\ panswedmortality.dta", clear * which must be all on one line, or use: cd "C:\Documents and Settings\u0032770.SRVR\Desktop\” cd “Biostats & Epi With Stata\datasets & do-files" use panswedmortality.dta, clear To obtain the direct standardized rates, we use Statistics Epidemiology and related Other Direct standardization Main tab: Characteristic variable: deaths Population variable: population Strata variable: age_category Group variables: nation Use standard population from Stata dataset: panswedstdpop OK dstdize deaths population age_category, by(nation) using(panswedstdpop) Chapter 3-12 (revised 16 May 2010) p. 8 ----------------------------------------------------------> nation= Panama -----Unadjusted----- Std. Pop. Stratum Pop. Stratum Pop. Cases Dist. Rate[s] Dst[P] s*P ---------------------------------------------------------0 - 29 741000 3904 0.689 0.0053 0.350 0.0018 30 - 59 275000 1421 0.256 0.0052 0.350 0.0018 60+ 59000 2456 0.055 0.0416 0.300 0.0125 ---------------------------------------------------------Totals: 1075000 7781 Adjusted Cases: 17351.2 Crude Rate: 0.0072 Adjusted Rate: 0.0161 95% Conf. Interval: [0.0156, 0.0166] ----------------------------------------------------------> nation= Sweden -----Unadjusted----- Std. Pop. Stratum Pop. Stratum Pop. Cases Dist. Rate[s] Dst[P] s*P ---------------------------------------------------------0 - 29 3145000 3523 0.420 0.0011 0.350 0.0004 30 - 59 3057000 10928 0.408 0.0036 0.350 0.0013 60+ 1294000 59104 0.173 0.0457 0.300 0.0137 ---------------------------------------------------------Totals: 7496000 73555 Adjusted Cases: 115032.5 Crude Rate: 0.0098 Adjusted Rate: 0.0153 95% Conf. Interval: [0.0152, 0.0155] Summary of Study Populations: nation N Crude Adj_Rate Confidence Interval -------------------------------------------------------------------------Panama 1075000 0.007238 0.016141 [ 0.015645, 0.016637] Sweden 7496000 0.009813 0.015346 [ 0.015235, 0.015457] Notice that the standardized risks are given for each country, but there is no standardized risk ratio or standardized risk difference reported by Stata. These have to be computed manually, as standardized risk ratio: display 0.016141/0.015346 1.051805 standardized risk difference: display 0.016141 - 0.015346 .000795 These two formulas are given in Rothman and Greenland (1998, p.63). Given two standardized rates, I s and I s* , both computed using the same standard distribution, the standardized rate ratio and standardized risk difference are given by IRs Is I s* and Chapter 3-12 (revised 16 May 2010) IRDs I s I s* p. 9 The same formulas apply for computing the standardized risk ratio and the standardized prevalence ratio and for computing the standardized risk difference and the standardized prevalence difference. The confidence intervals for these standardized effect measures are not simply forming ratios and differences with the limits of the individual standardized measures. Rothman and Greenland (1998, p.263) present formulas for the confidence intervals. These are not available in Stata for direct standardization with the dstdize command. Example Look at the article by Van Den Eden et al (2003). This is a paper that reports results using direct standardization. 1) Look at Statistical Methods section. You should now be able to understand it. Notice that they cite the US Census website. The website provides US population data so that researchers around the world can standardize to a common population distribution. 2) Notice how standardization allowed them to compare rates across race/ethnic groups in Table 3, and across studies/countries in Table 4. Chapter 3-12 (revised 16 May 2010) p. 10 Indirect Standardization The Stata command for indirect standardization is istdize. The following description and formula for indirect standardization was taken from the Stata reference manual under the dstdize command (StataCorp, 2003, Reference A-F, p.295): “Standardization of rates can be performed via the indirect method whenever the stratumspecific rates are either unknown or unreliable. If the stratum-specific rates are known, the direct standardization method is preferred. In order to apply the indirect method, the following must be available: 1. The observed number of cases in each population to be standardized, O. For example, if death rates in two states are being standardized using the US data rate for the same time period, then you must know the total number of deaths in each state. 2. The distribution across the various strata for the population being studied, n1,…,nk. If you are standardizing the death rate in the two states adjusting for age, then you must know the number of individuals in each of the k age groups. 3. The stratum-specific rates for the standard population, p1,…,pk. For the example, you must have the US death rate for each stratum (age group). 4. The crude rate of the standard population, C. For the example, you must have the mortality rate for all the US for the year.” The calculation is then (StataCorp, 2003, Reference A-F, p.299): “For indirect standardization, define O as the observed number of cases in each population to be standardized; n1,…,nk, the distribution across the various strata for the population being studied; R1,…,Rk, the stratum-specifc rates for the standard population; and C, the crude rate of the standard population. Then the expected number of cases (deaths), E, in each population is obtained by applying the standard population stratum-specific rates, R1,…,Rk, to the study populations: k E ni Ri i 1 The indirectly adjusted rate is then Rindirect C O E and O/E is the study population’s standardized mortality ratio (SMR) if death is the event of interest or the standardized incidence ratio (SIR) for studies of disease (or other) incidences.” Chapter 3-12 (revised 16 May 2010) p. 11 Exercise (indirect standardization) We will use data borrowed from Kahn and Sempos (1989, 95-105) that are available on the Stata website, and in the datasets & do-files subdirectory. The problem is (StataCorp, 2003, Ref A-F, p. 295), “We want to compare 1970 mortality rates in California and Maine, adjusting for age. Although we have age-specific population counts for the two states, we lack age-specific death rates. In this situation, direct standardization is not feasible. We can use the US population census data for the same year to produce indirectly standardized rates for the these two states.” The 1970 US population age stratum-specific rates are found in KahnStdPopRates.dta. 1. 2. 3. 4. 5. 6. 7. 8. +---------------------------------------+ | age population deaths rate | |---------------------------------------| | <15 57,900,000 103,062 .00178 | | 15-24 35,441,000 45,261 .00128 | | 25-34 24,907,000 39,193 .00157 | | 35-44 23,088,000 72,617 .00315 | |---------------------------------------| | 45-54 23,220,000 169,517 .0073 | | 55-64 18,590,000 308,373 .01659 | | 65-74 12,436,000 445,531 .03583 | | 75+ 7,630,000 736,758 .09656 | +---------------------------------------+ The observed number of cases and age stratum-specific population sizes for the study populations are found in the file KahnStudyPopSizes.dta. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. +-------------------------------------------+ | state age population death | |-------------------------------------------| | California <15 5,524,000 166,285 | | California 15-24 3,558,000 166,285 | | California 25-34 2,677,000 166,285 | | California 35-44 2,359,000 166,285 | |-------------------------------------------| | California 45-54 2,330,000 166,285 | | California 55-64 1,704,000 166,285 | | California 65-74 1,105,000 166,285 | | California 75+ 696,000 166,285 | |-------------------------------------------| | Maine <15 286,000 11,051 | | Maine 15-24 168,000 11,051 | | Maine 25-34 110,000 11,051 | | Maine 35-44 109,000 11,051 | |-------------------------------------------| | Maine 45-54 110,000 11,051 | | Maine 55-64 94,000 11,051 | | Maine 65-74 69,000 11,051 | | Maine 75+ 46,000 11,051 | +-------------------------------------------+ Notice here that the death variable is simply the total number of deaths, not the age-specific deaths. Chapter 3-12 (revised 16 May 2010) p. 12 Bringing the study data into Stata, File Open Find the directory where you copied the course CD Change to the subdirectory datasets & do-files Single click on KahnStudyPopSizes.dta Open use "C:\Documents and Settings\u0032770.SRVR\Desktop\ Biostats & Epi With Stata\datasets & do-files\ KahnStudyPopSizes.dta", clear * which must be all on one line, or use: cd "C:\Documents and Settings\u0032770.SRVR\Desktop\” cd “Biostats & Epi With Stata\datasets & do-files" use KahnStudyPopSizes.dta, clear Obtaining the indirect standardized rates, Statistics Epidemiology and related Other Indirect standardization Main tab: # cases variable: death Population variable: population Strata variables: age Use standard population from Stata dataset: KahnStdPopRates Use population variables: Case variable: death Population varible: population Options tab: Group variables: state Include table summary of standard population in output OK istdize death population age using KahnStdPopRates.dta, popvars(death population) by(state) print Chapter 3-12 (revised 16 May 2010) p. 13 ------Standard Population-----Stratum Rate ------------------------------<15 0.00178 15-24 0.00128 25-34 0.00157 35-44 0.00315 45-54 0.00730 55-64 0.01659 65-74 0.03583 75+ 0.09656 ------------------------------Standard population's crude rate: 0.00945 ----------------------------------------------------------> state= California Indirect Standardization Standard Population Observed Cases Stratum Rate Population Expected ---------------------------------------------------------<15 0.0018 5524000 9832.72 15-24 0.0013 3558000 4543.85 25-34 0.0016 2677000 4212.46 35-44 0.0031 2359000 7419.59 45-54 0.0073 2330000 17010.10 55-64 0.0166 1704000 28266.14 65-74 0.0358 1105000 39587.63 75+ 0.0966 696000 67206.23 ---------------------------------------------------------Totals: 19953000 178078.73 Observed Cases: SMR (Obs/Exp): SMR exact 95% Conf. Interval: [0.9293, Crude Rate: Adjusted Rate: 95% Conf. Interval: [0.0088, 166285 0.93 0.9383] 0.0083 0.0088 0.0089] ----------------------------------------------------------> state= Maine Indirect Standardization Standard Population Observed Cases Stratum Rate Population Expected ---------------------------------------------------------<15 0.0018 286000 509.08 15-24 0.0013 168000 214.55 25-34 0.0016 110000 173.09 35-44 0.0031 109000 342.83 45-54 0.0073 110000 803.05 55-64 0.0166 94000 1559.28 65-74 0.0358 69000 2471.99 75+ 0.0966 46000 4441.79 ---------------------------------------------------------Totals: 992000 10515.67 Observed Cases: SMR (Obs/Exp): SMR exact 95% Conf. Interval: [1.0314, Crude Rate: Adjusted Rate: 95% Conf. Interval: [0.0097, Chapter 3-12 (revised 16 May 2010) 11051 1.05 1.0707] 0.0111 0.0099 0.0101] p. 14 Summary of Study Populations (Rates): Cases state Observed Crude Adj_Rate Confidence Interval -------------------------------------------------------------------------California 166285 0.008334 0.008824 [0.008782, 0.008866] Maine 11051 0.011140 0.009931 [0.009747, 0.010118] Summary of Study Populations (SMR): Cases Cases Exact state Observed Expected SMR Confidence Interval -------------------------------------------------------------------------California 166285 178078.73 0.934 [0.929290, 0.938271] Maine 11051 10515.67 1.051 [1.031405, 1.070688] Chapter 3-12 (revised 16 May 2010) p. 15 Direct Standardized Rates Using Individual Level Data In this example, we will use a dataset from the Stata website. The file contains individual-level data on persons in four cities over a number of years. For the standard population, we will simply use the total sample distribution. Three of the cities (1,2, and 3) introduced a public health campaign in 1991 for reducing high blood pressure. City 5 was the control city, which received no public health campaign. The task is to obtain standardized high blood pressure rates for each city for each of the years 1990 and 1992, using, as the standard, the age, sex, and race distribution of the four cities and two years combined. Using these standardized rates, the goal is to judge whether or not the compaign was successful. Bringing the data into Stata File Open Find the directory where you copied the course CD Change to the subdirectory datasets & do-files Single click on highBP.dta Open use "C:\Documents and Settings\u0032770.SRVR\Desktop\ Biostats & Epi With Stata\datasets & do-files\ highBP.dta", clear * which must be all on one line, or use: cd "C:\Documents and Settings\u0032770.SRVR\Desktop\” cd “Biostats & Epi With Stata\datasets & do-files" use highBP.dta, clear Look at the data in the browser. You will notice n=1130 subjects, with age categorized into 5year age categories and a variable indicating whether the subject had high blood pressure or not. The dstdize command is designed to work with aggregate data. It will work, however, with individual level data if we create a variable recording the population size represented by each observation. For individual-level data, that size is one. Enter the following command to set up this variable: gen pop = 1 In the examples above, we always used a separate standard population file. This time, the problem is to use the study population’s four cities and two years as our standard population, stratified by age, sex, and race. Chapter 3-12 (revised 16 May 2010) p. 16 Requesting the standardize rates, Statistics Epidemiology and related Other Direct standardization Main tab: Characteristic variable: hbp Population variable: pop Strata variable: age_group race sex Group variables: city year Use standard population from data in memory if/in tab: If (expression): year==1990 | year==1992 Options tab: Include table summary of standard population in output OK dstdize hbp pop age_group race year if year==1990 | year==1992, by(city year) print ---------------Standard Population--------------Stratum Pop. Dist. ------------------------------------------------15 - 19 Black 1990 44 0.097 15 - 19 Black 1992 35 0.077 15 - 19 Hispanic 1990 4 0.009 15 - 19 Hispanic 1992 11 0.024 15 - 19 White 1990 5 0.011 15 - 19 White 1992 7 0.015 20 - 24 Black 1990 49 0.108 20 - 24 Black 1992 61 0.134 20 - 24 Hispanic 1990 11 0.024 20 - 24 Hispanic 1992 16 0.035 20 - 24 White 1990 12 0.026 20 - 24 White 1992 13 0.029 25 - 29 Black 1990 39 0.086 25 - 29 Black 1992 22 0.048 25 - 29 Hispanic 1990 9 0.020 25 - 29 Hispanic 1992 11 0.024 25 - 29 White 1990 13 0.029 25 - 29 White 1992 12 0.026 30 - 34 Black 1990 24 0.053 30 - 34 Black 1992 24 0.053 30 - 34 Hispanic 1990 2 0.004 30 - 34 Hispanic 1992 3 0.007 30 - 34 White 1990 14 0.031 30 - 34 White 1992 14 0.031 ------------------------------------------------Total: 455 (6 observations excluded due to missing values) Chapter 3-12 (revised 16 May 2010) p. 17 Summary of Study Populations: city year N Crude Adj_Rate Confidence Interval ----------------------------------------------------------------------1 1990 47 0.063830 0.024689 [ 0.000000, 0.050569] 1 1992 56 0.017857 0.003719 [ 0.000000, 0.010723] 2 1990 64 0.046875 0.024762 [ 0.000000, 0.050174] 2 1992 67 0.029851 0.007033 [ 0.000000, 0.015751] 3 1990 69 0.159420 0.062777 [ 0.028154, 0.097400] 3 1992 37 0.189189 0.028587 [ 0.012874, 0.044300] 5 1990 46 0.043478 0.025000 [ 0.000000, 0.052890] 5 1992 69 0.014493 0.015385 [ 0.000000, 0.036706] Does it appear that the campaign worked? It is hard to say. This illustrates the limitation of standardization. It is a good way to get descriptive estimates, which are adjusted for potential confounders, which is sometimes all that is needed. To test hypotheses, however, researchers must turn to stratification or regression models. Fitting a regression model to high blood pressure data, Chapter 3-12 (revised 16 May 2010) p. 18 keep if year==1990 | year==1992 * post-intervention indicator gen post = 1 if year == 1992 replace post = 0 if year == 1990 replace post = . if year ==. tab year post drop year * male indicator gen male = 1 if sex == "Male" replace male = 0 if sex == "Female" replace male = . if sex == "" tab sex male drop sex * race/ethnicity indictors gen black = 0 replace black = 1 if race == replace black = . if race == * gen white = 0 replace white = 1 if race == replace white = . if race == * gen hispanic = 0 replace hispanic = 1 if race replace hispanic = . if race * tab race black tab race white tab race hispanic drop race "Black" "" "White" "" == "Hispanic" == "" * continuous age variable (equal 5-year intervals) gen age=. replace age = 1 if age_group == "15 - 19" replace age = 2 if age_group == "20 - 24" replace age = 3 if age_group == "25 - 29" replace age = 4 if age_group == "30 - 34" tab age_group age * intervention indicator gen intervention = . replace intervention = 1 if city==5 replace intervention = 0 if city<5 tab city intervention * post x intervention interaction gen postxint = post*intervention tab postxint tab hbp // consider overfitting logistic hbp post intervention postxint male black hispanic age Chapter 3-12 (revised 16 May 2010) p. 19 We see that we should limit the number of predictors to 3 or 6, using the 30/10 or 30/5 rules discussed in K30 Intro Biostat course. hbp | Freq. Percent Cum. ------------+----------------------------------0 | 431 93.49 93.49 1 | 30 6.51 100.00 ------------+----------------------------------Total | 461 100.00 We will ignore this for the sake of completing the exercise. The results of the logistic regression model were, Logistic regression Log likelihood = -93.186581 Number of obs LR chi2(7) Prob > chi2 Pseudo R2 = = = = 455 34.75 0.0000 0.1572 -----------------------------------------------------------------------------hbp | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------post | .6630758 .290103 -0.94 0.348 .2812892 1.563052 intervention | .7409793 .5953338 -0.37 0.709 .1534317 3.578467 postxint | .409221 .5471474 -0.67 0.504 .0297757 5.624109 male | 2.165947 1.128395 1.48 0.138 .7801839 6.013105 black | .2669251 .1136918 -3.10 0.002 .115834 .6150958 hispanic | .2379172 .1882277 -1.81 0.070 .0504661 1.121637 age | 1.759359 .3734157 2.66 0.008 1.160622 2.666969 ------------------------------------------------------------------------------ To test our hypothesis, we use the post × intervention interaction term (p = 0.504). We conclude the intervention was not statistically significant. Chapter 3-12 (revised 16 May 2010) p. 20 Exercise Look at the article by Adams, et al (2006). Notice in the footnote of Table 2, 3rd line from bottom, they state, “Mortality rates are per 100,000 person-years, directly standardized to the age distribution of the cohort (according to sex).” They are using individual level data (one observation per subject), and creating their own standard population which is age by sex population proportions. In their Statistical Methods they report using 5-year age categories, which will give more stable estimates (since there are more observations per category than there are per year of age). Also, notice that they report standardized rates in their Table 2, but they also use multivariable Cox regression to test for associations. References Kahn HA, Sempos CT. (1989). Statistical Methods in Epidemiology. New York, Oxford University Press. Rothman KJ. (2002). Epidemiology: An Introduction. New York, Oxford University Press. Rothman KJ, Greenland S. (1998). Modern Epidemiology, 2nd ed. Philadelphia, PA, LippincottRaven Publishers. StataCorp. (2003). Stata Statistical Software: Release 8.0. College Station, Texas. Chapter 3-12 (revised 16 May 2010) p. 21