Appendix 1. Dataset Descriptions Births Dataset (births.dta) This dataset is distributed with the textbook: Hills M, De Stavola BL. A Short Introduction to Stata for Biostatistics. London, Timberlake Consultants Ltd. 2002. http://www.timberlake.co.uk The dataset concerns 500 mothers who had singleton births in a large London hospital. Codebook Variable id bweight lowbw gestwks preterm matage hyp sex sexalph Labels subject number birth weight (grams) birth weight < 2500 g 1=yes, 0=no gestational age (weeks) gestational age < 37 weeks 1=yes, 0=no maternal age (years) maternal hypertension 1=hypertensive, 0=normal sex of baby 1=male, 2=female sex of baby (alphabetic coding) “male”, “female” _____________________ Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah School of Medicine, 2011. http://www.ccts.utah.edu/biostats/?pageId=5385 Appendix 1 (revision 12 Jul 2010) p. 1 CASS Dataset (cass.dta) Source This dataset is available at the website for the textbook by Pepe (2003): http://www.fhcrc.org/science/labs/pepe/book,where it is listed as est1.dta. This version of the data was published in Leisenring et al. (2000, Table 5). Description [useful dataset to illustrate comparing two diagnostic tests with a common reference standard (gold standard)] Described in Pepe (2003, p.8), the data come from the coronary artery surgery study (CASS), and originally reported by Weiner et al. (1979, Table 2). In a cohort study of N=1465 men undergoing coronary arteriography (the gold standard) for suspected or probable coronary heart disease, both an exercise stress test (EST) and chest pain history (CPH) were recorded. Codebook N = 1,465 cad coronary artery disease (gold standard) 1 = yes 0 = no est exercise stress test (diagnostic test for CAD) 1 = positive 0 = negative cph chest pain history (diagnostic test for CAD) 1 = positive 0 = negative __________ Pepe MS. (2003). The Statistical Evaluation of Medical Tests for Classification and Prediction. New York, Oxford University Press. Leisenring W, Alonzo T, Pepe MS. (2000). Comparisons of predictive values of binary medical diagnostic tests for paired designs. Biometrics 56:345-51. Weiner DA, Ryan TJ, McCabe CH, Kennedy JW, Schloss M, Tristani F, Chaitman BR, Fisher LD. (1979). Exercise stress testing. Correlations among history of angina, ST-segment response and prevalence of coronary-artery disease in the Coronary Artery Aurgery Study (CASS). New England Journal of Medicine 301(5):230-5. Appendix 1 (revision 12 Jul 2010) p. 2 Evans County Dataset (evans.dta) Source dataset to accompany Kleinbaum and Klein (2002, chapter 2) http://www.sph.emory.edu/~dkleinb/logreg2.htm#data Brief Description [modeled with standard (unconditional) logistic regression] Data are from a cohort study in which n=609 white males were followed for 7 years, with coronary heart disease as the outcome of interest. Codebook n = 609 outcome chd coronary heart disease (1=presence, 0=absence) predictors cat catecholamine level (1=high, 0=normal) age age in years (continuous) chl cholesterol (continuous) smk smoker (1=ever smoked, 0=never smoked) ecg electrocardiogram abnormality (1=presence, 0=absence) dbp diastolic blood pressure (continuous) sbp systolic blood pressure (continuous) hbp high blood pressure (1=presence, 0=absence) defined as: DBP 160 or SBP 95 data management id subject identifier (unique #, one observation per subject) . list in 1/10 +-----------------------------------------------------------+ | id chd cat age chl smk ecg dbp sbp hbp | |-----------------------------------------------------------| 1. | 21 0 0 56 270 0 0 80 138 0 | 2. | 31 0 0 43 159 1 0 74 128 0 | 3. | 51 1 1 56 201 1 1 112 164 1 | 4. | 71 0 1 64 179 1 0 100 200 1 | 5. | 74 0 0 49 243 1 0 82 145 0 | |-----------------------------------------------------------| 6. | 91 0 0 46 252 1 0 88 142 0 | 7. | 111 1 0 52 179 1 1 80 128 0 | 8. | 131 0 0 63 217 0 0 92 135 0 | 9. | 141 0 0 42 176 1 0 76 114 0 | 10. | 191 0 0 55 250 0 1 114 182 1 | +-----------------------------------------------------------+ _______________ Kleinbaum DG, Klein M. (2002). Logistic Regression: A Self-Learning Text, 2nd ed. New York, Springer-Verlag. Appendix 1 (revision 12 Jul 2010) p. 3 Forced Expiratory Volume (FEV) dataset (fev.dta) Source dataset that accompanies text: Rosner (1995). Brief Description [modeled with linear regression] Data are determinations of FEV in 654 children, ages 3-19, who were seen in the Childhood Respiratory Disease Study in East Boston, Massachusetts (Tager et al, 1979). Codebook n = 654 outcome fev FEV1, forced expiratory volume (liters) predictors age age (years) height height (inches) male male gender (1=male, 0=female) smoker smoking status (1=ever smoked, 0=never smoked) “ever smoked” is defined as currently smoking or had at some time smoked as much as one cigarette per week, “never smoked” is never smoked as much as one cigarette per week (Tager et al, 1979). data management id subject identifier (unique #, one observation per subject) note: the study collected smoking data on both parents and children. The children are the subjects in this dataset, and the smoking variable applies to them. _________________ Rosner B. (1995). Fundamentals of Biostatistics, 4th ed., Belmont CA, Duxbury Press. Tager IB, Weiss ST, Rosner B, Speizer FE. (1979). Effect of parental cigarette smoking on pulmonary function in children. Am J Epidemiol 110;15-26. Appendix 1 (revision 12 Jul 2010) p. 4 Framingham Heart Study (2.20.Framingham.dta) This dataset is distributed with the textbook: Dupont WD. Statistical Modeling for Biomedical Researchers: A Simple Introduction to the Analysis of Complex Data. Cambridge UK, Cambridge University Press, 2002, p. 77. downloadable from: www.mc.vanderbilt.edu/prevmed/wddtext. The dataset comes from a long-term follow-up study of cardiovascular risk factors on 4699 patients living in the town of Framingham, Massachusetts. The patients were free of coronary heart disease at their baseline exam (recruitment of patients started in 1948). Codebook Variable Labels Baseline exam: sbp systolic blood pressure (SBP) in mm Hg dbp diastolic blood pressure (DBP) in mm Hg age age in years scl serum cholesterol (SCL) in mg/100ml bmi body mass index (BMI) = weight/height2 in kg/m2 sex gender 1=male, 2=female month month of year in which baseline exam occurred id patient identification variable (numbered 1 to 4699) Follow-up information on coronary heart disease: followup follow-up in days chdfate CHD outcome 1=patient develops CHD at the end of follow-up 0=otherwise Appendix 1 (revision 12 Jul 2010) p. 5 Lee Life Table and Lee Survival Dataset (LeeLifeTable.dta & LeeSurvival.dta) Source This dataset is listed in the survival analysis textbook by Lee (1980, Table 3.5, p.31). The data originally came from Myers (1969). Brief Description The data concern male patients with localized cancer of the rectum diagnosed in Connecticut from 1935 to 1954. The research question is whether survival improved for the 1945-1954 cohort of patients (cohort = 1) relative to the earlier 1935-1944 cohort (cohort = 0). The file LeeLifeTable.dta is used for generating life tables (see Chapter 5-7) using, ltable interval died, survival by(cohort) Follow-up is to the end of the 10th year. To make the life table come out right, if the subject was still alive and still being followed at the end of year 10, a score of 11 was assigned. The file LeeSurvival.dta is used for all other survival analyses, such as log-rank tests, Kaplan-Meier curves, and Cox regression. Subjects followed to the end of the 10th year are assigned a score of 10. Codebook id cohort interval died study ID number 1 = 1945-1955 patient cohort 0 = 1935-1944 patient cohort 1 to 10, time interval (year) following cancer diagnosis 11 = still alive and being followed at end of year 10, with follow-up ending at end of year 10 (LeeLifeTable.dta only) 1 = died 0 = withdrawn alive or lost to follow-up during year interval ____________ Lee ET. (1980). Statistical Methods for Survival Data Analysis. Belmont CA, Lifetime Learning Publications. Myers MH. (1969). A Computing Procedure for a Significance Test of the Difference Between Two Survival Curves, Methodological Note No. 18 in Methodological Notes compiled by the End Results Sections, National Cancer Institute, National Institute of Health, Bethesda, Maryland. See next page for data. Appendix 1 (revision 12 Jul 2010) p. 6 … Lee Life Table and Lee Survival Dataset (LeeLifeTable.dta & LeeSurvival.dta) To create the two data files, copy the following into the Stata do-file editor, highlight it, and hit the run button (last icon on do-file editor menu bar). clear input cohort1 interval1 died1 count1 /// cohort2 interval2 died2 count2 cohort3 interval3 died3 count3 /// cohort4 interval4 died4 count4 0 1 1 167 1 1 1 185 0 1 0 2 1 1 0 10 0 2 1 45 1 2 1 88 0 2 0 1 1 2 0 10 0 3 1 45 1 3 1 55 0 3 0 1 1 3 0 10 0 4 1 19 1 4 1 43 0 4 0 1 1 4 0 10 0 5 1 17 1 5 1 32 0 5 0 1 1 5 0 14 0 6 1 11 1 6 1 31 0 6 0 1 1 6 0 52 0 7 1 8 1 7 1 20 0 7 0 1 1 7 0 38 0 8 1 0 1 8 1 7 0 8 0 1 1 8 0 24 0 9 1 6 1 9 1 6 0 9 0 1 1 9 0 25 0 10 1 7 1 10 1 6 0 10 0 1 1 10 0 24 0 11 1 0 1 11 1 0 0 11 0 52 1 11 0 59 end quietly gen tempid=_n quietly reshape long cohort interval died count , /// i(tempid) j(block) drop if count==0 expand count drop block tempid count label define cohortlab 0 "0) 1935-1944" 1 "1) 1945-1955" label values cohort cohortlab label variable interval "followup-up time (one-year intervals)" label define diedlab 0 "0) still alive" 1 "1) died" label values died diedlab sort cohort interval died save LeeLifeTable, replace // for Life Tables recode interval 11=10 , replace save LeeSurvival, replace // for all other survival analysis Appendix 1 (revision 12 Jul 2010) p. 7 Medpar1 Dataset (medpar1.dta) Source This dataset accompanies the textbook by Hardin and Hilbe (2007). It was downloaded from the publisher’s website: http://www.stata-press.com/data/hh2/ Brief Description [modeled with generalized gamma regression or linear regression after log transformation of LOS]. The data are from the U.S. Medicare database, for a single hospital and one diagnostic code (DRG). The study aim is to determine if type of admission is associated with LOS, after adjusting for the number of ICD-9 codes. (Hardin and Hilbe, 2007, p.110). Codebook n = 3,676 observations (1 observation per subject) outcome los length of stay (continuous, 1 to 87 days) predictors admittype codes type of admission (1=elective, 0=emergency or urgent) number of ICD-9 codes recorded (continuous, 1 to 9) ___________ Hardin JW, Hilbe JM. (2007). Generalized Linear Models and Extensions, 2nd ed. Stata Press. Appendix 1 (revision 12 Jul 2010) p. 8 MI Dataset (mi.dta) Source dataset to accompany Kleinbaum and Klein (2002, Chapter 8) http://www.sph.emory.edu/~dkleinb/logreg2.htm#data Brief Description [modeled with conditional logistic regression] Data are from a 1:2 matched case-control study in which n=117 subjects are formed into 39 matched strata. Each stratum contains three subjects, one of whom is a case diagnosed with myocardial infarction and the other two are matched controls. Matching was done on age, race, sex, and hospital status. Codebook n = 117 observations (39 matches) outcome mi myocardial infarction (1=presence, 0=absence) predictors smk smoker (1=current smoker, 0=not current smoker) sbp systolic blood pressure (continuous) ecg electrocardiogram abnormality (1=presence, 0=absence) data management match variable indicating subject’s matched stratum (range 1 to 39) person subject identifier (unique #, one observation per subject) survtime was never defined in the Kleinbaum and Klein (2002) textbook. . list in 1/10 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. +--------------------------------------------------+ | match person mi smk sbp ecg survtime | |--------------------------------------------------| | 1 1 1 0 160 1 1 | | 1 2 0 0 140 0 2 | | 1 3 0 0 120 0 2 | | 2 4 1 0 160 1 1 | | 2 5 0 0 140 0 2 | |--------------------------------------------------| | 2 6 0 0 120 0 2 | | 3 7 1 0 160 0 1 | | 3 8 0 0 140 0 2 | | 3 9 0 0 120 0 2 | | 4 10 1 0 160 0 1 | +--------------------------------------------------+ _________ Kleinbaum DG, Klein M. (2002). Logistic Regression: A Self-Learning Text, 2nd ed. New York, Springer-Verlag. Appendix 1 (revision 12 Jul 2010) p. 9 Resting Metabolic Rate Dataset (rmr.dta) Data published by Nawata et al (2004). The data were taken from the authors’ Figure 1, a scatterplot, and so only approximate the actual values used by the authors. File rmr.dta Codebook group urinary excretion of albumin group (U-Alb) a = U-Alb < 30 mg/d b = 30 mg/d ≤ U-Alb ≤ 300 mg/d c = 300 mg/d < U-Alb lbm lean body mass (kg) rmr resting metabolic rate (kJ/h/m2) _______ Nawata K, Sohmiya M, Kawaguchi M, et al. (2004). Increased resting metabolic rate in patients with type 2 diabetes mellitus accompanied by advanced diabetic nephropathy. Metabolism 53(11) Nov: 1395-1398. Appendix 1 (revision 12 Jul 2010) p. 10 Smoking Cessation Study (smoke.csv) This dataset was distributed with the Rosner (1995) biostatistics textbook. In this dataset, 234 smokers who expressed a willingness to quit smoking were followed for one year to estimate the proportion of recidivism (quit for a time and then started again). Codebook Variable Labels id age gender subject identification number age quit smoking gender 1=male, 2=female smoking habit at time quit smoking (cigarettes/day) days abstinent (up to 365, which ends the follow-up period) cigs days __________ Rosner B. (1995). Fundamentals of Biostatistics, 4th ed. Belmont CA, Duxbury Press. Appendix 1 (revision 12 Jul 2010) p. 11 Wieand Dataset (wiedat2b.dta) Source This dataset is availabe at the website for the textbook by Pepe (2003): http://www.fhcrc.org/science/labs/pepe/book,where it is listed as wiedat2b.dta. This version of the data was published in Wieand et al. (1989). Description Described in Pepe (2003, p.10), the data were first published by Wieand et al. (1989), taken from a case-control study at the Mayo Clinic with 90 pancreatic cancer cases and 51 non-cancer controls with pancreatitis. The predictors are serum samples assayed for CA-19-9, a carbohydrate antigen, and CA-125, a cancer antigen. The study question is which of the two biomarkers best discriminates between cases and controls. Codebook N = 141 Variable Labels y1 CA19-9 carbohydrate antigen (continuous) [Bast et al., 1983] y2 CA125 cancer antigen (continuous) [Del Villano et al, 1983] d pancreatic cancer (referent standard, or “gold” standard) 1 = yes 0 = no -----Bast RC, Klug TL, St. John E, et al. (1983). Radio-immunoassay using a monoclonal antibody to monitor the course of epithelial ovarian cancer. N Engl J Med 309:883-7. Del Villano BC, Brennan S, Brock P, et al. (1983). Radioimmunometric assay for a monoclonal antibody-defined tumor marker, CA19-9. Clin Chem 29:549-52. Pepe MS. (2003). The Statistical Evaluation of Medical Tests for Classification and Prediction. New York, Oxford University Press. Wieand S, Gail MH, James BR, and James KL. (1989). A family of nonparametric statistics for comparing diagnostic markers with paired or unpaired data. Biometrika 76(3):585-92. Appendix 1 (revision 12 Jul 2010) p. 12 Wright Low Birthweight Dataset ( wright_lowbw.dta ) The dataset concerns 900 birthweight outcomes and risk factors attributable to the mother. Codebook Variable lowbw alc smo soc Labels low birthweight delivery 1=yes 0=no mother’s alcohol drinking frequency 1=Light, 2=Moderate, 3=Heavy mother smoked 1=no 2=yes mother’s social status 1=I and II (lower), 2=III (middle), 3=IV and V (upper) “male”, “female” ________ Source: This dataset, in a more condensed form, came from Stata’s website by using the command “use http://www.stata-press.com/dta/48/binreg”. It can be found in Wacholder (1986), who got it from Wright et al. (1983). References: Wacholder S (1986). Binomial regression in GLIM: estimating risk ratios and risk differences. Am J Epidemiol 123:174-184. Wright JT, Waterson EJ, Barrison PJ, et al. (1983). Alcohol consumption, pregnancy and low birthweight. Lancet 1:663-665. Appendix 1 (revision 12 Jul 2010) p. 13 Vaso Dataset (vaso.dta) Source Dataset to accompany textbook: Aitkin M, Anderson D, Fancis B, Hinde J. Statistical Modeling in GLIM. Oxford, Clarendon Press, 1989. Data were originally published in Finney DJ, The estimation from original records of the relationship between dose and quantal response. Biometrika 1947;34:320-334. Brief Description The data were obtained in a carefully controlled study of the effect of the RATE and VOLume of air inspired by human subjects on the occurrence (coded 1) or non-occurrence (coded 0) of a transient vasoconstriction RESPonse in the skin of the fingers. Three subjects were involved in the study: the first contributed 9 observations at different values of RATE and VOL, the second 8, and the third 22 observations. The experiment was designed to ensure as far as possible that successive observations obtained on each subject were independent: serial correlation between successive observations on the same subject is such studies is always a possibility. The aim is to fit a statistical model relating RESP to RATE and VOL. (Aitkin, et al. p.167). Codebook n = 39 outcome resp predictors vol rate response (1=occurrence of vasoconstriction, 0=non-occurrence of vasoconstriction) volume of air inspired rate of air inspired. Appendix 1 (revision 12 Jul 2010) p. 14