Appendix 1. Dataset Descriptions Births Dataset (births.tab) This dataset is distributed with the textbook: Hills M, De Stavola BL. A Short Introduction to Stata for Biostatistics. London, Timberlake Consultants Ltd. 2002. http://www.timberlake.co.uk The dataset concerns 500 mothers who had singleton births in a large London hospital. Codebook Variable id bweight lowbw gestwks preterm matage hyp sex sexalph Labels subject number birth weight (grams) birth weight < 2500 g 1=yes, 0=no gestational age (weeks) gestational age < 37 weeks 1=yes, 0=no maternal age (years) maternal hypertension 1=hypertensive, 0=normal sex of baby 1=male, 2=female sex of baby (alphabetic coding) “male”, “female” _____________________ Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata [unpublished manuscript] University of Utah School of Medicine, 2009. Appendix 1 p. 1 Evans County Dataset (evans.dta) Source dataset to accompany Kleinbaum and Klein (K&K chapter 2) http://www.sph.emory.edu/~dkleinb/logreg2.htm#data Brief Description [modeled with standard (unconditional) logistic regression] Data are from a cohort study in which n=609 white males were followed for 7 years, with coronary heart disease as the outcome of interest. Codebook n = 609 outcome chd coronary heart disease (1=presence, 0=absence) predictors cat catecholamine level (1=high, 0=normal) age age in years (continuous) chl cholesterol (continuous) smk smoker (1=ever smoked, 0=never smoked) ecg electrocardiogram abnormality (1=presence, 0=absence) dbp diastolic blood pressure (continuous) sbp systolic blood pressure (continuous) hpt high blood pressure (1=presence, 0=absence) defined as: DBP 160 or SBP 95 cc product term of cat chl ch product term of cat hpt data management id subject identifier (unique #, one observation per subject) . list in 1/10 +----------------------------------------------------------------------+ | id chd cat age chl smk ecg dbp sbp hpt cc ch | |----------------------------------------------------------------------| 1. | 21 0 0 56 270 0 0 80 138 0 0 0 | 2. | 31 0 0 43 159 1 0 74 128 0 0 0 | 3. | 51 1 1 56 201 1 1 112 164 1 201 1 | 4. | 71 0 1 64 179 1 0 100 200 1 179 1 | 5. | 74 0 0 49 243 1 0 82 145 0 0 0 | |----------------------------------------------------------------------| 6. | 91 0 0 46 252 1 0 88 142 0 0 0 | 7. | 111 1 0 52 179 1 1 80 128 0 0 0 | 8. | 131 0 0 63 217 0 0 92 135 0 0 0 | 9. | 141 0 0 42 176 1 0 76 114 0 0 0 | 10. | 191 0 0 55 250 0 1 114 182 1 0 0 | +----------------------------------------------------------------------+ Appendix 1 p. 2 Forced Expiratory Volume (FEV) dataset (fev.dta) Source dataset that accompanies text: Rosner (1995). Brief Description [modeled with linear regression] Data are determinations of FEV in 654 children, ages 3-19, who were seen in the Childhood Respiratory Disease Study in East Boston, Massachusetts (Tager et al, 1979). Codebook n = 654 outcome fev FEV1, forced expiratory volume (liters) predictors age age (years) height height (inches) male male gender (1=male, 0=female) smoker smoking status (1=ever smoked, 0=never smoked) “ever smoked” is defined as currently smoking or had at some time smoked as much as one cigarette per week, “never smoked” is never smoked as much as one cigarette per week (Tager et al). data management id subject identifier (unique #, one observation per subject) note: the study collected smoking data on both parents and children. The children are the subjects in this dataset, and the smoking variable applies to them. _________________ Rosner B. (1995). Fundamentals of Biostatistics, 4th ed., Belmont CA, Duxbury Press. Tager IB, Weiss ST, Rosner B, Speizer FE. (1979). Effect of parental cigarette smoking on pulmonary function in children. Am J Epidemiol 110;15-26. Appendix 1 p. 3 Framingham Heart Study (2.20.Framingham.dta) This dataset is distributed with the textbook: Dupont WD. Statistical Modeling for Biomedical Researchers: A Simple Introduction to the Analysis of Complex Data. Cambridge UK, Cambridge University Press, 2002, p. 77. downloadable from: www.mc.vanderbilt.edu/prevmed/wddtext. The dataset comes from a long-term follow-up study of cardiovascular risk factors on 4699 patients living in the town of Framingham, Massachusetts. The patients were free of coronary heart disease at their baseline exam (recruitment of patients started in 1948). Codebook Variable Labels Baseline exam: sbp systolic blood pressure (SBP) in mm Hg dbp diastolic blood pressure (DBP) in mm Hg age age in years scl serum cholesterol (SCL) in mg/100ml bmi body mass index (BMI) = weight/height2 in kg/m2 sex gender 1=male, 2=female month month of year in which baseline exam occurred id patient identification variable (numbered 1 to 4699) Follow-up information on coronary heart disease: followup follow-up in days chdfate CHD outcome 1=patient develops CHD at the end of follow-up 0=otherwise Appendix 1 p. 4 LeeLife Dataset (LeeLife.dta) Source This dataset is listed in the survival analysis textbook by Lee (1980, Table 3.5, p.31). The data originally came from Myers (1969). Brief Description The data concern male patients with localized cancer of the rectum diagnosed in Connecticut from 1935 to 1954. The research question is whether survival improved for the 1945-1954 cohort of patients (cohort = 1) relative to the earlier 1935-1944 cohort (cohort = 0). Codebook id study ID number cohort 1 = 1945-1955 patient cohort 0 = 1935-1944 patient cohort interval 1 to 10, time interval (year) following cancer diagnosis 11 = still alive and being followed at end of year 10 died 1 = died 0 = withdrawn alive or lost to follow-up during year interval ____________ Lee ET. (1980). Statistical Methods for Survival Data Analysis. Belmont CA, Lifetime Learning Publications. Myers MH. (1969). A Computing Procedure for a Significance Test of the Difference Between Two Survival Curves, Methodological Note No. 18 in Methodoligcal Notes compiled by the End Results Sections, National Cancer Institute, National Institute of Health, Bethesda, Maryland. Appendix 1 p. 5 MI Dataset (mi.dta) Source dataset to accompany Kleinbaum and Klein (2002, Chapter 8) http://www.sph.emory.edu/~dkleinb/logreg2.htm#data Brief Description [modeled with conditional logistic regression] Data are from a 1:2 matched case-control study in which n=117 subjects are formed into 39 matched strata. Each stratum contains three subjects, one of whom is a case diagnosed with myocardial infarction and the other two are matched controls. Matching was done on age, race, sex, and hospital status. Codebook n = 117 observations (39 matches) outcome mi myocardial infarction (1=presence, 0=absence) predictors smk smoker (1=current smoker, 0=not current smoker) sbp systolic blood pressure (continuous) ecg electrocardiogram abnormality (1=presence, 0=absence) data management match variable indicating subject’s matched stratum (range 1 to 39) person subject identifier (unique #, one observation per subject) survtime was never defined in the Kleinbaum and Klein textbook. . list in 1/10 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. +--------------------------------------------------+ | match person mi smk sbp ecg survtime | |--------------------------------------------------| | 1 1 1 0 160 1 1 | | 1 2 0 0 140 0 2 | | 1 3 0 0 120 0 2 | | 2 4 1 0 160 1 1 | | 2 5 0 0 140 0 2 | |--------------------------------------------------| | 2 6 0 0 120 0 2 | | 3 7 1 0 160 0 1 | | 3 8 0 0 140 0 2 | | 3 9 0 0 120 0 2 | | 4 10 1 0 160 0 1 | +--------------------------------------------------+ Appendix 1 p. 6 Resting Metabolic Rate Dataset (rmr.dta) Data published by Nawata et al (2004)(on course CD). The data were taken from the authors’ Figure 1, a scatterplot, and so only approximate the actual values used by the authors. File rmr.dta Codebook group urinary excretion of albumin group (U-Alb) a = U-Alb < 30 mg/d b = 30 mg/d ≤ U-Alb ≤ 300 mg/d c = 300 mg/d < U-Alb lbm lean body mass (kg) rmr resting metabolic rate (kJ/h/m2) Appendix 1 p. 7 Smoking Cessation Study (smoke.csv) This dataset was distributed with the Rosner (1995) biostatistics textbook. In this dataset, 234 smokers who expressed a willingness to quit smoking were followed for one year to estimate the proportion of recidivism (quit for a time and then started again). Codebook Variable Labels id age gender subject identification number age quit smoking gender 1=male, 2=female smoking habit at time quit smoking (cigarettes/day) days abstinent (up to 365, which ends the follow-up period) cigs days Appendix 1 p. 8 Wright Low Birthweight Dataset ( wright_lowbw.dta ) The dataset concerns 900 birthweight outcomes and risk factors attributable to the mother. Codebook Variable lowbw alc smo soc Labels low birthweight delivery 1=yes 0=no mother’s alcohol drinking frequency 1=Light, 2=Moderate, 3=Heavy mother smoked 1=no 2=yes mother’s social status 1=I and II (lower), 2=III (middle), 3=IV and V (upper) “male”, “female” ________ Source: This dataset, in a more condensed form, came from Stata’s website by using the command “use http://www.stata-press.com/dta/48/binreg”. It can be found in Wacholder (1986), who got it from Wright et al. (1983). References: Wacholder S (1986). Binomial regression in GLIM: estimating risk ratios and risk differences. Am J Epidemiol 123:174-184. Wright JT, Waterson EJ, Barrison PJ, et al. (1983). Alcohol consumption, pregnancy and low birthweight. Lancet 1:663-665. Appendix 1 p. 9 Vaso Dataset (vaso.dta) Source Dataset to accompany textbook: Aitkin M, Anderson D, Fancis B, Hinde J. Statistical Modeling in GLIM. Oxford, Clarendon Press, 1989. Data were originally published in Finney DJ, The estimation from original records of the relationship between dose and quantal response. Biometrika 1947;34:320-334. Brief Description The data were obtained in a carefully controlled study of the effect of the RATE and VOLume of air inspired by human subjects on the occurrence (coded 1) or non-occurrence (coded 0) of a transient vasoconstriction RESPonse in the skin of the fingers. Three subjects were involved in the study: the first contributed 9 observations at different values of RATE and VOL, the second 8, and the third 22 observations. The experiment was designed to ensure as far as possible that successive observations obtained on each subject were independent: serial correlation between successive observations on the same subject is such studies is always a possibility. The aim is to fit a statistical model relating RESP to RATE and VOL. (Aitkin, et al. p.167). Codebook n = 39 outcome resp predictors vol rate Appendix 1 response (1=occurrence of vasoconstriction, 0=non-occurrence of vasoconstriction) volume of air inspired rate of air inspired. p. 10 Appendix 1 p. 11