Appendix 1. Dataset Descriptions

advertisement
Appendix 1. Dataset Descriptions
Births Dataset (births.tab)
This dataset is distributed with the textbook: Hills M, De Stavola BL. A Short
Introduction to Stata for Biostatistics. London, Timberlake Consultants Ltd. 2002.
http://www.timberlake.co.uk
The dataset concerns 500 mothers who had singleton births in a large London hospital.
Codebook
Variable
id
bweight
lowbw
gestwks
preterm
matage
hyp
sex
sexalph
Labels
subject number
birth weight (grams)
birth weight < 2500 g
1=yes, 0=no
gestational age (weeks)
gestational age < 37 weeks
1=yes, 0=no
maternal age (years)
maternal hypertension
1=hypertensive, 0=normal
sex of baby
1=male, 2=female
sex of baby (alphabetic coding)
“male”, “female”
_____________________
Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata [unpublished manuscript] University of Utah School of
Medicine, 2009.
Appendix 1
p.
1
Evans County Dataset (evans.dta)
Source
dataset to accompany Kleinbaum and Klein (K&K chapter 2)
http://www.sph.emory.edu/~dkleinb/logreg2.htm#data
Brief Description [modeled with standard (unconditional) logistic regression]
Data are from a cohort study in which n=609 white males were followed for 7
years, with coronary heart disease as the outcome of interest.
Codebook
n = 609
outcome
chd
coronary heart disease (1=presence, 0=absence)
predictors
cat
catecholamine level (1=high, 0=normal)
age
age in years (continuous)
chl
cholesterol (continuous)
smk smoker (1=ever smoked, 0=never smoked)
ecg
electrocardiogram abnormality (1=presence, 0=absence)
dbp
diastolic blood pressure (continuous)
sbp
systolic blood pressure (continuous)
hpt
high blood pressure (1=presence, 0=absence)
defined as: DBP  160 or SBP  95
cc
product term of cat  chl
ch
product term of cat  hpt
data management
id
subject identifier (unique #, one observation per subject)
. list in 1/10
+----------------------------------------------------------------------+
| id
chd
cat
age
chl
smk
ecg
dbp
sbp
hpt
cc
ch |
|----------------------------------------------------------------------|
1. | 21
0
0
56
270
0
0
80
138
0
0
0 |
2. | 31
0
0
43
159
1
0
74
128
0
0
0 |
3. | 51
1
1
56
201
1
1
112
164
1
201
1 |
4. | 71
0
1
64
179
1
0
100
200
1
179
1 |
5. | 74
0
0
49
243
1
0
82
145
0
0
0 |
|----------------------------------------------------------------------|
6. | 91
0
0
46
252
1
0
88
142
0
0
0 |
7. | 111
1
0
52
179
1
1
80
128
0
0
0 |
8. | 131
0
0
63
217
0
0
92
135
0
0
0 |
9. | 141
0
0
42
176
1
0
76
114
0
0
0 |
10. | 191
0
0
55
250
0
1
114
182
1
0
0 |
+----------------------------------------------------------------------+
Appendix 1
p.
2
Forced Expiratory Volume (FEV) dataset (fev.dta)
Source
dataset that accompanies text: Rosner (1995).
Brief Description [modeled with linear regression]
Data are determinations of FEV in 654 children, ages 3-19, who were seen in the
Childhood Respiratory Disease Study in East Boston, Massachusetts (Tager et al,
1979).
Codebook
n = 654
outcome
fev
FEV1, forced expiratory volume (liters)
predictors
age
age (years)
height height (inches)
male male gender (1=male, 0=female)
smoker smoking status (1=ever smoked, 0=never smoked)
“ever smoked” is defined as currently smoking or had at some time
smoked as much as one cigarette per week, “never smoked” is
never smoked as much as one cigarette per week (Tager et al).
data management
id
subject identifier (unique #, one observation per subject)
note: the study collected smoking data on both parents and children. The
children are the subjects in this dataset, and the smoking variable applies to them.
_________________
Rosner B. (1995). Fundamentals of Biostatistics, 4th ed., Belmont CA, Duxbury Press.
Tager IB, Weiss ST, Rosner B, Speizer FE. (1979). Effect of parental cigarette smoking
on pulmonary function in children. Am J Epidemiol 110;15-26.
Appendix 1
p.
3
Framingham Heart Study (2.20.Framingham.dta)
This dataset is distributed with the textbook: Dupont WD. Statistical Modeling for
Biomedical Researchers: A Simple Introduction to the Analysis of Complex Data.
Cambridge UK, Cambridge University Press, 2002, p. 77.
downloadable from: www.mc.vanderbilt.edu/prevmed/wddtext.
The dataset comes from a long-term follow-up study of cardiovascular risk factors on
4699 patients living in the town of Framingham, Massachusetts. The patients were free
of coronary heart disease at their baseline exam (recruitment of patients started in 1948).
Codebook
Variable
Labels
Baseline exam:
sbp
systolic blood pressure (SBP) in mm Hg
dbp
diastolic blood pressure (DBP) in mm Hg
age
age in years
scl
serum cholesterol (SCL) in mg/100ml
bmi
body mass index (BMI) = weight/height2 in kg/m2
sex
gender
1=male, 2=female
month
month of year in which baseline exam occurred
id
patient identification variable (numbered 1 to 4699)
Follow-up information on coronary heart disease:
followup
follow-up in days
chdfate
CHD outcome
1=patient develops CHD at the end of follow-up
0=otherwise
Appendix 1
p.
4
LeeLife Dataset (LeeLife.dta)
Source
This dataset is listed in the survival analysis textbook by Lee (1980, Table 3.5, p.31).
The data originally came from Myers (1969).
Brief Description
The data concern male patients with localized cancer of the rectum diagnosed in
Connecticut from 1935 to 1954. The research question is whether survival improved for
the 1945-1954 cohort of patients (cohort = 1) relative to the earlier 1935-1944 cohort
(cohort = 0).
Codebook
id
study ID number
cohort
1 = 1945-1955 patient cohort
0 = 1935-1944 patient cohort
interval
1 to 10, time interval (year) following cancer diagnosis
11 = still alive and being followed at end of year 10
died
1 = died
0 = withdrawn alive or lost to follow-up during year
interval
____________
Lee ET. (1980). Statistical Methods for Survival Data Analysis. Belmont CA, Lifetime
Learning Publications.
Myers MH. (1969). A Computing Procedure for a Significance Test of the Difference
Between Two Survival Curves, Methodological Note No. 18 in Methodoligcal
Notes compiled by the End Results Sections, National Cancer Institute, National
Institute of Health, Bethesda, Maryland.
Appendix 1
p.
5
MI Dataset (mi.dta)
Source
dataset to accompany Kleinbaum and Klein (2002, Chapter 8)
http://www.sph.emory.edu/~dkleinb/logreg2.htm#data
Brief Description [modeled with conditional logistic regression]
Data are from a 1:2 matched case-control study in which n=117 subjects are
formed into 39 matched strata. Each stratum contains three subjects, one of
whom is a case diagnosed with myocardial infarction and the other two are
matched controls. Matching was done on age, race, sex, and hospital status.
Codebook
n = 117 observations (39 matches)
outcome
mi
myocardial infarction (1=presence, 0=absence)
predictors
smk smoker (1=current smoker, 0=not current smoker)
sbp
systolic blood pressure (continuous)
ecg
electrocardiogram abnormality (1=presence, 0=absence)
data management
match
variable indicating subject’s matched stratum (range 1 to 39)
person
subject identifier (unique #, one observation per subject)
survtime was never defined in the Kleinbaum and Klein textbook.
. list in 1/10
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
+--------------------------------------------------+
| match
person
mi
smk
sbp
ecg
survtime |
|--------------------------------------------------|
|
1
1
1
0
160
1
1 |
|
1
2
0
0
140
0
2 |
|
1
3
0
0
120
0
2 |
|
2
4
1
0
160
1
1 |
|
2
5
0
0
140
0
2 |
|--------------------------------------------------|
|
2
6
0
0
120
0
2 |
|
3
7
1
0
160
0
1 |
|
3
8
0
0
140
0
2 |
|
3
9
0
0
120
0
2 |
|
4
10
1
0
160
0
1 |
+--------------------------------------------------+
Appendix 1
p.
6
Resting Metabolic Rate Dataset (rmr.dta)
Data published by Nawata et al (2004)(on course CD). The data were taken from the
authors’ Figure 1, a scatterplot, and so only approximate the actual values used by the
authors.
File rmr.dta Codebook
group urinary excretion of albumin group (U-Alb)
a = U-Alb < 30 mg/d
b = 30 mg/d ≤ U-Alb ≤ 300 mg/d
c = 300 mg/d < U-Alb
lbm lean body mass (kg)
rmr
resting metabolic rate (kJ/h/m2)
Appendix 1
p.
7
Smoking Cessation Study (smoke.csv)
This dataset was distributed with the Rosner (1995) biostatistics textbook. In this dataset,
234 smokers who expressed a willingness to quit smoking were followed for one year to
estimate the proportion of recidivism (quit for a time and then started again).
Codebook
Variable
Labels
id
age
gender
subject identification number
age quit smoking
gender
1=male, 2=female
smoking habit at time quit smoking (cigarettes/day)
days abstinent (up to 365, which ends the follow-up period)
cigs
days
Appendix 1
p.
8
Wright Low Birthweight Dataset ( wright_lowbw.dta )
The dataset concerns 900 birthweight outcomes and risk factors attributable to the
mother.
Codebook
Variable
lowbw
alc
smo
soc
Labels
low birthweight delivery
1=yes 0=no
mother’s alcohol drinking frequency
1=Light, 2=Moderate, 3=Heavy
mother smoked
1=no 2=yes
mother’s social status
1=I and II (lower), 2=III (middle), 3=IV and V (upper)
“male”, “female”
________
Source: This dataset, in a more condensed form, came from Stata’s website by using the
command “use http://www.stata-press.com/dta/48/binreg”. It can be found in Wacholder
(1986), who got it from Wright et al. (1983).
References:
Wacholder S (1986). Binomial regression in GLIM: estimating risk ratios and risk
differences. Am J Epidemiol 123:174-184.
Wright JT, Waterson EJ, Barrison PJ, et al. (1983). Alcohol consumption, pregnancy and
low birthweight. Lancet 1:663-665.
Appendix 1
p.
9
Vaso Dataset (vaso.dta)
Source
Dataset to accompany textbook: Aitkin M, Anderson D, Fancis B, Hinde J.
Statistical Modeling in GLIM. Oxford, Clarendon Press, 1989. Data were
originally published in Finney DJ, The estimation from original records of the
relationship between dose and quantal response. Biometrika 1947;34:320-334.
Brief Description
The data were obtained in a carefully controlled study of the effect of the RATE
and VOLume of air inspired by human subjects on the occurrence (coded 1) or
non-occurrence (coded 0) of a transient vasoconstriction RESPonse in the skin of
the fingers. Three subjects were involved in the study: the first contributed 9
observations at different values of RATE and VOL, the second 8, and the third 22
observations. The experiment was designed to ensure as far as possible that
successive observations obtained on each subject were independent: serial
correlation between successive observations on the same subject is such studies is
always a possibility. The aim is to fit a statistical model relating RESP to RATE
and VOL. (Aitkin, et al. p.167).
Codebook
n = 39
outcome
resp
predictors
vol
rate
Appendix 1
response (1=occurrence of vasoconstriction,
0=non-occurrence of vasoconstriction)
volume of air inspired
rate of air inspired.
p. 10
Appendix 1
p. 11
Download