Appendix 1. Dataset Descriptions - HUMIS - BMI

advertisement
Appendix 1. Dataset Descriptions
Births Dataset (births.dta)
This dataset is distributed with the textbook: Hills M, De Stavola BL. A Short
Introduction to Stata for Biostatistics. London, Timberlake Consultants Ltd. 2002.
http://www.timberlake.co.uk
The dataset concerns 500 mothers who had singleton births in a large London hospital.
Codebook
Variable
id
bweight
lowbw
gestwks
preterm
matage
hyp
sex
sexalph
Labels
subject number
birth weight (grams)
birth weight < 2500 g
1=yes, 0=no
gestational age (weeks)
gestational age < 37 weeks
1=yes, 0=no
maternal age (years)
maternal hypertension
1=hypertensive, 0=normal
sex of baby
1=male, 2=female
sex of baby (alphabetic coding)
“male”, “female”
_____________________
Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University
of Utah School of Medicine, 2011. http://www.ccts.utah.edu/biostats/?pageId=5385
Appendix 1 (revision 12 Jul 2010)
p.
1
CASS Dataset (cass.dta)
Source
This dataset is available at the website for the textbook by Pepe (2003):
http://www.fhcrc.org/science/labs/pepe/book,where it is listed as
est1.dta. This version of the data was published in Leisenring et al. (2000, Table 5).
Description [useful dataset to illustrate comparing two diagnostic tests with a common
reference standard (gold standard)]
Described in Pepe (2003, p.8), the data come from the coronary artery surgery study
(CASS), and originally reported by Weiner et al. (1979, Table 2). In a cohort study of
N=1465 men undergoing coronary arteriography (the gold standard) for suspected or
probable coronary heart disease, both an exercise stress test (EST) and chest pain history
(CPH) were recorded.
Codebook
N = 1,465
cad
coronary artery disease (gold standard)
1 = yes
0 = no
est
exercise stress test (diagnostic test for CAD)
1 = positive
0 = negative
cph
chest pain history (diagnostic test for CAD)
1 = positive
0 = negative
__________
Pepe MS. (2003). The Statistical Evaluation of Medical Tests for Classification and
Prediction. New York, Oxford University Press.
Leisenring W, Alonzo T, Pepe MS. (2000). Comparisons of predictive values of binary
medical diagnostic tests for paired designs. Biometrics 56:345-51.
Weiner DA, Ryan TJ, McCabe CH, Kennedy JW, Schloss M, Tristani F, Chaitman BR,
Fisher LD. (1979). Exercise stress testing. Correlations among history of angina,
ST-segment response and prevalence of coronary-artery disease in the Coronary
Artery Aurgery Study (CASS). New England Journal of Medicine 301(5):230-5.
Appendix 1 (revision 12 Jul 2010)
p.
2
Evans County Dataset (evans.dta)
Source
dataset to accompany Kleinbaum and Klein (2002, chapter 2)
http://www.sph.emory.edu/~dkleinb/logreg2.htm#data
Brief Description [modeled with standard (unconditional) logistic regression]
Data are from a cohort study in which n=609 white males were followed for 7
years, with coronary heart disease as the outcome of interest.
Codebook
n = 609
outcome
chd
coronary heart disease (1=presence, 0=absence)
predictors
cat
catecholamine level (1=high, 0=normal)
age
age in years (continuous)
chl
cholesterol (continuous)
smk smoker (1=ever smoked, 0=never smoked)
ecg
electrocardiogram abnormality (1=presence, 0=absence)
dbp
diastolic blood pressure (continuous)
sbp
systolic blood pressure (continuous)
hbp
high blood pressure (1=presence, 0=absence)
defined as: DBP  160 or SBP  95
data management
id
subject identifier (unique #, one observation per subject)
. list in 1/10
+-----------------------------------------------------------+
| id
chd
cat
age
chl
smk
ecg
dbp
sbp
hbp |
|-----------------------------------------------------------|
1. | 21
0
0
56
270
0
0
80
138
0 |
2. | 31
0
0
43
159
1
0
74
128
0 |
3. | 51
1
1
56
201
1
1
112
164
1 |
4. | 71
0
1
64
179
1
0
100
200
1 |
5. | 74
0
0
49
243
1
0
82
145
0 |
|-----------------------------------------------------------|
6. | 91
0
0
46
252
1
0
88
142
0 |
7. | 111
1
0
52
179
1
1
80
128
0 |
8. | 131
0
0
63
217
0
0
92
135
0 |
9. | 141
0
0
42
176
1
0
76
114
0 |
10. | 191
0
0
55
250
0
1
114
182
1 |
+-----------------------------------------------------------+
_______________
Kleinbaum DG, Klein M. (2002). Logistic Regression: A Self-Learning Text, 2nd ed.
New York, Springer-Verlag.
Appendix 1 (revision 12 Jul 2010)
p.
3
Forced Expiratory Volume (FEV) dataset (fev.dta)
Source
dataset that accompanies text: Rosner (1995).
Brief Description [modeled with linear regression]
Data are determinations of FEV in 654 children, ages 3-19, who were seen in the
Childhood Respiratory Disease Study in East Boston, Massachusetts (Tager et al,
1979).
Codebook
n = 654
outcome
fev
FEV1, forced expiratory volume (liters)
predictors
age
age (years)
height height (inches)
male male gender (1=male, 0=female)
smoker smoking status (1=ever smoked, 0=never smoked)
“ever smoked” is defined as currently smoking or had at some time
smoked as much as one cigarette per week, “never smoked” is
never smoked as much as one cigarette per week (Tager et al,
1979).
data management
id
subject identifier (unique #, one observation per subject)
note: the study collected smoking data on both parents and children. The
children are the subjects in this dataset, and the smoking variable applies to them.
_________________
Rosner B. (1995). Fundamentals of Biostatistics, 4th ed., Belmont CA, Duxbury Press.
Tager IB, Weiss ST, Rosner B, Speizer FE. (1979). Effect of parental cigarette smoking
on pulmonary function in children. Am J Epidemiol 110;15-26.
Appendix 1 (revision 12 Jul 2010)
p.
4
Framingham Heart Study (2.20.Framingham.dta)
This dataset is distributed with the textbook: Dupont WD. Statistical Modeling for
Biomedical Researchers: A Simple Introduction to the Analysis of Complex Data.
Cambridge UK, Cambridge University Press, 2002, p. 77.
downloadable from: www.mc.vanderbilt.edu/prevmed/wddtext.
The dataset comes from a long-term follow-up study of cardiovascular risk factors on
4699 patients living in the town of Framingham, Massachusetts. The patients were free
of coronary heart disease at their baseline exam (recruitment of patients started in 1948).
Codebook
Variable
Labels
Baseline exam:
sbp
systolic blood pressure (SBP) in mm Hg
dbp
diastolic blood pressure (DBP) in mm Hg
age
age in years
scl
serum cholesterol (SCL) in mg/100ml
bmi
body mass index (BMI) = weight/height2 in kg/m2
sex
gender
1=male, 2=female
month
month of year in which baseline exam occurred
id
patient identification variable (numbered 1 to 4699)
Follow-up information on coronary heart disease:
followup
follow-up in days
chdfate
CHD outcome
1=patient develops CHD at the end of follow-up
0=otherwise
Appendix 1 (revision 12 Jul 2010)
p.
5
Lee Life Table and Lee Survival Dataset (LeeLifeTable.dta & LeeSurvival.dta)
Source
This dataset is listed in the survival analysis textbook by Lee (1980, Table 3.5, p.31).
The data originally came from Myers (1969).
Brief Description
The data concern male patients with localized cancer of the rectum diagnosed in
Connecticut from 1935 to 1954. The research question is whether survival improved for
the 1945-1954 cohort of patients (cohort = 1) relative to the earlier 1935-1944 cohort
(cohort = 0).
The file LeeLifeTable.dta is used for generating life tables (see Chapter 5-7) using,
ltable interval died, survival by(cohort)
Follow-up is to the end of the 10th year. To make the life table come out right, if the
subject was still alive and still being followed at the end of year 10, a score of 11 was
assigned.
The file LeeSurvival.dta is used for all other survival analyses, such as log-rank tests,
Kaplan-Meier curves, and Cox regression. Subjects followed to the end of the 10th year
are assigned a score of 10.
Codebook
id
cohort
interval
died
study ID number
1 = 1945-1955 patient cohort
0 = 1935-1944 patient cohort
1 to 10, time interval (year) following cancer diagnosis
11 = still alive and being followed at end of year 10,
with follow-up ending at end of year 10 (LeeLifeTable.dta
only)
1 = died
0 = withdrawn alive or lost to follow-up during year interval
____________
Lee ET. (1980). Statistical Methods for Survival Data Analysis. Belmont CA, Lifetime
Learning Publications.
Myers MH. (1969). A Computing Procedure for a Significance Test of the Difference
Between Two Survival Curves, Methodological Note No. 18 in Methodological
Notes compiled by the End Results Sections, National Cancer Institute, National
Institute of Health, Bethesda, Maryland.
See next page for data.
Appendix 1 (revision 12 Jul 2010)
p.
6
… Lee Life Table and Lee Survival Dataset (LeeLifeTable.dta & LeeSurvival.dta)
To create the two data files, copy the following into the Stata do-file editor, highlight it,
and hit the run button (last icon on do-file editor menu bar).
clear
input cohort1 interval1 died1 count1 ///
cohort2 interval2 died2 count2 cohort3 interval3 died3 count3 ///
cohort4 interval4 died4 count4
0 1 1 167 1 1 1 185 0 1 0 2 1 1 0 10
0 2 1 45 1 2 1 88 0 2 0 1 1 2 0 10
0 3 1 45 1 3 1 55 0 3 0 1 1 3 0 10
0 4 1 19 1 4 1 43 0 4 0 1 1 4 0 10
0 5 1 17 1 5 1 32 0 5 0 1 1 5 0 14
0 6 1 11 1 6 1 31 0 6 0 1 1 6 0 52
0 7 1 8 1 7 1 20 0 7 0 1 1 7 0 38
0 8 1 0 1 8 1 7 0 8 0 1 1 8 0 24
0 9 1 6 1 9 1 6 0 9 0 1 1 9 0 25
0 10 1 7 1 10 1 6 0 10 0 1 1 10 0 24
0 11 1 0 1 11 1 0 0 11 0 52 1 11 0 59
end
quietly gen tempid=_n
quietly reshape long cohort interval died count , ///
i(tempid) j(block)
drop if count==0
expand count
drop block tempid count
label define cohortlab 0 "0) 1935-1944" 1 "1) 1945-1955"
label values cohort cohortlab
label variable interval "followup-up time (one-year intervals)"
label define diedlab 0 "0) still alive" 1 "1) died"
label values died diedlab
sort cohort interval died
save LeeLifeTable, replace // for Life Tables
recode interval 11=10 , replace
save LeeSurvival, replace // for all other survival analysis
Appendix 1 (revision 12 Jul 2010)
p.
7
Medpar1 Dataset (medpar1.dta)
Source
This dataset accompanies the textbook by Hardin and Hilbe (2007). It was downloaded
from the publisher’s website: http://www.stata-press.com/data/hh2/
Brief Description [modeled with generalized gamma regression or linear regression after
log transformation of LOS].
The data are from the U.S. Medicare database, for a single hospital and one diagnostic
code (DRG). The study aim is to determine if type of admission is associated with LOS,
after adjusting for the number of ICD-9 codes. (Hardin and Hilbe, 2007, p.110).
Codebook
n = 3,676 observations (1 observation per subject)
outcome
los
length of stay (continuous, 1 to 87 days)
predictors
admittype
codes
type of admission (1=elective, 0=emergency or urgent)
number of ICD-9 codes recorded (continuous, 1 to 9)
___________
Hardin JW, Hilbe JM. (2007). Generalized Linear Models and Extensions, 2nd ed. Stata
Press.
Appendix 1 (revision 12 Jul 2010)
p.
8
MI Dataset (mi.dta)
Source
dataset to accompany Kleinbaum and Klein (2002, Chapter 8)
http://www.sph.emory.edu/~dkleinb/logreg2.htm#data
Brief Description [modeled with conditional logistic regression]
Data are from a 1:2 matched case-control study in which n=117 subjects are
formed into 39 matched strata. Each stratum contains three subjects, one of
whom is a case diagnosed with myocardial infarction and the other two are
matched controls. Matching was done on age, race, sex, and hospital status.
Codebook
n = 117 observations (39 matches)
outcome
mi
myocardial infarction (1=presence, 0=absence)
predictors
smk smoker (1=current smoker, 0=not current smoker)
sbp
systolic blood pressure (continuous)
ecg
electrocardiogram abnormality (1=presence, 0=absence)
data management
match
variable indicating subject’s matched stratum (range 1 to 39)
person
subject identifier (unique #, one observation per subject)
survtime was never defined in the Kleinbaum and Klein (2002) textbook.
. list in 1/10
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
+--------------------------------------------------+
| match
person
mi
smk
sbp
ecg
survtime |
|--------------------------------------------------|
|
1
1
1
0
160
1
1 |
|
1
2
0
0
140
0
2 |
|
1
3
0
0
120
0
2 |
|
2
4
1
0
160
1
1 |
|
2
5
0
0
140
0
2 |
|--------------------------------------------------|
|
2
6
0
0
120
0
2 |
|
3
7
1
0
160
0
1 |
|
3
8
0
0
140
0
2 |
|
3
9
0
0
120
0
2 |
|
4
10
1
0
160
0
1 |
+--------------------------------------------------+
_________
Kleinbaum DG, Klein M. (2002). Logistic Regression: A Self-Learning Text, 2nd ed.
New York, Springer-Verlag.
Appendix 1 (revision 12 Jul 2010)
p.
9
Resting Metabolic Rate Dataset (rmr.dta)
Data published by Nawata et al (2004). The data were taken from the authors’ Figure 1,
a scatterplot, and so only approximate the actual values used by the authors.
File rmr.dta Codebook
group urinary excretion of albumin group (U-Alb)
a = U-Alb < 30 mg/d
b = 30 mg/d ≤ U-Alb ≤ 300 mg/d
c = 300 mg/d < U-Alb
lbm lean body mass (kg)
rmr
resting metabolic rate (kJ/h/m2)
_______
Nawata K, Sohmiya M, Kawaguchi M, et al. (2004). Increased resting metabolic rate in
patients with type 2 diabetes mellitus accompanied by advanced diabetic
nephropathy. Metabolism 53(11) Nov: 1395-1398.
Appendix 1 (revision 12 Jul 2010)
p.
10
Smoking Cessation Study (smoke.csv)
This dataset was distributed with the Rosner (1995) biostatistics textbook. In this dataset,
234 smokers who expressed a willingness to quit smoking were followed for one year to
estimate the proportion of recidivism (quit for a time and then started again).
Codebook
Variable
Labels
id
age
gender
subject identification number
age quit smoking
gender
1=male, 2=female
smoking habit at time quit smoking (cigarettes/day)
days abstinent (up to 365, which ends the follow-up period)
cigs
days
__________
Rosner B. (1995). Fundamentals of Biostatistics, 4th ed. Belmont CA, Duxbury Press.
Appendix 1 (revision 12 Jul 2010)
p.
11
Wieand Dataset (wiedat2b.dta)
Source
This dataset is availabe at the website for the textbook by Pepe (2003):
http://www.fhcrc.org/science/labs/pepe/book,where it is listed as
wiedat2b.dta. This version of the data was published in Wieand et al. (1989).
Description
Described in Pepe (2003, p.10), the data were first published by Wieand et al. (1989),
taken from a case-control study at the Mayo Clinic with 90 pancreatic cancer cases and
51 non-cancer controls with pancreatitis. The predictors are serum samples assayed for
CA-19-9, a carbohydrate antigen, and CA-125, a cancer antigen. The study question is
which of the two biomarkers best discriminates between cases and controls.
Codebook
N = 141
Variable
Labels
y1
CA19-9 carbohydrate antigen (continuous) [Bast et al., 1983]
y2
CA125 cancer antigen (continuous) [Del Villano et al, 1983]
d
pancreatic cancer (referent standard, or “gold” standard)
1 = yes
0 = no
-----Bast RC, Klug TL, St. John E, et al. (1983). Radio-immunoassay using a monoclonal
antibody to monitor the course of epithelial ovarian cancer. N Engl J Med
309:883-7.
Del Villano BC, Brennan S, Brock P, et al. (1983). Radioimmunometric assay for a
monoclonal antibody-defined tumor marker, CA19-9. Clin Chem 29:549-52.
Pepe MS. (2003). The Statistical Evaluation of Medical Tests for Classification and
Prediction. New York, Oxford University Press.
Wieand S, Gail MH, James BR, and James KL. (1989). A family of nonparametric
statistics for comparing diagnostic markers with paired or unpaired data.
Biometrika 76(3):585-92.
Appendix 1 (revision 12 Jul 2010)
p.
12
Wright Low Birthweight Dataset ( wright_lowbw.dta )
The dataset concerns 900 birthweight outcomes and risk factors attributable to the
mother.
Codebook
Variable
lowbw
alc
smo
soc
Labels
low birthweight delivery
1=yes 0=no
mother’s alcohol drinking frequency
1=Light, 2=Moderate, 3=Heavy
mother smoked
1=no 2=yes
mother’s social status
1=I and II (lower), 2=III (middle), 3=IV and V (upper)
“male”, “female”
________
Source: This dataset, in a more condensed form, came from Stata’s website by using the
command “use http://www.stata-press.com/dta/48/binreg”. It can be found in Wacholder
(1986), who got it from Wright et al. (1983).
References:
Wacholder S (1986). Binomial regression in GLIM: estimating risk ratios and risk
differences. Am J Epidemiol 123:174-184.
Wright JT, Waterson EJ, Barrison PJ, et al. (1983). Alcohol consumption, pregnancy and
low birthweight. Lancet 1:663-665.
Appendix 1 (revision 12 Jul 2010)
p.
13
Vaso Dataset (vaso.dta)
Source
Dataset to accompany textbook: Aitkin M, Anderson D, Fancis B, Hinde J.
Statistical Modeling in GLIM. Oxford, Clarendon Press, 1989. Data were
originally published in Finney DJ, The estimation from original records of the
relationship between dose and quantal response. Biometrika 1947;34:320-334.
Brief Description
The data were obtained in a carefully controlled study of the effect of the RATE
and VOLume of air inspired by human subjects on the occurrence (coded 1) or
non-occurrence (coded 0) of a transient vasoconstriction RESPonse in the skin of
the fingers. Three subjects were involved in the study: the first contributed 9
observations at different values of RATE and VOL, the second 8, and the third 22
observations. The experiment was designed to ensure as far as possible that
successive observations obtained on each subject were independent: serial
correlation between successive observations on the same subject is such studies is
always a possibility. The aim is to fit a statistical model relating RESP to RATE
and VOL. (Aitkin, et al. p.167).
Codebook
n = 39
outcome
resp
predictors
vol
rate
response (1=occurrence of vasoconstriction,
0=non-occurrence of vasoconstriction)
volume of air inspired
rate of air inspired.
Appendix 1 (revision 12 Jul 2010)
p.
14
Download