Biostatistics 140.632 Final Exam May 10, 2006 The exam is a take

advertisement
Biostatistics 140.632
Final Exam
May 10, 2006
The exam is a take-home and will be due May 17, 2006 by 4:00 p.m. Please send
the answers to the questions to the sas@jhsph.edu e-mail. Please make sure that
your name is in the subject line of the e-mail. You may use lecture notes, lab
exercises, homework exercises, “The Little SAS Book”, and SAS Help during the
exam. Please ask for clarification if you do not understand any of the questions.
You must work by yourself.
Please Note:
By forwarding this exam to the course instructor, I acknowledge abiding by the
Bloomberg School of Public Health's Code of Academic Ethics. I neither gave nor
sought the assistance of any other person in the preparation of this exam.
Name (Please print) : _____________________________________________
Are you graduating this May?
Yes ____
No ____
1
Download the files from the class website
http://www.biostat.jhsph.edu/bstcourse/bio632/default.htm
There are 3 SAS data sets containing a subset of data from The Johns Hopkins Precursors
Study. This study was designed and initiated by Caroline Bedell Thomas in 1947 to identify the
precursors of cardiovascular disease. It is an ongoing longitudinal cohort study of 1337 former
medical students. During medical school each participant underwent a detailed medical history
and physical examination. After medical school, data collection was performed using annual
mailed questionnaires.
All three data sets have a unique identifying study number variable named STUDYN.
The BASELINE data set contains the following data collected during medical school:
Variable
age
bmonth
bday
byear
bmi
dbp
sbp
smoker
Description
graduation age
month of birth
day of birth
birth year
body mass index , kg/m2
diastolic blood pressure, mmHg
systolic blood pressure, mmHg
smoking status
studyn
study number
999=unknown
999=unknown
999=unknown
0=no 1=current
2=former 9=unknown
The CHOL data set contains cholesterol values measured in medical school:
Variable
chol1
chol2
chol3
chol4
chol5
studyn
Description
cholesterol 1,mg/dl
cholesterol 2
cholesterol 3
cholesterol 4
cholesterol 5
study number
999=unknown
999=unknown
999=unknown
999=unknown
999=unknown
The OUTCOME data set contains information on parental history of diabetes, the
occurrence of diabetes and the date of diagnosis, and the age of the participant at the last followup questionnaire.
Variable
fdiab
age
mdiab
dm
dmdate
studyn
Description
father’s diabetes status
age at last follow-up questionnaire
mother’s diabetes status
diabetes status of participant
date of diabetes diagnosis (SAS date)
study number
0=no 1=yes
0=no 1=yes
.=no 1=yes
.=no diabetes present
2
PART A. (15 points)
In order to analyze these data, you must merge them into one data set and create new
variables. Create a PERMANENT file, TOTAL located in the mylib library, by merging
baseline, chol, and outcome. RESTRICT the total file to contain only those records that are
present in the baseline file.
Please include ALL of the original variables from the 3 files. Do not add any new variables,
such as index variables if you use arrays. Use only ONE data step to create the file. However,
you can run a PROC FREQ or PROC MEANS to answer the questions and check your coding.
Answer questions 1-2 based on your results and saslog.
Please place an X next to the correct answer.
1.
How many records are in the TOTAL file?
a.
__ b.
c.
d.
e.
2.
1329
1331
898
607
none of the above
How many variables are in the TOTAL file? Do not create any new variables in your
data step.
a.
b.
c.
d.
e.
18
21
24
55
none of the above
Now write a SAS program to answer questions 3 and 4. You can create new variables and use
procedures, if needed.
3.
How many participants in the BASELINE file did not have a match on the CHOL file?
__ a. 898
__ b. 289
__ c. 433
d. 683
e. none of the above
4.
How many participants who are current or former smokers have diastolic blood pressure
less than or equal to 80 mmHg?
a.
b.
c.
d.
e.
493
416
392
417
none of the above
3
Please answer the following questions:
5.
You want to add a date variable, birthdate, which contains the birth date of each
participant in the TOTAL file. Write the statements that you will need to add to the data
step (written to answer questions 1-2 above) to create the birthdate variable.
________________________________________________________________
________________________________________________________________
________________________________________________________________
6.
You want to add a variable, mchol, which contains the mean serum cholesterol level (of
all known values for chol1-chol5) in the TOTAL file. Write the statements that you will
need to add to the data step (written to answer questions 1-2 above) to create the mchol
variable.
________________________________________________________________
________________________________________________________________
_______________________________________________________________
7.
You also want to add another new variable, dmage to the TOTAL file.
Dmage is defined as follows:
Age at diagnosis of diabetes if diabetic
OR
Age at last follow-up if NOT diabetic
Write the statements that you will add to the data step to create the dmage variable
_________________________________________________________________
_________________________________________________________________
_________________________________________________________________
4
8.
Now, we want to create a categorical variable, bmigrp, in the DATA step creating
TOTAL.
bmigrp is defined as follows for all known bmi values:
1 = bmi <23 kg/m2
2 = 23<=bmi<25 kg/m2
3 = bmi>=25 kg/m2
Write the statements that you will add to the data step to create the bmigrp variable
_________________________________________________________________
_________________________________________________________________
_________________________________________________________________
9.
You want to create 2 temporary files (filenames: early and late) from the TOTAL file in
ONE data step. EARLY will contain only those participants that were <30 years old
when they graduated from medical school. The LATE file will contain all the
participants that are >= 30 years old at graduation. Write one data step to create these
files from TOTAL.
_______________________________________________________
_______________________________________________________
_______________________________________________________
How many records are in the EARLY file?
10.
___________
Which of the following are valid SAS variable names? Place an X next to all valid
names
a.
b.
c.
d.
e.
_var_1
clinic name
10_bp
session#
f1visit
5
PART B. (25 points)
Please place an X by the correct answer to the following questions:
11.
In the PROC GCHART step below, which statement or statements (if any) contain an
error?
proc gchart data=clinic;
hbar company / sumvar=pctinsured type=cfreq;
vbar total
/ type=mean;
pie company / sumvar=total;
run;
__ a. the hbar and vbar statements only
__ b. the vbar and pie statements only
c. the vbar statement only
d. none of the above
12.
Consider the program below. If the variable charge contains the value 6914, how
will it appear in the PROC PRINT output?
data costs;
set clinic;
charge = numdays * costperday;
format charge 8.2;
run;
proc print data=costs;
format charge dollar6.;
run;
a. $6914.00
b. 6914
c. $6,914
d. $6,914.00
6
13.
Which program will produce a set of statistics grouped by the variable region?
a.
proc sort data=mortality out=mortality;
by region;
run;
proc means data=mortality mean range sum maxdec=0;
var total cvd resp suicide;
by region;
run;
b.
proc means data=mortality mean range sum maxdec=0;
var total cvd resp suicide;
class region;
run;
c. neither program
__ d. both programs
14.
If you had originally submitted the following statement, select the statement you
would then use to change only the plotting symbol for the first plot line in subsequent
plots.
symbol1 interpol=spline color=blue width=2 value=star;
a. symbol2 interpol=spline color=blue
width=2 value=square;
b. symbol2 value=square;
c. symbol1;
d. symbol1 value=square;
7
15.
Which of the programs produced the output shown?
Frequency
Percent
Row Pct
Col Pct
Table of Weight by Height
Height
Weight
<63 in
63+ in
Total
<100 lb
8
42.11
80.00
80.00
2
10.53
20.00
22.22
10
52.63
100+ lb
2
10.53
22.22
20.00
7
36.84
77.78
77.78
9
47.37
Total
10
52.63
9
47.37
19
100.00
a. proc freq data=class;
table weight height;
run;
b. proc freq data=class;
table weight*height;
run;
c. proc freq data=class;
table height * weight;
run;
e. proc freq data=class;
table weight, height;
run;
16.
For each of the following answer true or false. In a PROC FORMAT, ranges in the
VALUE statement can be specified as:
a. single value, such as 24 or T
_______
b. range of values, such as 0-22
_______
c. range of characters, such as 'A'-'M'
_______
d. list of values separated by commas,
such as 25,18,31
_______
8
17.
When the code below is run, what will the output file “d:\temp\saslab\body.htm”
contain?
ods html body='d:\temp\saslab\body.htm';
proc print data=work.alpha;
run;
proc print data=work.beta;
run;
ods html close;
a. the PROC PRINT output for work.alpha
b. the PROC PRINT output for work.beta
c. the PROC PRINT output for both work.alpha and
work.beta
d. Nothing – no output will be written
18.
Which of the following programs performs a regression with weight as the outcome
(dependent) variable and produces a plot of the residuals against the independent
variable?
a. symbol1 v=dot h=1 color=blue;
proc reg data=sashelp.class;
model weight = height;
plot r. * height;
run;
b. symbol1 v=dot h=1 color=blue;
proc reg data=sashelp.class;
model height = weight;
plot r. * weight;
run;
c. symbol1 v=dot h=1 color=blue;
proc reg data=sashelp.class;
model height = weight;
plot p. * height r. * height;
run;
d. symbol1 v=dot h=1 color=blue;
proc reg data=sashelp.class;
model weight = height;
plot p. * height; run;
9
19.
You wish to use PROC GENMOD to do a logistic regression. Which is the correct
choice for options in the model statement?
proc genmod data=mort;
model death = weight age / …………
run;
a.
b.
c.
d.
20.
link
link
link
link
=
=
=
=
;
log dist = binomial
log dist = poisson
logit
logit dist = binomial
The following SAS code has been submitted:
proc format;
value split low – 34
= "low"
34 – 67
= "Medium"
67 - high = "High";
run;
Which of the following correctly applies the format. Mark all that apply.
a. __
b. __
c. __
d. __
21.
proc print data=mydata;
var id split ;
run;
proc print data=mydata;
var id mark ;
format mark split;
run;
proc print data=mydata;
var id mark;
format mark $split.;
run;
proc print data=mydata;
var id mark;
format mark split.;
run;
What is the type and value of the variables in the trial dataset?
data trial;
ix = 0;
do i = 1 to 5;
x = 2*i;
end;
run;
_________________________________________________
10
Fill in the blanks for the next 2 questions.
22.
You wish to compare the Kaplan-Meier curves for three age groups of people. You
have data:
time = time in days since person joined study till they left or died
cvd = indicator variable (1 means death, 0 means no death)
age = age group (1 means <65, 2 means 65-75 and 3 means 75+)
smk = smoking indicator (1 means yes, 0 means no)
Write are the correct statements to go in the PROC LIFETEST?
proc lifetest data=mort plots=(s);
run;
23.
Using the same data as in Q22, write the correct statements to go into the PROC
PHREG to model the time to death using age and smk. Treat age as a continuous
variable and include an interaction for age by smoke.
11
Download