Biostatistics 140.632 Final Exam May 10, 2006 The exam is a take-home and will be due May 17, 2006 by 4:00 p.m. Please send the answers to the questions to the sas@jhsph.edu e-mail. Please make sure that your name is in the subject line of the e-mail. You may use lecture notes, lab exercises, homework exercises, “The Little SAS Book”, and SAS Help during the exam. Please ask for clarification if you do not understand any of the questions. You must work by yourself. Please Note: By forwarding this exam to the course instructor, I acknowledge abiding by the Bloomberg School of Public Health's Code of Academic Ethics. I neither gave nor sought the assistance of any other person in the preparation of this exam. Name (Please print) : _____________________________________________ Are you graduating this May? Yes ____ No ____ 1 Download the files from the class website http://www.biostat.jhsph.edu/bstcourse/bio632/default.htm There are 3 SAS data sets containing a subset of data from The Johns Hopkins Precursors Study. This study was designed and initiated by Caroline Bedell Thomas in 1947 to identify the precursors of cardiovascular disease. It is an ongoing longitudinal cohort study of 1337 former medical students. During medical school each participant underwent a detailed medical history and physical examination. After medical school, data collection was performed using annual mailed questionnaires. All three data sets have a unique identifying study number variable named STUDYN. The BASELINE data set contains the following data collected during medical school: Variable age bmonth bday byear bmi dbp sbp smoker Description graduation age month of birth day of birth birth year body mass index , kg/m2 diastolic blood pressure, mmHg systolic blood pressure, mmHg smoking status studyn study number 999=unknown 999=unknown 999=unknown 0=no 1=current 2=former 9=unknown The CHOL data set contains cholesterol values measured in medical school: Variable chol1 chol2 chol3 chol4 chol5 studyn Description cholesterol 1,mg/dl cholesterol 2 cholesterol 3 cholesterol 4 cholesterol 5 study number 999=unknown 999=unknown 999=unknown 999=unknown 999=unknown The OUTCOME data set contains information on parental history of diabetes, the occurrence of diabetes and the date of diagnosis, and the age of the participant at the last followup questionnaire. Variable fdiab age mdiab dm dmdate studyn Description father’s diabetes status age at last follow-up questionnaire mother’s diabetes status diabetes status of participant date of diabetes diagnosis (SAS date) study number 0=no 1=yes 0=no 1=yes .=no 1=yes .=no diabetes present 2 PART A. (15 points) In order to analyze these data, you must merge them into one data set and create new variables. Create a PERMANENT file, TOTAL located in the mylib library, by merging baseline, chol, and outcome. RESTRICT the total file to contain only those records that are present in the baseline file. Please include ALL of the original variables from the 3 files. Do not add any new variables, such as index variables if you use arrays. Use only ONE data step to create the file. However, you can run a PROC FREQ or PROC MEANS to answer the questions and check your coding. Answer questions 1-2 based on your results and saslog. Please place an X next to the correct answer. 1. How many records are in the TOTAL file? a. __ b. c. d. e. 2. 1329 1331 898 607 none of the above How many variables are in the TOTAL file? Do not create any new variables in your data step. a. b. c. d. e. 18 21 24 55 none of the above Now write a SAS program to answer questions 3 and 4. You can create new variables and use procedures, if needed. 3. How many participants in the BASELINE file did not have a match on the CHOL file? __ a. 898 __ b. 289 __ c. 433 d. 683 e. none of the above 4. How many participants who are current or former smokers have diastolic blood pressure less than or equal to 80 mmHg? a. b. c. d. e. 493 416 392 417 none of the above 3 Please answer the following questions: 5. You want to add a date variable, birthdate, which contains the birth date of each participant in the TOTAL file. Write the statements that you will need to add to the data step (written to answer questions 1-2 above) to create the birthdate variable. ________________________________________________________________ ________________________________________________________________ ________________________________________________________________ 6. You want to add a variable, mchol, which contains the mean serum cholesterol level (of all known values for chol1-chol5) in the TOTAL file. Write the statements that you will need to add to the data step (written to answer questions 1-2 above) to create the mchol variable. ________________________________________________________________ ________________________________________________________________ _______________________________________________________________ 7. You also want to add another new variable, dmage to the TOTAL file. Dmage is defined as follows: Age at diagnosis of diabetes if diabetic OR Age at last follow-up if NOT diabetic Write the statements that you will add to the data step to create the dmage variable _________________________________________________________________ _________________________________________________________________ _________________________________________________________________ 4 8. Now, we want to create a categorical variable, bmigrp, in the DATA step creating TOTAL. bmigrp is defined as follows for all known bmi values: 1 = bmi <23 kg/m2 2 = 23<=bmi<25 kg/m2 3 = bmi>=25 kg/m2 Write the statements that you will add to the data step to create the bmigrp variable _________________________________________________________________ _________________________________________________________________ _________________________________________________________________ 9. You want to create 2 temporary files (filenames: early and late) from the TOTAL file in ONE data step. EARLY will contain only those participants that were <30 years old when they graduated from medical school. The LATE file will contain all the participants that are >= 30 years old at graduation. Write one data step to create these files from TOTAL. _______________________________________________________ _______________________________________________________ _______________________________________________________ How many records are in the EARLY file? 10. ___________ Which of the following are valid SAS variable names? Place an X next to all valid names a. b. c. d. e. _var_1 clinic name 10_bp session# f1visit 5 PART B. (25 points) Please place an X by the correct answer to the following questions: 11. In the PROC GCHART step below, which statement or statements (if any) contain an error? proc gchart data=clinic; hbar company / sumvar=pctinsured type=cfreq; vbar total / type=mean; pie company / sumvar=total; run; __ a. the hbar and vbar statements only __ b. the vbar and pie statements only c. the vbar statement only d. none of the above 12. Consider the program below. If the variable charge contains the value 6914, how will it appear in the PROC PRINT output? data costs; set clinic; charge = numdays * costperday; format charge 8.2; run; proc print data=costs; format charge dollar6.; run; a. $6914.00 b. 6914 c. $6,914 d. $6,914.00 6 13. Which program will produce a set of statistics grouped by the variable region? a. proc sort data=mortality out=mortality; by region; run; proc means data=mortality mean range sum maxdec=0; var total cvd resp suicide; by region; run; b. proc means data=mortality mean range sum maxdec=0; var total cvd resp suicide; class region; run; c. neither program __ d. both programs 14. If you had originally submitted the following statement, select the statement you would then use to change only the plotting symbol for the first plot line in subsequent plots. symbol1 interpol=spline color=blue width=2 value=star; a. symbol2 interpol=spline color=blue width=2 value=square; b. symbol2 value=square; c. symbol1; d. symbol1 value=square; 7 15. Which of the programs produced the output shown? Frequency Percent Row Pct Col Pct Table of Weight by Height Height Weight <63 in 63+ in Total <100 lb 8 42.11 80.00 80.00 2 10.53 20.00 22.22 10 52.63 100+ lb 2 10.53 22.22 20.00 7 36.84 77.78 77.78 9 47.37 Total 10 52.63 9 47.37 19 100.00 a. proc freq data=class; table weight height; run; b. proc freq data=class; table weight*height; run; c. proc freq data=class; table height * weight; run; e. proc freq data=class; table weight, height; run; 16. For each of the following answer true or false. In a PROC FORMAT, ranges in the VALUE statement can be specified as: a. single value, such as 24 or T _______ b. range of values, such as 0-22 _______ c. range of characters, such as 'A'-'M' _______ d. list of values separated by commas, such as 25,18,31 _______ 8 17. When the code below is run, what will the output file “d:\temp\saslab\body.htm” contain? ods html body='d:\temp\saslab\body.htm'; proc print data=work.alpha; run; proc print data=work.beta; run; ods html close; a. the PROC PRINT output for work.alpha b. the PROC PRINT output for work.beta c. the PROC PRINT output for both work.alpha and work.beta d. Nothing – no output will be written 18. Which of the following programs performs a regression with weight as the outcome (dependent) variable and produces a plot of the residuals against the independent variable? a. symbol1 v=dot h=1 color=blue; proc reg data=sashelp.class; model weight = height; plot r. * height; run; b. symbol1 v=dot h=1 color=blue; proc reg data=sashelp.class; model height = weight; plot r. * weight; run; c. symbol1 v=dot h=1 color=blue; proc reg data=sashelp.class; model height = weight; plot p. * height r. * height; run; d. symbol1 v=dot h=1 color=blue; proc reg data=sashelp.class; model weight = height; plot p. * height; run; 9 19. You wish to use PROC GENMOD to do a logistic regression. Which is the correct choice for options in the model statement? proc genmod data=mort; model death = weight age / ………… run; a. b. c. d. 20. link link link link = = = = ; log dist = binomial log dist = poisson logit logit dist = binomial The following SAS code has been submitted: proc format; value split low – 34 = "low" 34 – 67 = "Medium" 67 - high = "High"; run; Which of the following correctly applies the format. Mark all that apply. a. __ b. __ c. __ d. __ 21. proc print data=mydata; var id split ; run; proc print data=mydata; var id mark ; format mark split; run; proc print data=mydata; var id mark; format mark $split.; run; proc print data=mydata; var id mark; format mark split.; run; What is the type and value of the variables in the trial dataset? data trial; ix = 0; do i = 1 to 5; x = 2*i; end; run; _________________________________________________ 10 Fill in the blanks for the next 2 questions. 22. You wish to compare the Kaplan-Meier curves for three age groups of people. You have data: time = time in days since person joined study till they left or died cvd = indicator variable (1 means death, 0 means no death) age = age group (1 means <65, 2 means 65-75 and 3 means 75+) smk = smoking indicator (1 means yes, 0 means no) Write are the correct statements to go in the PROC LIFETEST? proc lifetest data=mort plots=(s); run; 23. Using the same data as in Q22, write the correct statements to go into the PROC PHREG to model the time to death using age and smk. Treat age as a continuous variable and include an interaction for age by smoke. 11