Exploratory Analysis of Survey Data Ian Duling, AstraZeneca LP, Wilmington DE Valuable information can be derived from sample survey data that is collected on a sample of observations, which are selected, from the population of interest using a probability-based sample design. The complex multistage probability sample design used in a survey like the National Health and Nutrition Examination Survey (NHANES)1 improves the precision and controls costs of survey data collection, but makes analysis more complex in order to obtain unbiased estimates. Understanding the design of the questionnaires and the flow of data collection based on conditional responses to initial interview questions can be challenging. How this conditional logic influences the structure of the resulting datasets has a direct impact of the ease of identification, extraction and unbiased interpretation of responses to questions. Statistical inference to the entire population, requires the use of sample weights due to the differential probabilities of selection., i.e. the oversampling of certain subsets of the population. This discussion will relate specific examples of the author’s use of SAS® software to identify correlated variables within survey data and the generalization of population characteristics. The task undertaken in this example is to identify the overlap between NHANES participants responding YES to “Has a doctor or other health professional ever told you that you had arthritis?” and those responding YES to “During the past 12 months, have you had pain, aching, stiffness or swelling in or around a joint? From the Medical Conditions Section of the Sample Person Questionnaire (MCQ): MCQ.160 Has a doctor or other health professional ever told {you/SP} that {you/s/he} . . . MCQ.190 Which type of arthritis was it? a. had arthritis? YES . . . . . . . . . . 1 NO . . . . . . . . . . . 2 (b) REFUSED . . . . . 7 (b) DON'T KNOW . . 9 (b) RHEUMATOID ARTHRITIS . . . . 1 OSTEOARTHRITIS . . . . . . . . . . 2 OTHER . . . . . . . . . . . . . . . . . . . .3 REFUSED . . . . . . . . . . . . . . . . 7 DON'T KNOW . . . . . . . . . . . . . . 9 1 National Center for Health Statistics. Health, United States, 2000. Hyattsville, Maryland: Public Health Service. 2000. 1 From the Miscellaneous Pain Section of Sample Person Questionnaire (MPQ): MPQ.010 During the past 12 months, {have you/has SP} had pain, aching, stiffness or swelling in or around a joint? [Do not include neck pain.] YES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 NO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 (MPQ.060) REFUSED . . . . . . . . . . . . . . . . . . . . . . . . . 7 (MPQ.060) DON'T KNOW . . . . . . . . . . . . . . . . . . . . . . 9 (MPQ.060) MPQ.020 Were these symptoms present on most days for at least 1 month? YES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 NO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 REFUSED . . . . . . . . . . . . . . . . . . . . . . . . . 7 DON'T KNOW . . . . . . . . . . . . . . . . . . . . . . 9 MPQ.030 Did these symptoms begin only because of an injury? YES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 NO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 (MPQ.050) REFUSED . . . . . . . . . . . . . . . . . . . . . . . . . 7 (MPQ.050) DON'T KNOW . . . . . . . . . . . . . . . . . . . . . . 9 (MPQ.050) MPQ.050 Please look at this card and give me the joints that were affected. CODE ALL THAT APPLY. HAND CARD MPQ1 SHOULDER - RIGHT . . . . . . . . .10 SHOULDER - LEFT . . . . . . . . . 11 ELBOW - RIGHT . . . . . . . . . . 12 ELBOW - LEFT . . . . . . . . . . 13 HIP - RIGHT . . . . . . . . . . . 14 HIP - LEFT . . . . . . . . . . . 15 WRIST - RIGHT . . . . . . . . . . 16 WRIST - LEFT . . . . . . . . . . .17 KNEE - RIGHT . . . . . . . . . . .18 KNEE - LEFT . . . . . . . . . . . 19 ANKLE - RIGHT . . . . . . . . . . 20 ANKLE - LEFT . . . . . . . . . . 21 TOES - RIGHT . . . . . . . . . . 22 TOES - LEFT . . . . . . . . . . . 23 FINGERS/THUMB - RIGHT . . . . . . 24 FINGERS/THUMB - LEFT . . . . . . .25 OTHER (SPECIFY) . . . . . . . . . 26 REFUSED . . . . . . . . . . . . . 77 DON'T KNOW . . . . . . . . . . . .99 2 The NHANES codebook provides SAS metadata: NHANES 99+ Codebook for Data Collection (1999-2000) Create SAS variables for each instance of joint pain by recoding the original values: Shoulder50 = Refused50 = Dontknow50 = Elbow50 = Hip50 = Wrist50 = Knee50 = Ankle50 = Toes50 = Fingers50 = Other50 = (MPD050a (MPD050a (MPD050a (MPD050c (MPD050e (MPD050g (MPD050i (MPD050k (MPD050m (MPD050o (MPD050q = = = = = = = = = = = 10 or 77); 99); 12 or 14 or 16 or 18 or 20 or 22 or 24 or 26); MPD050b = 11); MPD050d MPD050f MPD050h MPD050j MPD050l MPD050n MPD050p = = = = = = = 13); 15); 17); 19); 21); 23); 25); MPQ.060 The following questions are about pain {you/SP} may have experienced in the past 3 months. Please refer to pain that lasted a whole day or more. Do not report aches and pains that were fleeting or minor. MPQ.120 Regarding {your/SP's} pain problem, which regions are affected? CODE ALL THAT APPLY HAND CARD MPQ2 HEAD . . . . . . . . . . . . . . . .10 FACE/DENTAL . . . . . . . . . . . . 11 SHOULDER GIRDLE - RIGHT . . . . . . 12 SHOULDER GIRDLE - LEFT . . . . . . .13 UPPER ARM - RIGHT . . . . . . . . . 14 UPPER ARM - LEFT . . . . . . . . . .15 MID-ARM - RIGHT . . . . . . . . . . 16 MID-ARM - LEFT . . . . . . . . . . .17 LOWER ARM - RIGHT . . . . . . . . . 18 LOWER ARM - LEFT . . . . . . . . . .19 UPPER BACK - RIGHT . . . . . . . . .20 UPPER BACK - LEFT . . . . . . . . . 21 3 Create SAS variables for each instance of region pain by recoding the original values: Head120 = (MPQ120a = 10); Refused120 = (MPQ120a = 77); Dontknow120 = (MPQ120a = 99); Facedental120 = (MPQ120b = 11); Shoulder120 = (MPQ120c = 12 or MPQ120d = 13); Arm120 = (MPQ120e = 14 or MPQ120f = 15 or MPQ120g = 16 or MPQ120h = 17 or MPQ120i = 18 or MPQ120j = 19); Upperback120 = (MPQ120k = 20 or MPQ120l = 21); Lowerback120 = (MPQ120m = 22 or MPQ120n = 23); Buttocks120 = (MPQ120o = 24 or MPQ120p = 25); Leg120 = (MPQ120q = 26 or MPQ120r = 27 or MPQ120s = 28 or MPQ120t = 29 or MPQ120u = 30 or MPQ120v = 31); Neck120 = (MPQ120w = 32); Sternum120 = (MPQ120x = 33); Chest120 = (MPQ120y = 34 or MPQ120z = 35); Abdomen120 = (MPQ120aa = 36); Spine120 = (MPQ120ab = 37); Hand120 = (MPQ120ac = 38 or MPQ120ad = 39); Foot120 = (MPQ120ae = 40 or MPQ120af = 41); Generate a frequency table for every possible combination of joint pain and chronic region pain values: %macro wtfreqs(var,table); proc freq data=anlys_1 noprint; table &var*Head120 / out=&table.1; table &var*Refused120 / out=&table.2; table &var*Dontknow120 / out=&table.3; table &var*Facedental120 / out=&table.4; table &var*Shoulder120 / out=&table.5; table &var*Arm120 / out=&table.6; table &var*Upperback120 / out=&table.7; table &var*Lowerback120 / out=&table.8; table &var*Buttocks120 / out=&table.9; table &var*Leg120 / out=&table.10; table &var*Neck120 / out=&table.11; table &var*Sternum120 / out=&table.12; table &var*Chest120 / out=&table.13; table &var*Abdomen120 / out=&table.14; table &var*Spine120 / out=&table.15; table &var*Hand120 / out=&table.16; table &var*Foot120 / out=&table.17; weight wtint2yr; run; %mend wtfreqs; 4 From the NHANES ANALYTIC AND REPORTING GUIDELINES 2 regarding weighting of sample data: “NHANES is based on a complex multistage probability sample design. Several aspects of the NHANES design must be taken into account in data analysis, including the sampling weights and the complex survey design. Appropriate sampling weights are needed to estimate prevalence, means, medians, and other statistics. Sampling weights are used to produce correct population estimates because each sample person does not have an equal probability of selection. The sampling weights incorporate the differential probabilities of selection and include adjustments for noncoverage and nonresponse. Although initial exploratory analyses may be performed on unweighted data with standard statistical packages assuming simple random sampling, final analyses should be done on weighted data using appropriate sampling weights.” The SAS FREQ procedure weight statement treats observations as if they appear multiple times in the input data set. The syntax is as follows, WEIGHT variable, where variable specifies a numeric variable whose value represents the frequency of the observation3. The FREQ procedure is unique in its application of sampling weights using the weight statement. Other SAS procedures such as the REPORT procedure apply weighting using the FREQ statement. %wtfreqs(Shoulder50,a) %wtfreqs(Refused50,b) %wtfreqs(Dontknow50,c) %wtfreqs(Elbow50,d) %wtfreqs(Hip50,e) %wtfreqs(Wrist50,f) %wtfreqs(Knee50,g) %wtfreqs(Ankle50,h) %wtfreqs(Toes50,i) %wtfreqs(Fingers50,j) %wtfreqs(Other50,k) run; Create a comprehensive dataset representing only those combinations of joint pain and chronic region pain values that include non-missing observations from both pain questions %macro dsfreqs(ds,table,var); data &ds; 2 ANALYTIC AND REPORTING GUIDELINES, The Third National Health and Nutrition Examination Survey, NHANES III (1988-94), October, 1996, 2-3 National Center for Health Statistics. Health, United States, 2000. Hyattsville, Maryland: Public Health Service. 2000. 3 SAS Procedures Guide, SAS Institute Inc. SAS OnlineDoc®, Version 8 February 2000 Copyright ©2000, SAS Institute Inc.. 5 set &table.1 (where=(&var and Head120)) &table.2 (where=(&var and Refused120)) &table.3 (where=(&var and Dontknow120)) &table.4 (where=(&var and Facedental120)) &table.5 (where=(&var and Shoulder120)) &table.6 (where=(&var and Arm120)) &table.7 (where=(&var and Upperback120)) &table.8 (where=(&var and Lowerback120)) &table.9 (where=(&var and Buttocks120)) &table.10 (where=(&var and Leg120)) &table.11 (where=(&var and Neck120)) &table.12 (where=(&var and Sternum120)) &table.13 (where=(&var and Chest120)) &table.14 (where=(&var and Abdomen120)) &table.15 (where=(&var and Spine120)) &table.16 (where=(&var and Hand120)) &table.17 (where=(&var and Foot120)); run; data all (drop=percent); set a b c d e f g h i j k; Create two variables (dim1 and dim2), that will represent row and column frequencies of the recoded survey response variables in a cross tabulation attrib dim1 length=$13 label='Affected Regions' dim2 length=$10 label='Affected Joints'; select; when (Head120) when (Refused120) when (Dontknow120) when (Facedental120) when (Shoulder120) when (Arm120) when (Upperback120) when (Lowerback120) when (Buttocks120) when (Leg120) when (Neck120) when (Sternum120) when (Chest120) when (Abdomen120) when (Spine120) when (Hand120) when (Foot120) otherwise end; select; when when when when when when (Shoulder50) (Refused50) (Dontknow50) (Elbow50) (Hip50) (Wrist50) dim1 dim1 dim1 dim1 dim1 dim1 dim1 dim1 dim1 dim1 dim1 dim1 dim1 dim1 dim1 dim1 dim1 dim1 dim2 dim2 dim2 dim2 dim2 dim2 = = = = = = = = = = = = = = = = = = = = = = = = 'Head'; 'Refused'; 'Dont know'; 'Face/dental'; 'Shoulder'; 'Arm'; 'Upper back'; 'Lower back'; 'Buttocks'; 'Leg'; 'Neck'; 'Sternum'; 'Chest'; 'Abdomen'; 'Spine'; 'Hand'; 'Foot'; ''; 'Shoulder'; 'Refused'; 'Dont know'; 'Elbow'; 'Hip'; 'Wrist'; 6 when (Knee50) when (Ankle50) when (Toes50) when (Fingers50) when (Other50) otherwise dim2 dim2 dim2 dim2 dim2 dim2 = = = = = = 'Knee'; 'Ankle'; 'Toes'; 'Fingers'; 'Other'; ''; end; run; %mend dsfreqs; %dsfreqs(a,a,Shoulder50); %dsfreqs(b,b,Refused50); %dsfreqs(c,c,Dontknow50); %dsfreqs(d,d,Elbow50); %dsfreqs(e,e,Hip50); %dsfreqs(f,f,Wrist50); %dsfreqs(g,g,Knee50); %dsfreqs(h,h,Ankle50); %dsfreqs(i,i,Toes50); %dsfreqs(j,j,Fingers50); %dsfreqs(k,k,Other50); Generate the final report proc report data=all nowindows missing headskip; column dim1 dim2,count; define dim1 / group width=13 ; define dim2 / across; define count / analysis sum format=COMMA20.0 'n'; run; In the previous example saving weighted output generated by the FREQ procedure to an output dataset and displaying that output using the REPORT procedure provides a means for display of wide integer values. The REPORT procedure can accommodate these values and allows for the control of cell formatting. The default output of cell values in a FREQ table could result in the display of cell values using scientific notation, e.g. numeric values equal to or greater than eight positions wide. A solution to this problem on the Microsoft Windows Platform is available from SAS support documentation: FAQ # 1786 Is there away to format the statistics in the crosstable produced by PROC FREQ? Answer: Beginning with Release 8.1 of SAS, there is an undocumented FORMAT= TABLES statement option that allows you to specify a format for the statistics. This option was added because there is no crosstabfreq ODS template that could be modified to change the format of the statistics4. 4 http://support.sas.com/faq/017/FAQ01786.html 7 NHANES 99-00 Cross tabulation of positive responses to survey questions MPQ120 (Regions Affected) and MPD050 (Joints Affected) for NHANES survey participants who answered Yes to MCQ160 Unweighted Affected Joints Ankle Elbow Fingers Hip Knee Other Shoulder Toes Wrist Affected Regions n n n n n n n n n Abdomen 16 11 17 12 22 1 16 4 8 Arm 45 48 37 36 59 7 55 20 38 Buttocks 40 24 32 37 56 9 34 17 26 Chest 8 5 7 6 9 2 9 3 4 Dont know . . 1 . . . . 1 . Face/dental 13 11 12 8 14 2 14 5 11 Foot 53 25 33 26 48 6 39 28 26 Hand 37 33 48 28 49 1 42 26 35 Head 33 23 31 29 45 5 30 13 24 Leg 79 44 51 65 138 12 75 26 47 Lower back 73 53 62 66 107 16 75 25 50 Neck 44 40 44 41 69 10 58 24 38 Refused 1 1 1 1 1 . 1 1 1 Shoulder 54 44 52 50 76 6 103 27 41 Spine 25 20 22 25 34 7 34 12 19 5 5 6 4 7 1 6 4 4 24 22 16 21 28 5 27 7 19 Sternum Upper back 8 NHANES 99-00 Cross tabulation of positive responses to survey questions MPQ120 (Regions Affected) and MPD050 (Joints Affected) for NHANES survey participants who answered Yes to MCQ160 Full Sample 2 Year Interview Weight Affected Joints Affected Regions Abdomen Ankle Elbow Fingers Hip Knee n n n n n 429,958 400,104 274,076 316,051 Other Shoulder n 640,187 n 2,710 329,883 Toes Wrist n n 88,574 136,008 Arm 1,525,207 2,056,496 1,526,275 1,404,824 2,561,355 194,005 2,194,921 640,246 1,405,588 Buttocks 1,651,581 1,178,071 1,403,960 1,249,935 2,272,391 327,929 1,690,884 753,215 1,240,717 280,376 152,394 251,292 Chest 398,286 253,426 323,141 303,541 352,429 11,507 Dont know . . 3,498 . . . Face/dental 605,156 607,129 540,795 296,764 606,602 . 3,498 . 11,507 654,113 147,754 504,916 Foot 2,223,468 1,082,243 1,100,993 746,624 1,891,356 271,638 1,742,052 812,139 1,140,223 Hand 1,364,416 1,365,395 1,813,441 981,372 1,862,868 62,508 1,512,258 995,245 1,405,502 Head 1,419,714 1,126,895 1,326,280 991,503 1,761,166 143,371 1,191,221 434,422 918,543 Leg 2,570,419 1,843,554 1,868,701 2,265,130 5,050,079 335,623 2,784,371 956,380 1,633,988 Lower back 2,632,807 2,301,164 2,608,339 2,521,009 4,193,287 455,485 3,341,103 912,681 1,880,029 Neck 1,886,262 1,935,908 1,865,593 1,683,254 2,976,353 228,065 2,526,951 963,832 1,611,639 Refused 26,311 26,311 26,311 26,311 26,311 . 26,311 26,311 26,311 Shoulder 2,157,715 2,097,878 2,052,722 1,917,835 3,386,778 299,144 4,161,594 888,175 1,549,446 Spine 1,212,026 1,060,117 1,028,620 1,010,831 1,458,014 251,270 1,656,260 563,453 1,111,666 Sternum 230,539 Upper back 315,435 303,362 188,447 327,222 72,823 318,273 147,161 190,315 1,146,370 1,082,466 737,917 830,672 1,263,165 129,676 1,464,090 287,056 943,161 Conclusion Weighted frequencies are more appropriate when analyzing the potential effect of nonresponse on survey estimates. Weighted response frequencies provide a more reliable tool for identifying data trends and issues. In this example the weighted survey response rates provide a more statistically significant basis for conclusions taken from sample estimates. REFERENCES National Center for Health Statistics. Health, United States, 2000. Hyattsville, Maryland: Public Health Service. 2000. SAS Procedures Guide, SAS Institute Inc. SAS OnlineDoc®, Version 8 February 2000 9 Copyright ©2000, SAS Institute Inc.. CONTACT INFORMATION Ian Duling Astra Zeneca LP 1800 Concord Pike PO Box 15437 Wilmingotn, DE 19850-5437 10