Exploratory Analysis of Survey Data

advertisement
Exploratory Analysis of Survey Data
Ian Duling, AstraZeneca LP, Wilmington DE
Valuable information can be derived from sample survey data that is collected on a
sample of observations, which are selected, from the population of interest using a
probability-based sample design. The complex multistage probability sample design
used in a survey like the National Health and Nutrition Examination Survey (NHANES)1
improves the precision and controls costs of survey data collection, but makes analysis
more complex in order to obtain unbiased estimates.
Understanding the design of the questionnaires and the flow of data collection based on
conditional responses to initial interview questions can be challenging. How this
conditional logic influences the structure of the resulting datasets has a direct impact of
the ease of identification, extraction and unbiased interpretation of responses to
questions. Statistical inference to the entire population, requires the use of sample
weights due to the differential probabilities of selection., i.e. the oversampling of certain
subsets of the population.
This discussion will relate specific examples of the author’s use of SAS® software to
identify correlated variables within survey data and the generalization of population
characteristics.
The task undertaken in this example is to identify the overlap between NHANES
participants responding YES to “Has a doctor or other health professional ever told you
that you had arthritis?” and those responding YES to “During the past 12 months, have
you had pain, aching, stiffness or swelling in or around a joint?

From the Medical Conditions Section of the Sample Person Questionnaire
(MCQ):
MCQ.160
Has a doctor or other health
professional ever told {you/SP}
that {you/s/he} . . .
MCQ.190
Which type of arthritis was it?
a. had arthritis?
YES . . . . . . . . . . 1
NO . . . . . . . . . . . 2 (b)
REFUSED . . . . . 7 (b)
DON'T KNOW . . 9 (b)
RHEUMATOID ARTHRITIS . . . . 1
OSTEOARTHRITIS . . . . . . . . . . 2
OTHER . . . . . . . . . . . . . . . . . . . .3
REFUSED . . . . . . . . . . . . . . . . 7
DON'T KNOW . . . . . . . . . . . . . . 9
1
National Center for Health Statistics. Health, United States, 2000. Hyattsville, Maryland: Public Health
Service. 2000.
1

From the Miscellaneous Pain Section of Sample Person Questionnaire
(MPQ):
MPQ.010 During the past 12 months, {have you/has SP} had pain, aching, stiffness or swelling in or around
a joint?
[Do not include neck pain.]
YES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
NO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 (MPQ.060)
REFUSED . . . . . . . . . . . . . . . . . . . . . . . . . 7 (MPQ.060)
DON'T KNOW . . . . . . . . . . . . . . . . . . . . . . 9 (MPQ.060)
MPQ.020 Were these symptoms present on most days for at least 1 month?
YES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
NO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
REFUSED . . . . . . . . . . . . . . . . . . . . . . . . . 7
DON'T KNOW . . . . . . . . . . . . . . . . . . . . . . 9
MPQ.030 Did these symptoms begin only because of an injury?
YES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
NO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 (MPQ.050)
REFUSED . . . . . . . . . . . . . . . . . . . . . . . . . 7 (MPQ.050)
DON'T KNOW . . . . . . . . . . . . . . . . . . . . . . 9 (MPQ.050)
MPQ.050 Please look at this card and give me the joints that were affected.
CODE ALL THAT APPLY. HAND CARD MPQ1
SHOULDER - RIGHT . . . . . . . . .10
SHOULDER - LEFT . . . . . . . . . 11
ELBOW - RIGHT . . . . . . . . . . 12
ELBOW - LEFT . . . . . . . . . . 13
HIP - RIGHT . . . . . . . . . . . 14
HIP - LEFT . . . . . . . . . . . 15
WRIST - RIGHT . . . . . . . . . . 16
WRIST - LEFT . . . . . . . . . . .17
KNEE - RIGHT . . . . . . . . . . .18
KNEE - LEFT . . . . . . . . . . . 19
ANKLE - RIGHT . . . . . . . . . . 20
ANKLE - LEFT . . . . . . . . . . 21
TOES - RIGHT . . . . . . . . . . 22
TOES - LEFT . . . . . . . . . . . 23
FINGERS/THUMB - RIGHT . . . . . . 24
FINGERS/THUMB - LEFT . . . . . . .25
OTHER (SPECIFY) . . . . . . . . . 26
REFUSED . . . . . . . . . . . . . 77
DON'T KNOW . . . . . . . . . . . .99
2

The NHANES codebook provides SAS metadata:
NHANES 99+
Codebook for Data Collection (1999-2000)

Create SAS variables for each instance of joint pain by recoding the original
values:
Shoulder50 =
Refused50 =
Dontknow50 =
Elbow50 =
Hip50 =
Wrist50 =
Knee50 =
Ankle50 =
Toes50 =
Fingers50 =
Other50 =
(MPD050a
(MPD050a
(MPD050a
(MPD050c
(MPD050e
(MPD050g
(MPD050i
(MPD050k
(MPD050m
(MPD050o
(MPD050q
=
=
=
=
=
=
=
=
=
=
=
10 or
77);
99);
12 or
14 or
16 or
18 or
20 or
22 or
24 or
26);
MPD050b = 11);
MPD050d
MPD050f
MPD050h
MPD050j
MPD050l
MPD050n
MPD050p
=
=
=
=
=
=
=
13);
15);
17);
19);
21);
23);
25);
MPQ.060 The following questions are about pain {you/SP} may have experienced in the past 3 months.
Please refer to pain that lasted a whole day or more. Do not report aches and pains that were fleeting or
minor.
MPQ.120 Regarding {your/SP's} pain problem, which regions are affected?
CODE ALL THAT APPLY HAND CARD MPQ2
HEAD . . . . . . . . . . . . . . . .10
FACE/DENTAL . . . . . . . . . . . . 11
SHOULDER GIRDLE - RIGHT . . . . . . 12
SHOULDER GIRDLE - LEFT . . . . . . .13
UPPER ARM - RIGHT . . . . . . . . . 14
UPPER ARM - LEFT . . . . . . . . . .15
MID-ARM - RIGHT . . . . . . . . . . 16
MID-ARM - LEFT . . . . . . . . . . .17
LOWER ARM - RIGHT . . . . . . . . . 18
LOWER ARM - LEFT . . . . . . . . . .19
UPPER BACK - RIGHT . . . . . . . . .20
UPPER BACK - LEFT . . . . . . . . . 21
3

Create SAS variables for each instance of region pain by recoding the
original values:
Head120 =
(MPQ120a = 10);
Refused120 =
(MPQ120a = 77);
Dontknow120 = (MPQ120a = 99);
Facedental120 = (MPQ120b = 11);
Shoulder120 = (MPQ120c = 12 or MPQ120d = 13);
Arm120 =
(MPQ120e = 14 or MPQ120f = 15 or MPQ120g = 16 or
MPQ120h = 17
or MPQ120i = 18 or MPQ120j = 19);
Upperback120 = (MPQ120k = 20 or MPQ120l = 21);
Lowerback120 = (MPQ120m = 22 or MPQ120n = 23);
Buttocks120 = (MPQ120o = 24 or MPQ120p = 25);
Leg120 =
(MPQ120q = 26 or MPQ120r = 27 or MPQ120s = 28 or
MPQ120t = 29
or MPQ120u = 30 or MPQ120v = 31);
Neck120 =
(MPQ120w = 32);
Sternum120 =
(MPQ120x = 33);
Chest120 =
(MPQ120y = 34 or MPQ120z = 35);
Abdomen120 =
(MPQ120aa = 36);
Spine120 =
(MPQ120ab = 37);
Hand120 =
(MPQ120ac = 38 or MPQ120ad = 39);
Foot120 =
(MPQ120ae = 40 or MPQ120af = 41);

Generate a frequency table for every possible combination of joint pain and
chronic region pain values:
%macro wtfreqs(var,table);
proc freq data=anlys_1 noprint;
table &var*Head120
/ out=&table.1;
table &var*Refused120
/ out=&table.2;
table &var*Dontknow120
/ out=&table.3;
table &var*Facedental120 / out=&table.4;
table &var*Shoulder120
/ out=&table.5;
table &var*Arm120
/ out=&table.6;
table &var*Upperback120
/ out=&table.7;
table &var*Lowerback120
/ out=&table.8;
table &var*Buttocks120
/ out=&table.9;
table &var*Leg120
/ out=&table.10;
table &var*Neck120
/ out=&table.11;
table &var*Sternum120
/ out=&table.12;
table &var*Chest120
/ out=&table.13;
table &var*Abdomen120
/ out=&table.14;
table &var*Spine120
/ out=&table.15;
table &var*Hand120
/ out=&table.16;
table &var*Foot120
/ out=&table.17;
weight wtint2yr;
run;
%mend wtfreqs;
4
From the NHANES ANALYTIC AND REPORTING GUIDELINES 2
regarding weighting of sample data:
“NHANES is based on a complex multistage probability sample design. Several
aspects of the NHANES design must be taken into account in data analysis,
including the sampling weights and the complex survey design. Appropriate
sampling weights are needed to estimate prevalence, means, medians, and other
statistics. Sampling weights are used to produce correct population estimates
because each sample person does not have an equal probability of selection. The
sampling weights incorporate the differential probabilities of selection and
include adjustments for noncoverage and nonresponse. Although initial
exploratory analyses may be performed on unweighted data with standard
statistical packages assuming simple random sampling, final analyses should be
done on weighted data using appropriate sampling weights.”
The SAS FREQ procedure weight statement treats observations as if they appear
multiple times in the input data set. The syntax is as follows, WEIGHT variable,
where variable specifies a numeric variable whose value represents the frequency
of the observation3. The FREQ procedure is unique in its application of sampling
weights using the weight statement. Other SAS procedures such as the REPORT
procedure apply weighting using the FREQ statement.
%wtfreqs(Shoulder50,a)
%wtfreqs(Refused50,b)
%wtfreqs(Dontknow50,c)
%wtfreqs(Elbow50,d)
%wtfreqs(Hip50,e)
%wtfreqs(Wrist50,f)
%wtfreqs(Knee50,g)
%wtfreqs(Ankle50,h)
%wtfreqs(Toes50,i)
%wtfreqs(Fingers50,j)
%wtfreqs(Other50,k)
run;

Create a comprehensive dataset representing only those combinations of
joint pain and chronic region pain values that include non-missing
observations from both pain questions
%macro dsfreqs(ds,table,var);
data &ds;
2
ANALYTIC AND REPORTING GUIDELINES, The Third National Health and Nutrition Examination
Survey, NHANES III (1988-94), October, 1996, 2-3
National Center for Health Statistics. Health, United States, 2000. Hyattsville, Maryland: Public Health
Service. 2000.
3
SAS Procedures Guide, SAS Institute Inc. SAS OnlineDoc®, Version 8 February 2000
Copyright ©2000, SAS Institute Inc..
5
set &table.1 (where=(&var and Head120))
&table.2 (where=(&var and Refused120))
&table.3 (where=(&var and Dontknow120))
&table.4 (where=(&var and Facedental120))
&table.5 (where=(&var and Shoulder120))
&table.6 (where=(&var and Arm120))
&table.7 (where=(&var and Upperback120))
&table.8 (where=(&var and Lowerback120))
&table.9 (where=(&var and Buttocks120))
&table.10 (where=(&var and Leg120))
&table.11 (where=(&var and Neck120))
&table.12 (where=(&var and Sternum120))
&table.13 (where=(&var and Chest120))
&table.14 (where=(&var and Abdomen120))
&table.15 (where=(&var and Spine120))
&table.16 (where=(&var and Hand120))
&table.17 (where=(&var and Foot120));
run;
data all (drop=percent);
set a b c d e f g h i j k;

Create two variables (dim1 and dim2), that will represent row and column
frequencies of the recoded survey response variables in a cross tabulation
attrib dim1 length=$13 label='Affected Regions' dim2 length=$10
label='Affected Joints';
select;
when (Head120)
when (Refused120)
when (Dontknow120)
when (Facedental120)
when (Shoulder120)
when (Arm120)
when (Upperback120)
when (Lowerback120)
when (Buttocks120)
when (Leg120)
when (Neck120)
when (Sternum120)
when (Chest120)
when (Abdomen120)
when (Spine120)
when (Hand120)
when (Foot120)
otherwise
end;
select;
when
when
when
when
when
when
(Shoulder50)
(Refused50)
(Dontknow50)
(Elbow50)
(Hip50)
(Wrist50)
dim1
dim1
dim1
dim1
dim1
dim1
dim1
dim1
dim1
dim1
dim1
dim1
dim1
dim1
dim1
dim1
dim1
dim1
dim2
dim2
dim2
dim2
dim2
dim2
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
'Head';
'Refused';
'Dont know';
'Face/dental';
'Shoulder';
'Arm';
'Upper back';
'Lower back';
'Buttocks';
'Leg';
'Neck';
'Sternum';
'Chest';
'Abdomen';
'Spine';
'Hand';
'Foot';
'';
'Shoulder';
'Refused';
'Dont know';
'Elbow';
'Hip';
'Wrist';
6
when (Knee50)
when (Ankle50)
when (Toes50)
when (Fingers50)
when (Other50)
otherwise
dim2
dim2
dim2
dim2
dim2
dim2
=
=
=
=
=
=
'Knee';
'Ankle';
'Toes';
'Fingers';
'Other';
'';
end;
run;
%mend dsfreqs;
%dsfreqs(a,a,Shoulder50);
%dsfreqs(b,b,Refused50);
%dsfreqs(c,c,Dontknow50);
%dsfreqs(d,d,Elbow50);
%dsfreqs(e,e,Hip50);
%dsfreqs(f,f,Wrist50);
%dsfreqs(g,g,Knee50);
%dsfreqs(h,h,Ankle50);
%dsfreqs(i,i,Toes50);
%dsfreqs(j,j,Fingers50);
%dsfreqs(k,k,Other50);

Generate the final report
proc report data=all nowindows missing headskip;
column dim1 dim2,count;
define dim1 / group width=13 ;
define dim2 / across;
define count / analysis sum format=COMMA20.0 'n';
run;
In the previous example saving weighted output generated by the FREQ procedure to an
output dataset and displaying that output using the REPORT procedure provides a means
for display of wide integer values. The REPORT procedure can accommodate these
values and allows for the control of cell formatting. The default output of cell values in a
FREQ table could result in the display of cell values using scientific notation, e.g.
numeric values equal to or greater than eight positions wide.
A solution to this problem on the Microsoft Windows Platform is available from SAS
support documentation: FAQ # 1786 Is there away to format the statistics in the crosstable produced by PROC FREQ? Answer: Beginning with Release 8.1 of SAS, there is
an undocumented FORMAT= TABLES statement option that allows you to specify a
format for the statistics. This option was added because there is no crosstabfreq ODS
template that could be modified to change the format of the statistics4.
4
http://support.sas.com/faq/017/FAQ01786.html
7
NHANES 99-00
Cross tabulation of positive responses to survey questions MPQ120 (Regions Affected)
and MPD050 (Joints Affected) for NHANES survey participants who answered Yes to MCQ160
Unweighted
Affected Joints
Ankle Elbow Fingers Hip Knee Other Shoulder Toes Wrist
Affected Regions
n
n
n
n
n
n
n
n
n
Abdomen
16
11
17
12
22
1
16
4
8
Arm
45
48
37
36
59
7
55
20
38
Buttocks
40
24
32
37
56
9
34
17
26
Chest
8
5
7
6
9
2
9
3
4
Dont know
.
.
1
.
.
.
.
1
.
Face/dental
13
11
12
8
14
2
14
5
11
Foot
53
25
33
26
48
6
39
28
26
Hand
37
33
48
28
49
1
42
26
35
Head
33
23
31
29
45
5
30
13
24
Leg
79
44
51
65
138
12
75
26
47
Lower back
73
53
62
66
107
16
75
25
50
Neck
44
40
44
41
69
10
58
24
38
Refused
1
1
1
1
1
.
1
1
1
Shoulder
54
44
52
50
76
6
103
27
41
Spine
25
20
22
25
34
7
34
12
19
5
5
6
4
7
1
6
4
4
24
22
16
21
28
5
27
7
19
Sternum
Upper back
8
NHANES 99-00
Cross tabulation of positive responses to survey questions MPQ120 (Regions Affected)
and MPD050 (Joints Affected) for NHANES survey participants who answered Yes to MCQ160
Full Sample 2 Year Interview Weight
Affected Joints
Affected Regions
Abdomen
Ankle
Elbow
Fingers
Hip
Knee
n
n
n
n
n
429,958
400,104
274,076
316,051
Other Shoulder
n
640,187
n
2,710
329,883
Toes
Wrist
n
n
88,574
136,008
Arm
1,525,207 2,056,496 1,526,275 1,404,824 2,561,355 194,005
2,194,921 640,246
1,405,588
Buttocks
1,651,581 1,178,071 1,403,960 1,249,935 2,272,391 327,929
1,690,884 753,215
1,240,717
280,376 152,394
251,292
Chest
398,286
253,426
323,141
303,541
352,429
11,507
Dont know
.
.
3,498
.
.
.
Face/dental
605,156
607,129
540,795
296,764
606,602
.
3,498
.
11,507
654,113 147,754
504,916
Foot
2,223,468 1,082,243 1,100,993
746,624 1,891,356 271,638
1,742,052 812,139
1,140,223
Hand
1,364,416 1,365,395 1,813,441
981,372 1,862,868
62,508
1,512,258 995,245
1,405,502
Head
1,419,714 1,126,895 1,326,280
991,503 1,761,166 143,371
1,191,221 434,422
918,543
Leg
2,570,419 1,843,554 1,868,701 2,265,130 5,050,079 335,623
2,784,371 956,380
1,633,988
Lower back
2,632,807 2,301,164 2,608,339 2,521,009 4,193,287 455,485
3,341,103 912,681
1,880,029
Neck
1,886,262 1,935,908 1,865,593 1,683,254 2,976,353 228,065
2,526,951 963,832
1,611,639
Refused
26,311
26,311
26,311
26,311
26,311
.
26,311
26,311
26,311
Shoulder
2,157,715 2,097,878 2,052,722 1,917,835 3,386,778 299,144
4,161,594 888,175
1,549,446
Spine
1,212,026 1,060,117 1,028,620 1,010,831 1,458,014 251,270
1,656,260 563,453
1,111,666
Sternum
230,539
Upper back
315,435
303,362
188,447
327,222
72,823
318,273 147,161
190,315
1,146,370 1,082,466
737,917
830,672 1,263,165 129,676
1,464,090 287,056
943,161
Conclusion
Weighted frequencies are more appropriate when analyzing the potential effect of nonresponse on survey estimates. Weighted response frequencies provide a more reliable tool
for identifying data trends and issues. In this example the weighted survey response rates
provide a more statistically significant basis for conclusions taken from sample estimates.
REFERENCES
National Center for Health Statistics. Health, United States, 2000. Hyattsville, Maryland: Public Health
Service. 2000.
SAS Procedures Guide, SAS Institute Inc. SAS OnlineDoc®, Version 8 February 2000
9
Copyright ©2000, SAS Institute Inc..
CONTACT INFORMATION
Ian Duling
Astra Zeneca LP
1800 Concord Pike
PO Box 15437
Wilmingotn, DE 19850-5437
10
Download