SAS Frequency Tabulations and Contingency Tables (Crosstabs)

advertisement
SAS Oneway Frequency Tabulations and Twoway Contingency
Tables (Crosstabs)
/***********************************************************
This example illustrates:
How to create user-defined formats
How to recode continuous variables into ordinal categories
How to generate oneway and twoway tables and basic tests
The following tests are illustrated:
Chi-square goodness of fit test
Binomial test of proportion for a two-level variable
Exact Binomial test
Pearson Chi-square test
Fisher’s exact test
Cochran-Armitage test for trend
Procs used:
Proc Format
Proc Means
Proc Freq
Proc Contents
Filename: frequencies.sas
************************************************************/
OPTIONS FORMCHAR="|----|+|---+=|-/\<>*";
OPTIONS NODATE PAGENO=1 FORMDLIM=" ";
PROC FORMAT;
VALUE AGEFMT
VALUE HIAGEFMT
VALUE HIHICHOLFMT
1 = "1:19-29"
2 = "2:30-39"
3 = "3:>39";
1 = "1:AGE > 39"
2 = "2:AGE <= 39";
1 = "1:>=240"
2 = "2:<240";
VALUE CHOLCATFMT 1 = "1:<200"
2 = "2:200-239"
3 = "3:>=240";
VALUE PILLFMT
1 = "1:PILL"
2 = "2:NO PILL";
VALUE WTFMT
1 = "1:<120"
2 = "2:120-139"
3 = "3:>=140";
VALUE HIBMIFMT
1 = "1:BMI>23"
2 = "2:BMI<=23";
RUN;
1
The log from that results from these Proc Format commands is shown below. These formats will be stored in
the Work library, and thus will be temporary. In the document that follows, you will see the formats being
applied within each procedure, by using a format statement. These formats will not be automatically attached to
variables, and have to be specified for each procedure.
4
PROC FORMAT;
5
VALUE AGEFMT
1 = "1: Age 19-29"
6
2 = "2: Age 30-39"
7
3 = "3: Age >39";
NOTE: Format AGEFMT has been output.
8
9
VALUE HIAGEFMT
1 = "1: Age >39"
10
2 = "2: Age <=39";
NOTE: Format HIAGEFMT has been output.
11
12
VALUE HICHOLFMT 1 = "1: Chol >=240"
13
2 = "2: Chol <240";
NOTE: Format HICHOLFMT has been output.
14
15
VALUE CHOLCATFMT 1 = "1: Chol <200"
16
2 = "2: Chol 200-239"
17
3 = "3: Chol >=240";
NOTE: Format CHOLCATFMT has been output.
18
19
VALUE PILLFMT
1 = "1: Pill"
20
2 = "2: No Pill";
NOTE: Format PILLFMT has been output.
21
22
VALUE WTFMT
1 = "1: Wt <120kg"
23
2 = "2: Wt 120-139kg"
24
3 = "3: Wt >=140kg";
NOTE: Format WTFMT has been output.
25
26
VALUE HIBMIFMT
1 = "1: BMI>23"
27
2 = "2: BMI<=23";
NOTE: Format HIBMIFMT has been output.
28
RUN;
NOTE: PROCEDURE FORMAT used (Total process time):
real time
0.25 seconds
cpu time
0.00 seconds
Now, we create a permanent SAS data set from the raw data file, Werner2.dat. The raw data are read in, then
missing value codes are assigned appropriately, and new variables are created. Note that missing values are
assigned before the new variables are created.
2
libname b510 "e:\510\";
DATA B510.WERNER;
INFILE "werner2.dat";
INPUT ID 1-4 AGE 5-8 HT 9-12 WT 13-16
PILL 17-20 CHOL 21-24 ALB 25-28
CALC 29-32 URIC 33-36 PAIR 37-39;
IF
IF
IF
IF
IF
HT
WT
ALB
CALC
URIC
= 999 then HT
= 999 then WT
= 99 then ALB
= 99 then CALC
= 99 then URIC
=
=
=
=
=
.;
.;
.;
.;
.;
WTKG = WT*.39;
HTCM = HT*2.54;
BMI = WTKG/(HTCM/100)**2;
IF BMI > 23 then HIBMI = 1;
IF 0<=BMI<=23 then HIBMI = 2;
IF AGE NOT=. THEN DO;
IF AGE <= 29 THEN AGEGROUP=1;
IF AGE > 29 AND AGE <= 39 THEN AGEGROUP=2;
IF AGE > 39 THEN AGEGROUP=3;
IF AGE > 39 THEN HIAGE=1;
IF AGE <= 39 THEN HIAGE=2;
END;
IF CHOL >= 400 OR CHOL < 100 THEN CHOL=.;
IF CHOL NOT=. THEN DO;
IF CHOL >= 240 THEN HICHOL=1;
IF CHOL <
240 THEN HICHOL=2;
IF CHOL <
IF CHOL >=
IF CHOL >=
200 THEN CHOLCAT=1;
200 AND CHOL < 240 THEN CHOLCAT=2;
240 THEN CHOLCAT=3;
END;
IF WT
IF
IF
IF
END;
NOT=.
WT <
WT >=
WT >=
THEN DO;
120 THEN WTCAT=1;
120 AND WT < 140 THEN WTCAT=2;
140 THEN WTCAT=3;
DROP WTKG HTCM;
RUN;
We use two methods for checking the newly created variables. The simplest one is Proc Means. This tells us
most importantly if we have included all cases in our new variables, and if we have avoided adding data where
there should be none! We will carefully examine the sample size for each original variable, and each new
variable that was created, to be sure they match. This simple check should always be done first!
TITLE "DESCRIPTIVE STATISTICS";
PROC MEANS;RUN;
DESCRIPTIVE STATISTICS
3
The MEANS Procedure
Variable
N
Mean
Std Dev
Minimum
Maximum
------------------------------------------------------------------------------ID
188
1598.96
1057.09
3.0000000
3519.00
AGE
188
33.8191489
10.1126942
19.0000000
55.0000000
HT
186
64.5107527
2.4850673
57.0000000
71.0000000
WT
186
131.6720430
20.6605767
94.0000000
215.0000000
PILL
188
1.5000000
0.5013351
1.0000000
2.0000000
CHOL
186
236.1505376
42.5555145
155.0000000
390.0000000
ALB
186
4.1112903
0.3579694
3.2000000
5.0000000
CALC
185
9.9621622
0.4795556
8.6000000
11.1000000
URIC
187
4.7705882
1.1572312
2.2000000
9.9000000
PAIR
188
47.5000000
27.2063810
1.0000000
94.0000000
BMI
184
19.0736235
2.6285786
15.2305671
29.6996059
HIBMI
184
1.9021739
0.2978899
1.0000000
2.0000000
AGEGROUP
188
1.9255319
0.8432096
1.0000000
3.0000000
HIAGE
188
1.6808511
0.4673916
1.0000000
2.0000000
HICHOL
186
1.5322581
0.5003051
1.0000000
2.0000000
CHOLCAT
186
2.2634409
0.7783954
1.0000000
3.0000000
WTCAT
186
2.0322581
0.7490767
1.0000000
3.0000000
-------------------------------------------------------------------------------
A second way to check recodes of continuous variables into categories is illustrated below. Basically, you can
check the minimum and maximum value of the original variable in each category of the new categorical
variable to be sure the range of values is specified as you wanted it to be. Do this only after you have checked
the sample sizes by using a simple Proc Means statement, as illustrated above.
TITLE "CHECKING RECODE OF WT INTO WTCAT";
PROC MEANS DATA=B510.WERNER;
CLASS WTCAT;
VAR WT;
FORMAT WTCAT WTFMT.;
RUN;
CHECKING RECODE OF WT INTO WTCAT
The MEANS Procedure
Analysis Variable : WT
N
WTCAT
Obs
N
Mean
Std Dev
Minimum
Maximum
--------------------------------------------------------------------------------------------1: Wt <120kg
49
49
109.4489796
7.0209841
94.0000000
119.0000000
2: Wt 120-139kg
82
82
128.6097561
5.9103510
120.0000000
138.0000000
3: Wt >=140kg
55
55
156.0363636
17.2969315
140.0000000
215.0000000
---------------------------------------------------------------------------------------------
4
TITLE "CHECKING RECODE OF AGE INTO AGEGROUP";
PROC MEANS DATA=B510.WERNER;
CLASS AGEGROUP;
VAR AGE;
FORMAT AGEGROUP AGEFMT.;
RUN;
CHECKING RECODE OF AGE INTO AGEGROUP
The MEANS Procedure
Analysis Variable : AGE
N
AGEGROUP
Obs
N
Mean
Std Dev
Minimum
Maximum
-----------------------------------------------------------------------------------------1: Age 19-29
74
74
23.8378378
2.7846302
19.0000000
29.0000000
2: Age 30-39
54
54
33.5925926
3.0376165
30.0000000
39.0000000
3: Age >39
60
60
46.3333333
4.6892111
40.0000000
55.0000000
------------------------------------------------------------------------------------------
TITLE "CHECKING RECODE OF CHOL INTO HICHOL";
PROC MEANS DATA=B510.WERNER;
CLASS HICHOL;
VAR CHOL;
FORMAT HICHOL HICHOLFMT.;
RUN;
CHECKING RECODE OF CHOL INTO HICHOL
The MEANS Procedure
Analysis Variable : CHOL
N
HICHOL
Obs
N
Mean
Std Dev
Minimum
Maximum
------------------------------------------------------------------------------------------1: Chol >=240
87
87
272.4712644
29.0159696
240.0000000
390.0000000
2: Chol <240
99
99
204.2323232
21.8985734
155.0000000
238.0000000
-------------------------------------------------------------------------------------------
TITLE "CHECKING RECODE OF CHOL INTO CHOLCAT";
PROC MEANS DATA=B510.WERNER;
CLASS CHOLCAT;
VAR CHOL;
FORMAT CHOLCAT CHOLCATFMT.;
RUN;
CHECKING RECODE OF CHOL INTO CHOLCAT
The MEANS Procedure
Analysis Variable : CHOL
N
CHOLCAT
Obs
N
Mean
Std Dev
Minimum
Maximum
--------------------------------------------------------------------------------------------1: Chol <200
38
38
181.7631579
12.9894639
155.0000000
198.0000000
2: Chol 200-239
61
61
218.2295082
12.6601651
200.0000000
238.0000000
3: Chol >=240
87
87
272.4712644
29.0159696
240.0000000
390.0000000
---------------------------------------------------------------------------------------------
TITLE "ONEWAY FREQUENCIES";
PROC FREQ DATA=B510.WERNER ORDER=INTERNAL;
TABLES PILL WTCAT AGEGROUP HIAGE HICHOL;
5
FORMAT AGEGROUP AGEFMT. HICHOL HICHOLFMT. PILL PILLFMT. WTCAT WTFMT. HIAGE
HIAGEFMT.;
RUN;
ONEWAY FREQUENCIES
The FREQ Procedure
Cumulative
Cumulative
PILL
Frequency
Percent
Frequency
Percent
--------------------------------------------------------------1: Pill
94
50.00
94
50.00
2: No Pill
94
50.00
188
100.00
Cumulative
Cumulative
WTCAT
Frequency
Percent
Frequency
Percent
-------------------------------------------------------------------1: Wt <120kg
49
26.34
49
26.34
2: Wt 120-139kg
82
44.09
131
70.43
3: Wt >=140kg
55
29.57
186
100.00
Frequency Missing = 2
Cumulative
Cumulative
AGEGROUP
Frequency
Percent
Frequency
Percent
----------------------------------------------------------------1: Age 19-29
74
39.36
74
39.36
2: Age 30-39
54
28.72
128
68.09
3: Age >39
60
31.91
188
100.00
Cumulative
Cumulative
HIAGE
Frequency
Percent
Frequency
Percent
---------------------------------------------------------------1: Age >39
60
31.91
60
31.91
2: Age <=39
128
68.09
188
100.00
Cumulative
Cumulative
HICHOL
Frequency
Percent
Frequency
Percent
-----------------------------------------------------------------1: Chol >=240
87
46.77
87
46.77
2: Chol <240
99
53.23
186
100.00
Frequency Missing = 2
Cumulative
Cumulative
CHOLCAT
Frequency
Percent
Frequency
Percent
-------------------------------------------------------------------1: Chol <200
38
20.43
38
20.43
2: Chol 200-239
61
32.80
99
53.23
3: Chol >=240
87
46.77
186
100.00
Frequency Missing = 2
One-Sample Tests for Categorical Variables
Binomial Confidence Intervals and Tests for Binary Variables:
If you have a categorical variable with only two levels, you can use the binomial option to request a 95%
confidence interval for the proportion in the first level of the variable, and a test of the null hypothesis:
H0: proportion in first category of the variable = π
6
In the option (P= ) you specify the hypothesized proportion in the first category of the tabled variable. By
default, SAS reports both one-sided and two-sided asymptotic p-values.
TITLE "BINOMIAL TEST";
PROC FREQ DATA=B510.WERNER ORDER=INTERNAL;
TABLES HIBMI / BINOMIAL (P=.20);
FORMAT HIBMI HIBMIFMT.;
RUN;
The hypotheses that we are testing are shown below:
H0: proportion with high bmi = 0.20
HA: proportion with high bmi  0.20
BINOMIAL TEST
Cumulative
Cumulative
HIBMI
Frequency
Percent
Frequency
Percent
-------------------------------------------------------------1:BMI>23
18
9.78
18
9.78
2:BMI<=23
166
90.22
184
100.00
Frequency Missing = 4
Binomial Proportion
for HIBMI = 1:BMI>23
------------------------------------Proportion (P)
0.0978
ASE
0.0219
95% Lower Conf Limit
0.0549
95% Upper Conf Limit
0.1408
Exact Conf Limits
95% Lower Conf Limit
95% Upper Conf Limit
0.0590
0.1502
Test of H0: Proportion = 0.2
ASE under H0
0.0295
Z
-3.4649
One-sided Pr < Z
0.0003
Two-sided Pr > |Z|
0.0005
If you wish to obtain an exact binomial test of the null hypothesis, use the exact statement.
7
PROC FREQ DATA=B510.WERNER ORDER=INTERNAL;
TABLES HIBMI / BINOMIAL (P=.20);
exact binomial;
FORMAT HIBMI HIBMIFMT.;
RUN;
This results in an exact test of the null hypothesis, in addition to the default asymptotic test.
Exact Test
One-sided Pr <= P
Two-sided = 2 * One-sided
1.415E-04
2.829E-04
Chi-square Goodness of Fit Tests for Categorical Variables:
Use the chisq option in the tables statement to get a chi-square goodness of fit test, which can be used for
categorical variables with two or more levels. By default SAS assumes that you wish to test the null hypothesis
that the proportion of cases is equal in all categories.
Use the testp= option to specify the proportions that you wish to test, if you don't want to assume equal
proportions in all categories. The total of all the proportions must be 1.0. You can also use percentages, in
which case, the total must add up to 100%. Give the appropriate proportions in the testp= option, specifying
them in order as they apply to each category.
TITLE "CHISQUARE GOODNESS OF FIT TEST";
PROC FREQ DATA=B510.WERNER ORDER=INTERNAL;
TABLES CHOLCAT / CHISQ TESTP=(.20 .30 .50);
FORMAT CHOLCAT CHOLCATFMT.;
RUN;
The null hypothesis that we are testing is:
H0: π1= 0.20, π2 = .30, π3 = .50
CHISQUARE GOODNESS OF FIT TEST
The FREQ Procedure
Test
Cumulative
Cumulative
CHOLCAT
Frequency
Percent
Percent
Frequency
Percent
-------------------------------------------------------------------------------1: Chol <200
38
20.43
20.00
38
20.43
2: Chol 200-239
61
32.80
30.00
99
53.23
3: Chol >=240
87
46.77
50.00
186
100.00
Frequency Missing = 2
Chi-Square Test
for Specified Proportions
------------------------Chi-Square
0.8889
DF
2
Pr > ChiSq
0.6412
Effective Sample Size = 186
Frequency Missing = 2
8
Two-Sample Tests for Categorical Variables:
Chi-Square test of Independence
Two by Two Table:
If you wish to examine the relationship between two categorical variables, you can use Proc Freq. Use the
chisq option to obtain the Pearson chi-square test of independence (or of homogeneity), and use the expected
option to get the expected value in each cell. The commands below can be used to get a cross-tabulation. In this
case, we have a 2 by 2 table, because each categorical variable has two levels. We test:
H0: HIAGE is independent of HICHOL status
HA: HIAGE is not independent of HICHOL status
Note that Fisher’s exact test is produced by default for a 2 x 2 table, when the chisq option is specified. Read
either the one-sided or two-sided p-value for the Fisher’s exact test, which are at the bottom of the respective
panel of output, and shown in bold below.
TITLE "2x2 TABLE";
PROC FREQ DATA=B510.WERNER ORDER=INTERNAL;
TABLES HIAGE*HICHOL / CHISQ EXPECTED;
FORMAT HIAGE HIAGEFMT. HICHOL HICHOLFMT.;RUN;
2x2 TABLE
Table of HIAGE by HICHOL
HIAGE
Frequency
Expected
Percent
Row Pct
Col Pct
HICHOL
|
|
|
|
|1: Chol |2: Chol | Total
|>=240
|<240
|
------------+--------+--------+
1: Age >39 |
42 |
18 |
60
| 28.065 | 31.935 |
| 22.58 |
9.68 | 32.26
| 70.00 | 30.00 |
| 48.28 | 18.18 |
------------+--------+--------+
2: Age <=39 |
45 |
81 |
126
| 58.935 | 67.065 |
| 24.19 | 43.55 | 67.74
| 35.71 | 64.29 |
| 51.72 | 81.82 |
------------+--------+--------+
Total
87
99
186
46.77
53.23
100.00
Frequency Missing = 2
9
Statistics for Table of HIAGE by HICHOL
Statistic
DF
Value
Prob
-----------------------------------------------------Chi-Square
1
19.1914
<.0001
Likelihood Ratio Chi-Square
1
19.5296
<.0001
Continuity Adj. Chi-Square
1
17.8389
<.0001
Mantel-Haenszel Chi-Square
1
19.0882
<.0001
Phi Coefficient
0.3212
Contingency Coefficient
0.3058
Cramer's V
0.3212
Fisher's Exact Test
---------------------------------Cell (1,1) Frequency (F)
42
Left-sided Pr <= F
1.0000
Right-sided Pr >= F
1.045E-05
Table Probability (P)
Two-sided Pr <= P
8.118E-06
1.741E-05
Effective Sample Size = 186
Frequency Missing = 2
Cochran-Armitage test for trend:
R x 2 table, or 2 x C table
The Cochran-Armitage test for trend is appropriate when either the row or column variable is binary (has two
levels) and the other variable is ordinal. It tests whether there is a linear trend in the proportion of subjects
having the binary characteristic. The Mantel-Haenszel test statistic tests for a linear by linear association and
can be used when both row and column variables are ordinal; it always has 1 degree of freedom. In the table
below, both the row and column variables could be considered to be ordinal, because a binary variable can be
thought of as a very simple case of an ordinal variable.
TITLE1 "3X2 TABLE";
TITLE2 "THE ROW VARIABLE IS ORDINAL";
PROC FREQ DATA=B510.WERNER ORDER=INTERNAL;
TABLES AGEGROUP*HICHOL / CHISQ TREND NOCOL NOPERCENT;
FORMAT AGEGROUP AGEFMT. HICHOL HICHOLFMT. ;
RUN;
For the Cochran-Armitage test, we are testing the null hypothesis:
H0: There is a linear trend in the proportion of women with high cholesterol, with increasing age
We are not testing whether the trend is in a positive or negative direction. To see that, simply examine the
proportions of participants with high cholesterol in each age group.
10
3X2 TABLE
THE ROW VARIABLE IS ORDINAL
The FREQ Procedure
Table of AGEGROUP by HICHOL
AGEGROUP
Frequency
Row Pct
HICHOL
|
|1: Chol |2: Chol |
|>=240
|<240
|
-------------+--------+--------+
1: Age 19-29 |
25 |
47 |
| 34.72 | 65.28 |
-------------+--------+--------+
2: Age 30-39 |
20 |
34 |
| 37.04 | 62.96 |
-------------+--------+--------+
3: Age >39
|
42 |
18 |
| 70.00 | 30.00 |
-------------+--------+--------+
Total
87
99
Frequency Missing = 2
Total
72
54
60
186
Statistics for Table of AGEGROUP by HICHOL
Statistic
DF
Value
Prob
-----------------------------------------------------Chi-Square
2
19.2578
<.0001
Likelihood Ratio Chi-Square
2
19.6016
<.0001
Mantel-Haenszel Chi-Square
1
15.5677
<.0001
Phi Coefficient
0.3218
Contingency Coefficient
0.3063
Cramer's V
0.3218
Statistics for Table of AGEGROUP by HICHOL
Cochran-Armitage Trend Test
-------------------------Statistic (Z)
3.9562
One-sided Pr > Z
<.0001
Two-sided Pr > |Z| <.0001
Effective Sample Size = 186
Frequency Missing = 2
Mantel-Haenszel test for a linear association between two ordinal categorical variables:
R x C table, both row and column variables are ordinal
In the next table, both the row and column variable are ordinal. In this case the Mantel-Haenszel test is
appropriate to test for a linear by linear association between the ordinal row variable and the ordinal column
variable. The Pearson Chi-square test is appropriate for testing general association (H0: the row variable is
independent of the column variable) whether there is ordering of the row and/or column variable or not. In a
table like this, which does have ordering of both row and column variables, the Pearson Chi-square test ignores
the ordering of the variables.
11
TITLE "3X3 TABLE BOTH ORDINAL VARIABLES";
PROC FREQ DATA=B510.WERNER ORDER=INTERNAL;
TABLES AGEGROUP*WTCAT / CHISQ nocol nopercent;
FORMAT AGEGROUP AGEFMT. WTCAT WTFMT.;
RUN;
3X3 TABLE BOTH ORDINAL VARIABLES
Table of AGEGROUP by WTCAT
AGEGROUP
WTCAT
Frequency
|
Row Pct
|1: Wt <1|2: Wt 12|3: Wt >=|
|20kg
|0-139kg |140kg
|
-------------+--------+--------+--------+
1: Age 19-29 |
24 |
34 |
16 |
| 32.43 | 45.95 | 21.62 |
-------------+--------+--------+--------+
2: Age 30-39 |
15 |
26 |
12 |
| 28.30 | 49.06 | 22.64 |
-------------+--------+--------+--------+
3: Age >39
|
10 |
22 |
27 |
| 16.95 | 37.29 | 45.76 |
-------------+--------+--------+--------+
Total
49
82
55
Total
74
53
59
186
Frequency Missing = 2
Statistics for Table of AGEGROUP by WTCAT
Statistic
DF
Value
Prob
-----------------------------------------------------Chi-Square
4
11.7418
0.0194
Likelihood Ratio Chi-Square
4
11.4638
0.0218
Mantel-Haenszel Chi-Square
1
8.7820
0.0030
Phi Coefficient
0.2513
Contingency Coefficient
0.2437
Cramer's V
0.1777
We now look at some examples using the Cars.sas7bdat SAS set.
We first use Proc Contents to learn what variables are in the data set, and the types of all the variables.
title;
proc contents data=b510.cars;
run;
The CONTENTS Procedure
Data Set Name
Member Type
Engine
Created
Last Modified
Protection
Data Set Type
Label
Data Representation
Encoding
B510.CARS
DATA
V9
Monday, August 21, 2006 09:41:24 PM
Monday, August 21, 2006 09:41:24 PM
Observations
Variables
Indexes
Observation Length
Deleted Observations
Compressed
Sorted
406
8
0
64
0
NO
NO
WINDOWS_32
wlatin1 Western (Windows)
Alphabetic List of Variables and Attributes
#
Variable
Type
5
8
2
ACCEL
CYLINDER
ENGINE
Num
Num
Num
Len
8
8
8
Format
Label
4.
1.
5.
Time to Accelerate from 0 to 60 mph (sec)
Number of Cylinders
Engine Displacement (cu. inches)
12
3
1
7
4
6
HORSE
MPG
ORIGIN
WEIGHT
YEAR
Num
Num
Num
Num
Num
8
8
8
8
8
5.
4.
1.
4.
2.
Horsepower
Miles per Gallon
Country of Origin
Vehicle Weight (lbs.)
Model Year (modulo 100)
proc format;
value originfmt 1="USA"
2="Europe"
3="Japan";
run;
Output from the SAS log is shown below. Because this format had already been defined in the current run of
SAS, there is a note in the log stating that it is already on the library. If this format were to be resubmitted with
new values, the new values would over-write the old values.
142 proc format;
143
value originfmt
144
145
NOTE: Format ORIGINFMT
NOTE: Format ORIGINFMT
146 run;
1="USA"
2="Europe"
3="Japan";
is already on the library.
has been output.
We now take a look at a 3 by 5 table (the row variable has 3 levels and the column variable has 5 levels) to see
if there is any association between Country of Origin, and Number of Cylinders. The Pearson chi-square test is
perhaps appropriate here…but let’s see.
title “Row variable is nominal, column variable is ordinal”
proc freq data = b510.cars;
tables origin*cylinder / chisq expected;
format origin originfmt.;
run;
13
Row variable is nominal, column variable is ordinal
Table of ORIGIN by CYLINDER
ORIGIN(Country of Origin)
CYLINDER(Number of Cylinders)
Frequency|
Expected |
Percent |
Row Pct |
Col Pct |
3|
4|
5|
6|
8| Total
---------+--------+--------+--------+--------+--------+
USA
|
0 |
72 |
0 |
74 |
107 |
253
| 2.4988 | 129.31 | 1.8741 | 52.474 | 66.842 |
|
0.00 | 17.78 |
0.00 | 18.27 | 26.42 | 62.47
|
0.00 | 28.46 |
0.00 | 29.25 | 42.29 |
|
0.00 | 34.78 |
0.00 | 88.10 | 100.00 |
---------+--------+--------+--------+--------+--------+
Europe
|
0 |
66 |
3 |
4 |
0 |
73
| 0.721 | 37.311 | 0.5407 | 15.141 | 19.286 |
|
0.00 | 16.30 |
0.74 |
0.99 |
0.00 | 18.02
|
0.00 | 90.41 |
4.11 |
5.48 |
0.00 |
|
0.00 | 31.88 | 100.00 |
4.76 |
0.00 |
---------+--------+--------+--------+--------+--------+
Japan
|
4 |
69 |
0 |
6 |
0 |
79
| 0.7802 | 40.378 | 0.5852 | 16.385 | 20.872 |
|
0.99 | 17.04 |
0.00 |
1.48 |
0.00 | 19.51
|
5.06 | 87.34 |
0.00 |
7.59 |
0.00 |
| 100.00 | 33.33 |
0.00 |
7.14 |
0.00 |
---------+--------+--------+--------+--------+--------+
Total
4
207
3
84
107
405
0.99
51.11
0.74
20.74
26.42
100.00
Frequency Missing = 1
Statistics for Table of ORIGIN by CYLINDER
Statistic
DF
Value
Prob
-----------------------------------------------------Chi-Square
8
185.7937
<.0001
Likelihood Ratio Chi-Square
8
217.1249
<.0001
Mantel-Haenszel Chi-Square
1
129.7702
<.0001
Phi Coefficient
0.6773
Contingency Coefficient
0.5608
Cramer's V
0.4789
WARNING: 40% of the cells have expected counts less
than 5. Chi-Square may not be a valid test.
Effective Sample Size = 405
Frequency Missing = 1
Because the table contains a high proportion of small expected values (expected values less than 5), SAS gives
a warning message in the output. In this case, we can use a Fisher’s exact test. Here are the commands we first
try to use:
title "Row variable is nominal, column variable is ordinal";
proc freq data = b510.cars;
tables origin*cylinder / chisq expected;
exact fisher;
format origin originfmt.;run;
The following message in the SAS log warns us that this may take a long time. We interrupted processing by
clicking on the “break” key, which looks like a circle around an exclamation point (!).
160
161
162
163
proc freq data = b510.cars;
tables origin*cylinder / chisq;
exact fisher;
format origin originfmt.;
14
164
run;
WARNING: Computing exact p-values for this problem may require much time and memory. Press the
system interrupt key to terminate exact computations.
NOTE: There were 406 observations read from the data set B510.CARS.
NOTE: PROCEDURE FREQ used (Total process time):
real time
31.02 seconds
cpu time
23.54 seconds
We now resubmit the commands, using instead the Monte Carlo option in SAS (mc). This will give us a quite
good approximation to the Fisher’s exact test p-value, but based on 10,000 randomly chosen tables.
title "Row variable is nominal, column variable is ordinal";
title2 "Try Fisher's Exact test";
proc freq data = b510.cars;
tables origin*cylinder / chisq expected;
exact fisher / mc;
format origin originfmt.;
run;
The output for these tests is shown below. The appropriate p-value is the portion labeled Pr <= P. SAS also
reports a 99% lower and upper confidence limit for the p-value. When reporting a p-value that is displayed as
0.0000, it is more acceptable to use p< 0.0001.
Note that we did not specify an initial seed for the Monte Carlo simulation, so SAS chose one for us. This seed
is reported at the bottom of the output. You can re-generate the same results by specifying the seed in your
Exact Statement, as shown below.
exact fisher / mc seed=210470001;
Statistics for Table of ORIGIN by CYLINDER
Fisher's Exact Test
---------------------------------Table Probability (P)
3.582E-49
Monte Carlo Estimate for the Exact Test
Pr <= P
0.0000
99% Lower Conf Limit
0.0000
99% Upper Conf Limit
4.604E-04
Number of Samples
10000
Initial Seed
210470001
Effective Sample Size = 405
15
Download