SAS Oneway Frequency Tabulations and Twoway Contingency Tables (Crosstabs) /*********************************************************** This example illustrates: How to create user-defined formats How to recode continuous variables into ordinal categories How to generate oneway and twoway tables and basic tests The following tests are illustrated: Chi-square goodness of fit test Binomial test of proportion for a two-level variable Exact Binomial test Pearson Chi-square test Fisher’s exact test Cochran-Armitage test for trend Procs used: Proc Format Proc Means Proc Freq Proc Contents Filename: frequencies.sas ************************************************************/ OPTIONS FORMCHAR="|----|+|---+=|-/\<>*"; OPTIONS NODATE PAGENO=1 FORMDLIM=" "; PROC FORMAT; VALUE AGEFMT VALUE HIAGEFMT VALUE HIHICHOLFMT 1 = "1:19-29" 2 = "2:30-39" 3 = "3:>39"; 1 = "1:AGE > 39" 2 = "2:AGE <= 39"; 1 = "1:>=240" 2 = "2:<240"; VALUE CHOLCATFMT 1 = "1:<200" 2 = "2:200-239" 3 = "3:>=240"; VALUE PILLFMT 1 = "1:PILL" 2 = "2:NO PILL"; VALUE WTFMT 1 = "1:<120" 2 = "2:120-139" 3 = "3:>=140"; VALUE HIBMIFMT 1 = "1:BMI>23" 2 = "2:BMI<=23"; RUN; 1 The log from that results from these Proc Format commands is shown below. These formats will be stored in the Work library, and thus will be temporary. In the document that follows, you will see the formats being applied within each procedure, by using a format statement. These formats will not be automatically attached to variables, and have to be specified for each procedure. 4 PROC FORMAT; 5 VALUE AGEFMT 1 = "1: Age 19-29" 6 2 = "2: Age 30-39" 7 3 = "3: Age >39"; NOTE: Format AGEFMT has been output. 8 9 VALUE HIAGEFMT 1 = "1: Age >39" 10 2 = "2: Age <=39"; NOTE: Format HIAGEFMT has been output. 11 12 VALUE HICHOLFMT 1 = "1: Chol >=240" 13 2 = "2: Chol <240"; NOTE: Format HICHOLFMT has been output. 14 15 VALUE CHOLCATFMT 1 = "1: Chol <200" 16 2 = "2: Chol 200-239" 17 3 = "3: Chol >=240"; NOTE: Format CHOLCATFMT has been output. 18 19 VALUE PILLFMT 1 = "1: Pill" 20 2 = "2: No Pill"; NOTE: Format PILLFMT has been output. 21 22 VALUE WTFMT 1 = "1: Wt <120kg" 23 2 = "2: Wt 120-139kg" 24 3 = "3: Wt >=140kg"; NOTE: Format WTFMT has been output. 25 26 VALUE HIBMIFMT 1 = "1: BMI>23" 27 2 = "2: BMI<=23"; NOTE: Format HIBMIFMT has been output. 28 RUN; NOTE: PROCEDURE FORMAT used (Total process time): real time 0.25 seconds cpu time 0.00 seconds Now, we create a permanent SAS data set from the raw data file, Werner2.dat. The raw data are read in, then missing value codes are assigned appropriately, and new variables are created. Note that missing values are assigned before the new variables are created. 2 libname b510 "e:\510\"; DATA B510.WERNER; INFILE "werner2.dat"; INPUT ID 1-4 AGE 5-8 HT 9-12 WT 13-16 PILL 17-20 CHOL 21-24 ALB 25-28 CALC 29-32 URIC 33-36 PAIR 37-39; IF IF IF IF IF HT WT ALB CALC URIC = 999 then HT = 999 then WT = 99 then ALB = 99 then CALC = 99 then URIC = = = = = .; .; .; .; .; WTKG = WT*.39; HTCM = HT*2.54; BMI = WTKG/(HTCM/100)**2; IF BMI > 23 then HIBMI = 1; IF 0<=BMI<=23 then HIBMI = 2; IF AGE NOT=. THEN DO; IF AGE <= 29 THEN AGEGROUP=1; IF AGE > 29 AND AGE <= 39 THEN AGEGROUP=2; IF AGE > 39 THEN AGEGROUP=3; IF AGE > 39 THEN HIAGE=1; IF AGE <= 39 THEN HIAGE=2; END; IF CHOL >= 400 OR CHOL < 100 THEN CHOL=.; IF CHOL NOT=. THEN DO; IF CHOL >= 240 THEN HICHOL=1; IF CHOL < 240 THEN HICHOL=2; IF CHOL < IF CHOL >= IF CHOL >= 200 THEN CHOLCAT=1; 200 AND CHOL < 240 THEN CHOLCAT=2; 240 THEN CHOLCAT=3; END; IF WT IF IF IF END; NOT=. WT < WT >= WT >= THEN DO; 120 THEN WTCAT=1; 120 AND WT < 140 THEN WTCAT=2; 140 THEN WTCAT=3; DROP WTKG HTCM; RUN; We use two methods for checking the newly created variables. The simplest one is Proc Means. This tells us most importantly if we have included all cases in our new variables, and if we have avoided adding data where there should be none! We will carefully examine the sample size for each original variable, and each new variable that was created, to be sure they match. This simple check should always be done first! TITLE "DESCRIPTIVE STATISTICS"; PROC MEANS;RUN; DESCRIPTIVE STATISTICS 3 The MEANS Procedure Variable N Mean Std Dev Minimum Maximum ------------------------------------------------------------------------------ID 188 1598.96 1057.09 3.0000000 3519.00 AGE 188 33.8191489 10.1126942 19.0000000 55.0000000 HT 186 64.5107527 2.4850673 57.0000000 71.0000000 WT 186 131.6720430 20.6605767 94.0000000 215.0000000 PILL 188 1.5000000 0.5013351 1.0000000 2.0000000 CHOL 186 236.1505376 42.5555145 155.0000000 390.0000000 ALB 186 4.1112903 0.3579694 3.2000000 5.0000000 CALC 185 9.9621622 0.4795556 8.6000000 11.1000000 URIC 187 4.7705882 1.1572312 2.2000000 9.9000000 PAIR 188 47.5000000 27.2063810 1.0000000 94.0000000 BMI 184 19.0736235 2.6285786 15.2305671 29.6996059 HIBMI 184 1.9021739 0.2978899 1.0000000 2.0000000 AGEGROUP 188 1.9255319 0.8432096 1.0000000 3.0000000 HIAGE 188 1.6808511 0.4673916 1.0000000 2.0000000 HICHOL 186 1.5322581 0.5003051 1.0000000 2.0000000 CHOLCAT 186 2.2634409 0.7783954 1.0000000 3.0000000 WTCAT 186 2.0322581 0.7490767 1.0000000 3.0000000 ------------------------------------------------------------------------------- A second way to check recodes of continuous variables into categories is illustrated below. Basically, you can check the minimum and maximum value of the original variable in each category of the new categorical variable to be sure the range of values is specified as you wanted it to be. Do this only after you have checked the sample sizes by using a simple Proc Means statement, as illustrated above. TITLE "CHECKING RECODE OF WT INTO WTCAT"; PROC MEANS DATA=B510.WERNER; CLASS WTCAT; VAR WT; FORMAT WTCAT WTFMT.; RUN; CHECKING RECODE OF WT INTO WTCAT The MEANS Procedure Analysis Variable : WT N WTCAT Obs N Mean Std Dev Minimum Maximum --------------------------------------------------------------------------------------------1: Wt <120kg 49 49 109.4489796 7.0209841 94.0000000 119.0000000 2: Wt 120-139kg 82 82 128.6097561 5.9103510 120.0000000 138.0000000 3: Wt >=140kg 55 55 156.0363636 17.2969315 140.0000000 215.0000000 --------------------------------------------------------------------------------------------- 4 TITLE "CHECKING RECODE OF AGE INTO AGEGROUP"; PROC MEANS DATA=B510.WERNER; CLASS AGEGROUP; VAR AGE; FORMAT AGEGROUP AGEFMT.; RUN; CHECKING RECODE OF AGE INTO AGEGROUP The MEANS Procedure Analysis Variable : AGE N AGEGROUP Obs N Mean Std Dev Minimum Maximum -----------------------------------------------------------------------------------------1: Age 19-29 74 74 23.8378378 2.7846302 19.0000000 29.0000000 2: Age 30-39 54 54 33.5925926 3.0376165 30.0000000 39.0000000 3: Age >39 60 60 46.3333333 4.6892111 40.0000000 55.0000000 ------------------------------------------------------------------------------------------ TITLE "CHECKING RECODE OF CHOL INTO HICHOL"; PROC MEANS DATA=B510.WERNER; CLASS HICHOL; VAR CHOL; FORMAT HICHOL HICHOLFMT.; RUN; CHECKING RECODE OF CHOL INTO HICHOL The MEANS Procedure Analysis Variable : CHOL N HICHOL Obs N Mean Std Dev Minimum Maximum ------------------------------------------------------------------------------------------1: Chol >=240 87 87 272.4712644 29.0159696 240.0000000 390.0000000 2: Chol <240 99 99 204.2323232 21.8985734 155.0000000 238.0000000 ------------------------------------------------------------------------------------------- TITLE "CHECKING RECODE OF CHOL INTO CHOLCAT"; PROC MEANS DATA=B510.WERNER; CLASS CHOLCAT; VAR CHOL; FORMAT CHOLCAT CHOLCATFMT.; RUN; CHECKING RECODE OF CHOL INTO CHOLCAT The MEANS Procedure Analysis Variable : CHOL N CHOLCAT Obs N Mean Std Dev Minimum Maximum --------------------------------------------------------------------------------------------1: Chol <200 38 38 181.7631579 12.9894639 155.0000000 198.0000000 2: Chol 200-239 61 61 218.2295082 12.6601651 200.0000000 238.0000000 3: Chol >=240 87 87 272.4712644 29.0159696 240.0000000 390.0000000 --------------------------------------------------------------------------------------------- TITLE "ONEWAY FREQUENCIES"; PROC FREQ DATA=B510.WERNER ORDER=INTERNAL; TABLES PILL WTCAT AGEGROUP HIAGE HICHOL; 5 FORMAT AGEGROUP AGEFMT. HICHOL HICHOLFMT. PILL PILLFMT. WTCAT WTFMT. HIAGE HIAGEFMT.; RUN; ONEWAY FREQUENCIES The FREQ Procedure Cumulative Cumulative PILL Frequency Percent Frequency Percent --------------------------------------------------------------1: Pill 94 50.00 94 50.00 2: No Pill 94 50.00 188 100.00 Cumulative Cumulative WTCAT Frequency Percent Frequency Percent -------------------------------------------------------------------1: Wt <120kg 49 26.34 49 26.34 2: Wt 120-139kg 82 44.09 131 70.43 3: Wt >=140kg 55 29.57 186 100.00 Frequency Missing = 2 Cumulative Cumulative AGEGROUP Frequency Percent Frequency Percent ----------------------------------------------------------------1: Age 19-29 74 39.36 74 39.36 2: Age 30-39 54 28.72 128 68.09 3: Age >39 60 31.91 188 100.00 Cumulative Cumulative HIAGE Frequency Percent Frequency Percent ---------------------------------------------------------------1: Age >39 60 31.91 60 31.91 2: Age <=39 128 68.09 188 100.00 Cumulative Cumulative HICHOL Frequency Percent Frequency Percent -----------------------------------------------------------------1: Chol >=240 87 46.77 87 46.77 2: Chol <240 99 53.23 186 100.00 Frequency Missing = 2 Cumulative Cumulative CHOLCAT Frequency Percent Frequency Percent -------------------------------------------------------------------1: Chol <200 38 20.43 38 20.43 2: Chol 200-239 61 32.80 99 53.23 3: Chol >=240 87 46.77 186 100.00 Frequency Missing = 2 One-Sample Tests for Categorical Variables Binomial Confidence Intervals and Tests for Binary Variables: If you have a categorical variable with only two levels, you can use the binomial option to request a 95% confidence interval for the proportion in the first level of the variable, and a test of the null hypothesis: H0: proportion in first category of the variable = π 6 In the option (P= ) you specify the hypothesized proportion in the first category of the tabled variable. By default, SAS reports both one-sided and two-sided asymptotic p-values. TITLE "BINOMIAL TEST"; PROC FREQ DATA=B510.WERNER ORDER=INTERNAL; TABLES HIBMI / BINOMIAL (P=.20); FORMAT HIBMI HIBMIFMT.; RUN; The hypotheses that we are testing are shown below: H0: proportion with high bmi = 0.20 HA: proportion with high bmi 0.20 BINOMIAL TEST Cumulative Cumulative HIBMI Frequency Percent Frequency Percent -------------------------------------------------------------1:BMI>23 18 9.78 18 9.78 2:BMI<=23 166 90.22 184 100.00 Frequency Missing = 4 Binomial Proportion for HIBMI = 1:BMI>23 ------------------------------------Proportion (P) 0.0978 ASE 0.0219 95% Lower Conf Limit 0.0549 95% Upper Conf Limit 0.1408 Exact Conf Limits 95% Lower Conf Limit 95% Upper Conf Limit 0.0590 0.1502 Test of H0: Proportion = 0.2 ASE under H0 0.0295 Z -3.4649 One-sided Pr < Z 0.0003 Two-sided Pr > |Z| 0.0005 If you wish to obtain an exact binomial test of the null hypothesis, use the exact statement. 7 PROC FREQ DATA=B510.WERNER ORDER=INTERNAL; TABLES HIBMI / BINOMIAL (P=.20); exact binomial; FORMAT HIBMI HIBMIFMT.; RUN; This results in an exact test of the null hypothesis, in addition to the default asymptotic test. Exact Test One-sided Pr <= P Two-sided = 2 * One-sided 1.415E-04 2.829E-04 Chi-square Goodness of Fit Tests for Categorical Variables: Use the chisq option in the tables statement to get a chi-square goodness of fit test, which can be used for categorical variables with two or more levels. By default SAS assumes that you wish to test the null hypothesis that the proportion of cases is equal in all categories. Use the testp= option to specify the proportions that you wish to test, if you don't want to assume equal proportions in all categories. The total of all the proportions must be 1.0. You can also use percentages, in which case, the total must add up to 100%. Give the appropriate proportions in the testp= option, specifying them in order as they apply to each category. TITLE "CHISQUARE GOODNESS OF FIT TEST"; PROC FREQ DATA=B510.WERNER ORDER=INTERNAL; TABLES CHOLCAT / CHISQ TESTP=(.20 .30 .50); FORMAT CHOLCAT CHOLCATFMT.; RUN; The null hypothesis that we are testing is: H0: π1= 0.20, π2 = .30, π3 = .50 CHISQUARE GOODNESS OF FIT TEST The FREQ Procedure Test Cumulative Cumulative CHOLCAT Frequency Percent Percent Frequency Percent -------------------------------------------------------------------------------1: Chol <200 38 20.43 20.00 38 20.43 2: Chol 200-239 61 32.80 30.00 99 53.23 3: Chol >=240 87 46.77 50.00 186 100.00 Frequency Missing = 2 Chi-Square Test for Specified Proportions ------------------------Chi-Square 0.8889 DF 2 Pr > ChiSq 0.6412 Effective Sample Size = 186 Frequency Missing = 2 8 Two-Sample Tests for Categorical Variables: Chi-Square test of Independence Two by Two Table: If you wish to examine the relationship between two categorical variables, you can use Proc Freq. Use the chisq option to obtain the Pearson chi-square test of independence (or of homogeneity), and use the expected option to get the expected value in each cell. The commands below can be used to get a cross-tabulation. In this case, we have a 2 by 2 table, because each categorical variable has two levels. We test: H0: HIAGE is independent of HICHOL status HA: HIAGE is not independent of HICHOL status Note that Fisher’s exact test is produced by default for a 2 x 2 table, when the chisq option is specified. Read either the one-sided or two-sided p-value for the Fisher’s exact test, which are at the bottom of the respective panel of output, and shown in bold below. TITLE "2x2 TABLE"; PROC FREQ DATA=B510.WERNER ORDER=INTERNAL; TABLES HIAGE*HICHOL / CHISQ EXPECTED; FORMAT HIAGE HIAGEFMT. HICHOL HICHOLFMT.;RUN; 2x2 TABLE Table of HIAGE by HICHOL HIAGE Frequency Expected Percent Row Pct Col Pct HICHOL | | | | |1: Chol |2: Chol | Total |>=240 |<240 | ------------+--------+--------+ 1: Age >39 | 42 | 18 | 60 | 28.065 | 31.935 | | 22.58 | 9.68 | 32.26 | 70.00 | 30.00 | | 48.28 | 18.18 | ------------+--------+--------+ 2: Age <=39 | 45 | 81 | 126 | 58.935 | 67.065 | | 24.19 | 43.55 | 67.74 | 35.71 | 64.29 | | 51.72 | 81.82 | ------------+--------+--------+ Total 87 99 186 46.77 53.23 100.00 Frequency Missing = 2 9 Statistics for Table of HIAGE by HICHOL Statistic DF Value Prob -----------------------------------------------------Chi-Square 1 19.1914 <.0001 Likelihood Ratio Chi-Square 1 19.5296 <.0001 Continuity Adj. Chi-Square 1 17.8389 <.0001 Mantel-Haenszel Chi-Square 1 19.0882 <.0001 Phi Coefficient 0.3212 Contingency Coefficient 0.3058 Cramer's V 0.3212 Fisher's Exact Test ---------------------------------Cell (1,1) Frequency (F) 42 Left-sided Pr <= F 1.0000 Right-sided Pr >= F 1.045E-05 Table Probability (P) Two-sided Pr <= P 8.118E-06 1.741E-05 Effective Sample Size = 186 Frequency Missing = 2 Cochran-Armitage test for trend: R x 2 table, or 2 x C table The Cochran-Armitage test for trend is appropriate when either the row or column variable is binary (has two levels) and the other variable is ordinal. It tests whether there is a linear trend in the proportion of subjects having the binary characteristic. The Mantel-Haenszel test statistic tests for a linear by linear association and can be used when both row and column variables are ordinal; it always has 1 degree of freedom. In the table below, both the row and column variables could be considered to be ordinal, because a binary variable can be thought of as a very simple case of an ordinal variable. TITLE1 "3X2 TABLE"; TITLE2 "THE ROW VARIABLE IS ORDINAL"; PROC FREQ DATA=B510.WERNER ORDER=INTERNAL; TABLES AGEGROUP*HICHOL / CHISQ TREND NOCOL NOPERCENT; FORMAT AGEGROUP AGEFMT. HICHOL HICHOLFMT. ; RUN; For the Cochran-Armitage test, we are testing the null hypothesis: H0: There is a linear trend in the proportion of women with high cholesterol, with increasing age We are not testing whether the trend is in a positive or negative direction. To see that, simply examine the proportions of participants with high cholesterol in each age group. 10 3X2 TABLE THE ROW VARIABLE IS ORDINAL The FREQ Procedure Table of AGEGROUP by HICHOL AGEGROUP Frequency Row Pct HICHOL | |1: Chol |2: Chol | |>=240 |<240 | -------------+--------+--------+ 1: Age 19-29 | 25 | 47 | | 34.72 | 65.28 | -------------+--------+--------+ 2: Age 30-39 | 20 | 34 | | 37.04 | 62.96 | -------------+--------+--------+ 3: Age >39 | 42 | 18 | | 70.00 | 30.00 | -------------+--------+--------+ Total 87 99 Frequency Missing = 2 Total 72 54 60 186 Statistics for Table of AGEGROUP by HICHOL Statistic DF Value Prob -----------------------------------------------------Chi-Square 2 19.2578 <.0001 Likelihood Ratio Chi-Square 2 19.6016 <.0001 Mantel-Haenszel Chi-Square 1 15.5677 <.0001 Phi Coefficient 0.3218 Contingency Coefficient 0.3063 Cramer's V 0.3218 Statistics for Table of AGEGROUP by HICHOL Cochran-Armitage Trend Test -------------------------Statistic (Z) 3.9562 One-sided Pr > Z <.0001 Two-sided Pr > |Z| <.0001 Effective Sample Size = 186 Frequency Missing = 2 Mantel-Haenszel test for a linear association between two ordinal categorical variables: R x C table, both row and column variables are ordinal In the next table, both the row and column variable are ordinal. In this case the Mantel-Haenszel test is appropriate to test for a linear by linear association between the ordinal row variable and the ordinal column variable. The Pearson Chi-square test is appropriate for testing general association (H0: the row variable is independent of the column variable) whether there is ordering of the row and/or column variable or not. In a table like this, which does have ordering of both row and column variables, the Pearson Chi-square test ignores the ordering of the variables. 11 TITLE "3X3 TABLE BOTH ORDINAL VARIABLES"; PROC FREQ DATA=B510.WERNER ORDER=INTERNAL; TABLES AGEGROUP*WTCAT / CHISQ nocol nopercent; FORMAT AGEGROUP AGEFMT. WTCAT WTFMT.; RUN; 3X3 TABLE BOTH ORDINAL VARIABLES Table of AGEGROUP by WTCAT AGEGROUP WTCAT Frequency | Row Pct |1: Wt <1|2: Wt 12|3: Wt >=| |20kg |0-139kg |140kg | -------------+--------+--------+--------+ 1: Age 19-29 | 24 | 34 | 16 | | 32.43 | 45.95 | 21.62 | -------------+--------+--------+--------+ 2: Age 30-39 | 15 | 26 | 12 | | 28.30 | 49.06 | 22.64 | -------------+--------+--------+--------+ 3: Age >39 | 10 | 22 | 27 | | 16.95 | 37.29 | 45.76 | -------------+--------+--------+--------+ Total 49 82 55 Total 74 53 59 186 Frequency Missing = 2 Statistics for Table of AGEGROUP by WTCAT Statistic DF Value Prob -----------------------------------------------------Chi-Square 4 11.7418 0.0194 Likelihood Ratio Chi-Square 4 11.4638 0.0218 Mantel-Haenszel Chi-Square 1 8.7820 0.0030 Phi Coefficient 0.2513 Contingency Coefficient 0.2437 Cramer's V 0.1777 We now look at some examples using the Cars.sas7bdat SAS set. We first use Proc Contents to learn what variables are in the data set, and the types of all the variables. title; proc contents data=b510.cars; run; The CONTENTS Procedure Data Set Name Member Type Engine Created Last Modified Protection Data Set Type Label Data Representation Encoding B510.CARS DATA V9 Monday, August 21, 2006 09:41:24 PM Monday, August 21, 2006 09:41:24 PM Observations Variables Indexes Observation Length Deleted Observations Compressed Sorted 406 8 0 64 0 NO NO WINDOWS_32 wlatin1 Western (Windows) Alphabetic List of Variables and Attributes # Variable Type 5 8 2 ACCEL CYLINDER ENGINE Num Num Num Len 8 8 8 Format Label 4. 1. 5. Time to Accelerate from 0 to 60 mph (sec) Number of Cylinders Engine Displacement (cu. inches) 12 3 1 7 4 6 HORSE MPG ORIGIN WEIGHT YEAR Num Num Num Num Num 8 8 8 8 8 5. 4. 1. 4. 2. Horsepower Miles per Gallon Country of Origin Vehicle Weight (lbs.) Model Year (modulo 100) proc format; value originfmt 1="USA" 2="Europe" 3="Japan"; run; Output from the SAS log is shown below. Because this format had already been defined in the current run of SAS, there is a note in the log stating that it is already on the library. If this format were to be resubmitted with new values, the new values would over-write the old values. 142 proc format; 143 value originfmt 144 145 NOTE: Format ORIGINFMT NOTE: Format ORIGINFMT 146 run; 1="USA" 2="Europe" 3="Japan"; is already on the library. has been output. We now take a look at a 3 by 5 table (the row variable has 3 levels and the column variable has 5 levels) to see if there is any association between Country of Origin, and Number of Cylinders. The Pearson chi-square test is perhaps appropriate here…but let’s see. title “Row variable is nominal, column variable is ordinal” proc freq data = b510.cars; tables origin*cylinder / chisq expected; format origin originfmt.; run; 13 Row variable is nominal, column variable is ordinal Table of ORIGIN by CYLINDER ORIGIN(Country of Origin) CYLINDER(Number of Cylinders) Frequency| Expected | Percent | Row Pct | Col Pct | 3| 4| 5| 6| 8| Total ---------+--------+--------+--------+--------+--------+ USA | 0 | 72 | 0 | 74 | 107 | 253 | 2.4988 | 129.31 | 1.8741 | 52.474 | 66.842 | | 0.00 | 17.78 | 0.00 | 18.27 | 26.42 | 62.47 | 0.00 | 28.46 | 0.00 | 29.25 | 42.29 | | 0.00 | 34.78 | 0.00 | 88.10 | 100.00 | ---------+--------+--------+--------+--------+--------+ Europe | 0 | 66 | 3 | 4 | 0 | 73 | 0.721 | 37.311 | 0.5407 | 15.141 | 19.286 | | 0.00 | 16.30 | 0.74 | 0.99 | 0.00 | 18.02 | 0.00 | 90.41 | 4.11 | 5.48 | 0.00 | | 0.00 | 31.88 | 100.00 | 4.76 | 0.00 | ---------+--------+--------+--------+--------+--------+ Japan | 4 | 69 | 0 | 6 | 0 | 79 | 0.7802 | 40.378 | 0.5852 | 16.385 | 20.872 | | 0.99 | 17.04 | 0.00 | 1.48 | 0.00 | 19.51 | 5.06 | 87.34 | 0.00 | 7.59 | 0.00 | | 100.00 | 33.33 | 0.00 | 7.14 | 0.00 | ---------+--------+--------+--------+--------+--------+ Total 4 207 3 84 107 405 0.99 51.11 0.74 20.74 26.42 100.00 Frequency Missing = 1 Statistics for Table of ORIGIN by CYLINDER Statistic DF Value Prob -----------------------------------------------------Chi-Square 8 185.7937 <.0001 Likelihood Ratio Chi-Square 8 217.1249 <.0001 Mantel-Haenszel Chi-Square 1 129.7702 <.0001 Phi Coefficient 0.6773 Contingency Coefficient 0.5608 Cramer's V 0.4789 WARNING: 40% of the cells have expected counts less than 5. Chi-Square may not be a valid test. Effective Sample Size = 405 Frequency Missing = 1 Because the table contains a high proportion of small expected values (expected values less than 5), SAS gives a warning message in the output. In this case, we can use a Fisher’s exact test. Here are the commands we first try to use: title "Row variable is nominal, column variable is ordinal"; proc freq data = b510.cars; tables origin*cylinder / chisq expected; exact fisher; format origin originfmt.;run; The following message in the SAS log warns us that this may take a long time. We interrupted processing by clicking on the “break” key, which looks like a circle around an exclamation point (!). 160 161 162 163 proc freq data = b510.cars; tables origin*cylinder / chisq; exact fisher; format origin originfmt.; 14 164 run; WARNING: Computing exact p-values for this problem may require much time and memory. Press the system interrupt key to terminate exact computations. NOTE: There were 406 observations read from the data set B510.CARS. NOTE: PROCEDURE FREQ used (Total process time): real time 31.02 seconds cpu time 23.54 seconds We now resubmit the commands, using instead the Monte Carlo option in SAS (mc). This will give us a quite good approximation to the Fisher’s exact test p-value, but based on 10,000 randomly chosen tables. title "Row variable is nominal, column variable is ordinal"; title2 "Try Fisher's Exact test"; proc freq data = b510.cars; tables origin*cylinder / chisq expected; exact fisher / mc; format origin originfmt.; run; The output for these tests is shown below. The appropriate p-value is the portion labeled Pr <= P. SAS also reports a 99% lower and upper confidence limit for the p-value. When reporting a p-value that is displayed as 0.0000, it is more acceptable to use p< 0.0001. Note that we did not specify an initial seed for the Monte Carlo simulation, so SAS chose one for us. This seed is reported at the bottom of the output. You can re-generate the same results by specifying the seed in your Exact Statement, as shown below. exact fisher / mc seed=210470001; Statistics for Table of ORIGIN by CYLINDER Fisher's Exact Test ---------------------------------Table Probability (P) 3.582E-49 Monte Carlo Estimate for the Exact Test Pr <= P 0.0000 99% Lower Conf Limit 0.0000 99% Upper Conf Limit 4.604E-04 Number of Samples 10000 Initial Seed 210470001 Effective Sample Size = 405 15