Chapter 9 Producing Descriptive Statistics PROC MEANS; Summarize descriptive statistics for continuous numeric variables. PROC FREQ; Summarize frequency tables for discrete numeric variables or categorical variables. Objectives • Compute statistical summaries such as mean, median, std, min, max, and so on for numeric continuous variables • Control # of decimals for reporting the summary statistics • Difference between PROC MEANS and PROC SUMMARY procedures. • Create one-way frequency table • Create 2-way, n-way cross frequency table 2 PROC MEANS Output Salary by Job Code The MEANS Procedure Analysis Variable : Salary Job N Code Obs N Mean Std Dev Minimum Maximum ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ FLTAT1 14 14 25642.86 2951.07 21000.00 30000.00 FLTAT2 18 18 35111.11 1906.30 32000.00 38000.00 FLTAT3 12 12 44250.00 2301.19 41000.00 48000.00 PILOT1 8 8 69500.00 2976.10 65000.00 73000.00 PILOT2 9 9 80111.11 3756.48 75000.00 86000.00 PILOT3 8 8 99875.00 7623.98 92000.00 112000.00 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 3 Calculating Summary Statistics for Numeric Variables The MEANS procedure displays simple descriptive statistics for the numeric variables in a SAS data set. General form of a simple PROC MEANS step: PROC MEANS DATA=SAS-data-set; RUN; Example: proc means data=mylib.crew; title 'Salary Analysis'; run; 4 Calculating Summary Statistics Salary Analysis The MEANS Procedure Variable N Mean Std Dev Minimum Maximum ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ HireDate 69 9812.78 1615.44 7318.00 12690.00 Salary 69 52144.93 25521.78 21000.00 112000.00 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ NOTE: PROC MEANS computes summary statistics for any variable we want. However, it is meaningless to compute some variables, such as Hiredate. 5 Calculating Summary Statistics By default, PROC MEANS – analyzes every numeric variable in the SAS data set – prints the statistics N, MEAN, STD, MIN, and MAX – excludes missing values before calculating statistics. 6 Specifying summary statistics to be computed PROC MEANS data = mylib.crew mean median range std ; To specify the summary statistics to be computed, add them to the PROC MEANS statement as options. Limitting Decimal Places By default, RPOC MEANS uses the BEST. Format to display values in the report. It can be many decimal places such as 52.000000 To specify the # of decimal places to k places: PROC MEAN Data = Mylib.crew MAXDEC=k ; Maxdec =2 will result in 2 decimals in the report. Maxdec =0 will result in no decimal place in the report. 8 Selecting Variables The VAR statement restricts the variables processed by PROC MEANS. General form of the VAR statement: VAR SAS-variable(s); 9 Selecting Variables Mylib.crew HireDate LastName FirstName Location Phone EmpID JobCode Salary 07NOV1992 BEAUMONT SALLY T. LONDON 1132 E00525 PILOT1 72000 1151 E02466 FLTAT3 41000 BARBARA ANN FRANKFURT 1163 E00802 PILOT2 81000 12MAY1985 BERGAMASCO CHRISTOPHER CARY 04AUG1988 BETHEA proc means data=Mylib.crew; var Salary; title 'Salary Analysis'; run; Salary Analysis The MEANS Procedure Analysis Variable : Salary N Mean Std Dev Minimum Maximum ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 69 52144.93 25521.78 21000.00 112000.00 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 10 Grouping Observations Using CLASS statement The CLASS statement in the MEANS procedure groups the observations of the SAS data set for analysis. General form of the CLASS statement: CLASS SAS-variable(s); 11 Grouping Observations Mylib.crew HireDate LastName FirstName Location Phone EmpID JobCode Salary 07NOV1992 BEAUMONT SALLY T. LONDON 1132 E00525 PILOT1 72000 1151 E02466 FLTAT3 41000 BARBARA ANN FRANKFURT 1163 E00802 PILOT2 81000 12MAY1985 BERGAMASCO CHRISTOPHER CARY 04AUG1988 BETHEA proc means data=mylib.crew maxdec=2; var Salary; class JobCode; title 'Salary by Job Code'; run; NOTE: The MAXDEC= option controls the number of decimal places displayed in the output. 12 Grouping Observations using CLASS statement Salary by Job Code The MEANS Procedure Analysis Variable : Salary Job N Code Obs N Mean Std Dev Minimum Maximum ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ FLTAT1 14 14 25642.86 2951.07 21000.00 30000.00 FLTAT2 18 18 35111.11 1906.30 32000.00 38000.00 FLTAT3 12 12 44250.00 2301.19 41000.00 48000.00 PILOT1 8 8 69500.00 2976.10 65000.00 73000.00 PILOT2 9 9 80111.11 3756.48 75000.00 86000.00 PILOT3 8 8 99875.00 7623.98 92000.00 112000.00 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ The summary is displayed based on the order of the categories of the CLASS variable. Variables in CLASS statement can be character or numeric. 13 Results due to CLASS Statement • The summary is displayed based on the order of the categories of the CLASS variable. • Variables in CLASS statement can be character or numeric. It is important to make sure you do not use continuous numeric variable in the CLASS statement. • If there are two or more variables in CLASS statement, the order of the variables in the CLASS statement determined the order in the output report. 14 PROC MEAN procedure Using BY Statement PROC MEANS; VAR variable list ; BY Variable; • It is important to know that when using BY statement, the data set MUST be sorted in ascending order based on the variables in the BY statement first using PROC SORT. •The result using BY statement is displayed as separate tables each is for the category of the variable in the BY statement. •If there are two or more variables in the BY statement, the order determines the order of the displayed tables in the report. 15 Exercise Write a program to read diabetes data set and use PROC Means to produce summary statistics • for variables Age, Height and Weight. Run the program and see the results. • Produce the summary statistics N, mean median, max, min std, and range, and Set decimal places to two by Maxdec =2. Run the program and see the results. • Ass the CLASS statement to produce summary results for each sex. Run the program to see the results. • Practice using BY statement for each sex. Before you add the BY SEX statement, Make sure you sort the data by SEX. Run the program and see the result. • Add a WHERE statement to select cases for AGE > 30 to the program. Run the program and see the results. 16 Create data set for summary statistics in PROC MEANS In many occasions, we may want to create a SAS data set consisting of the summary statistics calculated by PROC MEANS. OUTPUT OUT=sas-data-set summary-keyword(s) = variablename(s); NOTE: summary-keywords are: Mean, Min, Max, Range, Std, etc. Variablenames are the variable names you want to call for each summary statistics for each variable. 17 Create Summary Data Set using PROC MEANS Examples: PROC MEANS data= mylib.crew; VAR Hiredate salary; OUTPUT OUT = mylib.discrip mean = avghiredate avgsalary Median= medhiredate medsalary; Run; 18 Exercise Revise the following program to do the following task: Use the OUTPUT OUT= statement to save the summary statistics Mean, Median and Std to a sas data set dia_summary, then print this data set to see what’s in there. PROC MEANS data = mylib.diabetes maxdec =2 ; var age height weight; class sex; run; 19 PROC SUMMARY procedure PROC SUMMARY procedure uses the same program codes as PROC MEANS. PROC SUMMARY does not produce report by default. In order to produce the report, you need to add PRINT as the option: PROC SUMMARY data = sasdataset PRINT; When do we use PROC SUMMARY? If you only want to produce and save the summary to a SAS data set, you can use PROC SUMMARY. OR you can use the option: NOPRINT in PROC MEANS. PROC FREQ procedure Objectives – Generate simple descriptive statistics using the MEANS procedure. – Group observations of a SAS data set for analysis using the CLASS statement in the MEANS procedure. – Create one-way and two-way frequency tables using the FREQ procedure. – Restrict the variables processed by the FREQ procedure. 21 PROC FREQ Output Distribution of Job Code Values The FREQ Procedure Job Cumulative Cumulative Code Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ FLTAT1 14 20.29 14 20.29 FLTAT2 18 26.09 32 46.38 FLTAT3 12 17.39 44 63.77 PILOT1 8 11.59 52 75.36 PILOT2 9 13.04 61 88.41 PILOT3 8 11.59 69 100.00 22 Goal Report 1 International Airlines wants to know how many employees are in each job code. Distribution of Job Code Values The FREQ Procedure Job Cumulative Cumulative Code Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ FLTAT1 14 20.29 14 20.29 FLTAT2 18 26.09 32 46.38 FLTAT3 12 17.39 44 63.77 PILOT1 8 11.59 52 75.36 PILOT2 9 13.04 61 88.41 PILOT3 8 11.59 69 100.00 23 Goal Report 2 Categorize job code and salary values to determine how many employees fall into each group. Salary Distribution by Job Codes The FREQ Procedure Table of JobCode by Salary JobCode Salary Frequency Percent Row Pct Col Pct ‚ ‚ ‚ ‚Less tha‚25,000 t‚More tha‚ Total ‚n 25,000‚o 50,000‚n 50,000‚ ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Flight Attendant ‚ 5 ‚ 39 ‚ 0 ‚ 44 ‚ 7.25 ‚ 56.52 ‚ 0.00 ‚ 63.77 ‚ 11.36 ‚ 88.64 ‚ 0.00 ‚ ‚ 100.00 ‚ 100.00 ‚ 0.00 ‚ ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Pilot ‚ 0 ‚ 0 ‚ 25 ‚ 25 ‚ 0.00 ‚ 0.00 ‚ 36.23 ‚ 36.23 ‚ 0.00 ‚ 0.00 ‚ 100.00 ‚ ‚ 0.00 ‚ 0.00 ‚ 100.00 ‚ ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Total 5 39 25 69 7.25 56.52 36.23 100.00 24 Creating a Frequency Report • PROC FREQ displays frequency counts of the data values in a SAS data set. • General form of a simple PROC FREQ step: PROC FREQ DATA=SAS-data-set; RUN; Example: proc freq data=mylib.crew; run; 25 Creating a Frequency Report • By default, PROC FREQ – analyzes every variable in the SAS data set – displays each distinct data value – calculates the number of observations in which each data value appears (and the corresponding percentage) – indicates for each variable how many observations have missing values. 26 Default Frequency Reports mylib.crew HireDate LastName • 07NOV1992 BEAUMONT FirstName Location Phone EmpID JobCode Salary SALLY T. LONDON 1132 E00525 PILOT1 72000 1151 E02466 FLTAT3 41000 BARBARA ANN FRANKFURT 1163 E00802 PILOT2 81000 12MAY1985 BERGAMASCO CHRISTOPHER CARY 04AUG1988 BETHEA Distribution of proc freq data=mylib.crew; run; HireDate Distribution of Salary Distribution of Distribution of LastName JobCode Distribution of FirstName Distribution of Distribution of Distribution of Location Phone EmpID 27 ... One-Way Frequency Report • Use the TABLES statement to limit the variables included in the frequency counts. These are typically variables that have a limited number of distinct values. • General form of a PROC FREQ step with a TABLES statement: PROC FREQ DATA=SAS-data-set ; TABLES SAS-variables / NOCUM; RUN; NOCUM option in the TABLES statement suppress Cumulative frequency and Cumulative percentage 28 Creating a Frequency Report proc freq data=mylib.crew; tables JobCode; title 'Distribution of Job Code Values'; run; Distribution of Job Code Values The FREQ Procedure Job Cumulative Cumulative Code Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ FLTAT1 14 20.29 14 20.29 FLTAT2 18 26.09 32 46.38 FLTAT3 12 17.39 44 63.77 PILOT1 8 11.59 52 75.36 PILOT2 9 13.04 61 88.41 PILOT3 8 11.59 69 100.00 29 Using PROC FORMAT to redefine Categories of Values in TABLES statement International Airlines wants to use formats to categorize the flight crew by job code. Stored values Formatted values PILOT1 PILOT2 PILOT3 Pilot FLTAT1 FLTAT2 FLTAT3 Flight Attendant 30 Analyzing Categories of Values proc format; • value $codefmt 'FLTAT1'-'FLTAT3'='Flight Attendant' 'PILOT1'-'PILOT3'='Pilot'; run; proc freq data = mylib.crew; format JobCode $codefmt.; tables JobCode; run; NOTE: The original data values for Jobocde are not changed. They are still FLTAT1 FLTAT2, and so on. 31 Analyzing Categories of Values Distribution of Job Code Values The FREQ Procedure Cumulative Cumulative JobCode Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Flight Attendant 44 63.77 44 63.77 Pilot 25 36.23 69 100.00 32 Crosstabular Frequency Reports • A two-way, or crosstabular, frequency report analyzes all possible combinations of the distinct values of two variables. • The asterisk (*) operator in the TABLES statement is used to cross variables. • General form of the FREQ procedure to create a crosstabular report: PROC FREQ DATA=SAS-data-set; TABLES variable1*variable2; RUN; Variable1 is ROW and Variable2 is Column 33 Crosstabular Frequency Reports proc format; • value $codefmt 'FLTAT1'-'FLTAT3'='Flight Attendant' 'PILOT1'-'PILOT3'='Pilot'; value money low-<25000 ='Less than 25,000' 25000-50000='25,000 to 50,000' 50000<-high='More than 50,000'; run; proc freq data=mylib.crew; tables JobCode*Salary; format JobCode $codefmt. Salary money.; title 'Salary Distribution by Job Codes'; run; 34 Crosstabular Frequency Reports Salary Distribution by Job Codes The FREQ Procedure Table of JobCode by Salary JobCode Salary Frequency Percent Row Pct Col Pct ‚ ‚ ‚ ‚Less tha‚25,000 t‚More tha‚ Total ‚n 25,000‚o 50,000‚n 50,000‚ ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Flight Attendant ‚ 5 ‚ 39 ‚ 0 ‚ 44 ‚ 7.25 ‚ 56.52 ‚ 0.00 ‚ 63.77 ‚ 11.36 ‚ 88.64 ‚ 0.00 ‚ ‚ 100.00 ‚ 100.00 ‚ 0.00 ‚ ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Pilot ‚ 0 ‚ 0 ‚ 25 ‚ 25 ‚ 0.00 ‚ 0.00 ‚ 36.23 ‚ 36.23 ‚ 0.00 ‚ 0.00 ‚ 100.00 ‚ ‚ 0.00 ‚ 0.00 ‚ 100.00 ‚ ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Total 5 39 25 69 7.25 56.52 36.23 100.00 35 Additional Syntax for TABLES statement in PROC FREQ; statement Syntax Equivalent to tables A*(B C); tables A*B A*C; tables (A B)*(C D); tables A*C B*C A*D B*D; tables (A B C)*D; tables A*D B*D C*D; tables A - - C; tables A B C; tables (A - - C)*D; tables A*D B*D C*D TABLES A*B*C; Produces separate two-way tables of B*C for each value of A. To Suppress some columns in the PROC FREQ summary report PROC FREQ; TABLES var1*var2/ <OPTIONS>; Options for suppressing cell frequency: NOFREQ Options for suppressing cell percent: NOPERCENT Options for suppressing ROW percent: NOROW Options for suppressing COLUMN percent: NOCOL Additional usages of PROC FREQ statement In addition to reporting tables, PROC FREQ; statement also conduct many statistical tests for analyzing categorical data such as Chi-square test, Cochran-Mantel-Haenszel test, Fisher’s exact test, Kappa coefficient, Risk, Odds ratio and so on. This is beyond the programming course. Exercise The Diabetes data set consists of Sex, Age, Height, Weight, Pulse FastGluc PostGluc for 20 patients. Revise the following program by using PROC FREQ procedure to perform the following tasks: 1. Use IF statement to create AGE_G variable : IF AGE > 45 then, AGE_G = ‘Senior’ , otherwise AGE_G = ‘Young’. Create one-way table for variables SEX , Age_G, and Pulse using user-defined format. Run the program and see the results. 2. Create cross tabular table sex*(Age_G Pulse), make sure the user-defined format is applied for Pulse variable. Run the program and see the results. 3. Suppressing ROW percent and Column percent. Run the program and see the results. proc format; value pulft LOW-70 = 'Low' 71-High = 'High'; run; data diab; set mylib.diabetes; run; 39