SAS Global Forum 2007 Coders’ Corner Paper 072-2007 Calculating Statistics Using PROC MEANS versus PROC SQL Jyotheeswara Naidu Yellanki, Newark, DE, USA. ABSTRACT: Base SAS provided the PROC MEANS, which was very powerful/flexible procedure used to perform descriptive statistical analysis. This lead to a widespread use of MEANS procedure, as a result it became very popular amongst the older generation analysts. The proliferation of Relational Data base Management System (RDBMS) in information technology world lead to the introduction of PROC SQL in SAS v6.0. Being as English like language, Structure Query Language (SQL) become very popular amongst the newer generation analysts. Current SAS developer community has some analysts who use PROC MEANS to do all of the statistical work and some who use PROC SQL to do their part of the statistical work. But PROC SQL can perform or cater majority of the functionality that a PROC MEANS can deliver. Hence there is no real need for either community to learn the other’s PROCs. However analysts who do maintenance or enhancement projects need to be familiar or proficient with both PROCs. This presentation is intended to explain and emphasize the similarities and differences between SQL and MEANS procedures with examples. It will also act as a good reference document with ideas for MEANS users on how to code equivalent SQL and vice versa. INTRODUCTION: PROC SQL is a widely used language for retrieving and updating data in tables/views. Mainly it is used to retrieve data from RDBMS, calculate the descriptive statistics or summarize the data. The MEANS procedure provides data summarization tools to compute descriptive statistics for variables across all observations and within groups of observations. Both PROC MEANS and PROC SQL are now part of the base SAS product. This Paper will attempt to compare and contrast the data analysis of MEANS Procedure with equivalent methods in SQL procedure (with SQL Procedure we can do many things, but this paper will discuss about only SELECT statement). This paper will not access the efficiency or other system issues. SIMILARITIES AND DIFFERENCES: SYNTAX: THE GENERAL SYNTAX OF MEANS PROCEDURES IS: PROC MEANS DATA=sas-dataset-name <option(s)> <statistic-keyword(s)>; BY <DESCENDING> variable-1 <... <DESCENDING> variable-n><NOTSORTED>; CLASS variable(s) </ option(s)>; FREQ variable; ID variable(s); OUTPUT <OUT=SAS-data-set> <output-statistic-specification(s)> <id-group-specification(s)> <maximum-id-specification(s)> <minimum-id-specification(s)> </ option(s)> ; TYPES request(s); VAR variable(s) < / WEIGHT=weight-variable>; WAYS list; WEIGHT variable; THE GENERAL SYNTAX OF SQL PROCEDURES IS: SELECT <DISTINCT> object-item <,object-item>... <INTO :macro-variable-specification <, :macro-variable-specification>...> FROM from-list <WHERE sql-expression> <GROUP BY group-by-item <,group-by-item>...> <HAVING sql-expression> <ORDER BY order-by-item <,order-by-item>...>; Note: The order of the statements after PROC MEANS statement is not important. In PROC SQL procedure the order of the clauses are very important. NAMING CONVENTIONS: In SQL we will use RDBMS words and in MEANS we will use SAS words. The correlation between RDBMS words and SAS words are shown in the below table. 1 SAS Global Forum 2007 Coders’ Corner SAS Words Data Set Observations Variables RDBMS Words Table Rows Columns DESCRIPTIVE STATISTICS IN EACH PROCEDURE: Here is the sort list of main statistical functions that wither of PROCs can or can’t perform. Descriptive Statistics Keyword MEANS SQL CLM CSS CV KURTOSIS | KURT LCLM MAX MEAN | AVG MIN N NMISS PRT RANGE SKEWNESS | SKEW STDDEV | STD STDERR SUM SUMWGT UCLM USS VAR Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes No Yes Yes Yes Yes Yes Yes Yes Yes Yes No Yes Yes No No Yes Yes Yes Yes Yes Yes Yes No Yes Yes Yes Yes No Yes Yes INPUT DATA SOURCE: In MEANS procedure the input data is supplied through DATA=sas-datset-name. If this is omitted then it will take the most recently created SAS data set in that session. In SQL procedure we must have to give the table name in the FROM Clause. OUTPUT FROM PROCEDURE : Both the procedures will print the output in OUTPUT window or we can create SAS dataset/table. But in MEANS procedure the format of the data display in the output window is not same as the data created in the output SAS dataset. Whereas in SQL it is same. DEFAULT STATISTICS: By default the MEANS procedure will produce N (counts), mean, standard deviation, max and min on all numeric variables in the dataset. The MEANS procedure will not perform any statistics on character variables. If there are no numeric variable in the dataset then PROC MEANS procedure will only give the number of observations(N) in that dataset. Here is the simple code. PROC MEANS DATA=sas-dataset-name; run; SQL procedure will not calculate any statistics by default. But the SQL procedure will calculate some statistics on character variable. For Example sex is a char variable in SASHELP.CLASS dataset, we can perform max, min, n and nmiss values. Proc SQL; Select max(sex),Min(sex),n(sex),nmiss(sex) From sashelp.class; CODE COMPARISON: Let us take couple of examples to compare the code between SQL and MEANS Calculating the simple statistics: First we will consider how the results are displayed in the output window without creating into dataset/table. 2 SAS Global Forum 2007 Coders’ Corner By default the MEANS procedure will produces N, mean, standard deviation, max and min on all numeric variables in the dataset. We select the required statistics by specifying the statistics keywords in the MEANS statement. Similarly we can restrict the variable on which you want perform the statistics by specifying the variable names in VAR statement. The following MEANS procedure will display the results in output window and calculate the SUM, MEAN and STD on Age and HEIGHT variable. PROC MEANS DATA=SASHELP.CLASS NONOBS MAXDEC=2 SUM MEAN STD ; VAR AGE HEIGHT; RUN; Variable Sum Mean Std Dev --------------------------------------------------------Age 253.00 13.32 1.49 Height 1184.40 62.34 5.13 NOTE: The option MXDEC= is used to limit the decimal places in the result. NONOBS is used to suppress reporting the total number of observations for each unique combination of the class variables To produce the same result as above, we have use the following SQL: PROC SQL; select 'Age' as Variable, sum(age) as Sum format 10.2, avg(age) as Mean format 10.2, std(age) as std format 10.2 label 'Std Dev' from sASHELP.CLASS union select 'Height' as Variable, sum(height) as Sum format 10.2, avg(height) as Mean format 10.2, std(height) as std format 10.2 label 'Std Dev' from sASHELP.CLASS; QUIT; GROUP PROCESSING USING CLASS/GROUP BY: Often we want statistics for grouped observations instead of for whole observations. Use the CLASS statement to calculate the statistics based on category. PROC MEANS DATA=SASHELP.CLASS NONOBS MAXDEC=2 SUM MEAN STD ; CLASS sex; VAR AGE HEIGHT; RUN; Sex Variable Sum Mean Std Dev --------------------------------------------------------------F Age 119.00 13.22 1.39 Height 545.30 60.59 5.02 M Age 134.00 13.40 1.65 Height 639.10 63.91 4.94 ---------------------------------------------------------------- To produce the similar result as above we have use the following SQL: PROC SQL; select sex as Sex, 'Age' as Variable, sum(age) as Sum format 10.2, avg(age) as Mean format 10.2, std(age) as std format 10.2 label 'Std Dev' from SASHELP.CLASS Group by select sex as Sex, 3 SAS Global Forum 2007 Coders’ Corner 'Height' as Variable, sum(height) as Sum format 10.2, avg(height) as Mean format 10.2, std(height) as std format 10.2 label 'Std Dev' from sASHELP.CLASS group by sex; quit; But the SQL output looks slightly different. Sex Variable Sum Mean Std Dev ------------------------------------------------F Age 119.00 13.22 1.39 F Height 545.30 60.59 5.02 M Age 134.00 13.40 1.65 M Height 639.10 63.91 4.94 The CLASS or GROUP BY variable can be numeric or char, but they should contain a limited number of discrete values that represents meaningful groupings. GROUP PROCESSING USING CLASS WITH TYPES STATEMENT IN PROC MEANS: By default Grouping will takes place all combinations of CLASS variables. By using TYPES statement we can select overall or individual CLASS variable. For example. data grade; input Name $ Gender $ datalines; Abbott F 2 97 A 90 David M 3 99 c 87 Dennison M 1 97 A 85 Nancy F 3 99 B 79 Greeley F 2 97 A 82 Mick M 3 98 c 77 Jasper M 1 97 B 91 Mary F 3 97 C 75 Billy M 2 98 C 77 Roy M 3 98 B 92 ; run; proc means data=grade class Status Year; var Score; types () status*year; run; Status $ 87 96 72 88 91 91 93 79 83 84 Year $ Section $ Score Branford Crandell Edgar Faust Hart Isley Ray Nick Taylor Mandy M M F M F M M M F F 1 2 1 1 1 2 3 2 3 3 97 98 98 97 98 99 97 98 98 97 A B B B B A B B A C 92 81 89 78 84 88 76 85 86 88 FinalGrade @@; 97 71 80 73 80 86 90 89 81 87 nonobs n mean sum maxdec=2; It will produce overall counts, means and sum on entire data as one output and by grouping status and year as separate output. The results looks like this. Output due to :Types () N Mean Std Dev ----------------------------------20 84.10 1682.00 ----------------------------------- Output due to: types status*year Status Year N Mean Sum ----------------------------------------------------------1 97 4 86.50 346.00 98 2 86.50 173.00 2 97 2 86.00 172.00 98 3 81.00 243.00 4 SAS Global Forum 2007 Coders’ Corner 99 1 88.00 88.00 97 3 79.67 239.00 98 3 85.00 255.00 99 2 83.00 166.00 ----------------------------------------------------------3 The SQL code to produce similar output as below. proc sql; select count(*) as N, avg(score) as Mean format 10.2 , std(score) as Sum format 10.2 from grade; quit; proc sql; select status,year, count(*) as N, avg(score) as Mean format 10.2 , sum(score) as Sum format 10.2 from grade group by status, year; quit; Exclude analysis variable values that are not in CLASSDATA. In PROC MEANS we can specify a secondary data set that contains the combinations of class variables to analyze and exclude from the analysis for all combinations of class variable values that are not in the CLASSDATA= data set For example we want select only the class variables combination listed below SAS dataset statyesr then use the SAS dataset name in CLASSDATA= option in PROC MEANS statement. data statyear; input Status $ Year $ @@; datalines; 1 97 1 98 2 97 2 98 3 97 3 99 ; run; proc means data=grade nonobs nway n mean sum maxdec=0 classdata=statyear exclusive; class status year; var score; The excluded output looks like Status Year N Mean Sum ----------------------------------------------------------1 97 4 86.50 346.00 98 2 86.50 173.00 2 97 2 86.00 172.00 98 3 81.00 243.00 3 97 3 79.67 239.00 99 2 83.00 166.00 ------------------------------------------------------------ We can produce the same result using SQL with sub quires proc sql; select status,year, count(*) as N, mean(score) as Mean format 10.2 , sum(score) as Sum format 10.2 from grade 5 SAS Global Forum 2007 Coders’ Corner where status||year in (select status||year from statyear) group by status, year; quit; Creating output SAS dataset from PROC MEANS / PROC SQL: All the examples listed above are producing the output in the printed report form. Now we discuss how to create the results in SAS datasets using PROC MEANS and PROC SQL. In PROC MEANS using OUTPUT statement we can create the SAS dataset. The syntax is OUTPUT <OUT=SAS-data-set> <output-statistic-specification(s)> The output-statistic-specification(s) may be one or more of the following forms. In each form, stat is a statistics request keyword. Form 1: stat=name-list. The name list specifies variables containing stat. The variables in name list have a one-to-one correspondence with the variable listed in the VAR statement. Form 2: stat (varlist) =name-list. The name list identifies variables containing statistic stat for analysis variables varlist. The variables in name list have a one-to-one correspondence with the variable listed in the VAR statement. Form 3: stat=. This form requests an output dataset that contains the same variables as in the VAR statement list. Form 4: stat(varlist)=. This form specifies that VAR statement variables in varlist represent statistic stat in the output dataset. This form and form 3 may naot be used in the same OUTPUT statement and vice versa. For example Proc means data=grade nonobs maxdec=2 noprint; Class status year; Var score finalgrade; Output out=sumgrade max=scr_max grade_max mean(score finalgrade) = scr_mean grade_mean sum=; format scr_mean grade_mean 10.1; Run; proc print data=sumgrade;run; This will create two automatic variables _type_ and _FREQ_. _TYPE_ identifies the level of summary. Value of _TYPE_ = 0 indicates the summary record for the entire dataset. Value of _TYPE_ = 1 identifies summary data for each level of year across all status(status is ignored). Value of _TYPE_ = 2 identifies summary data for each level of status across all year (year is ignored). Value of _TYPE_ = 3 identifies summary data contain statistics for each level of status within each level of year. Note: The option NWAY in the PROC MEANS statement will select the maximum value of _TYPE_. The _FREQ_ is the number of observations in each level of summary. The output from the above code will look like: Status Year _TYPE_ _FREQ_ scr_max 97 98 99 1 2 3 1 1 2 2 2 3 97 98 97 98 99 97 0 1 1 1 2 2 2 3 3 3 3 3 3 20 9 8 3 6 6 8 4 2 2 3 1 3 92 92 92 88 92 90 92 92 89 90 85 88 88 grade_ max 97 97 91 96 97 91 96 97 80 91 89 86 90 Final scr_mean grade_mean Score Grade 84.1 84.1 83.9 84.7 86.5 83.8 82.5 86.5 86.5 86.0 81.0 88.0 79.7 6 84.9 85.4 82.4 90.0 82.5 84.5 87.0 83.8 80.0 89.0 81.0 86.0 85.3 1682 757 671 254 519 503 660 346 173 172 243 88 239 1698 769 659 270 495 507 696 335 160 178 243 86 256 SAS Global Forum 2007 3 3 98 99 3 3 Coders’ Corner 3 2 92 87 91 96 85.0 83.0 85.3 92.0 255 166 256 184 To get the same results from PROC SQL, use the following code PROC SQL; Create table sumgrade as Select ' ' as Status, ' ' as Year ,0 as _TYPE_ , count(*) as _FREQ_, Max(score) as scr_max, max(finalgrade) as grade_mean , avg(score) as scr_max format 10.1, avg(finalgrade) as grade_mean format 10.1, sum(score) as score, sum(finalgrade) as finalgrade from grade union Select ' ' as Status, Year ,1 as _TYPE_ , count(*) as _FREQ_, Max(score) as scr_max, max(finalgrade) as grade_mean , avg(score) as scr_max format 10.1 , avg(finalgrade) as grade_mean format 10.1, sum(score) as score, sum(finalgrade) as finalgrade from grade group by year union Select Status, ' ' as Year ,2 as _TYPE_ , count(*) as _FREQ_, Max(score) as scr_max, max(finalgrade) as grade_mean , avg(score) as scr_max format 10.1 , avg(finalgrade) as grade_mean format 10.1, sum(score) as score, sum(finalgrade) as finalgrade from grade group by status union Select Status, Year ,3 as _TYPE_ , count(*) as _FREQ_, Max(score) as scr_max, max(finalgrade) as grade_mean , avg(score) as scr_max format 10.1 , avg(finalgrade) as grade_mean format 10.1, sum(score) as score, sum(finalgrade) as finalgrade from grade group by status,year order by _TYPE_ ; quit; proc print data=sumgrade; run; Here is table with the summaries of our findings. Description MEANS SQL Input data Output dataset Analysis variables Grouping Variables Statistics to be performed DATA=dataset-name OUTPUT OUT=temp VAR var1 var2 CLASS var1 var2 Specify in PROC MEANS statement FROM table-name CREATE TABLE temp AS SELECT stat(var1),stat(var2) GROUP BY var1,var2 Specify in SELECT statement Note: in the above table the stat is any statistical function. CONCLUSION: This paper covered frequently used options and statements in PROC MEANS and equivalent code using PROC SQL.. This paper will be a good reference document those who are proficient in PROC MEANS and need work with PROC SQL and vice versa. There are certain functionality that are provided by PROC MEANS which can’t be replicated by using PROC SQL and vice versa. REFERENCES: SAS Institute Inc., SAS OnlineDoc ® 9, http://v9doc.sas.com/sasdoc/ SAS Institute Inc., Base SAS 9.1 ® Procedures Guide. 7 SAS Global Forum 2007 Coders’ Corner CONTACT INFORMATION: Your comments and questions are valued and encouraged. Contact the author at: Jyotheeswara Naidu Yellanki Email: yjnaidu@hotmail.com SAS and all other SAS Institute Inc., product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® Indicates USA registration. Other brand and product names are trademarks of their respective companies. 8