NESUG 2007 Foundations & Fundamentals Summing with SAS® Tatiana Homonoff, MDRC, New York, NY ABSTRACT This paper reviews methods used to sum data, including horizontal summation (summing data across variables), vertical summation (summing data across observations), and cumulative summation (summing data across both observations and variables to create running totals). Techniques include the addition operator, the SUM function, PROC PRINT, PROC MEANS, PROC SQL, the RETAIN statement, FIRST./LAST. processing, and the SUM statement. Special attention is paid to how SAS® handles missing values in each technique. INTRODUCTION Summing data seems like a simple concept, but there are actually many complexities. Depending on the data set and the analysis question, one should use different methods. This paper presents four analysis questions involving summation and answers them using the most appropriate SAS technique. SAMPLE DATA SET Below is a data set, FINANCIALS, that contains weekly revenue and cost information for two branches of a toy store in the weeks around Christmas. The two branches (LOCATION) are Boston and New York. Revenue comes from three different product lines: toys (REV1), games (REV2), and books (REV3). There is also one cost variable (COSTS). DATE indicates the start date of each holiday week. location date rev1 rev2 rev3 costs Boston Boston Boston Boston New York New York New York New York 12/15/2006 12/22/2006 12/29/2006 1/6/2007 12/15/2006 12/22/2006 12/29/2006 1/6/2007 1/6/2007 $500 $300 $100 $100 . $700 $200 $50 $950 $150 $500 $200 $100 $600 $600 $300 $100 $1,600 $300 $200 $50 $150 $400 $250 $100 $50 $800 -$200 -$200 -$200 -$200 -$400 -$400 -$400 -$400 -$1,600 Note that this data set has missing values for two variables: REV1 and LOCATION. In these data, a missing value of REV1 means that there was no revenue from toys in that week (i.e., a value of missing is equivalent to a value of zero, so the value is not, in fact, “missing”). On the other hand, the observation with missing data for LOCATION does not look like good data. It seems to be a total of the revenues and costs of the New York branch over the four holiday weeks that was included in this data set by accident. This paper describes how to deal with both good and bad missing data so as to avoid unexpected results. ANALYSIS QUESTIONS The owner of the toy store is interested in determining whether certain locations or certain product lines are more profitable than others. This paper will demonstrate various summation techniques in SAS to answer the following analysis questions: 1. 2. 3. 4. What is the weekly profit for each branch? How much revenue did each product line bring in during the holiday period (and did this differ by branch)? How much profit did each branch earn during the holiday period, in dollars and as a percent age of total profits? What is the cumulative profit to date of this company by branch at each week? HORIZONTAL SUMMATION 1 NESUG 2007 Foundations & Fundamentals Horizontal summation refers to adding values across variables within each observation (or row). Since each observation in our data set has revenue and cost information by week and branch, horizontal summation is used to answer the first question: • What is the weekly profit for each branch? The sum of revenue from the three product lines minus costs (which is actually plus in this data set since the “cost” variable stores negative numbers ) by observation is the total weekly profit by branch for each branch and week. THE ADDITION OPERATOR The first and most straightforward horizontal summation method is the addition operator (+). The addition operator returns a numeric value that is the sum of the arguments. The code to produce profit using this method is below. It also deletes the observation where the information on the branch location is missing since this observation was erroneously included in the data. data financials2; set financials; profit = rev1 + rev2 + rev3 + costs ; where location ne ' '; run; The resulting data set, FINANCIALS2, looks as follows: location date rev1 rev2 rev3 costs profit Boston Boston Boston Boston New York New York New York New York 12/15/2006 12/22/2006 12/29/2006 1/6/2007 12/15/2006 12/22/2006 12/29/2006 1/6/2007 500 300 100 100 . 700 200 50 150 500 200 100 600 600 300 100 300 200 50 150 400 250 100 50 -200 -200 -200 -200 -400 -400 -400 -400 750 800 150 150 . 1150 200 -200 This method produces the expected value of profit when there are complete data (i.e., no missing values) for all revenue and cost variables, but it assigns a missing value to the profit variable when any of the arguments is missing (as is the case for the toy revenue in the New York branch for the week beginning th December 15 ). Depending on the data set, this might be the desired result. If there is truly missing revenue data, the profit variable should be missing as well; if the missing data were ignored, SAS would produce a numeric value for the profit variable that would be look complete, but in reality would be inaccurate since it only contained revenue from two out of the three product lines. As mentioned above, however, in this data set, revenue is missing when there was no revenue for that product line and is, in fact, “good” data. So rather than assigning the profit variable a value of missing when there are missing revenue data, SAS should ignore the arguments with missing values and generate the profit variable using only the arguments with non-missing values. This will require a different technique. THE SUM FUNCTION The SUM function is another horizontal summation method. It works in the same way that the addition operator does but handles missing values differently. The SUM function ignores missing values and excludes them in the summation. If all values of the arguments are missing, it returns a missing value. If there is even one non-missing argument, however, it returns the sum. The code to produce profit using the SUM function is: profit = SUM(rev1, rev2, rev3, costs) ; 2 NESUG 2007 Foundations & Fundamentals The resulting data set, FINANCIALS2, looks as follows: location date rev1 rev2 rev3 costs profit Boston Boston Boston Boston New York New York New York New York 12/15/2006 12/22/2006 12/29/2006 1/6/2007 12/15/2006 12/22/2006 12/29/2006 1/6/2007 500 300 100 100 . 700 200 50 150 500 200 100 600 600 300 100 300 200 50 150 400 250 100 50 -200 -200 -200 -200 -400 -400 -400 -400 750 800 150 150 600 1150 200 -200 SAS ignored the argument with a missing values when calculating the profit variable, which is the desired result. See the appendix for various shortcuts using the SUM function. VERTICAL SUMMATION Now that the first analysis question has successfully been answered, we turn to the second: • How much revenue did each product line bring in during the holiday period (and did this differ by branch)? While the first analysis question required summation of several variables within observations, the second analysis question requires summation of individual variables across observations. This is what is called vertical summation. THE PRINT PROCEDURE The simplest method of vertical summation is the PRINT procedure. PROC PRINT does not allow for the creation of any new variables nor does it create a data set with summary variables. It does, however, print sums of existing variables that are specified in a SUM statement below the raw data. The WHERE statement excludes observations with missing branch data from being printed and being included in the summation. Missing values of the variables in the SUM statement are ignored; if all values are missing, the total is zero. proc print data = financials; sum rev1 rev2 rev3; where location ne ' '; run; SAS prints the following output: location date rev1 rev2 rev3 costs Boston Boston Boston Boston New York New York New York New York 12/15/2006 12/22/2006 12/29/2006 1/6/2007 12/15/2006 12/22/2006 12/29/2006 1/6/2007 500 300 100 100 . 700 200 50 ==== 1950 150 500 200 100 600 600 300 100 ==== 2550 300 200 50 150 400 250 100 50 ==== 1500 -200 -200 -200 -200 -400 -400 -400 -400 3 NESUG 2007 Foundations & Fundamentals A BY statement can also be used with PROC PRINT to sum revenue by branch. Both the totals by branch and the grand total are printed. Note that the data must be sorted by the BY variable. proc sort data=financials; by location; run; proc print data = financials; sum rev1 rev2 rev3; where location ne ' '; by location; run; SAS prints the following output: location=Boston date rev1 rev2 rev3 costs 12/15/2006 12/22/2006 12/29/2006 1/6/2007 500 300 100 100 ---1000 150 500 200 100 ---950 300 200 50 150 ---700 -200 -200 -200 -200 date rev1 rev2 rev3 costs 12/15/2006 12/22/2006 12/29/2006 1/6/2007 . 700 200 50 ---950 ==== 1950 600 600 300 100 ---1600 ==== 2550 400 250 100 50 ---800 ==== 1500 -400 -400 -400 -400 location=New York THE MEANS PROCEDURE The MEANS procedure is a data summarization tool used to calculate descriptive statistics for variables across all observations and within groups of observations. One of these statistics is SUM. This procedure is far more flexible than PROC PRINT, mainly because it can store results in an output data set that can be manipulated. SUMMATION ACROSS ALL OBSERVATIONS The first part of question two can be answered by using PROC MEANS to sum each of the three revenue variables across every observation in the data set. Note that without the inclusion of a WHERE statement, the summary variables would include the observation where the branch location was missing, thereby double-counting the revenue from New York. proc means data=financials noprint; var rev1 rev2 rev3; output out=revsum (drop=_type_ _freq_) sum(rev1-rev3)=revsum1-revsum3; where location ne ' '; run; 4 NESUG 2007 Foundations & Fundamentals The resulting data set, REVSUM, looks as follows: revsum1 revsum2 revsum3 1950 2550 1500 This data set contains the grand total of the revenue of each product line across all branches and weeks. Note that SAS ignores the missing revenue value in the first product line rather than generating a missing value for the total. While there is no option in PROC MEANS that assigns a missing value to the summary variable when there are missing values for the analysis variable, if the NMISS statistic is specified, SAS will create a variable that counts the number of missing values in the specified analysis variable. This variable can be inspected to determine whether the summary variable is excluding any observations due to missing data. SUMMATION BY GROUP USING THE BY STATEMENT The second part of the second analysis question requires vertical summation by branch rather than across all observations. • How much revenue did each product line bring in during the holiday period (and did this differ by branch)? This can be accomplished by adding a BY statement to the MEANS procedure. The input data set must be sorted by the BY group variable – in this case, LOCATION. Note that PROC MEANS considers a missing value to be a legitimate BY group value. Without the WHERE statement, all observations with a missing branch location would be summed together. This can create unexpected results if there are many observations with missing branch location values. proc sort data=financials; by location; run; proc means data=financials noprint; var rev1 rev2 rev3; output out=revsum_bybranch (drop=_type_ _freq_) sum(rev1-rev3)=revsum1-revsum3; by location; where location ne ' '; run; The resulting data set, REVSUM_BYBRANCH looks as follows: location revsum1 revsum2 revsum3 Boston New York 1000 950 950 1600 700 800 THE SQL PROCEDURE The third analysis question is: • How much profit did each branch earn during the holiday period, in dollars and as a percent of total profits? This question highlights some of the key limitations of PROC PRINT and PROC MEANS. PROC PRINT prints summary statistics, but cannot store them in a SAS data set; therefore, they are not available for future calculations (e.g., creating branch profits as a percent of total profits). While PROC MEANS does allow summary statistics to be stored in an output data set, it only summarizes variables that already exist in the input data set. Since the PROFIT variable does not exist in the raw data set, it would have to be 5 NESUG 2007 Foundations & Fundamentals created in a DATA step before summing vertically using PROC MEANS. A second DATA step is required to create branch profits as a percent of total profits. NESTED SUM FUNCTIONS The SQL procedure provides a way to do all of this in one step. In PROC SQL, when multiple columns are specified in an aggregate function (like the SUM function), the values in each row of the columns are calculated. If that SUM function is then nested in a second SUM function, SAS produces a grand total of the calculated variable across all observations. In other words, the inner SUM function is performing the horizontal summation while the outer SUM function is performing the vertical summation. This method can be used to calculate total profit over all weeks and branches. Note that SQL does not support variable lists in the SUM function. Observations with missing branch data should be removed with a WHERE clause. proc sql; create table financials_sum as select sum(sum(rev1,rev2,rev3,costs)) as branch_profit from financials where location ne ' '; quit; The resulting data set, FINANCIALS_SUM, looks as follows: branch_profit 3600 THE GROUP BY STATEMENT However, the third analysis question asks for the total profits by branch, not overall profits. PROC SQL can vertically sum profit by branch to create the variable BRANCH_PROFIT, dollar amount of profit by branch, using a GROUP BY statement. As with PROC MEANS, PROC SQL ignores missing values of the analysis variables in the SELECT statement, but it treats missing GROUP BY variables as valid data. Unlike PROC MEANS, PROC SQL does not require data to be sorted by the GROUP BY variable(s). proc sql; create table financials_sum as select location, sum(sum(rev1,rev2,rev3,costs)) as branch_profit from financials where location ne ' ' group by location; quit; The resulting data set, FINANCIALS_SUM, looks as follows: location Boston New York branch_profit 1850 1750 THE SELECT STATEMENT SUBQUERY The previous step calculated profit by branch in dollars, but the analysis question also asks for profit by branch as a percent of total profits. The first step to do this is to add total profits to the data set created above. In PROC MEANS, this would require a second step; in PROC SQL, this can be done in the same step by using a subquery nested in parentheses. A subquery is a query that is nested in another query. This subquery is executed first. Note that the variable TOTAL_PROFIT that is created in the subquery must also be referenced in the outer query in 6 NESUG 2007 Foundations & Fundamentals order to be included in the created table. In order to exclude observations with missing branch information when calculating the BRANCH_PROFIT and TOTAL_PROFIT variables, there must be a WHERE clause in the subquery as well in the outer query. proc sql; create table financials_sum as select location, sum(sum(rev1,rev2,rev3,costs)) as branch_profit, total_profit from financials, (select sum(sum(rev1,rev2,rev3,costs)) as total_profit from financials where location ne ' ') where location ne ' ' group by location; quit; The resulting data set, FINANCIALS_SUM, looks as follows: location Boston New York branch_profit 1850 1750 total_profit 3600 3600 The final step to answering this analysis question is to create the variable BRANCH_PCT – the profits by branch as a percent of total profits. This, too, can be created in the same SQL procedure. Note that since BRANCH_PROFIT is calculated in the outer query, it must be preceded by the word “calculated,” but since TOTAL_PROFIT was calculated in the subquery, it is not. This is because the inner subquery result is added to all rows selected by the outer query. proc sql; create table financials_sum as select location, sum(sum(rev1,rev2,rev3,costs)) as branch_profit, total_profit, calculated branch_profit/total_profit as branch_pct format=percent8.2 from financials, (select sum(sum(rev1,rev2,rev3,costs)) as total_profit from financials where location ne ' ') where location ne ' ' group by location; quit; The resulting data set, FINANCIALS_SUM, looks as follows: location branch_profit total_profit branch_pct Boston New York 1850 1750 3600 3600 51.39% 48.61% CUMULATIVE SUMMATION The final analysis question is: • What is the cumulative profit to date of this company by branch at each week? 7 NESUG 2007 Foundations & Fundamentals The previous section showed that PROC SQL was far more versatile than the DATA step and PROC MEANS for answering the analysis questions. SQL, however, does not process rows (observations) in a particular order. There is no easy way to use PROC SQL to sum data cumulatively (i.e., to sum data across variables and observations to create running totals), as the final analysis question requires. THE RETAIN STATEMENT AND THE SUM FUNCTION The SUM function was introduced in the section on horizontal summation in order to sum revenue and costs across variables within an observation to create the profit variable. SAS, however, automatically sets variables that are created within an assignment statement (like the profit variable) to “missing” before each iteration of the DATA step. The RETAIN statement can be used to prevent SAS from re-initializing the values of created variables before each iteration of the DATA step. In other words, the value calculated in the previous observation is carried down to the following observation. When the SUM function is combined with the RETAIN statement, the sum from the previous observation is carried down and added to the value in the current observation. This method can be used to calculate a running total of profits, CUMPROFIT, at each observation. In the following code, the variable CUMPROFIT is initialized to zero, summed with the PROFIT variable, then carried down to the following observation. If the zero were omitted from the RETAIN statement. CUMPROFIT would be initialized to “missing.” data financials_cum; set financials; if location ne ' '; profit=sum(rev1,rev2,rev3,costs); retain cumprofit 0; cumprofit=sum(cumprofit,profit); run; The resulting data set, FINANCIALS_CUM, looks as follows: location date rev1 rev2 rev3 costs profit cumprofit Boston Boston Boston Boston New York New York New York New York 12/15/2006 12/22/2006 12/29/2006 1/6/2007 12/15/2006 12/22/2006 12/29/2006 1/6/2007 500 300 100 100 . 700 200 50 150 500 200 100 600 600 300 100 300 200 50 150 400 250 100 50 -200 -200 -200 -200 -400 -400 -400 -400 750 800 150 150 600 1150 200 -200 750 1550 1700 1850 2450 3600 3800 3600 FIRST./LAST. PROCESSING The cumulative profit variable created above contains the correct cumulative profits for the Boston branch. The RETAIN statement, however, carries over the final total profit for the Boston branch to the first observation of the New York branch. In order to obtain the correct cumulative profit for the New York branch, the CUMPROFIT variable must be reset to zero at the first observation of the New York branch. This can be done with FIRST./LAST. processing. The following code sets the cumulative profit variable to zero at the first occurrence of each value of LOCATION. Note that this statement must come before the SUM function in order for the cumulative profit to include the profits from the first week in each branch. Observations are sorted by branch and week. The FIRST./LAST. code would cause errors if the data were not sorted by the BY variable LOCATION, but it does not require that the dates be in order. However, if the dates were out of order, the cumulative profit variable would be meaningless, or at least would not help answer the final analysis question. The next step adds a PROC SORT to ensure that the data are sorted correctly. proc sort data=financials; by location date; run; 8 NESUG 2007 Foundations & Fundamentals data financials_cum; set financials; by location; if location ne ' '; profit=sum(rev1,rev2,rev3,costs); retain cumprofit; if first.location then cumprofit=0; cumprofit=sum(cumprofit,profit); run; The resulting data set, FINANCIALS_CUM, looks as follows: location date rev1 rev2 rev3 costs profit cumprofit Boston Boston Boston Boston New York New York New York New York 12/15/2006 12/22/2006 12/29/2006 1/6/2007 12/15/2006 12/22/2006 12/29/2006 1/6/2007 500 300 100 100 . 700 200 50 150 500 200 100 600 600 300 100 300 200 50 150 400 250 100 50 -200 -200 -200 -200 -400 -400 -400 -400 750 800 150 150 600 1150 200 -200 750 1550 1700 1850 600 1750 1950 1750 THE SUM STATEMENT The SUM statement creates the same results as combining the SUM function with the RETAIN statement, but is slightly more efficient. The SUM statement initializes the variable on the left of the plus sign (+) to zero, retains the variable, and adds the value of the expression on the right of the plus sign to the variable. It ignores missing values and treats an expression that produces a missing value as zero. The following code creates the same output as shown above. proc sort data=financials; by location date; run; data financials_cum; set financials; by location; if location ne ' '; profit=sum(rev1,rev2,rev3,costs); if first.location then cumprofit=0; cumprofit+profit; run; CONCLUSION While there are many methods to sum data in SAS, some may be more appropriate or save more time than others. It is important to think critically about the analysis question before deciding which type of summation (horizontal, vertical, or cumulative) is needed to answer it and which SAS technique is best. It is also important to be aware of missing values in the data and to understand how SAS handles them in each technique. CONTACT INFORMATION Tatiana Homonoff Research Analyst MDRC th th 16 East 34 Street, 19 Floor New York, NY 10016 Phone: (212) 340-8629 Fax: (212) 684-0832 9 NESUG 2007 Foundations & Fundamentals Tatiana.Homonoff@mdrc.org www.mdrc.org SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. 10 NESUG 2007 Foundations & Fundamentals APPENDIX While there are only three types of revenue in this data set, suppose there were 50. Creating the profit variable would appear to require a lot of typing, but SAS has a few shortcuts. NUMERIC VARIABLE LISTS The three revenue variables all have the same naming convention, “Rev”<number>, and can be referenced as a numeric list (e.g., REV1-REV3). SAS produces unexpected results, however, when combining numeric lists with the SUM function. profit = SUM(rev1-rev3, costs) ; The resulting data set, FINANCIALS2, looks as follows: location date rev1 rev2 rev3 costs profit Boston Boston Boston Boston New York New York New York New York 12/15/2006 12/22/2006 12/29/2006 1/6/2007 12/15/2006 12/22/2006 12/29/2006 1/6/2007 500 300 100 100 . 700 200 50 150 500 200 100 600 600 300 100 300 200 50 150 400 250 100 50 -200 -200 -200 -200 -400 -400 -400 -400 0 -100 -150 -250 -400 50 -300 -400 Rather than reading “REV1- REV3” as a numeric list, SAS interprets it as subtraction of REV3 from REV1. This is not the desired result. If the numeric list is preceded with OF, however, SAS produces the desired result. profit = SUM(OF rev1-rev3, costs) ; Note that if there were more than one variable list in the set of arguments, each list would need to be preceded with its own OF. PREFIX LISTS A prefix list can be used to shorten the code even further. Specify the variable prefix followed by a semicolon (e.g., REV:). All variables that start with a prefix can be referenced by prefix:. Used without an OF, SAS generates an error. With an OF, SAS produces the desired result. profit = SUM(OF rev: , costs) ; POSITIONAL VARIABLE LISTS In this data set, the three revenue variables have the same naming convention, which allowed the use of the numeric and prefix lists. If that were not the case, but the summation argument variables were still adjacent to one another in the data set, a positional list could be used. Specify the first variable to use, two dashes, and the last variable to use (e.g. REV1--COSTS). Again, without the use of an OF, SAS produces unexpected results: it treats the double dash as double subtraction, i.e. addition, of REV1 and COSTS, rather than a positional list. SAS evaluates the expression in parentheses first. Since it assumes that the positional list is actually an arithmetic operator, the result of the inner expression is a single argument. If either of the variables in the inner expression is missing, the result of the inner expression is missing. Therefore, since the only argument to the SUM function is missing, the result of the whole expression is missing. profit = SUM(rev1--costs) ; The resulting data set, FINANCIALS2, looks as follows: location date rev1 rev2 rev3 11 costs profit NESUG 2007 Foundations & Fundamentals Boston Boston Boston Boston New York New York New York New York 12/15/2006 12/22/2006 12/29/2006 1/6/2007 12/15/2006 12/22/2006 12/29/2006 1/6/2007 500 300 100 100 . 700 200 50 150 500 200 100 600 600 300 100 300 200 50 150 400 250 100 50 -200 -200 -200 -200 -400 -400 -400 -400 300 100 -100 -100 . 300 -200 -350 Use OF to produce the desired results. profit = SUM(OF rev1--costs) ; WHERE CAN THESE TECHNIQUES BE USED? The examples above use the addition operator and the SUM function in assignment statements to create variables in a DATA step. They can also be used to subset data in an IF or WHERE statement in a DATA step or in a WHERE statement in a PROC step. However, while the IF statement supports the variable lists shortcuts described above, the WHERE statement does not. 12