Summing with SAS®

advertisement
NESUG 2007
Foundations & Fundamentals
Summing with SAS®
Tatiana Homonoff, MDRC, New York, NY
ABSTRACT
This paper reviews methods used to sum data, including horizontal summation (summing data across
variables), vertical summation (summing data across observations), and cumulative summation (summing
data across both observations and variables to create running totals). Techniques include the addition
operator, the SUM function, PROC PRINT, PROC MEANS, PROC SQL, the RETAIN statement,
FIRST./LAST. processing, and the SUM statement. Special attention is paid to how SAS® handles
missing values in each technique.
INTRODUCTION
Summing data seems like a simple concept, but there are actually many complexities. Depending on the
data set and the analysis question, one should use different methods. This paper presents four analysis
questions involving summation and answers them using the most appropriate SAS technique.
SAMPLE DATA SET
Below is a data set, FINANCIALS, that contains weekly revenue and cost information for two branches of
a toy store in the weeks around Christmas. The two branches (LOCATION) are Boston and New York.
Revenue comes from three different product lines: toys (REV1), games (REV2), and books (REV3).
There is also one cost variable (COSTS). DATE indicates the start date of each holiday week.
location
date
rev1
rev2
rev3
costs
Boston
Boston
Boston
Boston
New York
New York
New York
New York
12/15/2006
12/22/2006
12/29/2006
1/6/2007
12/15/2006
12/22/2006
12/29/2006
1/6/2007
1/6/2007
$500
$300
$100
$100
.
$700
$200
$50
$950
$150
$500
$200
$100
$600
$600
$300
$100
$1,600
$300
$200
$50
$150
$400
$250
$100
$50
$800
-$200
-$200
-$200
-$200
-$400
-$400
-$400
-$400
-$1,600
Note that this data set has missing values for two variables: REV1 and LOCATION. In these data, a
missing value of REV1 means that there was no revenue from toys in that week (i.e., a value of missing is
equivalent to a value of zero, so the value is not, in fact, “missing”). On the other hand, the observation
with missing data for LOCATION does not look like good data. It seems to be a total of the revenues and
costs of the New York branch over the four holiday weeks that was included in this data set by accident.
This paper describes how to deal with both good and bad missing data so as to avoid unexpected results.
ANALYSIS QUESTIONS
The owner of the toy store is interested in determining whether certain locations or certain product lines
are more profitable than others. This paper will demonstrate various summation techniques in SAS to
answer the following analysis questions:
1.
2.
3.
4.
What is the weekly profit for each branch?
How much revenue did each product line bring in during the holiday period (and did this differ by
branch)?
How much profit did each branch earn during the holiday period, in dollars and as a percent age of
total profits?
What is the cumulative profit to date of this company by branch at each week?
HORIZONTAL SUMMATION
1
NESUG 2007
Foundations & Fundamentals
Horizontal summation refers to adding values across variables within each observation (or row). Since
each observation in our data set has revenue and cost information by week and branch, horizontal
summation is used to answer the first question:
•
What is the weekly profit for each branch?
The sum of revenue from the three product lines minus costs (which is actually plus in this data set since
the “cost” variable stores negative numbers ) by observation is the total weekly profit by branch for each
branch and week.
THE ADDITION OPERATOR
The first and most straightforward horizontal summation method is the addition operator (+). The addition
operator returns a numeric value that is the sum of the arguments. The code to produce profit using this
method is below. It also deletes the observation where the information on the branch location is missing
since this observation was erroneously included in the data.
data financials2;
set financials;
profit = rev1 + rev2 + rev3 + costs ;
where location ne ' ';
run;
The resulting data set, FINANCIALS2, looks as follows:
location
date
rev1
rev2
rev3
costs
profit
Boston
Boston
Boston
Boston
New York
New York
New York
New York
12/15/2006
12/22/2006
12/29/2006
1/6/2007
12/15/2006
12/22/2006
12/29/2006
1/6/2007
500
300
100
100
.
700
200
50
150
500
200
100
600
600
300
100
300
200
50
150
400
250
100
50
-200
-200
-200
-200
-400
-400
-400
-400
750
800
150
150
.
1150
200
-200
This method produces the expected value of profit when there are complete data (i.e., no missing values)
for all revenue and cost variables, but it assigns a missing value to the profit variable when any of the
arguments is missing (as is the case for the toy revenue in the New York branch for the week beginning
th
December 15 ). Depending on the data set, this might be the desired result. If there is truly missing
revenue data, the profit variable should be missing as well; if the missing data were ignored, SAS would
produce a numeric value for the profit variable that would be look complete, but in reality would be
inaccurate since it only contained revenue from two out of the three product lines.
As mentioned above, however, in this data set, revenue is missing when there was no revenue for that
product line and is, in fact, “good” data. So rather than assigning the profit variable a value of missing
when there are missing revenue data, SAS should ignore the arguments with missing values and
generate the profit variable using only the arguments with non-missing values. This will require a different
technique.
THE SUM FUNCTION
The SUM function is another horizontal summation method. It works in the same way that the addition
operator does but handles missing values differently. The SUM function ignores missing values and
excludes them in the summation. If all values of the arguments are missing, it returns a missing value. If
there is even one non-missing argument, however, it returns the sum. The code to produce profit using
the SUM function is:
profit = SUM(rev1, rev2, rev3, costs) ;
2
NESUG 2007
Foundations & Fundamentals
The resulting data set, FINANCIALS2, looks as follows:
location
date
rev1
rev2
rev3
costs
profit
Boston
Boston
Boston
Boston
New York
New York
New York
New York
12/15/2006
12/22/2006
12/29/2006
1/6/2007
12/15/2006
12/22/2006
12/29/2006
1/6/2007
500
300
100
100
.
700
200
50
150
500
200
100
600
600
300
100
300
200
50
150
400
250
100
50
-200
-200
-200
-200
-400
-400
-400
-400
750
800
150
150
600
1150
200
-200
SAS ignored the argument with a missing values when calculating the profit variable, which is the desired
result.
See the appendix for various shortcuts using the SUM function.
VERTICAL SUMMATION
Now that the first analysis question has successfully been answered, we turn to the second:
•
How much revenue did each product line bring in during the holiday period (and did this differ by
branch)?
While the first analysis question required summation of several variables within observations, the second
analysis question requires summation of individual variables across observations. This is what is called
vertical summation.
THE PRINT PROCEDURE
The simplest method of vertical summation is the PRINT procedure. PROC PRINT does not allow for the
creation of any new variables nor does it create a data set with summary variables. It does, however, print
sums of existing variables that are specified in a SUM statement below the raw data. The WHERE
statement excludes observations with missing branch data from being printed and being included in the
summation. Missing values of the variables in the SUM statement are ignored; if all values are missing,
the total is zero.
proc print data = financials;
sum rev1 rev2 rev3;
where location ne ' ';
run;
SAS prints the following output:
location
date
rev1
rev2
rev3
costs
Boston
Boston
Boston
Boston
New York
New York
New York
New York
12/15/2006
12/22/2006
12/29/2006
1/6/2007
12/15/2006
12/22/2006
12/29/2006
1/6/2007
500
300
100
100
.
700
200
50
====
1950
150
500
200
100
600
600
300
100
====
2550
300
200
50
150
400
250
100
50
====
1500
-200
-200
-200
-200
-400
-400
-400
-400
3
NESUG 2007
Foundations & Fundamentals
A BY statement can also be used with PROC PRINT to sum revenue by branch. Both the totals by branch
and the grand total are printed. Note that the data must be sorted by the BY variable.
proc sort data=financials;
by location;
run;
proc print data = financials;
sum rev1 rev2 rev3;
where location ne ' ';
by location;
run;
SAS prints the following output:
location=Boston
date
rev1
rev2
rev3
costs
12/15/2006
12/22/2006
12/29/2006
1/6/2007
500
300
100
100
---1000
150
500
200
100
---950
300
200
50
150
---700
-200
-200
-200
-200
date
rev1
rev2
rev3
costs
12/15/2006
12/22/2006
12/29/2006
1/6/2007
.
700
200
50
---950
====
1950
600
600
300
100
---1600
====
2550
400
250
100
50
---800
====
1500
-400
-400
-400
-400
location=New York
THE MEANS PROCEDURE
The MEANS procedure is a data summarization tool used to calculate descriptive statistics for variables
across all observations and within groups of observations. One of these statistics is SUM. This procedure
is far more flexible than PROC PRINT, mainly because it can store results in an output data set that can
be manipulated.
SUMMATION ACROSS ALL OBSERVATIONS
The first part of question two can be answered by using PROC MEANS to sum each of the three revenue
variables across every observation in the data set. Note that without the inclusion of a WHERE statement,
the summary variables would include the observation where the branch location was missing, thereby
double-counting the revenue from New York.
proc means data=financials noprint;
var rev1 rev2 rev3;
output out=revsum (drop=_type_ _freq_)
sum(rev1-rev3)=revsum1-revsum3;
where location ne ' ';
run;
4
NESUG 2007
Foundations & Fundamentals
The resulting data set, REVSUM, looks as follows:
revsum1
revsum2
revsum3
1950
2550
1500
This data set contains the grand total of the revenue of each product line across all branches and weeks.
Note that SAS ignores the missing revenue value in the first product line rather than generating a missing
value for the total. While there is no option in PROC MEANS that assigns a missing value to the summary
variable when there are missing values for the analysis variable, if the NMISS statistic is specified, SAS
will create a variable that counts the number of missing values in the specified analysis variable. This
variable can be inspected to determine whether the summary variable is excluding any observations due
to missing data.
SUMMATION BY GROUP USING THE BY STATEMENT
The second part of the second analysis question requires vertical summation by branch rather than
across all observations.
•
How much revenue did each product line bring in during the holiday period (and did this differ by
branch)?
This can be accomplished by adding a BY statement to the MEANS procedure. The input data set must
be sorted by the BY group variable – in this case, LOCATION. Note that PROC MEANS considers a
missing value to be a legitimate BY group value. Without the WHERE statement, all observations with a
missing branch location would be summed together. This can create unexpected results if there are many
observations with missing branch location values.
proc sort data=financials;
by location;
run;
proc means data=financials noprint;
var rev1 rev2 rev3;
output out=revsum_bybranch (drop=_type_ _freq_)
sum(rev1-rev3)=revsum1-revsum3;
by location;
where location ne ' ';
run;
The resulting data set, REVSUM_BYBRANCH looks as follows:
location
revsum1
revsum2
revsum3
Boston
New York
1000
950
950
1600
700
800
THE SQL PROCEDURE
The third analysis question is:
•
How much profit did each branch earn during the holiday period, in dollars and as a percent of
total profits?
This question highlights some of the key limitations of PROC PRINT and PROC MEANS. PROC PRINT
prints summary statistics, but cannot store them in a SAS data set; therefore, they are not available for
future calculations (e.g., creating branch profits as a percent of total profits). While PROC MEANS does
allow summary statistics to be stored in an output data set, it only summarizes variables that already exist
in the input data set. Since the PROFIT variable does not exist in the raw data set, it would have to be
5
NESUG 2007
Foundations & Fundamentals
created in a DATA step before summing vertically using PROC MEANS. A second DATA step is required
to create branch profits as a percent of total profits.
NESTED SUM FUNCTIONS
The SQL procedure provides a way to do all of this in one step. In PROC SQL, when multiple columns
are specified in an aggregate function (like the SUM function), the values in each row of the columns are
calculated. If that SUM function is then nested in a second SUM function, SAS produces a grand total of
the calculated variable across all observations. In other words, the inner SUM function is performing the
horizontal summation while the outer SUM function is performing the vertical summation. This method
can be used to calculate total profit over all weeks and branches. Note that SQL does not support
variable lists in the SUM function. Observations with missing branch data should be removed with a
WHERE clause.
proc sql;
create table financials_sum as
select
sum(sum(rev1,rev2,rev3,costs)) as branch_profit
from financials
where location ne ' ';
quit;
The resulting data set, FINANCIALS_SUM, looks as follows:
branch_profit
3600
THE GROUP BY STATEMENT
However, the third analysis question asks for the total profits by branch, not overall profits. PROC SQL
can vertically sum profit by branch to create the variable BRANCH_PROFIT, dollar amount of profit by
branch, using a GROUP BY statement. As with PROC MEANS, PROC SQL ignores missing values of the
analysis variables in the SELECT statement, but it treats missing GROUP BY variables as valid data.
Unlike PROC MEANS, PROC SQL does not require data to be sorted by the GROUP BY variable(s).
proc sql;
create table financials_sum as
select
location,
sum(sum(rev1,rev2,rev3,costs)) as branch_profit
from financials
where location ne ' '
group by location;
quit;
The resulting data set, FINANCIALS_SUM, looks as follows:
location
Boston
New York
branch_profit
1850
1750
THE SELECT STATEMENT SUBQUERY
The previous step calculated profit by branch in dollars, but the analysis question also asks for profit by
branch as a percent of total profits. The first step to do this is to add total profits to the data set created
above. In PROC MEANS, this would require a second step; in PROC SQL, this can be done in the same
step by using a subquery nested in parentheses.
A subquery is a query that is nested in another query. This subquery is executed first. Note that the
variable TOTAL_PROFIT that is created in the subquery must also be referenced in the outer query in
6
NESUG 2007
Foundations & Fundamentals
order to be included in the created table. In order to exclude observations with missing branch information
when calculating the BRANCH_PROFIT and TOTAL_PROFIT variables, there must be a WHERE clause
in the subquery as well in the outer query.
proc sql;
create table financials_sum as
select
location,
sum(sum(rev1,rev2,rev3,costs)) as branch_profit,
total_profit
from financials,
(select
sum(sum(rev1,rev2,rev3,costs)) as total_profit
from financials
where location ne ' ')
where location ne ' '
group by location;
quit;
The resulting data set, FINANCIALS_SUM, looks as follows:
location
Boston
New York
branch_profit
1850
1750
total_profit
3600
3600
The final step to answering this analysis question is to create the variable BRANCH_PCT – the profits by
branch as a percent of total profits. This, too, can be created in the same SQL procedure. Note that since
BRANCH_PROFIT is calculated in the outer query, it must be preceded by the word “calculated,” but
since TOTAL_PROFIT was calculated in the subquery, it is not. This is because the inner subquery result
is added to all rows selected by the outer query.
proc sql;
create table financials_sum as
select
location,
sum(sum(rev1,rev2,rev3,costs)) as branch_profit,
total_profit,
calculated branch_profit/total_profit as branch_pct
format=percent8.2
from financials,
(select
sum(sum(rev1,rev2,rev3,costs)) as total_profit
from financials
where location ne ' ')
where location ne ' '
group by location;
quit;
The resulting data set, FINANCIALS_SUM, looks as follows:
location
branch_profit
total_profit
branch_pct
Boston
New York
1850
1750
3600
3600
51.39%
48.61%
CUMULATIVE SUMMATION
The final analysis question is:
•
What is the cumulative profit to date of this company by branch at each week?
7
NESUG 2007
Foundations & Fundamentals
The previous section showed that PROC SQL was far more versatile than the DATA step and PROC
MEANS for answering the analysis questions. SQL, however, does not process rows (observations) in a
particular order. There is no easy way to use PROC SQL to sum data cumulatively (i.e., to sum data
across variables and observations to create running totals), as the final analysis question requires.
THE RETAIN STATEMENT AND THE SUM FUNCTION
The SUM function was introduced in the section on horizontal summation in order to sum revenue and
costs across variables within an observation to create the profit variable. SAS, however, automatically
sets variables that are created within an assignment statement (like the profit variable) to “missing” before
each iteration of the DATA step.
The RETAIN statement can be used to prevent SAS from re-initializing the values of created variables
before each iteration of the DATA step. In other words, the value calculated in the previous observation is
carried down to the following observation. When the SUM function is combined with the RETAIN
statement, the sum from the previous observation is carried down and added to the value in the current
observation. This method can be used to calculate a running total of profits, CUMPROFIT, at each
observation. In the following code, the variable CUMPROFIT is initialized to zero, summed with the
PROFIT variable, then carried down to the following observation. If the zero were omitted from the
RETAIN statement. CUMPROFIT would be initialized to “missing.”
data financials_cum;
set financials;
if location ne ' ';
profit=sum(rev1,rev2,rev3,costs);
retain cumprofit 0;
cumprofit=sum(cumprofit,profit);
run;
The resulting data set, FINANCIALS_CUM, looks as follows:
location
date
rev1
rev2
rev3
costs
profit
cumprofit
Boston
Boston
Boston
Boston
New York
New York
New York
New York
12/15/2006
12/22/2006
12/29/2006
1/6/2007
12/15/2006
12/22/2006
12/29/2006
1/6/2007
500
300
100
100
.
700
200
50
150
500
200
100
600
600
300
100
300
200
50
150
400
250
100
50
-200
-200
-200
-200
-400
-400
-400
-400
750
800
150
150
600
1150
200
-200
750
1550
1700
1850
2450
3600
3800
3600
FIRST./LAST. PROCESSING
The cumulative profit variable created above contains the correct cumulative profits for the Boston
branch. The RETAIN statement, however, carries over the final total profit for the Boston branch to the
first observation of the New York branch. In order to obtain the correct cumulative profit for the New York
branch, the CUMPROFIT variable must be reset to zero at the first observation of the New York branch.
This can be done with FIRST./LAST. processing. The following code sets the cumulative profit variable to
zero at the first occurrence of each value of LOCATION. Note that this statement must come before the
SUM function in order for the cumulative profit to include the profits from the first week in each branch.
Observations are sorted by branch and week. The FIRST./LAST. code would cause errors if the data
were not sorted by the BY variable LOCATION, but it does not require that the dates be in order.
However, if the dates were out of order, the cumulative profit variable would be meaningless, or at least
would not help answer the final analysis question. The next step adds a PROC SORT to ensure that the
data are sorted correctly.
proc sort data=financials;
by location date;
run;
8
NESUG 2007
Foundations & Fundamentals
data financials_cum;
set financials;
by location;
if location ne ' ';
profit=sum(rev1,rev2,rev3,costs);
retain cumprofit;
if first.location then cumprofit=0;
cumprofit=sum(cumprofit,profit);
run;
The resulting data set, FINANCIALS_CUM, looks as follows:
location
date
rev1
rev2
rev3
costs
profit
cumprofit
Boston
Boston
Boston
Boston
New York
New York
New York
New York
12/15/2006
12/22/2006
12/29/2006
1/6/2007
12/15/2006
12/22/2006
12/29/2006
1/6/2007
500
300
100
100
.
700
200
50
150
500
200
100
600
600
300
100
300
200
50
150
400
250
100
50
-200
-200
-200
-200
-400
-400
-400
-400
750
800
150
150
600
1150
200
-200
750
1550
1700
1850
600
1750
1950
1750
THE SUM STATEMENT
The SUM statement creates the same results as combining the SUM function with the RETAIN statement,
but is slightly more efficient. The SUM statement initializes the variable on the left of the plus sign (+) to
zero, retains the variable, and adds the value of the expression on the right of the plus sign to the
variable. It ignores missing values and treats an expression that produces a missing value as zero. The
following code creates the same output as shown above.
proc sort data=financials;
by location date;
run;
data financials_cum;
set financials;
by location;
if location ne ' ';
profit=sum(rev1,rev2,rev3,costs);
if first.location then cumprofit=0;
cumprofit+profit;
run;
CONCLUSION
While there are many methods to sum data in SAS, some may be more appropriate or save more time
than others. It is important to think critically about the analysis question before deciding which type of
summation (horizontal, vertical, or cumulative) is needed to answer it and which SAS technique is best. It
is also important to be aware of missing values in the data and to understand how SAS handles them in
each technique.
CONTACT INFORMATION
Tatiana Homonoff
Research Analyst
MDRC
th
th
16 East 34 Street, 19 Floor
New York, NY 10016
Phone: (212) 340-8629
Fax: (212) 684-0832
9
NESUG 2007
Foundations & Fundamentals
Tatiana.Homonoff@mdrc.org
www.mdrc.org
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of
SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product
names are trademarks of their respective companies.
10
NESUG 2007
Foundations & Fundamentals
APPENDIX
While there are only three types of revenue in this data set, suppose there were 50. Creating the profit
variable would appear to require a lot of typing, but SAS has a few shortcuts.
NUMERIC VARIABLE LISTS
The three revenue variables all have the same naming convention, “Rev”<number>, and can be
referenced as a numeric list (e.g., REV1-REV3). SAS produces unexpected results, however, when
combining numeric lists with the SUM function.
profit = SUM(rev1-rev3, costs) ;
The resulting data set, FINANCIALS2, looks as follows:
location
date
rev1
rev2
rev3
costs
profit
Boston
Boston
Boston
Boston
New York
New York
New York
New York
12/15/2006
12/22/2006
12/29/2006
1/6/2007
12/15/2006
12/22/2006
12/29/2006
1/6/2007
500
300
100
100
.
700
200
50
150
500
200
100
600
600
300
100
300
200
50
150
400
250
100
50
-200
-200
-200
-200
-400
-400
-400
-400
0
-100
-150
-250
-400
50
-300
-400
Rather than reading “REV1- REV3” as a numeric list, SAS interprets it as subtraction of REV3 from
REV1. This is not the desired result. If the numeric list is preceded with OF, however, SAS produces the
desired result.
profit = SUM(OF rev1-rev3, costs) ;
Note that if there were more than one variable list in the set of arguments, each list would need to be
preceded with its own OF.
PREFIX LISTS
A prefix list can be used to shorten the code even further. Specify the variable prefix followed by a
semicolon (e.g., REV:). All variables that start with a prefix can be referenced by prefix:. Used without an
OF, SAS generates an error. With an OF, SAS produces the desired result.
profit = SUM(OF rev: , costs) ;
POSITIONAL VARIABLE LISTS
In this data set, the three revenue variables have the same naming convention, which allowed the use of
the numeric and prefix lists. If that were not the case, but the summation argument variables were still
adjacent to one another in the data set, a positional list could be used. Specify the first variable to use,
two dashes, and the last variable to use (e.g. REV1--COSTS).
Again, without the use of an OF, SAS produces unexpected results: it treats the double dash as double
subtraction, i.e. addition, of REV1 and COSTS, rather than a positional list. SAS evaluates the expression
in parentheses first. Since it assumes that the positional list is actually an arithmetic operator, the result of
the inner expression is a single argument. If either of the variables in the inner expression is missing, the
result of the inner expression is missing. Therefore, since the only argument to the SUM function is
missing, the result of the whole expression is missing.
profit = SUM(rev1--costs) ;
The resulting data set, FINANCIALS2, looks as follows:
location
date
rev1
rev2
rev3
11
costs
profit
NESUG 2007
Foundations & Fundamentals
Boston
Boston
Boston
Boston
New York
New York
New York
New York
12/15/2006
12/22/2006
12/29/2006
1/6/2007
12/15/2006
12/22/2006
12/29/2006
1/6/2007
500
300
100
100
.
700
200
50
150
500
200
100
600
600
300
100
300
200
50
150
400
250
100
50
-200
-200
-200
-200
-400
-400
-400
-400
300
100
-100
-100
.
300
-200
-350
Use OF to produce the desired results.
profit = SUM(OF rev1--costs) ;
WHERE CAN THESE TECHNIQUES BE USED?
The examples above use the addition operator and the SUM function in assignment statements to create
variables in a DATA step. They can also be used to subset data in an IF or WHERE statement in a DATA
step or in a WHERE statement in a PROC step. However, while the IF statement supports the variable
lists shortcuts described above, the WHERE statement does not.
12
Download