SET, MERGE, UPDATE, or MODIFY? (And What About Indexing?) ,Philip J. Weiss, Walsh Amerlca/PMSI the most part, older versions 01 SAS provided a smaller collection 01 Statements for performing these tasks than is now available. The basic DATA step tool kit, if it can be so defined, consisted primarily 01 SET, MERGE, and UPDATE statements which are used to combine SAS data sets in avariety 01 ways. AbstractManipulating multiple data sets within the same DATA step is a common programming task. The latest irltroduction 01 new data manipulation procedureS provides both increased flexibility and improved efficiency. Although these techniques are powerful arid eilective, their utility is weakened by potential misuse. Because the new statements process records using random and dynamic WHERE processing methods, SAse programmers may shun the, newer techniques and opt instead for more familiar and proven methods. Under earlier verSions of SAS, the SET statement had basically three uses. One could a) concatenate one or more data sets, b) interleave records from one data set with records from another data set, and c) Combine records from different data sets by utilizing more than one SET statement in the same DATA step. These capabilities still exist. Consequently, tools are needed to assist in the determination 01 better solutions. The objective 01 this, p<!per is to provide two essential tools. More specifically, the paper 1) compares and contrasts the new teclmiques with those' previously available; 2) develops decision-making criteria for choosing an appropriate data manipulation technique with efficiency as.a guiding principle., The MERGE statement has not changed syntactically from earlier to later SAS versions and is still used to joinrecords from two or more SAS data sets into a single observation. A MERGE statement is usually accompanied by a' BY Statement to combine or interleave records in a specific order. The UPDATE statement replaces variable values on a master (or primary) file with record values from what is called atransactlon file. This occurs through a comparison or merging 01 records based on variables contained in the BY group (a BY statement must be used). In addition, new records are capable 01 being added (i.e. appended). One constraint is that the master data set cannot have duplicate records in it by the variables contained in the BY statement. A new SAS data set is always created as output. IntrOductionSeve,ral new techniques are now available for manipulating data in the DATA step with Version 6.07. They consist 01: 1) a new SET option, specifically the KEY = option on the SET statement, 2) the MODIFY statement, and 3) the capability to create indexes in the DATA step. The addition of these techniques has introduced confusion as to which technique is best suited for a specific application., , ,The increase in flexibility has ultimately brought with it more complexity. The programmer now has to ask a more involved series 01 questions before choosing one 01 the available techniques over another. By not taking advantage 01 the new capabilities, and by performing data manipUlation in old ways, there is the potential to degrade computer resource effiCiency. It is also possible to misuse or misapply the new ,techniques in situations where altemate programming solutions are actually required. Enhancements to Version 6.07New to Version 6.07 are several techniques that significantly extend the capabilities of the DATA step. They are the MODIFY statement, the KEY= option on the SET statement, and the ability to create indexes during the DATA step with the INDEX= option. All 01 these features take advantage of the latest ability to read and modify records using random, or direct, access to the data Standard File Comparison Techniques- The MODIFY statement is new to Version 6.07 and can be used in a number 01 different ways. As with the UPDATE statement, the general purpose of the MODIFY statement is to update or replace variable values in a master file with those in a transaction file in the past, it was fairly easy to determine which data manipulation technique to use when combining or comparing records from two different data sets. For 127 data set have changed dramatically as new versions of Base SAS have been introduced. From a machine/CPU efficiency perspective, the process of indexing a data set has become easier and more cost effective. Base SAS Version 6.06 allowed batch users to create indexes only through PROC DATASETS or PROe SOL In addition to these methods, SAS Version 6.07 allows the programmer to index an Input (or external) file in the Data step using the INDEX= data set option. Index creation is accomplished by including INDEX= within parentheses after the data set name. An examination of the INDEX= data set option indicates that .either simple or complex indices can be created, the nature of which is determined by the number of variables that follow the INDEX= statement On separate parentheses, see Example 1).. in much the same way as is accomplished using the UPDATE statement. Unlike the UPDATE statement, however, the MODIFY statement 1) can be used without a BY statement, 2) can have duplicate values on the master data set, 3) does not require the data sets to be in sorted order if a BY statement is used, 4) allows the programmer to delete and append new observations to the master data set, and 5) replaces the SAS data set instead of making another copy, thus potentially saving on disk or work space. Processing differences are also apparent between the two statements. In the absence of an explicit WHERE clause, the UPDATE statement matches records based on specific variables contained in the BY statement; both data sets must be in sorted order. The MODIFY statement by contrast has identical record matching functionality through the inclusion of a BY statement; no preliminary sorting is necessary. While this process can be CPU intensive, the master data set can either be indexed or both the master and the transaction file can be sorted to reduce system processing overhead (see Beatrous and Armstrong, 1991 for indexing guidelines). Once an index t'las been successfully created, it . can be used to provide non-sequential .!\CCeSS to observations in the indexed data set Of greater impqrtance, however, is the use of indexes. for cOmparing or updating observations from one dala set with matching observations of another. One new technique for. data set comparison' is available by using the KEY= options on either the SET or MODIFY statements. The increased processing flexibility prOVided by the MODIFY statement (when compared to the UPDATE statement) creates situations where there is a greater chance for the programmer to produce erroneous output. When two data sets are being compared using the MODIFY statement,it is imperative that the programmer have a thorough understanding of the relationship between the two SAS files to avoid potential problems. This includes knowing whether or not the transaction data set contains duplicates or .observations not contained in the master data set, or whether . the master file contains duplicate observations. For bott'l the SET and MODIFY statements, data is accessed randomly according to the value of the key index variable for that observation. Syntactically, either statement (SET or MODIFY) cOntaining a KEY= option must be preceded by a SET statement. If no index has been created, or an index exists on a variable other than the' one listed in the KEY= option, an error message is issued. Even though the SET and the MODIFY statements appear similar to each other when the KEY= option is considered (see Example 1, following page), several significant differences do exist. In the MODIFY example on the left, data in the first SET statement (A2) acts as the transaction data set which will be used to perform operations on the master data set (A1). In the SET example, the reverse is true! In the latter example, the first SET statement contains the data (A2, called the primary data set) that will be modified and hence is somewhat equivalent to the master data set in the previous example. Data in the second SET statement (A 1, called' the lookup data set) is used to add variables to the output data set (A3) when a match is made. The reason new variables are initialized to blanks (after an observation in the primary data set has been retrieved) is because lookup table values are retained from the previous iteration if no match is found with the primary data set. Apart from some of the enhancements listed above, what the MODIFY statement holds in common with other Version 6.07 DATA step enhancements is the ability to access data directly or randomly (i.e. not sequentially). Either dynamic WHERE processing can be used (with a BY statement) or record processing can be handled via an index key (with KEY=). The new capability of controlling how records are processed brings with it the new responsibility on the part of the programmer to know when indexing is appropriate. Therefore, what follows is a discussion of the new index creation method available in the DATA step and the use of indexes with the MODIFY and SET statements. IndexingThe methods by which one is capable of indexing a 128 EXAMPLE 1 - KEY= options Used with MODIFY (left) and SET (right) statements OATA Al ONOEX.",(SPEC); INALE FlLE1 i DATA A1~NDEX=(SPEC»; INFILE ALE1; INPlIT @l MENUM $10. @37 SPEC S3. @42 OlDZlP $5.; DATAA2; INPUT @, MENUM $10. SPEC @42 NEWZIP 15.; sa P OATAA2; INALE FlLE2; INF1LE FiLa; INPUT @10 SPEC INPUT @10 . @15 DAlAA3; SET A2; MENUM-' $3. @15 NEWZIP $5.; DATAA1j SET A2: MODtFY A1 KEY=SPEC; OLOZlP,.,NEWZIP; , SPEC OlDZlP sa $5.; NEWZIP - ' SET AllCEY_SPEQ; IF NEWZIP NOT EQ • THEN OLOZlP_NEWZIP; RUN' RUN' option) does not me<lningfullyextend its data manipulation capabilities. Efficiency may, however, be imprOl(ed. The SET statement, on the ot~r hand, functions very differently •. when used In combination with a KEY= option. The SET statement actually loses some of Its functionality when used with the KEY= .option, while becoming more efficient at processing records. Processing ConsiderationsGiven that the SAS tool kit for comparing' and updating data sets has expanded so dramatically, the issue becomes one of determining. which technique should be used for a particular application. The question of which technique is more appropriate really begins by asking the question, 'What needs to be done with the files?'. In the chart that follows (Chart 1), several generic tasks are listed along with the tasks that are supported by the particular technique. As with the rest of this paper, this list is confined to techniques where two SAS data sets are being compared. While the list is not intended to be. exhaustive, it serves to illustrate some meaningful differences and similarities between the techniques shown in the chart. The chart serves to highlight the fact that several techniques can be used to perform essentially the same function. This is where the confusion may arise. For example, if the programmer wishes to add new variables from one data set to another, then it is unclear whether an UPDATE,.a MERGE, a SET, or a SET with a KEY = option is mand.ated. Ukewise, if one wishes to update or replace specific observations on one data set with another, all the listed techniques support that operation. This leads to doubts about which methOd should be used. Specifically, it shows that combining indexing with the MODIFY statement (specilically with a KEY= CHART 1 - Some SAS data set compsrlson techniques and their functionality . ACTION Techniques Update .. Update/ Replace Varlabla Values X Add New or Missing Variables Add New Records X '.' Delete Records Ap.....dI Inter- Concat- laeve Records _a Records X X Modily X X X X Modily w/ Key= X X X X Merge X X X X Set X X X Setw/Key- X X 129 X X X significantly larger (in terms of number of observations) than the other. While size is not in and of itseH a determinant of indexing efficiency, there is a moderate degree of correspondence. Large, when used in this context, can also mean that one file has less than 50% of records in common with the other based on one variable or a combination of variables. H one of the files can be judged to be larger than the other, then randomly accessing the data may' avoid having to process a signHicant number of the records contained in one or more of the files. Still, there needs to be some kind of CPU investment to create the index, but reuse of t,he same file may diminish the initial cost. Decision CrlterlaIn an effort to make the seleelion of an appropriate technique easier, a flow diagram (fOllowing page) is constructed based On specifiC, decision making criteria. Reelangles are process or action steps, while diamonds indicate depiSiQn branching processes. The object of this methodology is to begin with a collection of techniques and gradually to refine that collection until a smaller subset (or one) exists. The only rules to follow in using the diagram are that once a technique is dropped from consideration it should not be re-added to lower stages of the decisiQn making process. Unfortunately, the decision making process is not "strictly hierarchical in nature, but is organizable in terms of stages:' This specific Ordering has value in the sense thanposescquestions that are critical to ',processing !itrategy' and statement seleCtion. ,'i: I ":-(.;-\ When processing data non-sequentially, the Investment in computer resources occurs not at the comparison stage (as in a sequential MERGE), but c!uring the actual index creation for one of the files. The saVings occ,urs in being able to index one of the SAS data sets, save it as a SAS permanent data set, and then to be able to reuse that indexed data set aiter the initial invest(l1ent of constructing the index. Success of this strategy is predicated on the fact that the indexed file does not change very often. - BasicallY'there are three separate parts, or stages, to the decision'making process: They are; " 1) Accurately determining the relationship(s) '" betWeen the files. This invOlves'correctly identifYing the task that rnust oCcur 6h the master or primary datil set, as well 'as, '", analyiing the relative size of the master versus , the tranSaction <:lata Set. 'An example would be deciding whether one is goihg to' update or add r e t o r d s . · , Even though Stage 2 deals with potential resource use, the questiOn of whether to index or not remains , a central one. The key to quantifying resource use is to ask oneseH, how manY times a data set may be used to . update or modify records, especially considering subsequent processing. 2)-' Determining the repetitive nature of the specific , 'lip plication: The number of times a particular , data Set will be used ih successive operations is a good predictor of the kind and amount of computer resources that may be required. 3) Assess potentiaLiO(;ongrulties betwElllO the file and the'technique(s) under consideration. This meahs asking questions about the file parameters that may affect ,the uSeflilness or, success Of a particular Statement. ' . , " The first part of the decision tree (Stage 1) involves selecting the primaryfunctio,n of the, programming task, namely, to add records, change variable values, etc. The previous chart prpvides a starting point for identifying a set of manipulation techniques for conSideration. One of the first questions should, be whether or not the data set comparison process can be done with the aid of indexing. One method of'identifying whether indexing can be used is to determine whether one of the two files is 130 Despite the fact that indexing can be done in the , DATA stepwitho~ passing the data to a PROC, the cost (or investment) required to index a data Set will exceed the CPU necessary to process the data once through sequential methods (including sorting). l:h refore, H CPU is a concern, and file comparison , ,'will not occur on arej:Jetitive basis, then sequential proceSSing should be used. The MODIFY statement when accompanied with a BY statement, by virtue of the fact that it uses' dynamic WHERE processing (H no index has been created), will use more CPU than sequentiill processing methods. This is especially true Ha large number of observations are involved in the modification process. ,But, in those cases where MODIFY has to select on a small(er) portion of the overall data set (somewhere less than 50%), it can actually be more efficient. If more than 50% of the "observations are to be modHied, steps should be taken to reduce the amount of overhead by sorting the data sets prior to performing the merge, or by indexing' the master data set. If disk space or storage is restricted, then it is desirable to reduce ,the number of data sets being created in the e Decision Flow Diagram Stage 1 • SAS Data Set Comparison Stage 2 • -.. 1oouo7 No IStage 3 No peef•• a I~-:.;=I 131 • refinements in one's understanding take place and unique situations are not only encountered, but also accounted for, the decision making process can be expanded. It might even be the case that repetition and/or experience ingrains these ideas so firmly upon oneseH that diagrams and articles may never again be consulted. In the meantime.... program. The MODIFY works well in the latter scenario due to the fact that it modifies the data set 'in place'; MERGE and SET can also be modified to write over one of the output data sets. The final stage (Stage 3) of the decision making process involves determining whether the proposed task can be successfully processed using a particular technique. Various questions need to be posed regarding how that technique expects records to relate to one another between the files. Also important is how the proposed technique processes the records. References SAS Institute Inc. (1990) SAS Language, Reference Version 6, First Edition Cary, N.C.: SAS Institute Inc. SAS Institute Inc. (1991) The SAS System under MVS, Highlights of Release 6.07 Cary, N.C.: SAS Institute Inc. The MODIFY technique, for example, expects that transaction records have master file counterparts based on specific variable values. Transactions are defined literally (and logically) as successor records. Therefore, the SAS system treats transaction records as though they have a parent assignment to a master record. When two data sets are listed in a MODIFY statement and one of which is not a subset of the other, an error messages is issued regarding the presumed inconsistency. SAS Institute Inc. (1992) SAS Technical Report P222, Changes and Enhancements to Base SAS Software, Release 6.07 Cary, N.C.: SAS Institute Inc. Beatrous, Steve and Karen Armstrong, Effective Use of Indexes in the SAS System Proceedings of the Sixteenth Annual SAS Users Group International Conference, 16, 605-614. The issue of duplicates illustrates the last point. Integral to using one technique over another is understanding how that technique processes the records from each of the two files. Regarding DATA step file comparison, duplicates are treated differently when sequential and non-sequential data access methods are used. Depending on the desired output and/or the processing that must occur, certain techniques can give unwanted or unanticipated results. The final step, examining the remaining techniques, is left to programmer preference. SAS is a registered trademark or trademark of SAS Institute Inc. in the USA and other countries. -indicates USA registration. ConcluslonsFrom the comparison between standard and new techniques and from the decision making analysis presented here, it is clear that the new collection of file comparison techniques have their greatest utility , in specific applications. From a programming point 'of view, it behooves the programmer to know what those application scenarios are, as well as, what constraints are imposed on certain techniques. By structuring and formalizing the decision making process the programmer is less apt to make simple errors and spend less time trying to figure out exactly which technique to use. The previous analysis also gives the programmer a platform through which more detailed investigations into specific techniques can be launched. As 132