"The Top Ten": Frequently Encountered Errors of Beginning SAS Programmers Debby Vivari, Westat Inc. Correct Learning any new language, computer or otherwise, involves a period of trial and error. Because SAS is relatively easy to use, beginning SAS programmers get ambitious quickly, and problems can result. This is often worse for programmers experienced in other high level languages, because of preconceived notions about the structure of a computer language. Those for whom SAS is a first language do not have this prejudice. data dl; infile x; input abc; if a=l then output dl; proc print; 2. This is also more common for programmers accustomed to another high level language. Using an OS data set is not a bad idea if the data is being manipulated as preparation for an existing program in another language. But, frequently, SAS is used over and over again and the data file is input from an os data file each time. It's not difficult to copy the input lines from program to program, but it's a waste of computer time to convert the data to SAS format again and again. Some of the 'errors' that occur are impossible to ignore--they produce error messages from SAS and the program does not execute successfully. However, a SAS program can execute with no errors and still be 'wrong' in terms. of efficiency, because it does not make good use of the unique features of SAS. These kinds of errors can cost the most in terms of time and resources. The following list consists of some of the most commo.n errors encountered by new SAS programmers. It was compiled after a number of years of debugging other people's programs. It is by no means complete and the order is not necessarily significant. 1. Not using SAS data sets •• OS in and OS out. Acceptable data dl: infile input abc; d=a/b, file osout: put a b c d; Confusion between procs and data steps. Xi and then another program: This is especially true for experienced programmers, who are used to printing out a few results, or computing a few statistics during the course of inputting and reformatting the data. This certainly can be done in SAS in a data step, but frequently proc prints or proc means show up in the middle of a data step. It is important to stress the two-unit structure of data steps and procs and how these units con be ordered within a program. It is not a matter (as in Fortran or Cobol) of reading a record, converting it into a useful form and then accumulating totals, etc., and printing. First the data is read and converted. The file is then passed again in the proc and acted upon. data dl; in£ile osout; input abc d: proc freq; tables abc d; Better data x.dl; infile y: input abc; d=a/b, proc freq data=x.dl: tables abc d; Actually, this probably stems from a lack of understanding of the purpose for and structure of SAS data sets. New SAS programmers tend to be very concerned with 'what columns' the data items are in, and other non-pertinent issues. Many are not aware of the history or data description segments of SAS data sets, although new programmers could make the best use of them. A common example of this kind of confusion might be: Incorrect data dl, infile x; input abc; proc print; if a=l then output dl: 3. Passing and passing a file. Somehow the word 'set' isn't quite as explicit as the word 'read', even though it involves just about the same 93 thing--passing the file. Experienced programmers are usually more sensitive to this, but beginning programmers who are starting out on SAS just go crazy 'se.tting' again and aqain unnecessarily. Most examples in the SAS manuals show fairly short data steps with only one data set output. An example of a relatively long data step creating several output data sets often helps this situation. Fortran, Cobol or PLIl, indirect subscripting with 1 dimensional arrays can be considered a little, well l clumsy (don't get me wrong ... we're grateful to have subscripting at all). "Subscript out of range at line xxx., is a frequently seen error. Lots of examples seem to help •.. the documentation on arrays is a little sketchy and experience is really the only answer. It's not a bad idea to emphasize that, while using arrays in SAS can save quite a bit of programmer time, they seem to take up a lot of computer time and should not be used indiscrimi~ nately. Incorrect ---data dli set x.old; if sex=' f' ; data d2i set dl; if age<lO then x=l; else x=O; proc freq: tables x; 6. Correct An important difference between SAS and Fortran or Cobol is that SAS initializes numeric variables to . and character variables to ' , at the beginning of every observation. The retain statement is needed to carryover values of variables from one observation to another. This can be a real problem if it's not understood, because no error message is given, but incorrect output will aLmost certainly result. An example miqht be: data dl; set x.old; if· sex=' f' i if age<lO then x=l; else x=O; proc freq: tables x: There's no need for data set d2; that line could be left out completely. In fact, if x.ald had 500,000 observations in it, the point would seem even more import ant. 4. Incorrect No length statements, or all numeric data. data dl; set dO; by id; if first.id then y=O; Unfortunately, no one notices this problem until it becomes a big problem .... and someone runs out of work space. The problem usually turns out to be that 1000 indicator variables, all 0 or 1, were needed and each one defaulted to 8 bytes apiece. A single length statement can make all the difference. Also, very often data is not used computationally .•. it's only needed cateqorically, in crosstabs or for suhsetting. In that case, character data is really more appropriate, and will take only the space that it needs. It's important to add a warning about length statements. A default length of 4 for all the numeric variables can really clobber a tendigit number -- and non-integer values are never quite the same. In many instances, a judicious use of length statements and character data (where appropriate) can save a lot of work space. 5. The default of missing and' not using retain. y=y+l; if last.id then output dli Correct data dl; set dO; by id; retain y; if first.id then y=O: y=y+l; if last.id then output dl; 'y' should be a count of the number of observations per id, but without a retain statement, 'y' will- be missing whenever there is more than one observation per id. 7. When is a data set output? The default in SAS is to output all data sets at the end of the data step unless otherwise specified. However, the output statement allows one the option _of outputting at any point in the data step. This freedom is great but the programmer must remember that SAS will output the data set right then and any statements following the output statement will be ineffective. Not understanding indirect subscripting. ~ew programmers rarely understand subscripting of any type, so this really applies to more experienced programmers. For anyone used to the ease of double and triple subscripting in 94 records of students, with school id and district id. One might want to create school level and district level files at the same time the new student level file is being created. One pass of the data set is enough, if the file is sorted by district, school and student. A concrete example with at least 3 'by' variables seems to be the most help in reducing the confusion. (A similar type of problem occurs with proc summary and the type variable, and the same type of example would be useful here). Incorrect data dl d2; set dO; if sex='m' then output dl; if sex='f' then output d2; if age>2S then age2=1; Correct data dl d2; set dO: if age>2S then age2=1; if sex='m' then output dl; if sex='f' then output d2i data dstud dschl ddist; set dl; by district schl id stud_id; retain cntrl cntr2 -0; if first.schl id then cntrl=O; if first.district then cntr2=0; In the incorrect example, 'age2' will al....·ays be equal to missing, because it is set to missing at the beginnir.q of the data step and the data sets are output before it is given a value. 8. $char and $ and leading (trailing) blanks. cntrl+l; if last.schl id then do; cntr2=cntr2+cntrl; output dschl; end; if last. district then output ddist; output dstud; This has caused many new SAS proqrammers considerable grief. They know about the w.d. format and the $w. format and that seems to just about cover their needs, so they don't look any further. All character variables are input with the $w. format. However, the first time they have data with leading blanks, they're in trouble. They may want those leading blanks and the $w. format is going to truncdte them. This is even worse when an as file is output with the same $w. format because all the columns will be off. Usually, though, most programmers only make this mistake once -and it often looks something like this: The three data sets output are all different levels. 'Dstud' is one record per student, 'dschl' is one record per school, and 'ddist' is one record per district. When this method is not understood, the tendency is to create the student level file, set again by school id and create the school level file, and then set a third time by district id, to create the district file. This works, but it involves unnecessary steps. Incorrect data null ; infile x; input -x $ 6:- 10. file osout; put x $ 6. . .. ; Again, this mistake is usually made only once (that 1 s enough). It most of,... ten occurs in cleaning up a group of programs, and saving various files for later use, in documenting. The SAS data sets sort of mix in with all of the other ones. The utility runs with condition code 0, and everything's fine until someone tries to read the tape (usually two years later). Frequent warnings often help, but sometimes even that isn't enough. Correct data null; infile x; input -x $cnar6. file osout; put x $char6. If columns 1-6 of the input file are 123' there's trouble when that OS file is written out. 9. Copying SAS data sets with a gener (or other ibm utility) SAS is a powerful language, but is different in structure from the common higher level languages in several basic ways. Mistakes are bound to occur, but the more experienced SAS programmer can smooth the way for the novice, if he knows what to expect. First., last., and more than 1 'by' variable. Once the concept of 'by' variables is understood, it's hard to-" imagine what the problem could be for someone having difficulty. It's important to stress that 'by' variables only work when the file is sorted in that same order. For instance, suppose a file contains 95