Frequently Encountered Errors of Beginning SAS Programmers

advertisement
"The Top Ten":
Frequently Encountered Errors of Beginning SAS Programmers
Debby Vivari, Westat Inc.
Correct
Learning any new language, computer
or otherwise, involves a period of trial
and error.
Because SAS is relatively
easy to use, beginning SAS programmers
get ambitious quickly, and problems can
result. This is often worse for programmers experienced in other high level
languages, because of preconceived notions about the structure of a computer
language. Those for whom SAS is a first
language do not have this prejudice.
data dl;
infile x;
input abc;
if a=l then output dl;
proc print;
2.
This is also more common for programmers accustomed to another high level language. Using an OS data set is
not a bad idea if the data is being manipulated as preparation for an existing
program in another language.
But, frequently, SAS is used over and over again
and the data file is input from an os
data file each time.
It's not difficult
to copy the input lines from program to
program, but it's a waste of computer
time to convert the data to SAS format
again and again.
Some of the 'errors' that occur are
impossible to ignore--they produce error
messages from SAS and the program does
not execute successfully. However, a
SAS program can execute with no errors
and still be 'wrong' in terms. of efficiency, because it does not make good
use of the unique features of SAS.
These kinds of errors can cost the most
in terms of time and resources.
The following list consists of
some of the most commo.n errors encountered by new SAS programmers.
It was
compiled after a number of years of debugging other people's programs. It is
by no means complete and the order is
not necessarily significant.
1.
Not using SAS data sets •• OS in and
OS out.
Acceptable
data dl: infile
input abc;
d=a/b,
file osout:
put a b c d;
Confusion between procs and data
steps.
Xi
and then another program:
This is especially true for experienced programmers, who are used to
printing out a few results, or computing
a few statistics during the course of
inputting and reformatting the data.
This certainly can be done in SAS in a
data step, but frequently proc prints
or proc means show up in the middle of
a data step.
It is important to stress
the two-unit structure of data steps
and procs and how these units con be
ordered within a program.
It is not a
matter (as in Fortran or Cobol) of
reading a record, converting it into a
useful form and then accumulating totals, etc., and printing. First the
data is read and converted. The file is
then passed again in the proc and acted
upon.
data dl; in£ile osout;
input abc d:
proc freq;
tables abc d;
Better
data x.dl; infile y:
input abc;
d=a/b,
proc freq data=x.dl:
tables abc d;
Actually, this probably stems from
a lack of understanding of the purpose
for and structure of SAS data sets.
New
SAS programmers tend to be very concerned with 'what columns' the data
items are in, and other non-pertinent
issues. Many are not aware of the
history or data description segments of
SAS data sets, although new programmers
could make the best use of them.
A common example of this kind of
confusion might be:
Incorrect
data dl,
infile x;
input abc;
proc print;
if a=l then output dl:
3.
Passing and passing a file.
Somehow the word 'set' isn't quite
as explicit as the word 'read', even
though it involves just about the same
93
thing--passing the file.
Experienced
programmers are usually more sensitive
to this, but beginning programmers who
are starting out on SAS just go crazy
'se.tting' again and aqain unnecessarily.
Most examples in the SAS manuals show
fairly short data steps with only one
data set output. An example of a relatively long data step creating several
output data sets often helps this situation.
Fortran, Cobol or PLIl, indirect subscripting with 1 dimensional arrays can
be considered a little, well l clumsy
(don't get me wrong ... we're grateful to
have subscripting at all).
"Subscript
out of range at line xxx., is a frequently seen error. Lots of examples seem to
help •.. the documentation on arrays is a
little sketchy and experience is really
the only answer.
It's not a bad idea to emphasize
that, while using arrays in SAS can
save quite a bit of programmer time,
they seem to take up a lot of computer
time and should not be used indiscrimi~
nately.
Incorrect
---data dli set x.old;
if sex=' f' ;
data d2i set dl;
if age<lO then x=l;
else x=O;
proc freq: tables x;
6.
Correct
An important difference between SAS
and Fortran or Cobol is that SAS initializes numeric variables to . and
character variables to ' , at the beginning of every observation. The retain
statement is needed to carryover values
of variables from one observation to
another. This can be a real problem if
it's not understood, because no error
message is given, but incorrect output
will aLmost certainly result. An
example miqht be:
data dl; set x.old;
if· sex=' f' i
if age<lO then x=l;
else x=O;
proc freq: tables x:
There's no need for data set d2;
that line could be left out completely.
In fact, if x.ald had 500,000 observations in it, the point would seem even
more import ant.
4.
Incorrect
No length statements, or all
numeric data.
data dl; set dO;
by id;
if first.id then y=O;
Unfortunately, no one notices this
problem until it becomes a big problem
.... and someone runs out of work space.
The problem usually turns out to be
that 1000 indicator variables, all 0 or
1, were needed and each one defaulted to
8 bytes apiece. A single length statement can make all the difference. Also,
very often data is not used computationally .•. it's only needed cateqorically,
in crosstabs or for suhsetting.
In that
case, character data is really more appropriate, and will take only the space
that it needs.
It's important to add a
warning about length statements. A default length of 4 for all the numeric
variables can really clobber a tendigit number -- and non-integer values
are never quite the same.
In many instances, a judicious use of length
statements and character data (where
appropriate) can save a lot of work
space.
5.
The default of missing and'
not using retain.
y=y+l;
if last.id then output dli
Correct
data dl; set dO;
by id;
retain y;
if first.id then y=O:
y=y+l;
if last.id then output dl;
'y' should be a count of the number of
observations per id, but without a retain statement, 'y' will- be missing
whenever there is more than one observation per id.
7.
When is a data set output?
The default in SAS is to output all
data sets at the end of the data step
unless otherwise specified.
However,
the output statement allows one the option _of outputting at any point in the
data step. This freedom is great but
the programmer must remember that SAS
will output the data set right then and
any statements following the output
statement will be ineffective.
Not understanding indirect
subscripting.
~ew programmers rarely understand
subscripting of any type, so this
really applies to more experienced programmers. For anyone used to the ease
of double and triple subscripting in
94
records of students, with school id and
district id.
One might want to create
school level and district level files at
the same time the new student level file
is being created.
One pass of the data
set is enough, if the file is sorted by
district, school and student. A concrete example with at least 3 'by' variables seems to be the most help in reducing the confusion.
(A similar type
of problem occurs with proc summary and
the type variable, and the same type
of example would be useful here).
Incorrect
data dl d2; set dO;
if sex='m' then output dl;
if sex='f' then output d2;
if age>2S then age2=1;
Correct
data dl d2; set dO:
if age>2S then age2=1;
if sex='m' then output dl;
if sex='f' then output d2i
data dstud dschl ddist;
set dl;
by district schl id stud_id;
retain cntrl cntr2 -0;
if first.schl id then cntrl=O;
if first.district then cntr2=0;
In the incorrect example, 'age2'
will al....·ays be equal to missing, because
it is set to missing at the beginnir.q of
the data step and the data sets are output before it is given a value.
8.
$char and $ and leading (trailing)
blanks.
cntrl+l;
if last.schl id then do;
cntr2=cntr2+cntrl;
output dschl;
end;
if last. district then output ddist;
output dstud;
This has caused many new SAS proqrammers considerable grief. They know
about the w.d. format and the $w. format and that seems to just about cover
their needs, so they don't look any further. All character variables are input
with the $w. format.
However, the first
time they have data with leading blanks,
they're in trouble. They may want those
leading blanks and the $w. format is
going to truncdte them. This is even
worse when an as file is output with the
same $w. format because all the columns
will be off.
Usually, though, most programmers only make this mistake once -and it often looks something like this:
The three data sets output are all
different levels.
'Dstud' is one record per student,
'dschl'
is one record
per school, and 'ddist' is one record
per district.
When this method is not
understood, the tendency is to create
the student level file, set again by
school id and create the school level
file, and then set a third time by district id, to create the district file.
This works, but it involves unnecessary
steps.
Incorrect
data null ; infile x;
input -x $ 6:-
10.
file osout;
put x $ 6. . .. ;
Again, this mistake is usually made
only once (that 1 s enough).
It most of,...
ten occurs in cleaning up a group of
programs, and saving various files for
later use, in documenting. The SAS data
sets sort of mix in with all of the other
ones. The utility runs with condition
code 0, and everything's fine until
someone tries to read the tape (usually
two years later). Frequent warnings
often help, but sometimes even that
isn't enough.
Correct
data null; infile x;
input -x $cnar6.
file osout;
put x $char6.
If columns 1-6 of the input file are
123' there's trouble when that OS
file is written out.
9.
Copying SAS data sets with a gener
(or other ibm utility)
SAS is a powerful language, but is
different in structure from the common
higher level languages in several basic
ways. Mistakes are bound to occur, but
the more experienced SAS programmer can
smooth the way for the novice, if he
knows what to expect.
First., last., and more than 1 'by'
variable.
Once the concept of 'by' variables
is understood, it's hard to-" imagine what
the problem could be for someone having
difficulty.
It's important to stress
that 'by' variables only work when the
file is sorted in that same order.
For instance, suppose a file contains
95
Download