Paper Template - Western Users of SAS Software

advertisement
SAS® Programming Basics: Getting Your Data In and Understanding the
DATA Step
Helen Carey, Carey Consulting, Kaneohe, HI
ABSTRACT
SAS data sets are used in the SAS analysis and reporting procedures. There are various ways to get your data into a
SAS data set. In this presentation, we will concentrate on the DATA step and the Import/Export Wizard.
The DATA step is one of the building blocks of SAS programming. Understanding the basic structure and
components of the DATA step is fundamental in learning to create your own SAS data sets. You will learn how the
DATA step works, which includes understanding the input buffer and program data vector, the structure of the SAS
data set and what happens at compile time and execution time. By understanding DATA step processing, you can
debug your programs and interpret your results with confidence.
INTRODUCTION
SAS is a powerful information delivery system. Using SAS, you are able to analyze or process your data to solve your
problem and generate reports. First, you need to get your data into a form that SAS can use, which is the data set.
SAS is organized into steps: DATA steps and PROC steps. Usually, the DATA step is used to create and
manipulates data sets while the PROC step analyzes the data or generates reports. The SAS data set is used by
SAS procedures to analyze the data and produced a finished report.
The DATA step has an implicit loop that cycles through each record of the DATA step. There is an implicit output
statement at the end of the data step. During the compile phase of processing of the DATA step, the program data
vector, a logical area in memory to hold the values during the current loop through the DATA step, is created.
Understanding this implied Do Loop (until the end of file) and the program data vector are important to becoming a
better programmer. Throughout this paper, the program data vector will also be referred to as the PDV.
WHAT IS A SAS DATA SET?
If your data is not stored a SAS data set or a form accessible by SAS/ACCESS on your computer, you need to create
a SAS data set by entering data, by reading raw data, or by accessing files created by other software. You can create
a SAS data set from in-stream data (datalines, cards), raw data file (infile), another SAS data set (set, merge,
update), and DBMS files (SAS/Access, Oracle, import and export). Also PROC steps can create SAS data set with
the OUTPUT=option or OUTPUT statement.
A SAS data set is a table of data values organized into variables and observations. The variables in a SAS data set
are the columns of the table and the observations in a SAS data set are the rows.
Figure 1 SAS Data Set
SAS data sets can be classified as a SAS data file (member type DATA) or as a SAS data view (member type VIEW.)
A data file contains the data and data description. A data view contains the location of the data that is stored
elsewhere and either a description or where to find the data description. In most SAS programs, it doesn’t matter
which type it is. Data files and data views both can be processed in data steps and proc steps.
1
SAS® Programming Basics: Getting Your Data In and Understanding the DATA Step
DATA SET PROCESSING
Typically, you use the DATA step to:





put your data into a SAS data set
create new variables
manipulate the values of your variables to meet the needs of your analysis
check your data for errors and correct data errors
create new SAS data sets by subsetting, combining, and updating existing data sets.
DATA steps typically create or modify SAS data sets, but can also produce reports.
SAS processes the DATA step in two phases. It first compiles the program and then executes the machine code.
When you submit a DATA step for execution SAS first scans and checks the syntax, translates the statements to
machine code, creates the input buffer if you have an input or infile statement and then sets up the program data
vector (PDV) and data set descriptor information.
After the compile phase, the DATA step executes. During execution, the DATA step loops by first reading values into
the PDV, executing statements that may change the values in the PDV and then eventually writing the values in the
PDV out as an observation into the new data set.
Execution



Executes statement by statement until end of program
Repeats execution for each observation in input data until there are no more observations or records
Write values in PDV to new data set
Figure 2. Remember The Data Step Flow
When a DATA step is executed, information is written to a log to explain how it was executed. Always check your log
for any error messages and check the number of observations read and the number written.
DECLARATIVE STATEMENTS
Declarative statements are compile-time only statements. They provide information to the program data vector, and
cannot be conditionally executed. The declarative statements are:






drop, keep, rename
label
retain
length
format, informat
attrib
2
SAS® Programming Basics: Getting Your Data In and Understanding the DATA Step



array
by
where
PROGRAM DATA VECTOR (PDV)
When the data step is compiled, a buffer (computer memory) is allocated for the temporary storage of the observation
being built. It is called the PDV (program data vector). The PDV has space for one observation and all variables.
Variable names are added to the PDV in the order that they are encountered in the program. It contains the variables
in the input SAS data set or input file and the variables created in the DATA step statements.
Variable names can be mixed-case and are stored as defined on first occurrence subsequent use can be different
case.
SAS automatic variables are created when a DATA step executes. You can use these variables in your DATA step
programming but they are not stored in the created data sets. To save the value of an automatic variable, assign its
value to a data set variable. The values in automatic variable are Retained from one iteration of the DATA step to the
next.
Automatic variables include

_N_: the number of times the DATA step has iterated

_ERROR_: indicates if an error has occurred in the data step. The value will be 0 if no errors occurred. It will
be 1 if one or more errors occurred. Errors include such things as division by zero or invalid input values.
The FIRST.variable and LAST.variable automatic variables are available when you are using a BY variable in a
DATA step.
The DATA step has an implicit loop that cycles through each record of the DATA step. There is an implicit output
statement at the end of the data step. When the DATA step executes, the PDV contains the observation currently
being processed. At the end of the step, the data in the PDV is written to the new data set. DROP, KEEP, and
RENAME statements indicate which variables to drop, keep, or rename on the output data set.
The PUT statement can be a useful debugging tool. For example, the following statement writes the values of all
variables, including the automatic variables _ERROR_ and _N_, that are defined in the current DATA step:
put _all_;
Sample Code:
DATA Visit;
LENGTH
Test 3;
LABEL
Date='First Visit';
INFORMAT Score Myfmt3.;
FORMAT
Date MMDDYY8.;
INPUT Id $ 1-3 test 5-6
@8 score @11 date DATE7.;
Avg=MEAN(Test,Score);
Run;
In this program notice that the first variable encountered is TEST, then DATE. The last one is AVG.
Figure 3. Program Data Vector (PDV)
RETAIN STATEMENT
The order of the variables in the PDV is the same order that the variables become known to SAS, as well as the order
they will be in the SAS data set. If you want to reorder the variables in the data set, you need to create a new SAS
data set.
3
SAS® Programming Basics: Getting Your Data In and Understanding the DATA Step
One way is to list the variables in a RETAIN statement in the order that you want them. The RETAIN statement must
be placed before the set statement. The values in the variable are retained from one iteration of the DATA step to the
next.
DATA Visit;
RETAIN
Id Avg;
LENGTH
Test 3;
LABEL
Date='First Visit';
INFORMAT Score Myfmt3.;
FORMAT
Date MMDDYY8.;
INPUT Id $ 1-3 test 5-6
@8 score @11 date DATE7.;
Avg =MEAN(Test,Score);
Figure 4. Change Order of Variables With Retain
INPUT BUFFER
When the data step is compiled, an input buffer is allocated in memory for the temporary storage of a raw data line.
This is created only if raw data are read.
To print the contents of the input buffer, code:
put _infile_;
DATA DESCRIPTOR
The Descriptor portion is written to the data set and contains general information about the data set, including




the name of the data set
the date and time the data set was created
the number of observations and the number of variables
name and attributes of the variables
In addition to general information about the data set, the descriptor portion contains attribute information for each
variable in the data set. This includes the variable's name, type, length, format, informat, and label.
An informat (input format) tells SAS how to read raw data. SAS provides many informats for reading difference kinds
of data values. A format tells SAS how to write or group the data values.
SAS DATA VALUES AND DATES
There are only two types of variables in SAS: character and numeric. Numeric variable are stored with a length of 8 in
the PDV. If you use a length statement for a numeric variable to save storage space, that is used when outputting the
observation.
SAS dates are stored as numbers .
SAS represents a date internally as the number of days between January 1, 1960 and the specified date. Therefore,
if you were born before 1960, your date value is a negative number. When a variable is a SAS date value, you can
add and substract dates. To find the number of days between two dates, simply subtract the two SAS date variables.
duration = date1 – date2;
You can compare dates.
if date1 < date2 then do;
There are many built-in functions and formats to work with dates.
4
SAS® Programming Basics: Getting Your Data In and Understanding the DATA Step
STEPPING THROUGH THE DATA STEP
SAS sequentially reads each observation in the named data sets, one observation at a time, until there are no further
observations to process.
This is a sample program that we will use to step through the processing of a DATA step.
data Trip;
input Name $ When mmddyy10. COST;
Total= cost * 1.05;
cards;
Bev 1/12/2012 10
Phe 1/25/2012 40
Phe 2/25/2012 80
Bev 3/14/2012 50
Phe 2/30/2012 20
Phe 3/01/2012 30
run;
We are reading the variable Name in as a character value, indicated by the $, and When and COST as numeric
values.The variable When is read using the informat mmddyy10. and will be stored as the number of days since
January 1, 1960. Although it is a good idea, we are not storing a format with the variable When so that we can show it
as a number.
These are the only executable statements in the sample program.
input Name $ When mmddyy10. COST;
Total= cost * 1.05;
The next figure, figure 5, shows stepping through the data step one statement at a time. The line numbers are in the
yellow column on the left of the representation of the program data vector (PDV) and will used in the explanation of
stepping through the DATA step. The Statement column shows whether we are executing the INPUT statement or
the Total assignment statement in the program. So here we go.
5
SAS® Programming Basics: Getting Your Data In and Understanding the DATA Step
Figure 5. Stepping Through the Data Step
6
SAS® Programming Basics: Getting Your Data In and Understanding the DATA Step
Line 1 The DATA is in the first iteration of the step as shown by the automatic variable _N_. All values from variables
in the INPUT statement and assignment statements are set to missing, represented by the period (.). _N_ is
initialized to 1 and _ERROR_ to 0, because there are no errors at this point.
Line 2 Inputs the first line in the input file, that is, the lines of data after the CARDS statement. The variables in the
input statement are Name, When and Cost so their values are placed in the PDV.
Line 3 The assignment statement Total calculated a value for Total and that is placed in the PDV.
Line 4 We have reached the end of the implied Do loop of the DATA step. Because there are no OUTPUT
statements in the program, then there is an implied OUTPUT statement at the end of the DATA step.
Therefore the values in the PDV, except for the automatic variables _N_ and _ERROR_, are written to the
SAS data set. Now there is one observation in the data set work.Trip
.
Line 5 Next, there is a return to the top of the DATA step and all the values for Name, When, Cost and Total are
initialized to missing. That means that Name, which is character is set to blank (‘ ‘) and the numeric variables
are set to missing, represented by the period. _N_ is incremented by 1. _ERROR_ is set to 0 because there
is not an error at this point in the processing..
Lines 6 to 8 follow the same process as Lines 2 to 4. The second record from the CARDS file is read and outputted to
the data set
Lines 9-12 processes the third record.
Lines13-16 processes the fourth record.
Line 17 is the top of the DATA step and initializes the variables
Line 18 is difference from the above processing because the fifth input record contains an error. The record is::
Phe 2/30/2012 20
Feb 30 is an invalid data.The informat is mmddyy10. For the variable When. Therefore, the value When is set
to missing and _ERROR- is set to 1 to indicate an error.
Line 19 The Total value is calculated and put in the PDV.
Line 20 The PDV is written to the SAS data set. There are now 5 observations in the data set.
Lines 21-24 follows the same logic as for the previous observations.
Line 25 Variables are set to missing, _N_ is increased by 1 and _ERROR_ is set to 0.
Line 26 When the INPUT statement is processed, there are no more records. It is the end of the file, so SAS
immediately leaves the DATA step.
This is the results – the SAS data set work.trip. Notice that the values for the variable When are 19004, 19017, etc.
because they are stored as numbers.
Figure 6. SAS Data Set WORK.TRIP
7
SAS® Programming Basics: Getting Your Data In and Understanding the DATA Step
Let’s use PROC REPORT to write out the data set so that the variable is written twice, once to print it as a number
value and once with the more understandable display format DATE10. Notice that the value for the When variable for
the fifth observation is missing because there was an invalid date in the CARDS file.
Sample code:
proc report nowindows;
column Name
When When=PDate Total;
define Pdate/display 'Purchased'
format=date10.;
run;
Figure 7. PROC REPORT Results
OTHER WAYS OF CREATING A SAS DATA SET
SAS ENTERPRISE GUIDE
I like how easy it is easy to import data from other sources using SAS Enterprise Guide. You do not need the
SAS/ACCESS Interface to PC Files to import Microsoft Excel and Acccess files into the SAS Enterprise Guide.
However, if you do have the license, Enterprise Guide will use this capability to improve performance by selecting an
option. Once a SAS data set is imported and permanently stored, you can always use it in your own SAS programs.
It is worth checking out the SAS Enterprise Guide.
IMPORTING DATA
Importing an Excel File into SAS
PROC IMPORT and the Import Wizard are very useful tools for converting flat or ASCII files and external data
sources into SAS data sets.
The Import Wizard presents a series of windows with simple choices to guide you the process of importing or
exporting data. The wizard is easy to use.
Here are the steps to import an Excel file if you have licensed SAS/ACCESS Interface to PC Files, which lets you
import PC files, such as Excel or Access. Also there are other ways to create SAS data sets from Excel.
By default, the variable names come from the Excel column headers. The data values begin in the second row.
To start the Wizard, first make sure the Excel file is closed so that you do not get a file sharing error. Open SAS, then
click File > Import Data.
This opens the dialog box so that you
can select the data source type for your
input file.The default type is Microsoft
Excel, click next to accept the default.
8
SAS® Programming Basics: Getting Your Data In and Understanding the DATA Step
Figure 8. Import Wizard: Import Data
Figure 9. Import Wizard: Select Import Type
In the Connect to MS Excel dialog box, click Browse to locate
the Excel file you want to import.
Figure 10. Import Wizard: Connect to MS Excel
Click OK. This opens the Select table dialog box. Select the
table you want to import. If you have has multiple worksheets,
click the pull- down menu for the list of worksheets
SAS uses the first 8 observations to determine whether it is
character or numeric values In case of mixed-types, the type
that appears most often in the first 8 observations will be
applied. Values that do not conform to the assigned type will be
converted to missing values.
Figure 11. Import Wizard: Select Table
Click Next to open the Select library and member dialog box.
Enter the name of the library or click the pull-down menu to
find it. Enter the name of the SAS data set in the Member
box.
Figure 12. . Import Wizard: Select Library and member
Click Next.
Here you can save the PROC IMPORT statements for
subsequent use. If you want to import multiple worksheets,
saving the program file may save you time. You can edit the
PROC IMPORT statements and replace the name of the
worksheet in the RANGE=statement with the worksheet
name.
Click Finish.
.
Figure 13. Import Wizard: Create SAS Statements
9
SAS® Programming Basics: Getting Your Data In and Understanding the DATA Step
GETTING INFORMATION ABOUT THE DATA SET
Along with data values, each SAS data set contains metadata or data about the data. This information, recorded in
the descriptor portion of the data set, contains information like the names and attributes of all the variables, the
number of observations in the data set, and the date and time that the data set was created and updated.
PROC DATASETS
To view the descriptor portion, you can right click on the data set in the SAS Explorer window and select view
columns or print it with PROC DATASETS. The DETAILS option lists the number of observations, the number of
variables, and the label of the data set. This is also a way to find typos of the variable names.
For example:
proc datasets library=work details;
contents data=trip;
run;
Figure 14. PROC DATASETS Results
®
Michael Raithel’s paper PROC DATASETS; The Swiss Army Knife of SAS has everything you want to know about
PROC DATASETS, a powerful procedure. If you are just changing the attributes of the variables, such as their
names, informats and labels, then use PROC DATASETS to do the work for you, not the DATA step.
Use PROC DATASETS instead of the data step to concatenate SAS data sets. For example:
proc datasets library=youthlib;
append base=allyears
data=year1997;
run;
VIEWTABLE WINDOW
The ViewTable window in a SAS session, is an interactive way to view, enter and edit data. It is accessible from the
SAS Explorer window by clicking on the data set or view or using the viewtable (abbreviated vt) command from the
command box. The command box is below the menu bar.
Once you open the ViewTable window, you can select to view only
specific columns by typing the columns command from the command
box, such as columns 'memname name label'. This is the same
as using the hide/unhide on the Data Menu of the ViewTable window.
Close the ViewTable window before submitting the program that
Figure 15. Viewtable
recreates a SAS data set. More than once, I have not closed the
window, have not read the log, and wondered why the results did not
change. This is an example of where it is important to read the log after every run. By reading the log, I would have
found out that I could not re-create my data set because it was open in the ViewTable window and “The SAS System
stopped processing this step because of errors.” Reading the log after every run is a good practice to follow.
DICTIONARY TABLES
A DICTIONARY table is a read-only SAS view that contains information
about SAS libraries, SAS data sets, SAS macros, and external files that
are in use or available in the current SAS session. Each DICTIONARY
Figure 16. Dictionary Tables
10
SAS® Programming Basics: Getting Your Data In and Understanding the DATA Step
table has an associated PROC SQL view in the SASHELP library. You can see the entire contents of a DICTIONARY
table by opening its SASHELP view in the ViewTable window. These SAS views name starts with V, for example,
VCOLUMN or VMEMBER. SASHELP.VCOLUMN was the view that we were using above in the ViewTable Window
section.
Here is an example of accessing SASHELP.VCOLUMN using PROC SQL.
proc sql;
select memname format=$8.,
varnum,
name format=$15.,
label
from sashelp.vcolumn
where libname='SASHELP'
and memname='ZIPCODE';
run; quit;
Results
Figure 17. SASHELP.VCOLUMN
SAVE DISK SPACE
DROP, KEEP AND RENAME VARIABLES
Save disk space and make your data sets easier to understand by inputting only the data you need. Drop variables
you no longer need.
DATA mylib.yearly(DROP=Rain1-Rain12);
SET Old(DROP=Snow1-Snow12);
Total = SUM(of Rain1-Rain12);
programming statements
RUN;
Drop DO loop indexing variables.
data Mylib.NewCost (DROP=i);
set Mylib.Cost;
array Amt(100) Amt1-Amt100;
do i=1 TO 100;
Amt(i)=MAX(0,Amt(i));
end;
run;
STORING NUMERIC CATEGORICAL DATA
Store numeric categorical data in character variables to save space.
length quest1-quest40
$ 1;
Suppose you had a 40 question survey with 500 respondents and categorical responses from 1 to 9. It would take
20,000 bytes to store it as 1 character (40 x 500 = 20,000) versus storing it as a numeric of length 8 (160,000 bytes).
The default length of numeric variables in SAS data sets is 8 bytes. To save space, store integer numeric variables in
a length less than 8, if you can, by using the LENGTH statement for integer variables. For example, dummy variables
that would have only a value of 0 or 1 is a good candidate for having a length of 3. Use the LENGTH statement only
for variables whose values are always integers. Non-integer numbers lose precision if they are truncated.
length Id s1-s5 4
Income 8
default=3;
11
SAS® Programming Basics: Getting Your Data In and Understanding the DATA Step
Be careful when choosing the length, because the largest integer represented varies from one host system to
another. The largest integer number that you can store under z/OS with a length of 3 bytes is 65,536, on Unix and
Windows it is 8,192. See the SAS Companions for a specific host system for more information.
RECOMMENDED READING
SAS ONLINE DOCUMENTATION
Visit the Getting started with SAS Software in SAS Help and Documentation. It is available online from your
windowing environment or online under documentation at www.support.com.
Figure 18. SAS Help and Documentation
BOOKS
I wholeheartedly recommend the book Step-by-Step Programming with Base SAS Software. You can purchase it
from SAS Press or download the PDF for free. Do a google search for “Step-by-Step Programming with Base SAS
Software PDF”. It is more than 788 pages so you may want to just store it on your computer, Kindle, or tablet.
The Little SAS Book: A Primer, Fourth Edition by Lora Delwiche and Susan Slaughter is an easy-to-read classic and
is available from SAS Press and also from amazon.com as both a paperback book or a Kindle edition. Check with the
other programmers in your office. They may already have a copy.
Carpenter's Guide to Innovative SAS Techniques is a programming reference that includes advanced topics. It shows
DATA step techniques that solve complex data problems. Check out Amazon for better prices and shipping rates and
to view the table of contents and a free chapter.
SAS CONFERENCE PAPERS
One way to find SAS papers is to do a Google search. Another way is to visit Lex Jansen's website at
lexjansen.com. This site searches 7360 SAS papers from SAS Global Forum, SUGI, PharmaSUG, NESUG,
SESUG, PhUSE, WUSS, MWSUG, PNWSUG and SCSUG. Bound copies of old proceedings are being scanned and
added to the collection of SAS papers.
Figure 19. SAS Conference Papers Web Site
Save a tree. Do not print out every SAS paper or article that you find that you might want to read one day. They are
easier to find if you use a book marking tool, Instapaper, or your own list to keep a record of the papers you might
12
SAS® Programming Basics: Getting Your Data In and Understanding the DATA Step
want to read later. You can download papers to store on your computer or e-reader. The IPad and eReaders can
store and read PDFs and allow markup or bookmarks. With an eReader you can even read in bed.
A SAS essential is learning to write clear flexible code with a consistent style that is well documented. Read the
WUSS 2012 conference paper Habits that Help: Developing Good Programming Style by Casey Cantrell to learn
how.
CONCLUSIONS
The takeaway message is that you will be a better programmer if you don’t just learn the language but take the time
to understand how SAS works, which includes understanding the program data vector and implied DO loop.
REFERENCES

Howard, Neil (2003), “How SAS Thinks or Why the DATA Step Does What It Does”, Proceedings of the 28th
Annual SAS Users Group International Conference

Whitlock, Ian (2006), “How to Think Through the SAS DATA Step”, Proceedings of the 31st Annual SAS Users
Group International Conference

Whitlock, Marianne (2007), “The Program Data Vector As an Aid to DATA Step Reasoning”, Proceedings of the
2007 North East SAS Users Group Conference
CONTACT INFORMATION
Your comments and questions are valued and encouraged. Contact the author at:
Name: Helen Carey
E-mail: careyhi@gmail.com
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS
Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies.
13
Download