SAS® Programming Basics: Getting Your Data In and Understanding the DATA Step Helen Carey, Carey Consulting, Kaneohe, HI ABSTRACT SAS data sets are used in the SAS analysis and reporting procedures. There are various ways to get your data into a SAS data set. In this presentation, we will concentrate on the DATA step and the Import/Export Wizard. The DATA step is one of the building blocks of SAS programming. Understanding the basic structure and components of the DATA step is fundamental in learning to create your own SAS data sets. You will learn how the DATA step works, which includes understanding the input buffer and program data vector, the structure of the SAS data set and what happens at compile time and execution time. By understanding DATA step processing, you can debug your programs and interpret your results with confidence. INTRODUCTION SAS is a powerful information delivery system. Using SAS, you are able to analyze or process your data to solve your problem and generate reports. First, you need to get your data into a form that SAS can use, which is the data set. SAS is organized into steps: DATA steps and PROC steps. Usually, the DATA step is used to create and manipulates data sets while the PROC step analyzes the data or generates reports. The SAS data set is used by SAS procedures to analyze the data and produced a finished report. The DATA step has an implicit loop that cycles through each record of the DATA step. There is an implicit output statement at the end of the data step. During the compile phase of processing of the DATA step, the program data vector, a logical area in memory to hold the values during the current loop through the DATA step, is created. Understanding this implied Do Loop (until the end of file) and the program data vector are important to becoming a better programmer. Throughout this paper, the program data vector will also be referred to as the PDV. WHAT IS A SAS DATA SET? If your data is not stored a SAS data set or a form accessible by SAS/ACCESS on your computer, you need to create a SAS data set by entering data, by reading raw data, or by accessing files created by other software. You can create a SAS data set from in-stream data (datalines, cards), raw data file (infile), another SAS data set (set, merge, update), and DBMS files (SAS/Access, Oracle, import and export). Also PROC steps can create SAS data set with the OUTPUT=option or OUTPUT statement. A SAS data set is a table of data values organized into variables and observations. The variables in a SAS data set are the columns of the table and the observations in a SAS data set are the rows. Figure 1 SAS Data Set SAS data sets can be classified as a SAS data file (member type DATA) or as a SAS data view (member type VIEW.) A data file contains the data and data description. A data view contains the location of the data that is stored elsewhere and either a description or where to find the data description. In most SAS programs, it doesn’t matter which type it is. Data files and data views both can be processed in data steps and proc steps. 1 SAS® Programming Basics: Getting Your Data In and Understanding the DATA Step DATA SET PROCESSING Typically, you use the DATA step to: put your data into a SAS data set create new variables manipulate the values of your variables to meet the needs of your analysis check your data for errors and correct data errors create new SAS data sets by subsetting, combining, and updating existing data sets. DATA steps typically create or modify SAS data sets, but can also produce reports. SAS processes the DATA step in two phases. It first compiles the program and then executes the machine code. When you submit a DATA step for execution SAS first scans and checks the syntax, translates the statements to machine code, creates the input buffer if you have an input or infile statement and then sets up the program data vector (PDV) and data set descriptor information. After the compile phase, the DATA step executes. During execution, the DATA step loops by first reading values into the PDV, executing statements that may change the values in the PDV and then eventually writing the values in the PDV out as an observation into the new data set. Execution Executes statement by statement until end of program Repeats execution for each observation in input data until there are no more observations or records Write values in PDV to new data set Figure 2. Remember The Data Step Flow When a DATA step is executed, information is written to a log to explain how it was executed. Always check your log for any error messages and check the number of observations read and the number written. DECLARATIVE STATEMENTS Declarative statements are compile-time only statements. They provide information to the program data vector, and cannot be conditionally executed. The declarative statements are: drop, keep, rename label retain length format, informat attrib 2 SAS® Programming Basics: Getting Your Data In and Understanding the DATA Step array by where PROGRAM DATA VECTOR (PDV) When the data step is compiled, a buffer (computer memory) is allocated for the temporary storage of the observation being built. It is called the PDV (program data vector). The PDV has space for one observation and all variables. Variable names are added to the PDV in the order that they are encountered in the program. It contains the variables in the input SAS data set or input file and the variables created in the DATA step statements. Variable names can be mixed-case and are stored as defined on first occurrence subsequent use can be different case. SAS automatic variables are created when a DATA step executes. You can use these variables in your DATA step programming but they are not stored in the created data sets. To save the value of an automatic variable, assign its value to a data set variable. The values in automatic variable are Retained from one iteration of the DATA step to the next. Automatic variables include _N_: the number of times the DATA step has iterated _ERROR_: indicates if an error has occurred in the data step. The value will be 0 if no errors occurred. It will be 1 if one or more errors occurred. Errors include such things as division by zero or invalid input values. The FIRST.variable and LAST.variable automatic variables are available when you are using a BY variable in a DATA step. The DATA step has an implicit loop that cycles through each record of the DATA step. There is an implicit output statement at the end of the data step. When the DATA step executes, the PDV contains the observation currently being processed. At the end of the step, the data in the PDV is written to the new data set. DROP, KEEP, and RENAME statements indicate which variables to drop, keep, or rename on the output data set. The PUT statement can be a useful debugging tool. For example, the following statement writes the values of all variables, including the automatic variables _ERROR_ and _N_, that are defined in the current DATA step: put _all_; Sample Code: DATA Visit; LENGTH Test 3; LABEL Date='First Visit'; INFORMAT Score Myfmt3.; FORMAT Date MMDDYY8.; INPUT Id $ 1-3 test 5-6 @8 score @11 date DATE7.; Avg=MEAN(Test,Score); Run; In this program notice that the first variable encountered is TEST, then DATE. The last one is AVG. Figure 3. Program Data Vector (PDV) RETAIN STATEMENT The order of the variables in the PDV is the same order that the variables become known to SAS, as well as the order they will be in the SAS data set. If you want to reorder the variables in the data set, you need to create a new SAS data set. 3 SAS® Programming Basics: Getting Your Data In and Understanding the DATA Step One way is to list the variables in a RETAIN statement in the order that you want them. The RETAIN statement must be placed before the set statement. The values in the variable are retained from one iteration of the DATA step to the next. DATA Visit; RETAIN Id Avg; LENGTH Test 3; LABEL Date='First Visit'; INFORMAT Score Myfmt3.; FORMAT Date MMDDYY8.; INPUT Id $ 1-3 test 5-6 @8 score @11 date DATE7.; Avg =MEAN(Test,Score); Figure 4. Change Order of Variables With Retain INPUT BUFFER When the data step is compiled, an input buffer is allocated in memory for the temporary storage of a raw data line. This is created only if raw data are read. To print the contents of the input buffer, code: put _infile_; DATA DESCRIPTOR The Descriptor portion is written to the data set and contains general information about the data set, including the name of the data set the date and time the data set was created the number of observations and the number of variables name and attributes of the variables In addition to general information about the data set, the descriptor portion contains attribute information for each variable in the data set. This includes the variable's name, type, length, format, informat, and label. An informat (input format) tells SAS how to read raw data. SAS provides many informats for reading difference kinds of data values. A format tells SAS how to write or group the data values. SAS DATA VALUES AND DATES There are only two types of variables in SAS: character and numeric. Numeric variable are stored with a length of 8 in the PDV. If you use a length statement for a numeric variable to save storage space, that is used when outputting the observation. SAS dates are stored as numbers . SAS represents a date internally as the number of days between January 1, 1960 and the specified date. Therefore, if you were born before 1960, your date value is a negative number. When a variable is a SAS date value, you can add and substract dates. To find the number of days between two dates, simply subtract the two SAS date variables. duration = date1 – date2; You can compare dates. if date1 < date2 then do; There are many built-in functions and formats to work with dates. 4 SAS® Programming Basics: Getting Your Data In and Understanding the DATA Step STEPPING THROUGH THE DATA STEP SAS sequentially reads each observation in the named data sets, one observation at a time, until there are no further observations to process. This is a sample program that we will use to step through the processing of a DATA step. data Trip; input Name $ When mmddyy10. COST; Total= cost * 1.05; cards; Bev 1/12/2012 10 Phe 1/25/2012 40 Phe 2/25/2012 80 Bev 3/14/2012 50 Phe 2/30/2012 20 Phe 3/01/2012 30 run; We are reading the variable Name in as a character value, indicated by the $, and When and COST as numeric values.The variable When is read using the informat mmddyy10. and will be stored as the number of days since January 1, 1960. Although it is a good idea, we are not storing a format with the variable When so that we can show it as a number. These are the only executable statements in the sample program. input Name $ When mmddyy10. COST; Total= cost * 1.05; The next figure, figure 5, shows stepping through the data step one statement at a time. The line numbers are in the yellow column on the left of the representation of the program data vector (PDV) and will used in the explanation of stepping through the DATA step. The Statement column shows whether we are executing the INPUT statement or the Total assignment statement in the program. So here we go. 5 SAS® Programming Basics: Getting Your Data In and Understanding the DATA Step Figure 5. Stepping Through the Data Step 6 SAS® Programming Basics: Getting Your Data In and Understanding the DATA Step Line 1 The DATA is in the first iteration of the step as shown by the automatic variable _N_. All values from variables in the INPUT statement and assignment statements are set to missing, represented by the period (.). _N_ is initialized to 1 and _ERROR_ to 0, because there are no errors at this point. Line 2 Inputs the first line in the input file, that is, the lines of data after the CARDS statement. The variables in the input statement are Name, When and Cost so their values are placed in the PDV. Line 3 The assignment statement Total calculated a value for Total and that is placed in the PDV. Line 4 We have reached the end of the implied Do loop of the DATA step. Because there are no OUTPUT statements in the program, then there is an implied OUTPUT statement at the end of the DATA step. Therefore the values in the PDV, except for the automatic variables _N_ and _ERROR_, are written to the SAS data set. Now there is one observation in the data set work.Trip . Line 5 Next, there is a return to the top of the DATA step and all the values for Name, When, Cost and Total are initialized to missing. That means that Name, which is character is set to blank (‘ ‘) and the numeric variables are set to missing, represented by the period. _N_ is incremented by 1. _ERROR_ is set to 0 because there is not an error at this point in the processing.. Lines 6 to 8 follow the same process as Lines 2 to 4. The second record from the CARDS file is read and outputted to the data set Lines 9-12 processes the third record. Lines13-16 processes the fourth record. Line 17 is the top of the DATA step and initializes the variables Line 18 is difference from the above processing because the fifth input record contains an error. The record is:: Phe 2/30/2012 20 Feb 30 is an invalid data.The informat is mmddyy10. For the variable When. Therefore, the value When is set to missing and _ERROR- is set to 1 to indicate an error. Line 19 The Total value is calculated and put in the PDV. Line 20 The PDV is written to the SAS data set. There are now 5 observations in the data set. Lines 21-24 follows the same logic as for the previous observations. Line 25 Variables are set to missing, _N_ is increased by 1 and _ERROR_ is set to 0. Line 26 When the INPUT statement is processed, there are no more records. It is the end of the file, so SAS immediately leaves the DATA step. This is the results – the SAS data set work.trip. Notice that the values for the variable When are 19004, 19017, etc. because they are stored as numbers. Figure 6. SAS Data Set WORK.TRIP 7 SAS® Programming Basics: Getting Your Data In and Understanding the DATA Step Let’s use PROC REPORT to write out the data set so that the variable is written twice, once to print it as a number value and once with the more understandable display format DATE10. Notice that the value for the When variable for the fifth observation is missing because there was an invalid date in the CARDS file. Sample code: proc report nowindows; column Name When When=PDate Total; define Pdate/display 'Purchased' format=date10.; run; Figure 7. PROC REPORT Results OTHER WAYS OF CREATING A SAS DATA SET SAS ENTERPRISE GUIDE I like how easy it is easy to import data from other sources using SAS Enterprise Guide. You do not need the SAS/ACCESS Interface to PC Files to import Microsoft Excel and Acccess files into the SAS Enterprise Guide. However, if you do have the license, Enterprise Guide will use this capability to improve performance by selecting an option. Once a SAS data set is imported and permanently stored, you can always use it in your own SAS programs. It is worth checking out the SAS Enterprise Guide. IMPORTING DATA Importing an Excel File into SAS PROC IMPORT and the Import Wizard are very useful tools for converting flat or ASCII files and external data sources into SAS data sets. The Import Wizard presents a series of windows with simple choices to guide you the process of importing or exporting data. The wizard is easy to use. Here are the steps to import an Excel file if you have licensed SAS/ACCESS Interface to PC Files, which lets you import PC files, such as Excel or Access. Also there are other ways to create SAS data sets from Excel. By default, the variable names come from the Excel column headers. The data values begin in the second row. To start the Wizard, first make sure the Excel file is closed so that you do not get a file sharing error. Open SAS, then click File > Import Data. This opens the dialog box so that you can select the data source type for your input file.The default type is Microsoft Excel, click next to accept the default. 8 SAS® Programming Basics: Getting Your Data In and Understanding the DATA Step Figure 8. Import Wizard: Import Data Figure 9. Import Wizard: Select Import Type In the Connect to MS Excel dialog box, click Browse to locate the Excel file you want to import. Figure 10. Import Wizard: Connect to MS Excel Click OK. This opens the Select table dialog box. Select the table you want to import. If you have has multiple worksheets, click the pull- down menu for the list of worksheets SAS uses the first 8 observations to determine whether it is character or numeric values In case of mixed-types, the type that appears most often in the first 8 observations will be applied. Values that do not conform to the assigned type will be converted to missing values. Figure 11. Import Wizard: Select Table Click Next to open the Select library and member dialog box. Enter the name of the library or click the pull-down menu to find it. Enter the name of the SAS data set in the Member box. Figure 12. . Import Wizard: Select Library and member Click Next. Here you can save the PROC IMPORT statements for subsequent use. If you want to import multiple worksheets, saving the program file may save you time. You can edit the PROC IMPORT statements and replace the name of the worksheet in the RANGE=statement with the worksheet name. Click Finish. . Figure 13. Import Wizard: Create SAS Statements 9 SAS® Programming Basics: Getting Your Data In and Understanding the DATA Step GETTING INFORMATION ABOUT THE DATA SET Along with data values, each SAS data set contains metadata or data about the data. This information, recorded in the descriptor portion of the data set, contains information like the names and attributes of all the variables, the number of observations in the data set, and the date and time that the data set was created and updated. PROC DATASETS To view the descriptor portion, you can right click on the data set in the SAS Explorer window and select view columns or print it with PROC DATASETS. The DETAILS option lists the number of observations, the number of variables, and the label of the data set. This is also a way to find typos of the variable names. For example: proc datasets library=work details; contents data=trip; run; Figure 14. PROC DATASETS Results ® Michael Raithel’s paper PROC DATASETS; The Swiss Army Knife of SAS has everything you want to know about PROC DATASETS, a powerful procedure. If you are just changing the attributes of the variables, such as their names, informats and labels, then use PROC DATASETS to do the work for you, not the DATA step. Use PROC DATASETS instead of the data step to concatenate SAS data sets. For example: proc datasets library=youthlib; append base=allyears data=year1997; run; VIEWTABLE WINDOW The ViewTable window in a SAS session, is an interactive way to view, enter and edit data. It is accessible from the SAS Explorer window by clicking on the data set or view or using the viewtable (abbreviated vt) command from the command box. The command box is below the menu bar. Once you open the ViewTable window, you can select to view only specific columns by typing the columns command from the command box, such as columns 'memname name label'. This is the same as using the hide/unhide on the Data Menu of the ViewTable window. Close the ViewTable window before submitting the program that Figure 15. Viewtable recreates a SAS data set. More than once, I have not closed the window, have not read the log, and wondered why the results did not change. This is an example of where it is important to read the log after every run. By reading the log, I would have found out that I could not re-create my data set because it was open in the ViewTable window and “The SAS System stopped processing this step because of errors.” Reading the log after every run is a good practice to follow. DICTIONARY TABLES A DICTIONARY table is a read-only SAS view that contains information about SAS libraries, SAS data sets, SAS macros, and external files that are in use or available in the current SAS session. Each DICTIONARY Figure 16. Dictionary Tables 10 SAS® Programming Basics: Getting Your Data In and Understanding the DATA Step table has an associated PROC SQL view in the SASHELP library. You can see the entire contents of a DICTIONARY table by opening its SASHELP view in the ViewTable window. These SAS views name starts with V, for example, VCOLUMN or VMEMBER. SASHELP.VCOLUMN was the view that we were using above in the ViewTable Window section. Here is an example of accessing SASHELP.VCOLUMN using PROC SQL. proc sql; select memname format=$8., varnum, name format=$15., label from sashelp.vcolumn where libname='SASHELP' and memname='ZIPCODE'; run; quit; Results Figure 17. SASHELP.VCOLUMN SAVE DISK SPACE DROP, KEEP AND RENAME VARIABLES Save disk space and make your data sets easier to understand by inputting only the data you need. Drop variables you no longer need. DATA mylib.yearly(DROP=Rain1-Rain12); SET Old(DROP=Snow1-Snow12); Total = SUM(of Rain1-Rain12); programming statements RUN; Drop DO loop indexing variables. data Mylib.NewCost (DROP=i); set Mylib.Cost; array Amt(100) Amt1-Amt100; do i=1 TO 100; Amt(i)=MAX(0,Amt(i)); end; run; STORING NUMERIC CATEGORICAL DATA Store numeric categorical data in character variables to save space. length quest1-quest40 $ 1; Suppose you had a 40 question survey with 500 respondents and categorical responses from 1 to 9. It would take 20,000 bytes to store it as 1 character (40 x 500 = 20,000) versus storing it as a numeric of length 8 (160,000 bytes). The default length of numeric variables in SAS data sets is 8 bytes. To save space, store integer numeric variables in a length less than 8, if you can, by using the LENGTH statement for integer variables. For example, dummy variables that would have only a value of 0 or 1 is a good candidate for having a length of 3. Use the LENGTH statement only for variables whose values are always integers. Non-integer numbers lose precision if they are truncated. length Id s1-s5 4 Income 8 default=3; 11 SAS® Programming Basics: Getting Your Data In and Understanding the DATA Step Be careful when choosing the length, because the largest integer represented varies from one host system to another. The largest integer number that you can store under z/OS with a length of 3 bytes is 65,536, on Unix and Windows it is 8,192. See the SAS Companions for a specific host system for more information. RECOMMENDED READING SAS ONLINE DOCUMENTATION Visit the Getting started with SAS Software in SAS Help and Documentation. It is available online from your windowing environment or online under documentation at www.support.com. Figure 18. SAS Help and Documentation BOOKS I wholeheartedly recommend the book Step-by-Step Programming with Base SAS Software. You can purchase it from SAS Press or download the PDF for free. Do a google search for “Step-by-Step Programming with Base SAS Software PDF”. It is more than 788 pages so you may want to just store it on your computer, Kindle, or tablet. The Little SAS Book: A Primer, Fourth Edition by Lora Delwiche and Susan Slaughter is an easy-to-read classic and is available from SAS Press and also from amazon.com as both a paperback book or a Kindle edition. Check with the other programmers in your office. They may already have a copy. Carpenter's Guide to Innovative SAS Techniques is a programming reference that includes advanced topics. It shows DATA step techniques that solve complex data problems. Check out Amazon for better prices and shipping rates and to view the table of contents and a free chapter. SAS CONFERENCE PAPERS One way to find SAS papers is to do a Google search. Another way is to visit Lex Jansen's website at lexjansen.com. This site searches 7360 SAS papers from SAS Global Forum, SUGI, PharmaSUG, NESUG, SESUG, PhUSE, WUSS, MWSUG, PNWSUG and SCSUG. Bound copies of old proceedings are being scanned and added to the collection of SAS papers. Figure 19. SAS Conference Papers Web Site Save a tree. Do not print out every SAS paper or article that you find that you might want to read one day. They are easier to find if you use a book marking tool, Instapaper, or your own list to keep a record of the papers you might 12 SAS® Programming Basics: Getting Your Data In and Understanding the DATA Step want to read later. You can download papers to store on your computer or e-reader. The IPad and eReaders can store and read PDFs and allow markup or bookmarks. With an eReader you can even read in bed. A SAS essential is learning to write clear flexible code with a consistent style that is well documented. Read the WUSS 2012 conference paper Habits that Help: Developing Good Programming Style by Casey Cantrell to learn how. CONCLUSIONS The takeaway message is that you will be a better programmer if you don’t just learn the language but take the time to understand how SAS works, which includes understanding the program data vector and implied DO loop. REFERENCES Howard, Neil (2003), “How SAS Thinks or Why the DATA Step Does What It Does”, Proceedings of the 28th Annual SAS Users Group International Conference Whitlock, Ian (2006), “How to Think Through the SAS DATA Step”, Proceedings of the 31st Annual SAS Users Group International Conference Whitlock, Marianne (2007), “The Program Data Vector As an Aid to DATA Step Reasoning”, Proceedings of the 2007 North East SAS Users Group Conference CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Name: Helen Carey E-mail: careyhi@gmail.com SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. 13