A. Schweder SAS Basics Workshop 03/28/03 SAS BASICS WORKSHOP Amanda E. Schweder, Ph.D. Department of Psychology Yale University StatLab: Main Classroom 140 Prospect Street March 28th 2003 1:00-3:00pm 1 A. Schweder SAS Basics Workshop 03/28/03 This workshop is designed to introduce new users to some of the basic concepts necessary for using the SAS System by covering the following topics: 1. 2. 3. 4. 5. 6. 7. 8. 9. The SAS window system How to construct a SAS program, review the SAS log and SAS output How to input and read raw data files and SAS data files Difference between a DATA step and a PROC step Difference between a temporary SAS dataset and a permanent SAS dataset The LIBNAME and LIBREF statements Simple ways to manipulate the data and the output Simple procedures available for analysis Importance of the semi-colon in SAS INTRODUCTION to the SAS SYSTEM The Statistical Analysis System (SAS) was created in 1977 for conducting agricultural research. SAS is essentially a series of computer programs used for data management, analysis, and presentation. It is also considered a 4th generation programming language because it requires only the names of operations in order to perform them. The programs have already been written and the code/syntax used simply invokes those programs. 3rd generation programs like BASIC, FORTRAN, C, and Pascal all required very involved code to perform simple operations. Generally, SAS entails writing a SAS program, submitting it for analysis, and reviewing the log and output files to help you understand your data and the results of your analyses. Click on the SAS icon to start the SAS System. Three main windows appear. You can also use the pulldown menu “View” to open the different types of windows in the SAS environment. 1. ENHANCED EDITOR window (in v. 8.0 and above; provides color codes for syntax) SAS Program – <filename>.sas 2. LOG window SAS Log – <filename>.log 3. OUTPUT window SAS Output – <filename>.out Consider each of these to be a file. Differentiate between them based on the extension (sas, log, out) after the dot. The SAS environment looks similar to the Windows environment. There are pull down menus and icon buttons for controlling some of the features of SAS. Much of SAS, however, requires writing code using the SAS programming language. 2 A. Schweder SAS Basics Workshop 03/28/03 OVERVIEW for WRITING the SAS PROGRAM General syntax rules when writing SAS code: 1. SAS statements can be written in upper or lower case (or mixed). 2. SAS statements can begin and end in any column, but a word may not be split between any two lines. 3. More than one SAS statement can be entered per line. 4. Blank lines can be used and are recommended to aid readability. 5. SAS processes statements in steps so make sure that the data steps and procedure steps are in the proper order. 6. Every SAS statement ends with a semicolon ; Building a SAS Program: 1. Practice writing a program in the Editor window. Save the file with a name like TEST.SAS. 2. Format the output with some OPTIONS listed at the top of the program. OPTIONS PS=66 LS=165 NOCENTER NOFMTERR; PS= tells SAS to fit up to 66 lines of output per page; ranges from 15 - 32,767 lines LS= tells SAS to fit up to 165 columns of output per line; ranges from 64 – 256 columns NOCENTER tells SAS to print output flush left instead of the default centered. NOFMTERR tells SAS to continue processing even if its read the assigned formats before (which can produce an error message sometimes). 3. Document as much as possible directly in the program with comments – it helps you keep track of what you are doing in your program and why you are doing it. Comments can go anywhere in your program. A SAS comment can start with an asterisk and end with a semicolon: * PROGRAMMER: AES DATE: 11/15/02 PROJECT: SAS BASICS WORKSHOP; You can also use the following, which helps avoid semicolon mishaps: /* LEARN HOW TO INPUT A RAW DATA FILE */ Within a comment, do not use semicolons and avoid using quotation marks (single & double). 4. Handling data and invoking procedures always occurs in one of 2 steps in SAS. DATA step: builds a SAS data set (e.g., adds variables, merges datasets) OR PROC step: processes a SAS data set (e.g., produce means, frequencies) 3 A. Schweder SAS Basics Workshop 03/28/03 WORKING with RAW DATA There are a number of different ways to input and read raw data in SAS (i.e., the instructions given to SAS about the location and format of the variables). 1. Characteristics of a raw data file: A. Each row represents an observation, containing data values for one subject. B. Each column represents a variable across all subjects: e.g., sex, birth date, test scores C. Values assigned to variables can be: Numeric – includes only numbers Character – includes letters, sometimes letters and numbers (alphanumeric) D. The kind of values that are assigned to variables can influence the way in which SAS reads the data and performs certain analyses. Its important to gain familiarity with your raw data. 2. To create the raw data file, key in the lines of data using Word, Notepad, or any text editor and save the file as <filename>.dat or .txt. We will use a raw data file called testdata.txt that is saved in ‘c:\temp\sasbasics’. 3. Avoid errors in keying the data. A. In the raw data file, data must be entered starting on Line 1. B. Leave no blank lines at the top or bottom of the file (unless data is missing for a subject and should be left blank – key in “pretend data” by using the space bar to represent the number of columns of data that should be there if the data were not missing). C. Make sure variables are keyed into the correct column D. Right-justify numeric data E. Left-justify character data F. Use blank columns between variables to aid in readability Two of the main ways to input a raw dataset with a SAS program include: 1. Using the INPUT and CARDS (or DATALINES) commands to input the actual raw data within the SAS program. (Note: Use when you have a small set of data.) 2. Using an INFILE command to refer SAS to an external raw data file saved somewhere (e.g., floppy disk, hard drive, network). (Note: Better to use when you have a large set of data.) I. INPUT and CARDS Examples: Example 1. General Template: DATA <data-set-name>; INPUT (variable-name1) (variable-name1) (variable-name3); CARDS; Keyed in lines of data go here – each row an observation, each column a variable; PROC <name-of-desired-statistical-procedure> DATA=<data-set-name>; VAR <name of variables to be processed>; RUN; 4 A. Schweder SAS Basics Workshop 03/28/03 Example 1. Sample Program: * PROGRAMMER: AES DATE: 11/15/02 PROJECT: SAS BASICS WORKSHOP; /* LEARN HOW TO INPUT A RAW DATA FILE */ /* Example 1 using INPUT and CARDS commands */ DATA TEMP; INPUT SUBJECT SATV SATM; CARDS; 1 520 490 2 610 590 3 470 450 4 410 390 5 510 460 6 580 350 ; * COMMENT: Below is a PROC step, which allows you to manipulate and analyze your SAS data set. This produces means for SATV and SATM; PROC MEANS DATA=TEMP; VAR SATV SATM; RUN; Example 2. General Template: DATA <data-set-name>; INPUT #line-number @ column-number (variable-name) (column-width.) @ column-number (variable-name) (column-width.) @ column-number (variable-name) (column-width.) ; CARDS; Keyed in lines of data go here – each row an observation, each column a variable; PROC <name-of-desired-statistical-procedure> DATA=<data-set-name>; VAR <name of variables to be processed>; RUN; Example 2. Sample Program: /* Example 2 using INPUT and CARDS commands */ DATA TEMP; INPUT #1 @ @ @ @ @ @ @ 1 2 3 4 5 6 7 (V1) (V2) (V3) (V4) (V5) (V6) (V7) (1.) (1.) (1.) (1.) (1.) (1.) (1.) 5 A. Schweder SAS Basics Workshop 03/28/03 @ 9 (AGE) (2.) @ 12 (IQ) (3.) @ 16 (NUMBER) (1.) CARDS; 2234243 3424325 3242424 3242323 3232143 ; ; 22 98 1 20 105 2 32 90 3 19 119 4 18 101 5 * COMMENT: Below is a PROC step, which allows you to manipulate and analyze your SAS data set. This produces means for V1, V2, AGE, and IQ; PROC MEANS; VAR V1 V2 AGE IQ; Run; Below describes parts of the example programs above: 1. DATA statement: General form: DATA <data-set-name> Data-set-name: TEMP 2. INPUT statement (as in Example 2): INPUT #line-number @ column-number (variable-name) (column-width.) @ column-number (variable-name) (column-width.) @ column-number (variable-name) (column-width.) ; 3. Line number directions: #line-number Tells SAS what line to start on to read each subject’s data INPUT #1 In this example, it starts at line 1 4. Column location, variable name, and column width directions: @ column-number # of the column at which each variable begins (variable-name) name given to each variable (column-width.) # of columns to be occupied by each variable Note: Column width must be followed by a period because it helps when decimals are part of the variable. Also, above IQ was given 3 columns (even though some IQ values were only 2 digits). Don’t forget the semicolon at the end of the INPUT statement! INPUT #1 @ 1 (V1) (1.) ; At Line 1, Column 1, variable is called V1, and is 1 column wide 6 A. Schweder SAS Basics Workshop 03/28/03 5. CARDS (or DATALINES) statement: Right after the INPUT statement goes the CARDS statement to tell SAS that there is raw data. There must be a semicolon after the word CARDS and again after the raw data. 6. Data lines: Data lines are the values for each row/observation/subject. Again, leave no blank lines (otherwise SAS will think that a subject has missing data) and very carefully check the columns of the variables to make sure they are aligned correctly. Make sure you have a semicolon on the line right below your last line of data. 7. PROC and RUN Statements: PROC tells SAS to perform a given procedure or statistical analysis: e.g., CONTENTS, MEANS, TTEST, UNIVARIATE, FREQ, GLM, ANOVA, or PRINT. RUN tells SAS to execute the PROC. 8. General rules for data set names and variable names: A. Must begin with a letter (not a number) B. May be no more than 8 characters long C. May contain no special characters such as “*” or “#” D. May contain no blank spaces E. Example data set names: MYDATA, survey2, Dissert, TEMP II. INFILE Example: General Template: DATA <data-set-name>; INFILE <directory-path-and-name-of-data-file>; INPUT #line-number @ column-number (variable-name) (column-width.) @ column-number (variable-name) (column-width.) @ column-number (variable-name) (column-width.) ; PROC <name-of-desired-statistical-procedure> DATA=<data-set-name>; VAR <name of variables to be processed>; RUN; Example Program: /* COMMENT: Example using INFILE command */ /* COMMENT: INFILE indicates the name of the data file in which the raw data exists and where it can be found (need to specify the directory path if the file is not located in the same place as the program). The INPUT statement indicates the structure of the data file (as referred to by the INFILE command). */ DATA TEMP; INFILE 'c:\temp\sasbasics\testdata.txt'; 7 A. Schweder SAS Basics Workshop 03/28/03 INPUT #1 @ @ @ @ @ @ @ @ @ @ ; 1 (V1) (1.) 2 (V2) (1.) 3 (V3) (1.) 4 (V4) (1.) 5 (V5) (1.) 6 (V6) (1.) 7 (V7) (1.) 9 (AGE) (2.) 12 (IQ) (3.) 16 (NUMBER) (1.) /* This will produce a mean for each variable in the data set because vars were not specified */ PROC MEANS DATA=TEMP; RUN; The code is identical to using CARDS, but the INFILE statement is added and the CARDS statement and data lines are deleted. Instead of including the raw data in the program, the INFILE statement indicates where to find the raw data. The INPUT statement is still needed to tell SAS the structure of the raw data. Additional tips for handling variables when inputting data: Input a string of variables with the same prefix and different numeric suffixes. Think about the variables V1-V7 from above. The prefix (V) is the same, but the suffix is a different number. This is useful when you have a survey or questionnaire with many items. If you have multiple surveys, the prefix could be some abbreviated form of what the particular survey is. INPUT #1 @1 (V1-V7) @9 (AGE) @12 (IQ) saves lines of code because it’s a string of variables Inputting character variables requires that you indicate in the INPUT statement that it is a character variable. The use of a $ before the number of columns required tells SAS that it’s a character variable. For example, if we added a variable called SEX, it could be inputted with values of M or F instead of values of 1 or 2. INPUT #1 @1 (V1-V7) @9 (AGE) …. @18 (SEX) (1.) (2.) (3.) (1.) (2.) ($1.) $ is included to indicate character values for SEX Sometimes multiple lines of data are needed for each subject. INPUT #1 @ 1 (V1-V7) (1.) @ 9 (AGE) (2.) @ 12 (IQ) (3.) @ 16 (NUMBER) (1.) @ 18 (SEX) ($1.) #2 @ 1 (SATV) (3.) @ 5 (SATM) (3.) ; 8 A. Schweder SAS Basics Workshop 03/28/03 Raw data for this input statement would look like this for 3 subjects: 2234243 22 98 1 M 520 490 Subject 1 has data for SATV and SATM 3424325 20 105 2 M Subject 2 is missing data for SATV and SATM 3242424 32 90 3 F 390 420 If data is missing for an observation, leave the space there as if it were present so SAS doesn’t misalign the rows. Create decimal places on input for numeric variables so you don’t have to key in the decimal point: If you had a variable called GPA, key it in without the decimals 3.56 356 2.20 220 INPUT #1 @ 1 (GPA) (3.2) ; Tells SAS to use 3 cols. & put a decimal in the 2nd CARDS; 356 220; Inputting “check all that apply questions” as multiple variables: Treat single questions with multiple parts to them as a set of questions. For each question there can be a value of either 0 (not checked) or 1 (checked) – making each question a dichotomous variable using dummy coding. WORKING with TEMPORARY and PERMANENT DATASETS The DATA statement tells SAS to build a SAS data set. 1. Building a Temporary SAS Data Set The syntax for building a temporary SAS data set is: DATA <data-set-name> ; INFILE ‘drive:\path\filename.dat’ ; INPUT variable information ; Here, the DATA statement refers to the data-set-name as the name of a temporary SAS data set. TEMP was used in the previous programs as a data set name. Example Program: DATA TEMP; INFILE 'c:\temp\sasbasics\testdata.txt'; INPUT #1 @ 1 (V1) (1.) @ 2 (V2) (1.) @ 3 (V3) (1.) 9 A. Schweder SAS Basics Workshop 03/28/03 @ @ @ @ @ @ @ ; 4 (V4) (1.) 5 (V5) (1.) 6 (V6) (1.) 7 (V7) (1.) 9 (AGE) (2.) 12 (IQ) (3.) 16 (NUMBER) (1.) This code will not create a physical SAS dataset called testdata. Instead, the code invokes the physical raw dataset called testdata.txt and creates a temporary dataset called TEMP only for as along as you are working in that DATA step and in that program. After SAS runs the program that creates TEMP, it deletes it. A permanent data set, however, is kept even after SAS runs the program that creates it. 2. Building a Permanent SAS Data Set The syntax that creates a permanent SAS data set is: LIBNAME libref ‘drive:\path’; DATA <libref.filename>; Two-level name The LIBNAME statement defines a libref, or a nickname, for the drive and the directory path in which to save or to find the permanent SAS data set. A libref is 1-8 characters long, no spaces are allowed, and can start with an “_” or a letter, but not a number (i.e., any valid SAS name). It works by giving a nickname to the ‘drive:\path’ (single quotes are required) for the duration of the current SAS program. Define all librefs at the beginning of a SAS program to document where permanent SAS data sets are saved (or used) by the SAS program. The DATA step tells SAS to create a permanent SAS data set by using a two-level name, i.e., <libref.filename>. The 1st level of the name is the libref, or the previously defined nickname in the LIBNAME statement to represent the ‘drive:\path’ where the permanent SAS data set is stored. The libref name is followed by a period. The 2nd level of the name is the filename of the permanent SAS data set stored in the libref. SAS automatically appends the extension .SD2 to permanent SAS data sets. Example Program: /* COMMENT: Example of saving a permanent SAS dataset from the raw dataset */ LIBNAME sasdata ‘c:\temp\sasbasics’ ; * Step below creates a permanent SAS dataset called testdata.sd2 ; DATA sasdata.testdata ; * Step below uses the raw dataset to create testdata.sd2 ; 10 A. Schweder SAS Basics Workshop 03/28/03 INFILE ‘c:\temp\sasbasics\testdata.txt’; INPUT #1 @ 1 (V1) (1.) @ 2 (V2) (1.) @ 3 (V3) (1.) @ 4 (V4) (1.) @ 5 (V5) (1.) @ 6 (V6) (1.) @ 7 (V7) (1.) @ 9 (AGE) (2.) @ 12 (IQ) (3.) @ 16 (NUMBER) (1.) ; PROC CONTENTS DATA=sasdata.testdata ; RUN; The LIBNAME statement above uses sasdata as the libref to refer to ‘c:\temp\sasbasics’. The DATA step (using the two-level name) tells SAS to create a permanent SAS data set called testdata and to save it in sasdata (a.k.a. c:\temp\sasbasics). The INFILE statement tells SAS where the raw data set file exists in order to create testdata.sd2, the SAS dataset. After this program is run, check that the permanent SAS data set called testdata.sd2 exists in c:\temp\sasbasics. Also, check the output to see the contents of testdata.sd2. 3. Processing a Temporary SAS Data Set Now that we have created a SAS dataset, we can use it to process the dataset temporarily. This is helpful when you are testing out some code and don’t necessarily want to save the changes you are making. Note that we no longer need to use the INFILE statement to indicate where to find the file; instead, we use a SET statement. LIBNAME <libref > ‘drive:\path’; DATA <data-set-name> ; A temporary SAS dataset used as the working file for code to follow SET <libref.filename> ; A permanent dataset (in some cases it can be a temp SAS dataset) must be named here using the SET statement so a temporary data set can be created from it. Example Program: /* COMMENT: Setting a permanent SAS dataset to process temporarily */ LIBNAME sasdata ‘c:\temp\sasbasics’ ; * Step below creates a temporary SAS dataset called TEMP ; DATA TEMP ; 11 A. Schweder SAS Basics Workshop 03/28/03 SET sasdata.testdata ; PROC CONTENTS DATA=TEMP ; RUN; Note that no physical SAS dataset file is saved in c:\temp\sasbasics called temp.sd2. In the output, the contents will indicate that this data set is called TEMP. 4. Processing a Permanent SAS Data Set One way to process a permanent SAS data set to perform a procedure is illustrated in this syntax: LIBNAME libref ‘drive:\path’; PROC <name-of- statistical-procedure> DATA = <libref.filename> ; Two-level name Note that the INPUT and INFILE statements are not needed now. The LIBNAME statement defines the libref so that it refers to the ‘drive:\path’ where the permanent SAS data set is stored. The PROC statement tells SAS to perform a procedure on the SAS data set. Follow PROC with the name of the procedure you want SAS to perform (e.g., MEANS, PRINT). After the PROC, but before the semicolon, comes a DATA statement that uses the libref to tell SAS where the permanent SAS data set is stored (the directory), followed by a period, and what the filename is of the permanent SAS data set. Example Program: /* COMMENT: Setting a permanent SAS dataset on a PROC step */ LIBNAME sasdata ‘c:\temp\sasbasics’ ; * Step below prints the data out for the permanent dataset called testdata; PROC PRINT DATA=sasdata.testdata ; RUN; The PROC statement tells SAS to perform the PRINT procedure on the permanent SAS data set testdata.sd2 stored in c:\temp\sasbasics (as referred to by the libref we created using the LIBNAME statement, “sasdata”). 12 A. Schweder SAS Basics Workshop 03/28/03 Another way to work with permanent data sets is to SET an existing permanent SAS data in order to make a new permanent data set with a different name as well as changes to the data set. Example Program: /* COMMENT: Creating a new permanent SAS dataset by setting a permanent SAS dataset */ LIBNAME sasdata ‘c:\temp\sasbasics’ ; * Step below saves a new data set called newdata.sd2 that is identical to the data set called testdata.sd2 but with a new variable called sex; DATA sasdata.testdat2 ; SET sasdata.testdata ; * Create a variable called sex based on ID number ; if number in (1,2,3) then sex = 1; if number in (4,5) then sex = 0; PROC PRINT DATA=sasdata.testdat2; RUN; The DATA statement tells SAS to save a new permanent SAS data set called testdat2.sd2 stored in ‘c:\temp\sasbasics’ by setting using the SET statement the data set called testdata.sd2. Check in ‘c:\temp\sasbasics’ to make sure that it was created. Also, check the output to see that the new variable is included in the dataset. Note that an INFILE statement tells SAS what raw data set to use, whereas a SET statement tells SAS what existing or permanent SAS data set to use. WAYS to MANIPULATE the DATA Data-manipulation will transform the data set in some way, e.g., add new variables or change existing variables. Data manipulation code can go on a DATA step usually in one of two places: 1) Immediately after the INPUT statement (whether you use CARDS or INFILE) Example Program: DATA TEMP; INFILE ‘c:\temp\sasbasics\testdata.txt’; INPUT #1 @ 1 (V1-V7) (1.) @ 9 (AGE) (2.) @ 12 (IQ) (3.) @ 16 (NUMBER) (1.) ; if number in (1,2,3) then sex = 1; data-manipulation & data-subsetting statements go here if number in (4,5) then sex = 0; PROC PRINT DATA = TEMP; RUN; 13 A. Schweder SAS Basics Workshop 03/28/03 2) Immediately after the creation of a new data set: Example Program: DATA TEMP; INFILE ‘c:\temp\sasbasics\testdata.txt’; INPUT #1 @ 1 (V1-V7) (1.) @ 9 (AGE) (2.) @ 12 (IQ) (3.) @ 16 (NUMBER) (1.) ; DATA TEMP2; SET TEMP; name of new data set to create name of existing data set if number in (1,2,3) then sex = 1; data-manipulation & data-subsetting statements go here if number in (4,5) then sex = 0; PROC PRINT DATA = TEMP; the variable SEX will not be in this dataset RUN; PROC PRINT DATA = TEMP2; the variable SEX will be in this dataset RUN; Ways to manipulate the data can include creating variables in a DATA step with an assignment statement (see syntax below). Variables can be created or recoded in a DATA step, but not in a PROC step. 1. Create duplicate variables with new variable names: General syntax: <new-variable-name> = <existing-variable-name> ; Examples: V1 = BDI1; GENDER = SEX; 2. Duplicating variables vs. renaming variables: In the previous examples, the variables were not re-named; instead, duplicate variables were created with new names. Both original and duplicate variables remain in the data set. There is also a RENAME function to permanently rename variables without duplicating them. 3. Create new variables from existing variables: Use these symbols in SAS to perform operations on variables: ( +, - , * , / , = ) Use parentheses and follow rules for order of operations. Use SAS functions such as SUM, MEAN, or ROUND in an assignment statement. Always check created variables to verify that they were created correctly. General syntax: <new-variable-name> = <formula-including-existing-variable-name> ; Examples: VTOTAL = V1 + V2 + V3 + V4 ; SAS will not compute for obs with missing values 14 A. Schweder SAS Basics Workshop 03/28/03 VTOTAL = SUM(V1,V2,V3,V4) ; SAS ignores missing values & computes based on the values present Summing variables V1 through V4 creates a new variable called VTOTAL. 4. Recode variables to have a different value: SAS can overwrite existing variables or create a new variable to store recoded values. Variable values can be recoded upon INPUT or recoded after they are saved in a SAS data set. SAS can recode variable values or ranges into user-specified values with IF-THEN statements. Example: IF SEX = 1 THEN SEX = ‘M’ ; 5. Recode reversed variables: Sometimes questionnaires have reversed items – a question is stated so that the meaning is the opposite of the meaning of the other items on the questionnaire. In general, perform the reversal before other data manipulations are performed on those items. It is good practice to store recoded variable values as a new variable and leave the existing variable intact. <new-variable> = <constant – existing-variable> ; The constant is always equal to the number of response items on your survey plus 1. V1R = 6 – V1; (in the case of 5 response items) SUBSETTING DATA Data-subsetting will eliminate unwanted observations from a sample so only a specified subgroup is in the data set. For example, you only want to look at males and not females, or a particular age range. Use what is called a sub-setting IF statement to perform analyses on only a subset of observations included in the data set. General syntax: DATA <new-data-set-name> ; SET <existing-data-set-name> ; IF statement; Example: To obtain the mean for each variable only for ages greater than 20 in the data set: DATA TEMP2; SET TEMP; IF AGE > 20 ; 15 A. Schweder SAS Basics Workshop 03/28/03 PROC MEANS DATA = TEMP2; This will display means only for those subjects older than 20. RUN; LABELS for VARIABLES Use the LABEL statement to associate a label with any or all of the variables. Many SAS procedures print a variable name followed by its label to help document what is in the output. General syntax: LABEL var1 = ‘label for var1’ The label can be up to 40 characters (including blanks) var2 = ‘label for var2’ … var[n] = ‘label for var[n]’ ; The LABEL statement tells SAS to associate the label “label for var1” with the variable var1, the label “label for var2” with variable var2, and so on. Use the LABEL statement within a DATA step to associate the label(s) permanently with the variable(s). These labels will be used in subsequent PROCs. Use the LABEL statement within a PROC step to associate the label(s) temporarily with the variable(s). Labels associated with variables in a PROC step will be used in that PROC only. FORMATS for VARIABLES A format is a set of instructions that tells SAS how to print variable values in the output. A format can be associated with one or more variables temporarily in a PROC step or permanently in a DATA step. You need to provide a place for SAS to keep the format library that you create. You use the LIBNAME statement to do this. The libref LIBRARY is always used to refer to the format library. SAS will create a separate file (.SC2) of the format library. This file must always be with the SAS file or else you will encounter errors. 1. To associate a format temporarily, use the FORMAT statement on a PROC step. Example: PROC FORMAT LIBRARY=LIBRARY; VALUE $sex ‘M’ = ‘Male’ ‘F’ = ‘Female’ ; VALUE affinity 1 = ‘not at all’ 2 = ‘a little’ 3 = ‘in the middle’ 4 = ‘a lot’ 5 = ‘I LOVE IT’ ; LIBNAME sasdata ‘c:\temp\sasbasics’ ; LIBNAME library ‘c:\temp\sasbasics’ ; 16 A. Schweder SAS Basics Workshop 03/28/03 PROC MEANS DATA=sasdata.testdata; VAR v1 v2 v3 v4 v5 v6 v7 ; FORMAT affinity. ; RUN; 2. To associate a format permanently, use the FORMAT statement on a DATA step. Example: PROC FORMAT LIBRARY=LIBRARY; VALUE $sex ‘M’ = ‘Male’ ‘F’ = ‘Female’ ; VALUE affinity 1 = ‘not at all’ 2 = ‘a little’ 3 = ‘in the middle’ 4 = ‘a lot’ 5 = ‘I LOVE IT’ ; LIBNAME sasdata ‘c:\temp\sasbasics’ ; LIBNAME library ‘c:\temp\sasbasics’ ; DATA TEMP; SET sasdata.testdata; FORMAT v1-v7 affinity. ; PROC MEANS DATA=TEMP; RUN; PROCEDURES 1. Examining the variables in a SAS data set To print descriptor information about a SAS data set, use PROC CONTENTS. General syntax: PROC CONTENTS DATA = <libref.filename> or <filename>; This tells SAS to run the CONTENTS procedure on the temporary SAS data set called TEMP. PROC CONTENTS will list the name, type (numeric or character), length in bytes, and ordinal position in the SAS data set, for each variable in alphabetical order. General syntax with options: PROC CONTENTS DATA = <libref.filename> or <filename> POSITION; You can use statement options to change the defaults for PROC CONTENTS: POSITION – will list variables in the order of their position in the SAS data set SHORT – will print only a list of the variable names in the SAS data set 17 A. Schweder SAS Basics Workshop 03/28/03 2. Examining the values in a SAS data set To print the actual data (the actual observations) in a SAS data set, use PROC PRINT. General syntax: PROC PRINT DATA = DATA = <libref.filename> or <filename>; This tells SAS to run the PRINT procedure on temporary SAS data set TEMP. PRINT numbers each observation and lists variable values in columns under the variable name. General syntax: PROC PRINT DATA = <libref.filename> or <filename> DOUBLE NOOBS ; You can use statement options to change the defaults for PROC PRINT: DOUBLE – double-spaces output NOOBS – suppresses printing of the observation number UNIFORM – formats all pages uniformly (by default, SAS fits as much per page as possible) 3. Producing frequency tables and crosstabulations To produce frequency tables and/or crosstabulations and any relevant statistics use PROC FREQ. General syntax: PROC FREQ DATA = <libref.filename> or <filename>; TABLES var var * var var * var * var / options ; var = simple (one-way) frequency table var * var = crosstabulation (two-way table) where values of the variable before the asterisk (*) will occupy the rows of the table and the values of the variable after the asterisk will occupy the columns of the table (row * column). var * var * var = crosstabulations of the second variable by the third variable for each level of the first (control) variable (control * row * column). The slash (/) tells SAS to compute optional statistic(s) options for the tables (e.g., / CHISQ ; ) 4. Producing univariate descriptive statistics To calculate univariate descriptive statistics (e.g., mean, standard deviation, maximum, minimum, median, percentiles) for one or more numeric variables use PROC UNIVARIATE. General syntax: PROC UNIVARIATE DATA = <libref.filename> or <filename>; VAR var1 var2 … var[n] ; PROC UNIVARIATE can provide additional detail on the distribution of a variable including plots, frequency tables, and a test to determine whether the data are normally distributed. Add the 18 A. Schweder SAS Basics Workshop 03/28/03 PLOT, FREQ, and /or NORMAL option to the PROC UNIVARIATE statement to include this information to the output. General syntax with options: PROC UNIVARIATE DATA = <libref.filename> or <filename> PLOT FREQ NORMAL ; VAR var1 var2 … varn ; PROC UNIVARIATE will print a separate page of output for each variable. It is useful for examining percentiles and outliers. Use PROC MEANS to print univariate descriptive statistics for more than one variable on the same page. Note that there are many, many more procedures that SAS uses to perform analyses. TITLES Document your output with the use of titles. Titles can be used anywhere in the program. General syntax: TITLE ‘<Insert your title here: This is a title to be printed on line 1 of each page of output>’ ; Note that SAS processes a program in steps. A step begins with either a DATA or a PROC statement. A step ends with another DATA or PROC statement (or the end of the program). All TITLEs encountered from the beginning of the step until the beginning of the next step are used for the current step. Use optional RUN; statements to end a step at a specific point. Suppress a TITLE by writing the TITLE; statement with no text following it. PUTTING A PROGRAM TOGETHER /* PUT A PROGRAM TOGETHER */ OPTIONS PS = 66 LS = 165 NOCENTER NOFMTERR; /* Assign formats to the variables. Numbers generally don't require formats unless you categorize them. This step just lays out the formats, but does not permanently assign them. */ PROC FORMAT LIBRARY=LIBRARY; VALUE $sex 'M' = 'Male' 'F' = 'Female' ; VALUE affinity 1 2 3 4 5 = = = = = 'not at all' 'a little' 'in the middle' 'a lot' 'I LOVE IT' ; LIBNAME sasdata 'c:\temp\sasbasics' ; LIBNAME library 'c:\temp\sasbasics' ; 19 A. Schweder SAS Basics Workshop 03/28/03 DATA sasdata.testdata ; INFILE 'c:\temp\sasbasics\testdata.txt'; INPUT #1 @ 1 (V1) (1.) @ 2 (V2) (1.) @ 3 (V3) (1.) @ 4 (V4) (1.) @ 5 (V5) (1.) @ 6 (V6) (1.) @ 7 (V7) (1.) @ 9 (AGE) (2.) @ 12 (IQ) (3.) @ 16 (NUMBER) (1.) ; /* Create a permanent data set called TESTDAT2. We need a new data set because we are about to change the data by adding labels and formats to the variables and creating new variables. We need to SET the data set we want to work from (called TESTDATA) in order to create the new version (called TESTDAT2). */ DATA sasdata.testdat2 ; SET sasdata.testdata ; /* Create some new variables */ if number in (1,2,3) then sex = 'M'; if number in (4,5) then sex = 'F'; GENDER = SEX; VTOTAL = V1 + V2 + V3 + V4; /* Assign labels to the variables. */ LABEL V1 = 'Variable 1' V2 = 'Variable 2' V3 = 'Variable 3' V4 = 'Variable 4' V5 = 'Variable 5' V6 = 'Variable 6' V7 = 'Variable 7' age = 'Age of Subject' IQ = 'IQ of Subject' number = 'ID Number' gender = 'Gender of Subject' vtotal = 'Total sum of V1-V4' ; /* Permanently assign the formats to the variables. V1-V7 use the same format. */ FORMAT gender $sex. V1-V7 affinity. ; /* When you want SAS to use the data set that you last invoked for a procedure, you do not need to identify it in the PROC statement. SAS defaults to the last dataset used - in this case, it is TEMPDAT2. 20 A. Schweder SAS Basics Workshop 03/28/03 /* Print the variables in the data set for each person. */ PROC PRINT DOUBLE; TITLE 'Print of data in TESTDAT2.SD2'; RUN; /* Produce means of the variables in the data set */ PROC MEANS; TITLE 'Means of numeric variables in TESTDAT2.SD2'; RUN; /* Correlate age and IQ */ PROC CORR; VAR AGE IQ; TITLE 'Correlation b/t age and IQ'; RUN; AFTER RUNNING YOUR SAS PROGRAM Always check the log file that is produced when you run a SAS program. Check the number of observations read. The log will indicate if there are any errors in the program that must be fixed. The log also provides comments about what SAS did with your program. When an error is found, return to the program and, starting from the beginning of the program, edit one thing at a time and re-run the program (this helps isolate where the problem is located because the log doesn’t always specify exactly where the problem occurred). MISCELLANEOUS NOTES An excellent collection of searchable SAS resources: http://www.ats.ucla.edu/stat/sas/ SAS is can be fairly abstract, but it is also very powerful. SAS is great for large data sets with hundreds or thousands of observations and variables. SAS relies heavily on programming code as opposed to using icons and pull-down menus to execute commands. SAS is a very logical language and is useful for planning out the steps necessary to do complicated data work. Also, note that certain statements must go before other statements. One of the hardest concepts to grasp is the distinction between a temporary data set and a permanent data set. Know your data well. Know what kind of file you will be working from. Think about whether you need to build a data file from scratch or utilize an existing data file. There are MANY ways to accomplish the same goal in SAS. Go with what feels most comfortable. You can always look up how to do things in SAS if you can’t remember! 21