DATA STEP by DATA STEP you’ll go far Aaron J. Rabushka Statistical Programmer INC Research, Inc. Austin, TX Note that the code examples in this presentation were developed and run under SAS V9.2 (TS2M2), and that the coloring in the displays of code and results comes from the author without necessarily representing actual SAS displays. A FEW BASICS The SAS system offers a halfway house between canned routines and procedural programming. Its pre-programmed procedures save a lot of work and time since programmers do not have to re-code standardized and routinized procedures and utilities every time they use them. A FEW BASICS Most SAS code goes into STEPs, either DATA STEPs or PROC (PROCedure) STEPs. OPEN CODE refers to instructions not associated with either of these (e.g., OPTIONS statements). Some DATA STEP features will seem very familiar to procedural programmers, and some will seem annoyingly foreign. A FEW BASICS Every SAS DATA STEP begins with the word DATA. Note that in this instance it is not followed by an equal sign as DATA= references an already existing data set during the course of a PROC statement. A FEW BASICS SAS has two data types, NUMERIC and CHARACTER. SAS users derive all of their variables from these two types. SAS does not have special types for LOGICAL or DATE fields. A FEW BASICS If the programmer does not name a dataset in the DATA statement the system will name it as DATA with a sequence number appended. data; x = 1; output; run; data; y = 10; output; run; 1 2 3 4 data; x = 1; output; run; NOTE: The data set WORK.DATA1 has 1 observations and 1 variables. NOTE: DATA statement used (Total process time): real time 0.04 seconds cpu time 0.01 seconds 5 6 7 8 data; y = 10; output; run; NOTE: The data set WORK.DATA2 has 1 observations and 1 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.01 seconds A FEW BASICS If the programmer does not name a dataset in the DATA statement the system will name it as DATA with a sequence number appended. This practice is not recommended as it can result in world-class confusions. A FEW BASICS SAS dataset names officially have two parts, a library name and a data set name. A period separates the two. If a programmer does not specify a library name for a dataset the SAS system will attach WORK. to the dataset name that he assigns. The programmer does not need to articulate WORK. in the code. data demonstration; x = 1; output; run; 10 11 12 13 data demonstration; x = 1; output; run; NOTE: The data set WORK.DEMONSTRATION has 1 observations and 1 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.00 seconds A FEW BASICS SAS dataset names officially have two parts, a library name and a data set name. A period separates the two. If a programmer does not specify a library name for a dataset the SAS system will attach WORK. to the dataset name that he assigns. The programmer does not need to articulate WORK. in the code. WORK. files disappear when the SAS session ends. A FEW BASICS SAS datasets that need to be saved or that have been saved into libraries from previous SAS sessions need to have both their dataset names and their library names articulated every time the program references them. The programmer must declare library names with LIBNAME before using them in this way. *NOTE THAT LIBRARY DEFINITIONS ARE OPERATING-SYSTEM SPECFIC; libname ajrdata "h:\"; data ajrdata.demonstration; x = 1; output; run; 17 libname ajrdata "h:\"; NOTE: Libref AJRDATA was successfully assigned as follows: Engine: V9 Physical Name: h:\ 18 19 20 data ajrdata.demonstration; 21 x = 1; 22 output; 23 run; NOTE: The data set AJRDATA.DEMONSTRATION has 1 observations and 1 variables. NOTE: DATA statement used (Total process time): real time 0.21 seconds cpu time 0.00 seconds A FEW BASICS Dataset names can have at most 32 characters and must start with a letter or underscore. data this_is_an_example_of_a_dataset_name_that_ is_too_long; x = 1; output; run; 25 data this_is_an_example_of_a_dataset_name_that_is_too_long; ---------------------------------------------------307 ERROR 307-185: The data set name cannot have more than 32 characters. 26 27 28 x = 1; output; run; NOTE: The SAS System stopped processing this step because of errors. NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds data 123_this_will_not_work ; x = 1; output; run; 80 data 123_this_will_not_work; --22 200 ERROR 22-322: Syntax error, expecting one of the following: a name, a quoted string, /, ;, _DATA_, _LAST_, _NULL_. ERROR 200-322: The symbol is not recognized and will be ignored. 81 82 83 x = 1; output; run; NOTE: The SAS System stopped processing this step because of errors. WARNING: The data set WORK._THIS_WILL_NOT_WORK may be incomplete. When this step was stopped there were 0 observations and 1 variables. NOTE: DATA statement used (Total process time): real time 0.03 seconds cpu time 0.01 seconds A FEW BASICS If a programmer uses the name of a dataset that already exists then SAS will simply write the new dataset over the old one of that name, without warning. data one_num; x = 1; output; run; data one_num; y = 10; output; run; proc print data=one_num; title1 "one_num"; title2 "note that this contains the data"; title3 "from the second DATA ONE_NUM step"; run; 65 data one_num; 66 x = 1; 67 output; 68 run; NOTE: The data set WORK.ONE_NUM has 1 observations and 1 variables. NOTE: DATA statement used (Total process time): real time 0.03 seconds cpu time 0.01 seconds 69 data one_num; 70 y = 10; 71 output; 72 run; 73 NOTE: The data set WORK.ONE_NUM has 1 observations and 1 variables. NOTE: DATA statement used (Total process time): real time 0.03 seconds cpu time 0.00 seconds 74 75 76 proc print data=one_num; title1 "one_num"; title2 "note that this contains the data"; 77 title3 "from the second DATA ONE_NUM step"; 78 run; NOTE: There were 1 observations read from the data set WORK.ONE_NUM. NOTE: PROCEDURE PRINT used (Total process time): real time 0.01 seconds cpu time 0.00 seconds one_num note that this contains the data from the second DATA ONE_NUM step Obs 1 y 10 A FEW BASICS A DATA statement can create a single data set or multiple datasets: DATA DATA SUBJECTS; MEN WOMEN; GETTING DATA INTO SAS DATASETS SAS users usually refer to records in datasets as observations. SAS DATA STEPs operate as implied loops which iterate as necessary to handle the data involved. GETTING DATA INTO SAS DATASETS A programmer can assign data values directly through assignment statements. data assignments; length country $ 12; subject = 25; country = "PARAGUAY"; run; 13 14 15 16 17 18 data assignments; length country $ 12; subject = 25; country = "PARAGUAY"; run; NOTE: The data set WORK.ASSIGNMENTS has 1 observations and 2 variables. NOTE: DATA statement used (Total process time): real time 0.03 seconds cpu time 0.00 seconds assignments Obs 1 country subject PARAGUAY 25 GETTING DATA INTO SAS DATASETS A programmer can assign data values by including a DATALINES or CARDS section in a DATA STEP. Note that SAS accepts these two interchangeably even when no actual cards are involved. data free_form; input age sex $; datalines; 54 MALE 35 MALE 40 FEMALE 29 FEMALE ;;;; proc print data=free_form; title1 "data free_form"; run; 1 2 3 4 data free_form; input age sex $; datalines; NOTE: The data set WORK.FREE_FORM has 4 observations and 2 variables. NOTE: DATA statement used (Total process time): real time 0.14 seconds cpu time 0.03 seconds 9 10 11 12 ;;;; proc print data=free_form; title1 "data free_form"; run; NOTE: There were 4 observations read from the data set WORK.FREE_FORM. NOTE: PROCEDURE PRINT used (Total process time): real time 0.17 seconds cpu time 0.04 seconds data free_form Obs 1 2 3 4 age 54 35 40 29 sex MALE MALE FEMALE FEMALE GETTING DATA INTO SAS DATASETS FROM EXTERNAL FLAT OR DELIMITED FILES INFILE statements reference and describe external source files. INPUT statements direct SAS to read and incorporate the data from these external source files. GETTING DATA INTO SAS DATASETS FROM EXTERNAL FLAT OR DELIMITED FILES INFILE locates and describes an external data source. The syntax of INFILE statements varies by operating system. EXAMPLES: WINDOWS: data test; infile ‘c:\work\space\sasajr\test.dat’; GETTING DATA INTO SAS DATASETS FROM EXTERNAL FLAT OR DELIMITED FILES UNIX: data test; infile ‘users/sasajr/test.dat’; MAINFRAME: //FILEIN DD DSN=YAHUPITZ.AJRDATA,DISP=SHR . . . DATA TEST; INFILE FILEIN; GETTING DATA INTO SAS DATASETS FROM EXTERNAL FLAT OR DELIMITED FILES Can also use a FILENAME statement in open SAS code to refer to an external file. Also operating-system-specific. Example from Windows: FILENAME TESTDATA ‘c:\work\space\sasajr\test.dat’; . . . DATA TEST; INFILE TESTDATA; GETTING DATA INTO SAS DATASETS FROM EXTERNAL FLAT OR DELIMITED FILES INFILE statements can also describe a file as delimited with the DLM option, which identifies the delimiter used in the file in question. EXAMPLE FOR A COMMA-DELIMITED FILE: infile ‘users/sasajr/test.dat’ dlm = ‘,’; This is useful in turning .CSV files from Excel spreadsheets into SAS datasets. GETTING DATA INTO SAS DATASETS FROM EXTERNAL FLAT OR DELIMITED FILES A couple of options that are useful with delimited-file INFILE statements are DSD, which will recognize missing values between two delimiters in a row, and MISSOVER, which keeps SAS from reading data from the following line if the current observation is not completely filled in. EXAMPLE FOR A COMMA-DELIMITED FILE: infile ‘users/sasajr/test.dat’ dlm = ‘,’ dsd missover; GETTING DATA INTO SAS DATASETS FROM EXTERNAL FLAT OR DELIMITED FILES Once the data source is identified with either INFILE or DATALINES, INPUT creates the variables in the resultant SAS dataset. The simplest form of an INPUT statement is often called free-form input. It does not require the data to be laid out consistently in columns. Character variables can have at most 8 characters, and cannot include spaces. Note the use of the dollar sign to indicate that a variable is character rather than numeric. data free_form; input age sex $; datalines; 54 MALE 35 MALE 40 FEMALE 29 FEMALE ;;;; proc print data=free_form; title1 "data free_form"; run; 25 26 27 data free_form; input age sex $; datalines; NOTE: The data set WORK.FREE_FORM has 4 observations and 2 variables. NOTE: DATA statement used (Total process time): real time 0.03 seconds cpu time 0.00 seconds 32 33 34 35 36 ;;;; proc print data=free_form; title1 "data free_form"; run; NOTE: There were 4 observations read from the data set WORK.FREE_FORM. NOTE: PROCEDURE PRINT used (Total process time): real time 0.01 seconds cpu time 0.01 seconds data free_form Obs age sex 1 2 3 4 54 35 40 29 MALE MALE FEMALE FEMALE GETTING DATA INTO SAS DATASETS FROM EXTERNAL FLAT OR DELIMITED FILES If the data are column-aligned in their source you can use column pointers, which consists of an @ sign followed by a number, to indicate their placement within the source record. data column_aligned; input @1 age @4 sex $; datalines; 54 MALE 35 MALE 40 FEMALE 29 FEMALE ;;;; 38 39 40 data column_aligned; input @1 age @4 sex $; datalines; NOTE: The data set WORK.COLUMN_ALIGNED has 4 observations and 2 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.00 seconds 45 46 47 48 49 ;;;; proc print data = column_aligned; title1 "data column_aligned"; run; NOTE: There were 4 observations read from the data set WORK.COLUMN_ALIGNED. NOTE: PROCEDURE PRINT used (Total process time): real time 0.01 seconds cpu time 0.01 seconds data column_aligned Obs 1 2 3 4 age 54 35 40 29 sex MALE MALE FEMALE FEMALE GETTING DATA INTO SAS DATASETS FROM EXTERNAL FLAT OR DELIMITED FILES If the source data have more than one row per observation SAS offers row pointers that consist of a pound sign (#) followed by a number. data column_and_row_pointers; input #1 @1 age @4 sex $ #2 country $; datalines; 54 MALE URUGUAY 35 MALE KAZAKHSTAN 40 FEMALE UNITED KINGDOM 29 FEMALE AUSTRALIA ;;;; proc print data=column _and_row_pointers; title1 'data column_and_row_pointers'; run; 119 120 121 122 data column_and_row_pointers; input #1 @1 age @4 sex $ #2 country $; datalines; NOTE: The data set WORK.COLUMN_AND_ROW_POINTERS has 4 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 0.03 seconds cpu time 0.00 seconds 131 132 133 134 135 ;;;; proc print data=column_and_row_pointers; title1 'data column_and_row_pointers'; run; NOTE: There were 4 observations read from the data set WORK.COLUMN_AND_ROW_POINTERS. NOTE: PROCEDURE PRINT used (Total process time): real time 0.01 seconds cpu time 0.00 seconds data column_and_row_pointers Obs 1 2 3 4 age 54 35 40 29 sex MALE MALE FEMALE FEMALE country URUGUAY KAZAKHST UNITED AUSTRALI GETTING DATA INTO SAS DATASETS FROM EXTERNAL FLAT OR DELIMITED FILES SAS also allows moving forward to a subsequent row within an observation by using a slash (/). data row_and_column; input @1 age @4 sex datalines; 54 MALE URUGUAY 35 MALE KAZAKHSTAN 40 FEMALE UNITED KINGDOM 29 FEMALE AUSTRALIA ;;;; $ / @1 country $ ; proc print data=row_and_column; title1 "data row_and_column"; run; 51 52 53 data row_and_column; input @1 age @4 sex datalines; $ / @1 country $ ; NOTE: The data set WORK.ROW_AND_COLUMN has 4 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.01 seconds 62 63 64 65 66 ;;;; proc print data=row_and_column; title1 "data row_and_column"; run; NOTE: There were 4 observations read from the data set WORK.ROW_AND_COLUMN. NOTE: PROCEDURE PRINT used (Total process time): real time 0.01 seconds cpu time 0.00 seconds data row_and_column Obs age sex country 1 2 3 4 54 35 40 29 MALE MALE FEMALE FEMALE URUGUAY KAZAKHST UNITED AUSTRALI GETTING DATA INTO SAS DATASETS FROM EXTERNAL FLAT OR DELIMITED FILES Formatted input can work with source data that does not fit the requirements of the less specific types of INPUT statements. For example, character variables that are longer than 8 characters, and/or include embedded spaces. Note that SAS often refers to formats used in an INPUT statement as INFORMATs. data row_and_column_with_a_format; input @1 age @4 sex $ / country $14.; datalines; 54 MALE URUGUAY 35 MALE KAZAKHSTAN 40 FEMALE UNITED KINGDOM 29 FEMALE AUSTRALIA ;;;; proc print data=row_and_column_with_a_format; title1 "data row_and_column_with_a_format"; run; 68 69 70 71 data row_and_column_with_a_format; input @1 age @4 sex $ / country $14.; datalines; NOTE: The data set WORK.ROW_AND_COLUMN_WITH_A_FORMAT has 4 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.01 seconds 80 81 82 83 84 ;;;; proc print data=row_and_column_with_a_format; title1 "data row_and_column_with_a_format"; run; NOTE: There were 4 observations read from the data set WORK.ROW_AND_COLUMN_WITH_A_FORMAT. NOTE: PROCEDURE PRINT used (Total process time): real time 0.01 seconds cpu time 0.00 seconds data row_and_column_with_a_format Obs age sex country 1 2 3 4 54 35 40 29 MALE MALE FEMALE FEMALE URUGUAY KAZAKHSTAN UNITED KINGDOM AUSTRALIA GETTING DATA INTO SAS DATASETS FROM OTHER SAS DATASETS SAS offers three commands to make new datasets out of previously existing SAS datasets: SET, MERGE, and UPDATE, all of which can be used in situations that range from extremely simple to extremely complex. GETTING DATA INTO SAS DATASETS FROM OTHER SAS DATASETS SET allows a programmer to simply copy the contents of one dataset into another without any further modifications. data column_and_row_pointers; input #1 @1 age @4 sex $ #2 country $; datalines; 54 MALE URUGUAY 35 MALE KAZAKHSTAN 40 FEMALE UNITED KINGDOM 29 FEMALE AUSTRALIA ;;;; data subjects; set column_and_row_pointers; run; 137 138 139 data column_and_row_pointers; input #1 @1 age @4 sex $ #2 country $; datalines; NOTE: The data set WORK.COLUMN_AND_ROW_POINTERS has 4 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 0.03 seconds cpu time 0.01 seconds 148 149 150 151 152 ;;;; data subjects; set column_and_row_pointers; run; NOTE: There were 4 observations read from the data set WORK.COLUMN_AND_ROW_POINTERS. NOTE: The data set WORK.SUBJECTS has 4 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.01 seconds GETTING DATA INTO SAS DATASETS FROM OTHER SAS DATASETS In tandem with IF and OUTPUT statements SET can create multiple datasets from a single source. data men women unknown; set column_and_row_pointers; if sex = "MALE" then output men; if sex = "FEMALE" then output women; if not (sex in ("MALE","FEMALE")) then output unknown; run; 154 data men women unknown; 155 set column_and_row_pointers; 156 if sex = "MALE" then output men; 157 if sex = "FEMALE" then output women; 158 if not (sex in ("MALE","FEMALE")) then output unknown; 159 run; NOTE: There were 4 observations read from the data set WORK.COLUMN_AND_ROW_POINTERS. NOTE: The data set WORK.MEN has 2 observations and 3 variables. NOTE: The data set WORK.WOMEN has 2 observations and 3 variables. NOTE: The data set WORK.UNKNOWN has 0 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 0.07 seconds cpu time 0.01 seconds GETTING DATA INTO SAS DATASETS FROM OTHER SAS DATASETS Without getting into the intricacies of this here, a single SET statement with 2 or more datasets gives different results than multiple SET statements for individual datasets. GETTING DATA INTO SAS DATASETS FROM OTHER SAS DATASETS MERGE can create a dataset from two or more pre-existing SAS datasets. Although SAS does allow its use with or without a BY statement, using MERGE without BY is dangerous since it puts observations together with no regard for their content. GETTING DATA INTO SAS DATASETS FROM OTHER SAS DATASETS A special option called MERGENOBY is often used to avoid the chaotic results of a MERGE statement without an attendant BY. OPTIONS MERGENOBY=ERROR will terminate any datastep that has a MERGE without a BY (a reminder that having nothing is sometimes preferable to having garbage), and OPTIONS MERGENOBY=WARNING will allow the DATA step to complete while issuing a WARNING to the log. GETTING DATA INTO SAS DATASETS FROM OTHER SAS DATASETS Using a BY statement along with MERGE forces SAS to match observations on common values. This can be used for oneto-one, one-to-many, and many-to-one matches, whose common values must match exactly. Data sets need to be SORTed by the matching variables in order for a MERGE...BY command to work. data sbp; * systolic blood pressure; input subject sbp; datalines; 1 120 3 122 4 108 5 133 6 120 7 129 8 139 9 123 10 139 run; data dbp; input subject dbp; * diastolic blood pressure; datalines; 4 80 5 79 1 95 3 88 2 80 10 88 8 77 6 84 7 82 9 90 run; GETTING DATA INTO SAS DATASETS FROM OTHER SAS DATASETS Running a MERGE statement without SORTing: data bp; merge sbp dbp; by subject; run; 30 31 32 33 34 data bp; merge sbp dbp; by subject; run; ERROR: BY variables are not properly sorted on data set WORK.DBP. subject=5 sbp=133 dbp=79 FIRST.subject=1 LAST.subject=1 _ERROR_=1 _N_=4 NOTE: The SAS System stopped processing this step because of errors. NOTE: There were 5 observations read from the data set WORK.SBP. NOTE: There were 3 observations read from the data set WORK.DBP. WARNING: The data set WORK.BP may be incomplete. When this step was stopped there were 3 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 0.03 seconds cpu time 0.01 seconds GETTING DATA INTO SAS DATASETS FROM OTHER SAS DATASETS Running a MERGE with the previous data steps properly SORTed: proc sort data=sbp; by subject; run; proc sort data=dbp; by subject; run; data bp; merge sbp dbp; by subject; run; 44 45 46 47 data bp; merge sbp dbp; by subject; run; NOTE: There were 9 observations read from the data set WORK.SBP. NOTE: There were 10 observations read from the data set WORK.DBP. NOTE: The data set WORK.BP has 10 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 0.06 seconds cpu time 0.01 seconds RESULT OF MERGE WHICH CONTAINS ALL VARIABLES FROM THE SOURCE DATASETS Obs 1 2 3 4 5 6 7 8 9 10 subject 1 2 3 4 5 6 7 8 9 10 sbp dbp 120 . 122 108 133 120 129 139 123 139 95 80 88 80 79 84 82 77 90 88 GETTING DATA INTO SAS DATASETS FROM OTHER SAS DATASETS Datasets associated with MERGE can have special flag variables associated with them to use for selecting variables that show up in the associated dataset. This involves the IN= option. * MERGE, keeping only observations for subjects in both source datasets:; data bp3; merge sbp(in=insbp) dbp(in=indbp); by subject; if insbp and indbp; run; 49 50 51 52 53 data bp3; merge sbp(in=insbp) dbp(in=indbp); by subject; if insbp and indbp; run; NOTE: There were 9 observations read from the data set WORK.SBP. NOTE: There were 10 observations read from the data set WORK.DBP. NOTE: The data set WORK.BP3 has 9 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.01 seconds GETTING DATA INTO SAS DATASETS FROM OTHER SAS DATASETS A problem to watch out for: if two datasets have identically named variables apart from those used in the BY statement the value from the second-listed in the MERGE statement overwrites the value from the first, even if the value in the second dataset is missing. data sbp2; input subject sbp datalines; 1 120 M 3 122 M 4 108 M 5 133 F 6 120 F 7 129 F 8 139 M 9 123 F 10 139 M run; sex $; data dbp2; input subject dbp sex $; datalines; 4 80 ? 5 79 ? 1 95 ? 3 88 ? 2 80 ? 10 88 ? 8 77 ? 6 84 ? 7 82 ? 9 90 ? run; proc sort data=sbp2; by subject; run; proc sort data=dbp2; by subject; run; data bp2; merge sbp2 dbp2; by subject; run; proc print data=bp2; var subject sbp dbp sex; title1 'MERGED DATA'; title2 'SEX FROM THE SECOND DATASET'; title3 'HAS WRITTEN OVER SEX FROM THE FIRST DATASET'; run; 109 110 111 112 113 data bp2; merge sbp2 dbp2; by subject; run; NOTE: There were 9 observations read from the data set WORK.SBP2. NOTE: There were 10 observations read from the data set WORK.DBP2. NOTE: The data set WORK.BP2 has 10 observations and 4 variables. NOTE: DATA statement used (Total process time): real time 0.04 seconds cpu time 0.01 seconds MERGED DATA SEX FROM THE SECOND DATASET HAS WRITTEN OVER SEX FROM THE FIRST DATASET Obs 1 2 3 4 5 6 7 8 9 10 subject sbp dbp sex 1 2 3 4 5 6 7 8 9 10 120 . 122 108 133 120 129 139 123 139 95 80 88 80 79 84 82 77 90 88 ? ? ? ? ? ? ? ? ? ? GETTING DATA INTO SAS DATASETS FROM OTHER SAS DATASETS One solution to this: rename at least one of the variables. data sbp3; input subject sbp datalines; 1 120 M 3 122 M 4 108 M 5 133 F 6 120 F 7 129 F 8 139 M 9 123 F 10 139 M run; sbpsex $; data dbp3; input subject dbp dbpsex $; datalines; 4 80 ? 5 79 ? 1 95 ? 3 88 ? 2 80 ? 10 88 ? 8 77 ? 6 84 ? 7 82 ? 9 90 ? run; proc sort data=sbp3; by subject; run; proc sort data=dbp3; by subject; run; data bp3; merge sbp3 dbp3; by subject; run; proc print data=bp3; var subject sbp dbp sbpsex dbpsex; title1 "MERGED DATA"; title2 "WITH NO VARIABLES WRITTEN OVER"; run; MERGED DATA WITH NO VARIABLES WRITTEN OVER Obs 1 2 3 4 5 6 7 8 9 10 subject 1 2 3 4 5 6 7 8 9 10 sbp dbp sbpsex dbpsex 120 . 122 108 133 120 129 139 123 139 95 80 88 80 79 84 82 77 90 88 M ? ? ? ? ? ? ? ? ? ? M M F F F M F M GETTING DATA INTO SAS DATASETS FROM OTHER SAS DATASETS The UPDATE command applies the values of one dataset (TRANSACTION DATA SET) to another (MASTER DATASET). It can write a new version of the MASTER DATASET, or it can create a third dataset. GETTING DATA INTO SAS DATASETS FROM OTHER SAS DATASETS It copies non-missing values from the TRANSACTION dataset over MASTER dataset values where appropriate. It makes no change to the MASTER value if the corresponding TRANSACTION value is missing. Like MERGE, UPDATE requires the input datasets to be sorted. * MASTER DATASET:; data weeks_in_study; input subject datalines; 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 1 10 1 ;;;; run; lastweek; * TRANSACTION DATASET:; data recent_weeks; input subject datalines; 11 2 9 3 7 4 5 2 3 4 1 1 2 . 4 . 6 . ;;;; run; lastweek; proc sort data=weeks_in_study; by subject; run; proc sort data=recent_weeks; by subject; run; data current_weeks; update weeks_in_study by subject; run; recent_weeks; proc print data=current_weeks; title1 "MASTER DATASET"; title2 "AFTER APPLYING THE TRANSACTION DATASET"; run; MASTER DATASET AFTER APPLYING THE TRANSACTION DATASET Obs subject lastweek 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 1 1 4 1 2 1 4 1 3 1 2 WORKING WITH DATA ONCE THEY ARE IN A SAS DATASET Since we can rarely command the form in which we get our data the tools for working with them in SAS are greatly helpful. DATA step statements can be positional (that is, their order matters), or non-positional (that is, their order does not matter). WORKING WITH DATA ONCE THEY ARE IN A SAS DATASET LENGTH statements give programmers a pro-active way to set the length of a variable without leaving it to the chance of the variable’s content. WORKING WITH DATA ONCE THEY ARE IN A SAS DATASET The LENGTH statement is most often associated with CHARACTER variables. It can make the difference between usable data and garbage, especially when trying to match with other datasets, as with MERGE statements or PROC APPEND. data length_1; input @1 yesno_code; if yesno_code = 1 then yesno = "NO"; else if yesno_code = 2 then yesno = "YES"; datalines; 2 1 ;;;; run; LENGTH_1, WITH LENGTH OF YESNO DETERMINED FROM THE DATA VALUES AND THEIR ORDER NOTE THAT THE YES VALUE IS TRUNCATED TO YE Obs 1 2 yesno_ code 2 1 yesno YE NO data length_correct; length yesno $ 3; input @1 yesno_code; if yesno_code = 1 then yesno = "NO"; else if yesno_code = 2 then yesno = "YES"; datalines; 2 1 ;;;; run; DATA LENGTH CORRECT NOTE THAT THE LENGTH OF YESNO IS SET IN THE LENGTH STATEMENT AND THE DATA ARE PRESENTED IN FULL Obs yesno_ code yesno 1 2 2 1 YES NO WORKING WITH DATA ONCE THEY ARE IN A SAS DATASET Position is VERY important for LENGTH statements since SAS is extremely finicky about their coming at or near the top of a DATA step. data length_wrong; input @1 yesno_code; if yesno_code = 1 then yesno = "NO"; else if yesno_code = 2 then yesno = "YES"; length yesno $ 3; datalines; 2 1 ;;;; run; 110 data length_wrong; 111 input @1 yesno_code; 112 if yesno_code = 1 then yesno = "NO"; 113 else if yesno_code = 2 then yesno = "YES"; 114 115 length yesno $ 3; WARNING: Length of character variable yesno has already been set. Use the LENGTH statement as the very first statement in the DATA STEP to declare the length of a character variable. 116 117 datalines; NOTE: The data set WORK.LENGTH_WRONG has 2 observations and 2 variables. NOTE: DATA statement used (Total process time): real time 0.04 seconds cpu time 0.01 seconds DATA LENGTH_WRONG NOTE THAT THE LENGTH OF YESNO IS NOT SET TO WHAT IS IN THE LENGTH STATEMENT AND DATA ARE TRUNCATED Obs 1 2 yesno_ code 2 1 yesno YE NO data countries; length country $ 20; input @1 country $25.; datalines; URUGUAY PAKISTAN JAPAN ISRAEL UNITED ARAB EMIRATES UNITED KINGDOM ;;;; data more_countries; length country $ 25; input @1 country $25.; datalines; CANADA ARGENTINA UNITED STATES OF AMERICA ZAIRE NEW ZEALAND ZAMBIA ;;;; proc append base=countries data=more_countries; run; 154 155 proc append base=countries data=more_countries; run; NOTE: Appending WORK.MORE_COUNTRIES to WORK.COUNTRIES. WARNING: Variable country has different lengths on BASE and DATA files (BASE 20 DATA 25). ERROR: No appending done because of anomalies listed above. Use FORCE option to append these files. NOTE: 0 observations added. NOTE: The data set WORK.COUNTRIES has 6 observations and 1 variables. NOTE: Statements not processed because of errors noted above. NOTE: PROCEDURE APPEND used (Total process time): real time 0.03 seconds cpu time 0.00 seconds NOTE: The SAS System stopped processing this step because of errors. WORKING WITH DATA ONCE THEY ARE IN A SAS DATASET HOLDING ON TO WHAT YOU WANT AND GETTING RID OF THE REST Often programming requires working with variables that are unnecessary and unwanted in the final output. WORKING WITH DATA ONCE THEY ARE IN A SAS DATASET HOLDING ON TO WHAT YOU WANT AND GETTING RID OF THE REST KEEP and DROP statements allow for holding on only to those variables that you want after the work is done. They can be applied either as stand-alone statements placed anywhere in the DATA step, or as parenthetical modifications to the DATA statement. Obs 1 2 3 4 dataset SUBJECTS age sex country 54 35 40 29 MALE MALE FEMALE FEMALE URUGUAY KAZAKHSTAN UNITED KINGDOM AUSTRALIA data countries_1; set subjects; keep country; run; dataset COUNTRIES_1 Obs 1 2 3 4 country URUGUAY KAZAKHSTAN UNITED KINGDOM AUSTRALIA data countries_2; keep country; set subjects; run; dataset COUNTRIES_2 Obs country 1 2 3 4 URUGUAY KAZAKHSTAN UNITED KINGDOM AUSTRALIA data countries_3 (keep=country); set subjects; run; dataset COUNTRIES_3 Obs 1 2 3 4 country URUGUAY KAZAKHSTAN UNITED KINGDOM AUSTRALIA WORKING WITH DATA ONCE THEY ARE IN A SAS DATASET DROP mirrors KEEP in that it dispenses with variables that you don’t want to hold on to rather than holding on to those that you do want. data countries_4; drop age sex; set subjects; run; dataset COUNTRIES_4 Obs 1 2 3 4 country URUGUAY KAZAKHSTAN UNITED KINGDOM AUSTRALIA WORKING WITH DATA ONCE THEY ARE IN A SAS DATASET The RETAIN statement sounds similar to KEEP, but functions differently. Use RETAIN to hold a variable’s value across observations. Great for building counters within datasets. WORKING WITH DATA ONCE THEY ARE IN A SAS DATASET In addition to facilities for KEEPing and DROPping variables within observations, SAS has facilities to hold on to and discard observations in a dataset. WORKING WITH DATA ONCE THEY ARE IN A SAS DATASET To exclude unwanted observations use the DELETE command. This is most often done conditionally, in conjunction with an IF statement. data men4; set subjects; if sex ne "MALE" then delete; run; 346 347 348 349 350 data men4; set subjects; if sex ne "MALE" then delete; run; NOTE: There were 4 observations read from the data set WORK.SUBJECTS. NOTE: The data set WORK.MEN4 has 2 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 0.04 seconds cpu time 0.00 seconds WORKING WITH DATA ONCE THEY ARE IN A SAS DATASET To include only those observations that are wanted use the OUTPUT command. This can be done conditionally, in conjunction with an IF statement. data men5; set subjects; if sex = "MALE" then output; run; 352 353 354 355 data men5; set subjects; if sex = "MALE" then output; run; NOTE: There were 4 observations read from the data set WORK.SUBJECTS. NOTE: The data set WORK.MEN5 has 2 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.00 seconds WORKING WITH DATA ONCE THEY ARE IN A SAS DATASET Note that OUTPUT does not always need to be articulated. If there is no explicit OUTPUT or DELETE statement anywhere in a DATA step SAS will include all observations processed. WORKING WITH DATA ONCE THEY ARE IN A SAS DATASET One command that may take some getting useful is the SUBSETTING IF statement. It takes getting used to because it sounds like an incomplete sentence, an IF with no THEN. A good way to think of it is IF XXX THEN include this observation. If the observation passes the IF condition it is included in the dataset and SAS applies the rest of the commands in the DATA step. If the observation fails the IF condition the observation is excluded and SAS does not process the observation further. data men6; set subjects; if sex = "MALE"; run; 362 363 364 365 data men6; set subjects; if sex = "MALE"; run; NOTE: There were 4 observations read from the data set WORK.SUBJECTS. NOTE: The data set WORK.MEN6 has 2 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 0.03 seconds cpu time 0.00 seconds WORKING WITH DATA ONCE THEY ARE IN A SAS DATASET Often a WHERE statement can be used to the same effect as a subsetting IF. If the datasets in question are large WHERE can save a lot of computer time. Deficit: the log does not show you how many observations are in the source dataset when the code uses WHERE, and it does with a subsetting IF. data men6; set subjects; if sex = "MALE"; run; data men7; set subjects; where sex = "MALE"; run; 367 368 369 370 data men6; set subjects; if sex = "MALE"; run; NOTE: There were 4 observations read from the data set WORK.SUBJECTS. NOTE: The data set WORK.MEN6 has 2 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 0.03 seconds cpu time 0.01 seconds 371 372 373 374 375 data men7; set subjects; where sex = "MALE"; run; NOTE: There were 2 observations read from the data set WORK.SUBJECTS. WHERE sex='MALE'; NOTE: The data set WORK.MEN7 has 2 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.01 seconds DATASETS WHERE WE KEEP NOTHING Sometimes we use DATA steps to accomplish tasks that do not involve keeping any data. For these we often use the special reserved name _NULL_. Uses include writing reports, setting or checking macro variables, and writing messages to the log. data _null_; x = ('16MAY2010'D '10DEC2009'D)/7; Y = ('10JAN2010'D '10DEC2009'D)/7; put x y; run; 377 378 379 380 381 data _null_; x = ('16MAY2010'D - '10DEC2009'D)/7; Y = ('10JAN2010'D - '10DEC2009'D)/7; put x y; run; 22.428571429 4.4285714286 NOTE: DATA statement used (Total process time): real time 0.03 seconds cpu time 0.00 seconds PROCEDURAL PROGRAMMNG TOOLS Position is extremely important for procedural commands since the same commands in different orders can give some very different results. data _null_; x = 4; x = x + 1; x = x * 4; put "x from first set of statements: " x; x = 4; x = x * 4; x = x + 1; put "x from second set of statements: " x; run; 22 23 24 25 26 27 28 29 30 31 data _null_; x = 4; x = x + 1; x = x * 4; put "x from first set of statements: " x; x = 4; x = x * 4; x = x + 1; put "x from second set of statements: " x; run; x from first set of statements: 20 x from second set of statements: 17 NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds PROCEDURAL PROGRAMMNG TOOLS Note that assignment statements such as those just shown are quite economical in terms of computer time. PROCEDURAL PROGRAMMNG TOOLS CONDITIONAL BRANCHING—THE VERY HEART OF PROCEDURAL PROGRAMMING! The basic IF...THEN statement instructs the system to execute a command or group of commands if the specified condition is true, and possibly another command or command group if it is false. Here is a basic IF statement that shows what to do if the stated condition is true, and nothing else. data showing_if1; length yesno $ 3; input @1 yesno_code; if yesno_code = 1 then yesno = "NO"; datalines; 2 1 4 3 ;;;; run; proc print data=showing_if1; title1 "DEMONSTRATING RESULTS OF A BASIC IF STATEMENT"; run; DEMONSTRATING RESULTS OF A BASIC IF STATEMENT Obs 1 2 3 4 yesno NO yesno_ code 2 1 4 3 PROCEDURAL PROGRAMMNG TOOLS In order to instruct the system what to do when the IF condition fails, follow the IF statement with an ELSE. data showing_if2; length yesno $ 3; input @1 yesno_code; if yesno_code = 1 then yesno = "NO"; else yesno = "YES"; datalines; 2 1 4 3 ;;;; run; proc print data=showing_if2; title1 "DEMONSTRATING RESULTS" ; title2 "OF AN IF STATEMENT"; title3 "WITH A FOLLOWING ELSE STATEMENT"; run; DEMONSTRATING RESULTS OF AN IF STATEMENT WITH A FOLLOWING ELSE STATEMENT Obs 1 2 3 4 yesno YES NO YES YES yesno_ code 2 1 4 3 PROCEDURAL PROGRAMMNG TOOLS SAS allows for complex constructions involving IF and ELSE statements. These are helpful in expressing the outcomes of complex logic. data showing_if3; length yesno $ 3; input @1 yesno_code; if yesno_code = 1 then yesno = "NO"; else if yesno_code = 2 then yesno = "YES"; else yesno = "INV"; datalines; 2 1 4 3 ;;;; run; proc print data=showing_if3; title1 "DEMONSTRATING RESULTS"; title2 "OF AN IF STATEMENT"; title3 "WITH A FOLLOWING ELSE IF AND ELSE"; run; DEMONSTRATING RESULTS OF AN IF STATEMENT WITH A FOLLOWING ELSE IF AND ELSE Obs 1 2 3 4 yesno YES NO INV INV yesno_ code 2 1 4 3 PROCEDURAL PROGRAMMNG TOOLS In order to have SAS execute multiple commands pursuant to a condition use a DO block in connection with the appropriate IF or ELSE statement. Note that a DO block must close with an END statement, and that END statements close out DO blocks and not IF blocks. data showing_if4; length yesno $ 3; input @1 yesno_code; if yesno_code = 1 then yesno = "NO"; else if yesno_code = 2 then yesno = "YES"; else do; yesno = "INV"; put "Observation " _n_ " has an invalid code."; end; datalines; 2 1 4 3 ;;;; run; 179 180 181 182 183 184 185 186 187 188 189 190 data showing_if4; length yesno $ 3; input @1 yesno_code; if yesno_code = 1 then yesno = "NO"; else if yesno_code = 2 then yesno = "YES"; else do; yesno = "INV"; put "Observation " _n_ " has an invalid code."; end; datalines; Observation 3 Observation 4 has an invalid code. has an invalid code. NOTE: The data set WORK.SHOWING_IF4 has 4 observations and 2 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.01 seconds 195 196 ;;;; run; DEMONSTRATING RESULTS OF AN IF STATEMENT WITH A FOLLOWING ELSE IF AND ELSE Obs 1 2 3 4 yesno YES NO INV INV yesno_ code 2 1 4 3 PROCEDURAL PROGRAMMNG TOOLS LOOPS In addition to designating blocks of commands to be executed once, you can also use DO to loop through groups of commands subject to certain conditions. PROCEDURAL PROGRAMMNG TOOLS LOOPS DO WHILE loops iterate as long as the test condition is true. * demonstrating DO--WHILE loop:; data _null_; index_var = 0; do while (index_var < 10); put "index_var: " index_var; index_var = index_var + 1; end; run; 4971 4972 4973 4974 4975 4976 4977 4978 4979 * demonstrating DO--WHILE loop:; data _null_; index_var = 0; do while (index_var < 10); put "index_var: " index_var; index_var = index_var + 1; end; run; index_var: 0 index_var: 1 index_var: 2 index_var: 3 index_var: 4 index_var: 5 index_var: 6 index_var: 7 index_var: 8 index_var: 9 NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds PROCEDURAL PROGRAMMNG TOOLS LOOPS DO WHILE loops iterate as long as the test condition is true. DO UNTIL loops iterate as long as the test condition is false. * demonstrating DO--UNTIL loop:; data _null_; index_var = 0; do until (index_var > 10); put "index_var: " index_var; index_var = index_var + 1; end; run; 4980 4981 4982 4983 4984 4985 4986 4987 4988 * demonstrating DO--UNTIL loop:; data _null_; index_var = 0; do until (index_var > 10); put "index_var: " index_var; index_var = index_var + 1; end; run; index_var: 0 index_var: 1 index_var: 2 index_var: 3 index_var: 4 index_var: 5 index_var: 6 index_var: 7 index_var: 8 index_var: 9 index_var: 10 NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds PROCEDURAL PROGRAMMNG TOOLS LOOPS DO WHILE loops iterate as long as the test condition is true. DO UNTIL loops iterate as long as the test condition is false. DO loops can also be set up to iterate a set number of times. * demonstrating a DO loop with a set number of iterations:; data _null_; do index_var = 1 to 5; put "index_var: " index_var; end; run; 4989 4990 4991 4992 4993 4994 4995 * demonstrating a DO loop with a set number of iterations:; data _null_; do index_var = 1 to 5; put "index_var: " index_var; end; run; index_var: 1 index_var: 2 index_var: 3 index_var: 4 index_var: 5 NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds PROCEDURAL PROGRAMMNG TOOLS ARRAYS SAS arrays have some fetching and often frustrating differences from arrays in other programming languages. Unlike other languages’ arrays that have a group of values that have identical attributes and no life outside of the array, arrays in SAS consist of groups of variables that need to be of the same type, but that do not necessarily have anything else in common. PROCEDURAL PROGRAMMNG TOOLS ARRAYS For example, SAS would allow programmers to put RACE, SEX, and COUNTRY, and ETHNICITY into a single array, even though they are all of different sizes. PROCEDURAL PROGRAMMNG TOOLS ARRAYS SAS offers two methods for handling array subscripts, called IMPLICIT and EXPLICIT subscripting. With IMPLICIT subscripting the programmer does not need to be concerned with articulating array subscripts. data implicit_arrays; LENGTH RACE $ 15 SEX $ 6 ETHNICITY $ 12 COUNTRY $ 27; subject = "99"; * NOTE THE USE OF THE DOLLAR SIGN FOR THE ARRAY OF CHARACTER VARIABLES:; ARRAY DEMOS $ RACE SEX ETHNICITY COUNTRY; ARRAY VITALS SYSTOLIC_BP DIASTOLIC_BP PULSE RESPIRATION WEIGHT; PROCEDURAL PROGRAMMNG TOOLS ARRAYS To loop through elements of an implicitly subscripted array use the DO OVER looping command. * DEMONSTRATION OF NULLING OUT A SET OF VARIABLES THROUGH USING ARRAYS:; data implicit_arrays; LENGTH RACE $ 15 SEX $ 6 ETHNICITY $ 12 COUNTRY $ 27; subject = "99"; ARRAY DEMOS $ RACE SEX ETHNICITY COUNTRY; ARRAY VITALS SYSTOLIC_BP DIASTOLIC_BP PULSE RESPIRATION WEIGHT; DO OVER DEMOS; DEMOS = " "; END; DO OVER VITALS; VITALS = .; END; run; PROCEDURAL PROGRAMMNG TOOLS ARRAYS To loop through elements of an implicitly subscripted array use the DO OVER looping command. For reasons obscure implicitly subscripted arrays are often frowned upon. PROCEDURAL PROGRAMMNG TOOLS ARRAYS For EXPLICITLY subscripted arrays the programmer needs to declare the number of elements or use an * for a number of elements that is not pre-determined. data explicit_arrays1; LENGTH RACE $ 15 SEX $ 6 ETHNICITY $ 12 COUNTRY $ 27; subject = "99"; ARRAY DEMOS {4} $ RACE SEX ETHNICITY COUNTRY; ARRAY VITALS {5} SYSTOLIC_BP DIASTOLIC_BP PULSE RESPIRATION WEIGHT; PROCEDURAL PROGRAMMNG TOOLS ARRAYS Looping through elements of an explicitly subscripted array requires DO loops that use index variables. data explicit_arrays1; LENGTH RACE $ 15 SEX $ 6 ETHNICITY $ 12 COUNTRY $ 27; subject = "99"; ARRAY DEMOS {4} $ RACE SEX ETHNICITY COUNTRY; ARRAY VITALS {5} SYSTOLIC_BP DIASTOLIC_BP PULSE RESPIRATION WEIGHT; DO INDEX = 1 TO 4; DEMOS{INDEX} = " "; END; DO INDEX = 1 TO 5; VITALS{INDEX} = .; END; run; PROCEDURAL PROGRAMMNG TOOLS ARRAYS Note that in no case does the name of the array stay in any datasets that are created. Using an array in a subsequent DATA step requires declaring it again. PROCEDURAL PROGRAMMNG TOOLS RECODING DATA Since data do not always come in in the form that users of SAS output require, programmers often have to recode them. They often do this in DATA steps. For example, users often need to report ages in groups rather than as a specific year value. data men4; set four; if sex = "MALE"; * RECODING DATA USING ASSIGNMENT STATEMENTS:; If age = . then age_group = "not reported"; If 0 <= age < 15 then age_group = "< 15"; If 15 <= age <= 24 then age_group = "15 - 24"; If 25 <= age <= 34 then age_group = "24 - 34"; If 35 <= age <= 44 then age_group = "35 - 44"; If 45 <= age <= 54 then age_group = "45 - 54"; If 55 <= age <= 64 then age_group = "55 - 64"; if age >= 65 then age_group = "65+"; ; run; proc format; value agegr . = "not reported" 0 - 14 = "< 15" 15 - 24 = "15 - 24" 25 - 34 = "25 - 34" 35 - 44 = "35 - 44" 45 - 54 = "45 - 54" 55 - 64 = "55 - 64" /* USING AN IMPOSSIBLY HIGH VALUE AS AN UPPER BOUND*/ 65 - 999 = "65+"; run; data men4; set four; if sex = "MALE"; * RECODING DATA USING A FORMAT:; age_group = put(age,agegr.); run; PROCEDURAL PROGRAMMNG TOOLS RECODING DATA Note that the results of this PUT command always go into CHARACTER variables, and that PUT and OUTPUT are not inverses of one another. VARIABLE ATTRIBUTES LABELS As the data names that programmers use frequently don’t mean anything to anyone else, information users frequently request labels to go with the variables. A single LABEL statement can assign labels to multiple variables. data blood pressure; input subject sbp dbp; label sbp = "Systolic Blood Pressure" dbp = "Diastolic Blood Pressure" ; datalines; 1 120 80 3 122 79 4 108 95 5 133 88 6 120 77 7 129 84 8 139 84 9 123 89 10 139 80 run; VARIABLE ATTRIBUTES LABELS As the data names that programmers use frequently don’t mean anything to anyone else, information users frequently request labels to go with the variables. A single LABEL statement can assign labels to multiple variables. Note that this statement is non-positional. VARIABLE ATTRIBUTES FORMATS SAS can format variables in PROC steps, in which case the format is only applied through the duration of that PROC step, or in a DATA step, in which case the format stays with the variable throughout the program. data men4; set four; * SINCE IT IS ASSIGNED IN A DATA STEP THE FORMAT AGEGR. WILL STAY WITH THE VARIABLE AGE THROUGHOUT THIS PROGRAM. format age agegr.; run; VARIABLE ATTRIBUTES FORMATS SAS can format variables in PROC steps, in which case the format is only applied through the duration of that PROC step, or in a DATA step, in which case the format stays with the variable throughout the program. Note that this statement is non-positional. CONTACT AARON RABUSHKA AT ARABUSHKA@INCRESEARCH .COM QUESTIONS?