Advanced Data Manipulation Greg Jenkins 1 Topics • How a data step really works. • Advanced data step features: arrays, loops, SAS system variables, retain, output statements, advanced “input” statements, using the data step as an output device. • Data manipulation procedures: proc transpose, and an overview of proc sql. 2 Compile Part of a Data Step • SAS scans the syntax of the data step (checks for syntax errors). • Translation of source code to machine language. • Defines input/output files • Creates: - input buffer(non-SAS data) - Program Data Vector(PDV) • Set variable attributes for output dataset. • Figures out what variables to set missing. 3 Execute Part of a Data Step • Executes code written in a data step. • Basically inputs a line of data into the PDV from the data source(if one exists). • This PDV is just a temporary storage area that can be thought of as a row of data. • Then any calculations, subsetting, etc. that are part of the data step code are done. • These steps are processed until an “output” statement is arrived at, or the last data step command is reached(but one that is part of the executable code) 4 Example Sample Code data main; set old; length y $20.; z = x + 2; keep x y z; run; OLD dataset X A 3 Qsrt 5 f 2 # 5 Compile Part of Example Data Step • SAS scans code for “syntax errors”, none there(at least I hope no typos), so it moves on to the next step. • SAS translates the “SAS code” into machine language. • Defines the input file as a SAS dataset named “old”, and the output file as a SAS dataset named “main”. 6 Compile Part of Example Data Step • Since the input file is a SAS dataset(no input statement) SAS doesn’t create a input buffer. • SAS creates the PDV: X A Y Z SAS System Variables … … 7 Compile Part of Example Data Step • SAS now creates the attributes of the variables in the output dataset. • Attributes of a variable are: - type(character, or numeric) - formats/informats - labels - length • We’ll suppose that the variables in the old dataset were defined as: - ‘x” numeric, “a” character - no formats, informats, or labels for both - “x” has a length of 8, and “a” has a length of 4 8 Compile Part of Example Data Step • SAS creates the attributes in the new “main” dataset for the variables in the “old” dataset, using the same attributes as in the “old” dataset since no commands were issued to change the attributes. 9 Compile Part of Example Data Step • The attributes of the variables that will be created are: - “z” numeric, “y” character (this is defined in the length statement since a $ appears after the name. - no formats, informats, or labels since none are defined in the data step. - the length of “z” will be defined to be 8 by default (although this could be changed, though not suggested), and the length of “y” will be defined as 20 due to the length statement. 10 Compile Part of Example Data Step • Then SAS initializes the new variables to missing and starts the executable phase of the data step. • It’s good to note here that there are other things that SAS is doing during this phase of the data step, but for a basic understanding of what’s going on here we’ll ignore them and move on (there are some books and papers on this subject if you’re really interested). 11 Executable Part of the Data Step • Input row 1st row of data from “old” dataset into the PDV: OLD dataset data main; set old; length y $20.; z = x + 2; keep x y z; run; X A 3 Qrst X A 3 Qsrt 5 f 2 # Y Z SAS System Variables _n_=1 … … … 12 Executable Part of the Data Step • Run through executable code: z = x + 2; X A 3 Qrst Y Z SAS System Variables 5 _n_=1 … … … 13 Executable Part of the Data Step • End of executable code “implicit output” command adds data in PDV to output dataset X A 3 Qrst Y Z SAS System Variables 5 _n_=1 … … … MAIN dataset X 3 Y Z 5 14 Executable Part of the Data Step • Input row 2nd row of data from “old” dataset into the PDV: OLD dataset data main; set old; length y $20.; z = x + 2; keep x y z; run; X A 5 f X A 3 Qsrt 5 f 2 # Y Z SAS System Variables _n_=2 … … … 15 Executable Part of the Data Step • Run through executable code: z = x + 2; X A 5 f Y Z SAS System Variables 7 _n_=2 … … … 16 Executable Part of the Data Step • End of executable code “implicit output” command adds data in PDV to output dataset X A 5 f Y Z SAS System Variables 7 _n_=2 … … … MAIN dataset X 5 Y Z 7 17 Executable Part of the Data Step • Input row 3rd row of data from “old” dataset into the PDV: OLD dataset data main; set old; length y $20.; z = x + 2; keep x y z; run; X A 2 # X A 3 Qsrt 5 f 2 # Y Z SAS System Variables _n_=3 … … … 18 Executable Part of the Data Step • Run through executable code: z = x + 2; X A 2 # Y Z SAS System Variables 4 _n_=3 … … … 19 Executable Part of the Data Step • End of executable code “implicit output” command adds data in PDV to output dataset X A 2 # Y Z SAS System Variables 4 _n_=3 … … … MAIN dataset X 2 Y Z 4 20 Executable Part of the Data Step • SAS can’t find any more observations in the input dataset so the data step is ended and the PDV is destroyed. 21 Output Statement • “implicit” vs. “explicit” output statement. • In the previous example the “implicit” output statement was used, after all executable code is run in data step, SAS outputs a row of data from PDV to the output dataset. • “explicit” output statement is used to control when data is written from the PDV to the output dataset. 22 Output Statement • When an output command is run across in a data step data is written from the PDV to the output dataset: data gsdfs; x = 3; output; x = 5; output; run; 23 Output Statement • In the previous example there is no data to be read in, so SAS reads through the executable code only once. • So x is set to a value of 3 in the PDV and then outputted to the output dataset gsdfs. • X again is set to a value of 5 in the PDV and the outputted to the output dataset gsdfs. X GSDFS Dataset 3 5 24 SAS System Variables • SAS system variables are created during the data step but not stored in the output dataset, although can be used in the executable code. • _n_ is a variable that gives the iterations of the output command in the data step. • _error_ is related to data errors in inputting data. 25 SAS System Variables • Other variables that can be used in the executable code are: in=, first.”by-variable”, last.”by-variable”, end=, point= • An in= statement creates a variable that has a user specified name and takes on the value of 1 if the observation is in the input dataset during execution, and 0 if it is not in the input dataset. 26 SAS System Variables DATA1 DATA2 ID X ID X 3 4 2 -6 Data main; set data1(in=myvar) data2(in=other1); Nv1 = myvar; Nv2 = other1; MAIN ID X NV1 NV2 3 4 1 0 2 -6 0 1 27 SAS System Variables • The first. & last. Variables can only used if a by statement is specified: data temp; set temp2; by id date; if first. id then f = 1; else f = 0; if last. id then l = 1; else l = 0; run; 28 SAS System Variables TEMP TEMP2 ID DATE ID DATE F L 1 1/5/02 1 1/5/02 1 0 1 2/6/02 1 2/6/02 0 1 2 12/31/01 2 12/31/01 1 0 2 6/15/02 2 6/15/02 0 0 2 7/7/02 2 7/7/02 0 1 3 4/7/02 3 4/7/02 1 1 29 Retain Statement • Allows variables to remain in the PDV and not be re-initialized as missing for a new input line from an input statement or dataset. • Can often be useful when used in combination with the first. & last. system variables. 30 Retain Statement TEMP data data1; set temp; by id date; retain visit; if first.id then visit = 1; else visit = visit + 1; run; ID DATE 1 1/5/02 1 2/6/02 2 12/31/01 2 6/15/02 2 7/7/02 3 4/7/02 31 Retain Statement DATA1 ID DATE VISIT 1 1/5/02 1 1 2/6/02 2 2 12/31/01 1 2 6/15/02 2 2 7/7/02 3 3 4/7/02 1 32 Arrays • Arrays are a way of representing a group of variables in a possibly more efficient way. • Two parts of using an array are the array “declaration” statement and the array “call”. • Names of arrays can be any SAS name providing that the name is not used by any of the variables in the data step. • The array “declaration” statement that will take on the values of a group of variables has the following form: array arrayname variable-list; 33 Arrays • What this statement does is represent the first “column” of the arrayname as the first variable in the variable-list, the second column as the second variable in the list, … • To use the array use the following general call of: arrayname[column #]. • Other options in the array statement are to create new variables from an array initializing the elements at some starting value(you can make these temporary, i.e. not outputted to the output dataset stored by using the _temporary_ command). 34 Arrays Data data1; set data2; array x a b c; q = x[1] + x[3] – x[2]; run; DATA2 DATA1 A B C A B C Q 1 3 -7 1 3 -7 -9 4 2 6 4 2 6 8 35 Arrays • To create a different array element index than that of the default starting at one, change the array declaration statement: array array-name[start:stop] varlist; 36 Arrays Data data1; set data2; array x[2:4] a b c; q = x[2] + x[4] – x[3]; run; DATA2 DATA1 A B C A B C Q 1 3 -7 1 3 -7 -9 4 2 6 4 2 6 8 37 Arrays • There are also multidimensional arrays, to define these use the following general array declaration statement: array array-name[dim1,dim2, …] varlist; 38 Arrays data data1; set data2; array x[2,2] a b c d; q = x[1,1] + x[1,2] – x[2,1] + x[2,2]; run; DATA2 DATA1 A B C D A B C D Q 1 3 -7 2 1 3 -7 2 13 4 2 6 3 4 2 6 3 3 39 Loops • Iterative do loops, conditional do loops. • Iterative loops work with a counter, or loop through a specified group of code for a specified number of times. • Conditional loops, continue to loop until or while a conditional statement is true. • All loops are started with a do statement and ended with an end statement. 40 Iterative Do Loop • Basic syntax: do counter-variable = start-value to end-value <by by-value>; < programming statements(body of the loop);> end; 41 Iterative Do Loop TEMP I X data temp; do i = 1 to 5; x = i + 2; output; end; run; 1 3 2 4 3 5 4 6 5 7 42 Conditional Do Loop • Two types of conditional do loops, there are: do until, and do while loops. • Do until loops continue looping until the conditional statement used becomes true (note that this will always execute at least once). • Do while loop continues looping while the conditional statement is true. 43 Do While Loop • General syntax: do while(conditional statements); < programming statements(body of the loop);> end; 44 Do While Loop TEMP data temp; i = 1; do while(i le 5); x = i + 2; output; i = i + 1; end; run; I X 1 3 2 4 3 5 4 6 5 7 45 Do Until Loop • General syntax: do until(conditional statements); < programming statements(body of the loop);> end; 46 Do Until Loop TEMP data temp; i = 1; do until(i > 5); x = i + 2; output; i = i + 1; end; run; I X 1 3 2 4 3 5 4 6 5 7 47 Advanced Input Statements • Up to this point we looked at the basic input statement assuming the input was given as space delimited values with each row as an observation. Also mainly syntax for reading numeric data. • General syntax: input variable-list; 48 Basic Input Statement Arguments • For general character variables, the variable name must be followed by a $: data temp; input d $; cards; Banana Orange ; 49 Basic Input Statement Arguments • For character variables with an embedded blank use the $ then &: data temp; input d $ &; cards; Banana Rama Orange Roughy ; 50 Basic Input Statement Arguments • For character variables with quotation marks that are not meant as delimiters use the $ then ~: data temp; input d $ ~; cards; Banana’s hasn’t ; 51 Basic Input Statement Arguments • For inputting missing values of character or numeric data use a period(.). • Another way to input missing values that are indicated by a special value is to use the missing statement. data temp; missing r; input a b; cards; 2r .4 ; A B 2 .R . 4 52 Changing Delimiters in Input Statement • To specify what delimiter to use in the list input style add the delimiter = option to the infile statement: data data1; infile ‘U:/sasclass/datafile.txt’ delimiter=‘,’; input a b c; run; 53 Formatted Input • If you need to read in input that is formatted such as dates just follow the variable name by a : and its format: data temp; input date : mmddyy10.; cards; 11/05/1996 12/13/2000 ; 54 Column Input • Read in data by specifying the “columns” the data occupies in the input file: data temp; infile in; input x 1-5 y 12-14; run; In Column # 123456789012345678901234567890 3234 4321 132344 55 Column Input • The previous example would read from the input file and create the following dataset: TEMP Dataset X Y 3234 132 56 Pointer Controls • Can use the @ character to control the start column a variable is read in at. Note, that it will read like list input. TEMP Dataset data temp; input @2 a @4 b; cards; 12345 67890 ; A B 2345 45 7890 90 57 Pointer Controls TEMP Dataset data temp; input @2 a @4 b; cards; 12 45 67 90 ; A B 2 45 7 90 58 Pointer Controls • Can use the + character to control the start column a variable is read in at in relation to the previously variable read in. TEMP Dataset data temp; input @2 a 1. +2 b; cards; 12345 67890 ; A B 2 5 7 0 59 Pointer Controls • There are also record (or input file row) control symbols like #, which will use #n the nth record. data temp; input #2 a; cards; 1 2 3 4 ; TEMP Dataset A 2 4 60 Pointer Controls • Be careful though this can be tricky with multiple variables. data temp; input #2 a #1 b; cards; 15 26 37 48 ; TEMP Dataset A B 2 1 4 3 61 Pointer Controls • The / symbol will start the input for the variable at the 1st column of the next line of input. data temp; input a /b c; cards; 1 23 4 56 ; TEMP Dataset A B C 1 2 3 4 5 6 62 Using Data Step as an Output Device • Use the put statement in a data step to output to somewhere other than a data file. data temp; x = 3; put x; run; • The above example will output a 3 to the log window or .log file. 63 Using Data Step as an Output Device • If you want to output to a place other than the log window or .log file use the file statement. data temp; file print; x = 3; put x; run; • This will output to the output window or the .lst file. 64 Using Data Step as an Output Device • You can also specify a file to output to instead of the output or .lst file: data temp; file ‘U:/myfile.txt’; x = 3; put x; run; • This will output to U:/myfile.txt 65 Using Data Step as an Output Device • Sometimes you don’t need to create a data set if you’re just using the data step as an output device, so use the _null_ keyword in the data statement: data _null_; x = 3; put x; run; 66 Proc Transpose • This procedure will transpose data, the data manipulation that it does can be done in the data step but this procedure is often simpler to use. • Example: You have a data set with many observations for each person, say blood pressure measurements done at several clinical visits and want to a have a data set with one observation per person with a variable for each blood pressure measurement. 67 YOU HAVE THIS: ID Visit BP 1 1 90 1 2 98 1 3 76 2 1 82 2 2 104 3 1 115 68 BUT YOU WANT THIS: ID BP1 BP2 BP3 1 90 98 76 2 82 104 . 3 115 . . 69 Proc Transpose • General syntax: proc transpose data = dsname1 out = dsname2; var varname; id idvar; by byvar; run; 70 Proc Transpose • The data= statement is the input dataset name (i.e. dataset to be transposed), and the out= statement is the output dataset (i.e. transposed dataset). • The var statement indicates the variable to be transposed. • The id statement indicates the variable names you want created for the transposed data, the default is to create the variables var1 – varn. • The by variables are like by variables in all the other procedures and indicate how to transpose the data. 71 Previous Example • Using the previous blood pressure example another variable must be added to the input data set to indicate the new variable names. OLD Dataset ID Visit idvar BP 1 1 BP1 90 1 2 BP2 98 1 3 BP3 76 2 1 BP1 82 2 2 BP2 104 3 1 BP1 115 72 Previous Example • So the code needed to transpose the data and achieve the data structure desired would be: proc sort data = old; by id; run; proc transpose data = old out = new; var bp; id idvar; by id; run; 73 Proc SQL • Proc SQL can do much if not more than what a data step can do. • It has the ability to access data from other sources (databases, etc.) and in the case of some databases can pass native SQL language to the database for added efficiency. • Another ability that SQL has is to do complex merging of data. 74 Proc SQL • This is an overview of the procedure, so we’ll just look at some basic parts of the procedure. • Proc SQL is an interactive procedure and is started by issuing the statement: proc sql; • There is no data= statement for this procedure, data is read in a little differently. 75 Proc SQL • The main option for proc sql; statement is print|noprint. • In one respect this is a “query” procedure intended to be used with databases. Meaning that you are asking SAS to tell you something about the data, so the noprint option will suppress the “query” part of the procedure and just build datasets(for the most part) or tables as they are referred to in the SQL procedure language. 76 Proc SQL • There is a nice graphical user interface that is often helpful since it will output code. • The SQL GUI can be found be selecting “tools” then “Query”. • This interface is also mainly intended as a “query” tool but can also be used to perform most of the procedures capabilities. 77 Select Statement • The next most important statement in the SQL procedure is the select statement. • The select statement allows you to decide which variables are of interest or that will be used in the “query” or dataset building. Proc sql; select <options> from <options> where <options> group by <options> order by <options> having <options>; 78 Select Statement • To get data into the procedure use the from statement and include a table name(an example is a SAS dataset name). • In the select statement you can use an asterisk(*) to indicate you are interested in all the variables in the dataset or supply a list of variables seperated by commas. Proc sql; select * from data1; Proc sql; select a, b, c from data1; 79 Select Statement • Proc SQL has many functions such as min, max, average, etc. which can be used to create new variables, however, they do work across rows (observations) instead of columns (variables) as in the data step. • To create a new variable using one of these functions you have to define the function and then use an as statement followed by the new variable name; 80 Select Statement • Example 1 creates a new variable y that will take on the value one-half of x. • Example 2 will create a new variable name dsmin that will be the minimum x value in the entire dataset data1. EXAMPLE 1 EXAMPLE 2 Proc sql; select x*0.5 as y from data1; Proc sql; select min(x) as dsmin from data1; 81 Select Statement • If you want to use any variable that you have created in proc sql (in other proc sql statements) you have to proceed the new variable name by the word calculated. • The following example uses the order by statement for sorting the data as in the proc sort procedure. Proc sql; select x*0.5 as y from data1 order by calculated y; Quit; 82 Select Statement • The group by statement “collapses” rows of like data decided by the variables following the statement. It is similar to the familiar by statement in other SAS procedures. • Variables in the group by clause do not have to be specified in the select clause. • This is like a by statement but no sorting is necessary prior to running proc sql for this or any other part of the SQL procedure. 83 DATA1 Dataset ID DATE INCOME 1 1/13/00 3213 1 2/7/01 545 1 6/3/99 654 2 2/7/02 5235 2 1/8/00 8768 3 12/2/89 2155 84 Proc sql; select count(*) as days, sum(income) as totinc from data1 group by id; Quit; Result of Query ID 1 2 3 DAYS 3 2 1 INCOME 4412 14003 2155 85 Select Statement • The where and having statements work the same as where statements in other SAS procedures & the data step. • If the procedure is connected to a database however you can pass database “native” SQL code, instead of using SAS code. • Also if the procedure is connected to another database and SAS statements are used instead of “native” SQL to the database, SAS must make a temporary copy of the entire database before it subsets (not very efficient). 86 SQL Joins • Joins in SQL are basically merging and concatenating datasets together. • There are many types of joins: left, right, inner, outer, etc. • It is often easier to use the proc SQL graphical user interface than trying to figure out how to code this, but if you do code this, you need to put the datasets of interest in the from clause and create aliases for them. 87 SQL Joins • An alias for a dataset works in the same way as a libname for an alias of a SAS library. Proc sql; select a.id, a.x, b.y from data1 a, data2 b where a.id = b.id; DATA1 Dataset ID 3 4 X 14 17 DATA2 Dataset ID 3 4 Y 6 -5 88 SQL Joins • The previous example will complete a merge as was done by the data step with a merge and by statement. Result of Query ID 3 4 X 14 17 Y 6 -5 89 SQL Joins • More complex joins are complete using the “type of join name” between two or more select … from statements or in the from statement alone. Proc sql; select * from data4 union select * from data5; Proc sql; select * from data4 d4 left join data5 d5 on d4.id = d5.person; 90 Creating a SAS Dataset from a Query • Up until now we have just been creating queries which basically do not store any information. • To create a SAS dataset from a query use the “create table table-name as” statement PROC SQL SYNTAX proc sql; create table data2 as select * from data1; quit; DATA STEP SYNTAX Data data2; set data1; Run; 91 Accessing Databases • There are many different statements depending what operating environment you are working in, as well as the type of database you’re connecting to and how that database is set up. • The basic statement you can use is the connect to and disconnect from statements. Proc sql; connect to odbc … < SQL statements > disconnect from odbc … ; Quit; 92