UNC-Wilmington Department of Economics and Finance ECN 377 Dr. Chris Dumas The "Data Step" in SAS Recall that a SAS program is composed of four main parts: Wow, the Data Step ! So useful ! Comments/Notes Options The "Data Step" Procedures ("Procs") This handout gives additional detail on the "Data Step" part of a SAS program. Purpose of the Data Step The Data Step has several purposes: 1. 2. 3. 4. 5. 6. Opening SAS data sets (recall that Proc Import is not used with SAS data sets) Saving SAS data sets (recall that Proc Export is not used with SAS data sets) Select particular variables (columns) for analysis, and drop the rest Select particular observations (rows) for analysis, and delete the rest Create new variables and add them to the data set Modify or change the values of existing variables Notice that all of these purposes are related to preparing data for analysis. The purpose of the Data Step is to prepare the data for analysis by the Proc commands that will follow the Data Step. Structure of the Data Step The Data Step is composed of three parts: 1. The "data" command that creates a new dataset and gives it a name 2. Various data manipulation commands 3. The "run" command TIP: Think of the initial "data" command and the final "run" command as "bookends" marking the beginning and the end of the Data Step. 1 Importing Data into SAS Refer to the handout “Variables, Data Sets, Data Files, Importing and Exporting Data in SAS” to review how to import data into SAS. The remainder of this handout assumes that you have imported a dataset into SAS and named it “dataset01” !!!!!!! A SAS Program Can Have More than One Active Dataset There can be more than one active dataset in SAS’s brain. Each Data Step creates a new dataset inside SAS's brain. For example, suppose you have already imported data into SAS and created a dataset in SAS named “dataset01”. Using the commands below, you can create another dataset named “dataset02”, and you can use the “set” command to bring a copy of the data from dataset01 into dataset02. So, at this point, you have two datasets, each with a copy of all the data. data dataset02; set dataset01; run; Similarly, the commands below create a new dataset “dataset03” from dataset02. data dataset03; set dataset02; run; Note: Looking ahead, when you have more than one active data set in SAS’s brain, you need to be careful to tell SAS which data set to work on when you issue a Proc command. Selecting/Deleting Observations (Rows) in the Dataset Using the Data Step Suppose you have a large dataset named dataset01 with many observations (rows), and you don't need all of them. There are two methods that you could use to tell SAS which rows to select for further analysis and which rows to delete. First, you could create a new dataset, named, say, dataset02, and use the "firstobs=" and "obs=" options in a "set" command to bring only some of the rows from dataset01 into dataset02: data dataset02; set dataset01 (firstobs=10 obs=90); run; In the example program above, SAS creates dataset02 and then begins to “set” a copy of dataset01 into dataset02. As SAS does so, it will only copy observations 10 through 90 of dataset01 into dataset02; the "firstobs" option specifies the first observation that you want to copy, and the "obs" option specifies the last observation you want to copy. You must use the parentheses. If you use the “set” command without the firstobs and obs options, then SAS will copy all of the data rows from dataset01 into dataset02. 2 The second method that you could use to select/delete data rows is to use "If" and "If ... then delete" commands inside the Data Step. For example, suppose you only want to keep observations in which variable "a" is equal to 7, you could use the following program: data dataset02; set dataset01; if a=7; run; In the example program above, SAS creates dataset02 and then begins to “set” a copy of dataset01 into dataset02. As SAS does so, it looks at each observation (row) and copies the observation from dataset01 to dataset02 only "if a=7". All other data rows are not copied. Implicitly, the "if a=7" is saying "if a=7, then keep and copy the observation," but you don't need to write the "...then keep the observation" part. Suppose instead that you want to keep rows where a=7 or b<5, then you could use "if a=7 or b<5;" instead of "if a=7;" in the program above. Now suppose, instead, that you want to delete rows in which variable x is greater than 5 (x>5). When you want to delete rows, you must actually write the ". . . then delete" part, for example: data dataset02; set dataset01; if x>5 then delete; run; These commands work with text variables as well. For example, suppose you have a text variable called "studentname", and you want to drop rows of data related to the student named "slacker" from the data set, because he quit coming to class: data dataset02; set dataset01; if studentname='slacker' then delete; run; Notice in the program above that you must put single quotes around a value of a text variable (not the variable name, like studentname, but a variable value, like 'slacker') The statement below will delete any row of data where the value of variable d equals "." (missing). Commands like this are useful for dropping rows with missing data from the data set. if d=. then delete; 3 Keeping/Dropping Variables (Columns) in the Dataset Using the Data Step Suppose you have a large dataset with many variables (columns). You may not need all of the variables in the data set. If you want SAS to drop all of the variables except for a few, you can use the "Keep" command inside the Data Step. On the other hand, if you want SAS to keep most of the variables and drop only a few, you can use the "Drop" command inside the Data Step. Example -- Keep Variables X, Q and R, and drop all other variables keep X Q R; Example -- Drop Variables a, b and m, and keep all other variables drop a b m; TIP: You can use either Drop or Keep; use the one that results in the least typing for you. Creating New Variables in the Dataset Using the Data Step The statement "q=50;" in the Data Step below creates a new variable (column of data) called "q," adds it to the dataset, and sets its value equal to 50 for every observation (row) of data. If you create a new variable, be sure to choose a name for the variable that is different from any other variable names that you are currently using in the program. data dataset02; set dataset01; q=50; run; NOTE: Any commands that create or modify variables need to be placed inside a Data Step!!! Similarly, we could place the statement below inside the data step to create a new variable (column of data) called "h," add it to the dataset, and set its value equal to variable "a" multiplied by variable "b". h=a*b; In SAS, "**" means "raise to the power of." The statement below creates a new variable (column of data) called "g," adds it to the dataset, and sets its value equal to the value of variable "a" squared. g=a**2; The statement below creates a new variable "i" equal to 4 plus the natural log of variable "a" (Note: in SAS, "log" means natural log.): i=4+log(a); 4 If you want base 10 log, "log10", you need to use "log10" as the operator, as shown below: i=4+log10(a); TIP: See the handout titled "Mathematical and Logical Operators in SAS" for a list of additional math and logic operators available in SAS that can be used to create and modify variables. Modifying Existing Variables in the Dataset Using the Data Step REMEMBER: Any commands that create or modify variables need to be placed inside a Data Step!!! You can replace the value of a variable with a new value by referring to the variable itself: data dataset02; set dataset01; if d=12 then d=20; run; Above, if d=12, then the value of d will be changed from 12 to 20. The "if/then" command below shows how to change the value of one variable based on the value of another variable. If d = 12 on a row of data, then the value of variable c on that row is changed to 1000 if d=12 then c=1000; The "if/then" command below shows how to change the value of a text variable. If text variable e is equal to 'wired,' then it is changed to 'zippy.' Again, text variable values must be enclosed in single quotes. if e='wired' then e='zippy'; 5