SAS--The Data Step

advertisement
UNC-Wilmington
Department of Economics and Finance
ECN 377
Dr. Chris Dumas
The "Data Step" in SAS
Recall that a SAS program is composed of four main parts:




Wow, the Data
Step ! So useful !
Comments/Notes
Options
The "Data Step"
Procedures ("Procs")
This handout gives additional detail on the "Data Step" part of a SAS program.
Purpose of the Data Step
The Data Step has several purposes:
1.
2.
3.
4.
5.
6.
Opening SAS data sets (recall that Proc Import is not used with SAS data sets)
Saving SAS data sets (recall that Proc Export is not used with SAS data sets)
Select particular variables (columns) for analysis, and drop the rest
Select particular observations (rows) for analysis, and delete the rest
Create new variables and add them to the data set
Modify or change the values of existing variables
Notice that all of these purposes are related to preparing data for analysis. The purpose of the
Data Step is to prepare the data for analysis by the Proc commands that will follow the Data
Step.
Structure of the Data Step
The Data Step is composed of three parts:
1. The "data" command that creates a new dataset and gives it a name
2. Various data manipulation commands
3. The "run" command
TIP: Think of the initial "data" command and the final "run" command as "bookends"
marking the beginning and the end of the Data Step.
1
Importing Data into SAS
Refer to the handout “Variables, Data Sets, Data Files, Importing and Exporting Data in
SAS” to review how to import data into SAS.
The remainder of this handout assumes that you have
imported a dataset into SAS and named it “dataset01” !!!!!!!
A SAS Program Can Have More than One Active Dataset
There can be more than one active dataset in SAS’s brain. Each Data Step creates a new dataset
inside SAS's brain. For example, suppose you have already imported data into SAS and created
a dataset in SAS named “dataset01”. Using the commands below, you can create another dataset
named “dataset02”, and you can use the “set” command to bring a copy of the data from
dataset01 into dataset02. So, at this point, you have two datasets, each with a copy of all the
data.
data dataset02;
set dataset01;
run;
Similarly, the commands below create a new dataset “dataset03” from dataset02.
data dataset03;
set dataset02;
run;
Note: Looking ahead, when you have more than one active data set in SAS’s brain, you need to
be careful to tell SAS which data set to work on when you issue a Proc command.
Selecting/Deleting Observations (Rows) in the Dataset Using the Data Step
Suppose you have a large dataset named dataset01 with many observations (rows), and you don't
need all of them. There are two methods that you could use to tell SAS which rows to select for
further analysis and which rows to delete. First, you could create a new dataset, named, say,
dataset02, and use the "firstobs=" and "obs=" options in a "set" command to bring only some of
the rows from dataset01 into dataset02:
data dataset02;
set dataset01 (firstobs=10 obs=90);
run;
In the example program above, SAS creates dataset02 and then begins to “set” a copy of
dataset01 into dataset02. As SAS does so, it will only copy observations 10 through 90 of
dataset01 into dataset02; the "firstobs" option specifies the first observation that you want to
copy, and the "obs" option specifies the last observation you want to copy. You must use the
parentheses. If you use the “set” command without the firstobs and obs options, then SAS will
copy all of the data rows from dataset01 into dataset02.
2
The second method that you could use to select/delete data rows is to use "If" and "If ... then
delete" commands inside the Data Step. For example, suppose you only want to keep
observations in which variable "a" is equal to 7, you could use the following program:
data dataset02;
set dataset01;
if a=7;
run;
In the example program above, SAS creates dataset02 and then begins to “set” a copy of
dataset01 into dataset02. As SAS does so, it looks at each observation (row) and copies the
observation from dataset01 to dataset02 only "if a=7". All other data rows are not copied.
Implicitly, the "if a=7" is saying "if a=7, then keep and copy the observation," but you don't need
to write the "...then keep the observation" part.
Suppose instead that you want to keep rows where a=7 or b<5, then you could use "if a=7 or
b<5;" instead of "if a=7;" in the program above.
Now suppose, instead, that you want to delete rows in which variable x is greater than 5 (x>5).
When you want to delete rows, you must actually write the ". . . then delete" part, for example:
data dataset02;
set dataset01;
if x>5 then delete;
run;
These commands work with text variables as well. For example, suppose you have a text
variable called "studentname", and you want to drop rows of data related to the student named
"slacker" from the data set, because he quit coming to class:
data dataset02;
set dataset01;
if studentname='slacker' then delete;
run;
Notice in the program above that you must put single quotes around a value of a text variable
(not the variable name, like studentname, but a variable value, like 'slacker')
The statement below will delete any row of data where the value of variable d equals "."
(missing). Commands like this are useful for dropping rows with missing data from the data set.
if d=. then delete;
3
Keeping/Dropping Variables (Columns) in the Dataset Using the Data Step
Suppose you have a large dataset with many variables (columns). You may not need all of the
variables in the data set. If you want SAS to drop all of the variables except for a few, you can
use the "Keep" command inside the Data Step. On the other hand, if you want SAS to keep most
of the variables and drop only a few, you can use the "Drop" command inside the Data Step.
Example -- Keep Variables X, Q and R, and drop all other variables
keep X Q R;
Example -- Drop Variables a, b and m, and keep all other variables
drop a b m;
TIP: You can use either Drop or Keep; use the one that results in the least typing for you.
Creating New Variables in the Dataset Using the Data Step
The statement "q=50;" in the Data Step below creates a new variable (column of data) called "q,"
adds it to the dataset, and sets its value equal to 50 for every observation (row) of data. If you
create a new variable, be sure to choose a name for the variable that is different from any other
variable names that you are currently using in the program.
data dataset02;
set dataset01;
q=50;
run;
NOTE: Any commands that create or modify variables need to be placed inside a Data Step!!!
Similarly, we could place the statement below inside the data step to create a new variable
(column of data) called "h," add it to the dataset, and set its value equal to variable "a" multiplied
by variable "b".
h=a*b;
In SAS, "**" means "raise to the power of." The statement below creates a new variable
(column of data) called "g," adds it to the dataset, and sets its value equal to the value of variable
"a" squared.
g=a**2;
The statement below creates a new variable "i" equal to 4 plus the natural log of variable "a"
(Note: in SAS, "log" means natural log.):
i=4+log(a);
4
If you want base 10 log, "log10", you need to use "log10" as the operator, as shown below:
i=4+log10(a);
TIP: See the handout titled "Mathematical and Logical Operators in SAS" for a list of
additional math and logic operators available in SAS that can be used to create and
modify variables.
Modifying Existing Variables in the Dataset Using the Data Step
REMEMBER: Any commands that create or modify variables need to be placed inside a Data
Step!!!
You can replace the value of a variable with a new value by referring to the variable itself:
data dataset02;
set dataset01;
if d=12 then d=20;
run;
Above, if d=12, then the value of d will be changed from 12 to 20.
The "if/then" command below shows how to change the value of one variable based on the value
of another variable. If d = 12 on a row of data, then the value of variable c on that row is
changed to 1000
if d=12 then c=1000;
The "if/then" command below shows how to change the value of a text variable. If text variable
e is equal to 'wired,' then it is changed to 'zippy.' Again, text variable values must be enclosed in
single quotes.
if e='wired' then e='zippy';
5
Download