MFE SAS Workshop

advertisement
Haas MFE SAS Workshop
Lecture 2: The Data Management
Alex Vedrashko
For sample code and these slides, see Peng Liu’s page
http://faculty.haas.berkeley.edu/peliu/computing
Creating datasets (recap from L1)

The ultimate goal: save data to disk as a permanent SAS dataset.

Read from a SAS library file (e.g. downloaded from CRSP).
libname mylib ‘r:\temp\’;
data d; set mylib.crspsample; run;
Read using INFILE and INPUT from an external file. Example from Lect
1:
DATA LOAN1;
INFILE 'R:\bulk\SAS\MFE\loan.txt' DELIMITER=',';
INPUT ID Origination mmddyy10. Term Rate Balance
Appraisal LTV FICO_orig City $ State $2. ;


Use SAS menu “File – Import Data”. Very flexible.

Read using INPUT and DATALINES (CARDS).
DATA portf; INPUT portfolioreturn @;
datalines;
15.9 -2.1 0.3;
2
Viewing Datasets

Browse the saved SAS dataset with extension
sas7bdat. Double click on the file in windows explorer
or use
 Browse the saved SAS dataset in SAS Explorer. Click
on Libraries icon, then on your library name.
 Datasets are automatically assigned to the WORK
library if no libname is given, e.g. dataset d1 is actually
WORK.d1 in
data d1; set mylib.d0; … run;
The WORK library is temporary--all datasets in it
disappear when you close SAS.
3
Viewing Datasets: PUT statement
Proc Print; (see lec.1)
2. PUT statement in the DATA step. Syntax: PUT variable names; Writes to
the LOG window. Useful for debugging and simple output to text files.
data d; set loan1;
put origination term city;
run;
 Show variable names: put origination= term= city;
1.

Output to a file, rather than LOG window.
filename f "r:\mysasoutput.txt";
data d; set loan1;
file f; put origination term city;
run;
 Preset SAS variables:
_N_ (stands for “observation number”) , _all_ (stands for “all variables”)
put _n_ city state;
4
FORMAT statement
FORMAT is an instruction that tells SAS how to write variable
values
 Specifying Format is usually necessary to make Date variables
readable.
 Permanently associate format with a variable in a given dataset:
data d; set loan1;
format Origination mmddyy8.; put Origination rate;
proc print; run;
 Put statement or Proc Print automatically use this format.
 Temporarily: You can also specify format in Proc Print
proc print;
format Origination mmddyy8.;
run;


SAS stores dates as the number of days from Jan. 1, 1960.
5
Data Type Conversion: the issue




SAS only has three data types: Numeric, Character
and Date/time.
When you accidentally mix variable types, SAS tries to fix
your program by converting them.
Log File - “Note: Numeric Values have been converted to
Character!” Cannot ignore this!
For example: 110 can be numeric or character, when you use
numerical function on character variables or vice versa. SAS tries
to convert to appropriate data type first, then perform function
calculations.
How to Fix? A practical way is to use INPUT/PUT functions.

Close cousin of input/put statements, but different!
6
Data Type Conversion: Solution

Numeric to Character

Character to Numeric

new=PUT (old, format);

new=INPUT (old, informat);

Format must be the type you are
converting from – numeric

Informat must be the type you
are converting to – numeric or
SASdate
Rate_chr = put (rate, 5.2)
Rate_num=input(rate_chr,5.);
To verify, apply a character function:
Digit1=substr(Rate_chr,1,1);
To verify, apply a numeric function:
Intgr_r=floor(Rate_num);
7
Titles and Footnotes

SAS allows up to ten lines of text at the top (titles) and bottom
(footnotes) of each page of output, specified with title and
footnote statements. The form of these statements is
title<n> text; or footnote<n> text;
where n, if specified, can range from 1 to 10, and text must be
surrounded by single quotes or no quotes.
Title ‘Mortgage dataset’;
Proc print; run;

If text is omitted, the title or footnote is deleted; otherwise it remains in
effect until it is redefined. Thus, to have no titles, use:
title;
 By default SAS includes the date and page number on the top of
each piece of output. These can be suppressed with the nodate and
nopage system options.
8
System Options for Output Control

Syntax: option opt;
Useful options to manage how SAS output
(the OUTPUT window) looks like:
Date/nodate (shows current date)
Number/nonumber (shows pagenumber)
Center/nocenter (centers output – useful for proc means,
etc.)
 formdlim = '-'; (defines the delimiter between pages.
Results in more readable output of econometric proc’s)

See all available options (in the LOG window):
proc options;
9
IF-THEN-ELSE
The DATA step is where all variable assignment takes place
 Sometimes you will want to condition assignment by
using IF-THEN-ELSE statement






IF condition THEN action;
ELSE IF condition THEN action;
…
ELSE action;
Example:
data p1; set portf;
if portfolioreturn>10 then promotion=1;
else if portfolioreturn<0 then promotion=-1; /*fired*/
else promotion=0;
run;


The ELSE statements are optional
With the above syntax you can only assign a single action with
each statement
10
IF-THEN-DO-END

Use a DO-END loop inside of an IF-THEN statement to perform
multiple actions on the given condition





IF condition THEN DO;
action;
action;
END;
Examples:
if portfolioreturn>10 then do; promotion=1;
bonus=50000+10000*sqrt(portfolioreturn); end;
else if portfolioreturn<0 then do; promotion=-1; bonus=10000; end;
else do; promotion=0; bonus=50000; end;

Conditions can be specified with symbols or mnemonics
=
^= , ~=
&
|,!
EQ
NE
AND
OR
>
>=
<
<=
GT
GE
LT
LE
11
Logical Conditions

Other useful conditions can be set by the following:
IF var1 IN(val1, val2, val3 …) THEN …;
if state in ('OR', 'WA', 'CA') then
region='Pacific';

IF var1 BETWEEN val1 AND val3 THEN …;
if GPA between 3.7 and 4 then letterGPA=‘A’;
Alternatively: if 3.7<=GPA<=4 then letterGPA=‘A’;


Conditions can contain functions, numeric and
character variables, constants, and mathematical
expressions
if rate**2 > 25 then highsqrate=1;
12
Statements and Options That Control
Reading and Writing
Task
Manage variables
Manage
observations
Statements
Data set
options
System
options
DROP
DROP=
KEEP
KEEP=
RENAME
RENAME=
WHERE
WHERE=
subsetting IF
FIRSTOBS=
FIRSTOBS=
DELETE
OBS=
OBS=
OUTPUT
13
Manage variables:
KEEP, DROP statements

DROP list-of-variables
tells which variables from the input dataset should
NOT be included in the output dataset.
data d1; set d; drop i j temp_variable;

KEEP list-of-variables
tells which variables from the input dataset should be
included in the output dataset (the other variables are
dropped).
data d1; set d; keep rate balance;
14
Manage Observations:
Subsetting IF statement. DELETE statement

A special case of the IF-THEN statement is an IF statement
without a ‘then’ action, i.e. IF condition;
data d1; set d; if Origination>'01Jan2002'd;
rte=rate/100; run;


If the condition is true, then SAS continues with the DATA
statements for this observation.
Otherwise no further statements are processed for that observation,
and the observation is not added to the data set.
To delete certain observations (the opposite of the subsetting
IF statement) use:

IF condition THEN DELETE;
data d1; set d; if Origination>'01Jan2002'd
then delete;
15
Subsetting IF: dealing with
missing observations

If you want to keep only non-missing observations:
if portfolioreturn; leaves only observations
where portfolioreturn is not missing. An equivalent
statement: if portfolioreturn^=.;

Note that a missing value in SAS is considered to be
smaller than all numeric or character values.
Thus, if portfolioreturn<0; includes
observations with missing returns!
To avoid “firing” traders with missing return records, add:
if portfolioreturn=. then promotion=.;
16
Manage Observations:
WHERE statement

Alternative to IF statement for sub-setting data
WHERE condition;
data d1; set loan1; where year(Origination)>2002;


Differences between IF and WHERE: http://support.sas.com/faq/042/FAQ04278.html
WHERE can be used in both DATA and PROC steps. IF is only for
DATA steps
proc print data=loan1; where year(Origination)>2002;

Can use WHERE with CONTAINS operator:
data d1; set data=loan1; where city contains 'SANTA';
 WHERE cannot be used to modify data from INPUT statements. Only to
control data that comes from existing SAS data sets via SET or
MERGE. (wrong use: data d; input a; where a>0;)
 WHERE cannot be applied to new variables created in the current DATA
step; IF can. (wrong: data d1; set loan1; where Rate_num>3;
Use: If Rate_num>3;)

17
Manage Observations:
Data Set Options

WHERE = condition
data d1; set d (where=
(2002<=year(Origination)<2005));
 KEEP = variable list, DROP = variable list
data d1; set d (keep= origination rate);
data d1 (drop=x y);
infile f; input x y z; ... run;
 Tells SAS to rename certain variables.
RENAME = (oldvar = newvar)
data d1;
set LOAN1 (rename= (origination=issued));
18
Dataset options (contd.)

Start reading from observation # n. Syntax:
FIRSTOBS = n
 Stop reading at observation # n. Syntax:
OBS = n
Data d1; set d (firstobs=5 obs=20); … run;
In procedures:
proc means data=d (firstobs=5 obs=20);
run;

Here Proc Means analyzes only observations
5 through 20 of the data set d.
19
Concatenating Two Data Sets




Concatenating the data sets
appends the observations
from one data set to another
data set.
The DATA step reads DATA1
sequentially until all
observations have been
processed, and then reads
DATA2.
Data set COMBINED contains
the results of the
concatenation.
Note that the data sets are
processed in the order in
which they are listed in the
SET statement.
20
Interleaving Two Data Sets

The datasets must be
sorted by the values
of the variables listed
in the BY statement.
 Similar to
Concatenating, but
preserves the sorting
order.
21
One-to-One Reading and One-to-One Merging
(use this method with caution)

One-to-one reading combines
observations from two or more
SAS data sets by creating
observations that contain all of
the variables from each
contributing data set.
 The first observation in one data
set with the first in the other, and
so on.
 The DATA step stops after it has
read the last observation from the
smallest data set.
 One-to-one merging is similar to
a one-to-one reading, with two
exceptions: you use the MERGE
statement instead of multiple SET
statements, and the DATA step
reads all observations from all
data sets.
22
Match-Merging
(most common data set manipulation)

Match-merging
combines
observations from
two or more SAS
data sets into a
single observation
in a new data set
based on the values
of one or more
common variables.
23
Updating
• Input data sets must be
sorted by the values of the
variables listed in the BY
statement. (In this example,
MASTER and
TRANSACTION are both
sorted by Year.)
• UPDATE replaces an
existing file with a new file
• UPDATE does not replace
nonmissing values in a
master data set with
missing values from a
transaction data set.
24
Merging datasets
Sort the datasets according to the var list in BY.
2. Use the MERGE statement inside a DATA step.
proc sort data=d1; by var_list;
proc sort data=d2; by var_list;
DATA newdata; MERGE d1 d2 …; BY var_list;
 The input data sets specified in MERGE will not be modified
1.
Values of any common variables not specified in the BY statement
are likely to be mixed up in the new data set. To prevent this, use
the RENAME data set option
 Example. Dataset d1 has variable ret containing market returns, and
d2 has variable ret containing individual stock returns. Merge the
dataset by tradedate.
 We rename ret in d1 to mktret:
data newd;
merge d1 (rename= (rate=loanr)) d2; by origination;

25
Merging datasets.
IN= Data Set Option


The IN= option allows the user to omit observations that are not
common to all data sets.
Creates a temp. variable for tracking whether that data set
contributed to the current observation
IN = index_var_name
data d; merge d1 (in=indicator1) d2; by tradedate;
if indicator1;
 Indicator is 1 if the data set contributed and 0 otherwise
 Use the IF statement on the index variables. In the above
example, only observations found in d1 will be included in d.


Variable will not be written to the new data set. To include it in d,
assign its value to a standard variable, e.g. ind1=indicator1;
26
Example of IN= option.
Merging by ID variable.
Dataset d1 (in=a)
ID
V1
2
3
4
421
129
122
6
7
8
534
343
324
Dataset d2 (in=b)
ID
V2
1
343
2
85
4
5
6
763
229
554
8
895
27
If a;
(i.e. observation must be in dataset d1)
Dataset d1 (in=a)
ID
V1
2
3
4
421
129
122
6
7
8
534
343
324
Dataset d2 (in=b)
ID
V2
1
343
2
85
.
4
763
5
229
6
554
.
8
895
28
If b;
(i.e. observation must be in dataset d2)
Dataset d1 (in=a)
ID
V1
.
2
421
3
129
4
122
.
6
534
7
343
8
324
Dataset d2 (in=b)
ID
V2
1
343
2
85
4
5
6
763
229
554
8
895
29
Preview of Lecture 3:

Procedures for dataset manipulation:
PROC APPEND adds the observations from
one SAS data set to the end of another SAS
data set.
 PROC SQL reads observations from up to
32 SAS data sets and joins them into single
observations; manipulates observations in a
SAS data set in place; easily produces a
Cartesian product.

30
OUTPUT command

The OUTPUT statement is used in the datastep to write the current values of all
variables to a data set. There is an IMPLICIT output statement at the end of each
datastep iteration (unless an output statement appears somewhere in the datastep).
 The following pieces of code are equivalent:
data d; input r1-r9; run; cards; ...
and
data d; input r1-r9; output; run; cards; ...



The OUTPUT statement is commonly used to create several SAS data sets in a single
datastep. Specify the dataset name after OUTPUT.
Example: Split the mortgage data into separate datasets for each state.
data ca wa;
set loan1;
if state='WA' then OUTPUT wa;
if state='CA' then OUTPUT ca;
proc print data=wa; proc print data=ca; run;
Once an OUTPUT statement is specified, the implied OUTPUT at the end of the DATA
step no longer exists and all observation writing must be specified by the user.
31
DO loop
Example: The input data is a line of four quarters of earnings for 100
firms. Read the data, indexing each observation by quarter.
data earnings;
input ticker $ @;
do quarter=1 to 4;
input earn @;
output;
end;
Datalines;
ibm 10.2 15 12 8
msft 25.1 27 29.4 35
;
run;

Other examples:
do state='CA','OR'; ... end;
do weekdays=1,3,5; ... end;
Output:
Obs
1
2
3
4
5
6
7
8
ticker quarter earn
ibm
1
10.2
ibm
2
15.0
ibm
3
12.0
ibm
4
8.0
msft
1
25.1
msft
2
27.0
msft
3
29.4
msft
4
35.0
32
Variable Arrays

Arrays are used mainly to group variables



Useful for performing the same calculations on a group of
variables or searching through a set of variables.
For example, your balance sheet data variables d_1 … d_150
are in millions, and you need to make them in 100s of millions.
Arrays defined using the ARRAY statement in a DATA
step
Syntax: ARRAY name (n) variable_list;
ARRAY all_vars var1-var10;
 n is the number of elements in the array and is optional
 The variable list is also optional but either n or the variable list
must be specified
 The variable list can contain variables that have not yet been
created – option for initializing variable values
 A $ should precede a variable list of character variables

33
Arrays (cont.)

In the calculation section of a DATA step the array can
be referenced by name(i) where i is the position of the
element you wish to refer to

Since parenthesis are also used in functions it is not a good
idea to give your array the same name as a SAS function

Example.
DATA d1; input var1-var10;
ARRAY all_vars var1-var10;
DO i = 1 to 10;
all_vars(i) = i/100;
END;
RUN;
34
Controlling the Built-in Data Loop:
RETAIN Statement
The built-in loop stores the data for a given observation for the
current run of the DATA step
 When the loop reaches the end of the DATA step and returns to the
top to read for the next observation all values are reset to missing
 To force the built-in loop to keep values from previous observations
use the RETAIN statement: RETAIN variable-list;
 The values of the variables specified in the RETAIN statement will
keep their values until they are reset by an INPUT or assignment
statement.
 Example: Calculate the highest mortgage balance to date.
proc sort data=loan1; by origination;
data d1; set loan1;
retain maxbal;
maxbal=max(maxbal, balance);
run;

35
Controlling the Built-in Data Loop:
SUM statement
A special case is a plus sign in an assignment that does not have
an equal sign, e.g. cumsum + newvar;
 This sum implicitly retains the previous value of newvar and adds
it to cumsum.
 Example. Calculate the growth of the total appraised value of
houses in the dataset to date.
proc sort data=loan1; by origination;
data d1; set loan1;
totalvalue + appraisal;
run;

This is equivalent to
retain totalvalue 0;
/*initialize to 0*/
totalvalue =sum(totalvalue, appraisal);

36
LAG Function



In general SAS is not very convenient about directly accessing
observations, e.g. for particular dates
If you want to do serious time series analysis you should use the
procedures in the SAS/ETS package
The lag function is used to reference previous values of a variable
newvar = LAG (variable); or newvar = LAGn (variable);



Where n refers to the number of observations to go back
Example (quarterly earnings):
lag2_earn = LAG2(earn);
If we are in observation 100, for example, this statement will
assign the price from observation 98 to the variable lag2_price in
observation 100. Similarly the value of lag2_price in observation
98 will be the value of price in observation 96.
37
LAG Function (contd.)







The order of observations is determined by the
current sort
BY does not work with lags. So you need do manual
checks to prevent nonsensical lags when dealing
with panel data. (This is the issue with the earnings
example).
Lags are tricky to use because of the built-in loop.
Sometimes the lag value is not available (missing).
The lag queue is not initialized until the lag function
is called.
Similarly the lag queue is not updated until the lag
function is called
Hints: Use separate data steps to create lags and
levels
Do not use the LAG function in a loop
38
Lecture 2 References
SAS onlinedoc > “BASE SAS”, “SAS
Language Reference: Dictionary” > “Data step
options”
 Manuals in pdf:

http://www.math.wpi.edu/saspdf/common/mainpdf.htm


“Base SAS” section
SAS User Group International “Beginning
tutorials”
http://www.lexjansen.com/cgi-bin/sugi.php?x=sbt&s=sugi_s

Merging datasets:
http://support.sas.com/techsup/technote/ts644.html
39
Download