Haas MFE SAS Workshop Lecture 2: The Data Management Alex Vedrashko For sample code and these slides, see Peng Liu’s page http://faculty.haas.berkeley.edu/peliu/computing Creating datasets (recap from L1) The ultimate goal: save data to disk as a permanent SAS dataset. Read from a SAS library file (e.g. downloaded from CRSP). libname mylib ‘r:\temp\’; data d; set mylib.crspsample; run; Read using INFILE and INPUT from an external file. Example from Lect 1: DATA LOAN1; INFILE 'R:\bulk\SAS\MFE\loan.txt' DELIMITER=','; INPUT ID Origination mmddyy10. Term Rate Balance Appraisal LTV FICO_orig City $ State $2. ; Use SAS menu “File – Import Data”. Very flexible. Read using INPUT and DATALINES (CARDS). DATA portf; INPUT portfolioreturn @; datalines; 15.9 -2.1 0.3; 2 Viewing Datasets Browse the saved SAS dataset with extension sas7bdat. Double click on the file in windows explorer or use Browse the saved SAS dataset in SAS Explorer. Click on Libraries icon, then on your library name. Datasets are automatically assigned to the WORK library if no libname is given, e.g. dataset d1 is actually WORK.d1 in data d1; set mylib.d0; … run; The WORK library is temporary--all datasets in it disappear when you close SAS. 3 Viewing Datasets: PUT statement Proc Print; (see lec.1) 2. PUT statement in the DATA step. Syntax: PUT variable names; Writes to the LOG window. Useful for debugging and simple output to text files. data d; set loan1; put origination term city; run; Show variable names: put origination= term= city; 1. Output to a file, rather than LOG window. filename f "r:\mysasoutput.txt"; data d; set loan1; file f; put origination term city; run; Preset SAS variables: _N_ (stands for “observation number”) , _all_ (stands for “all variables”) put _n_ city state; 4 FORMAT statement FORMAT is an instruction that tells SAS how to write variable values Specifying Format is usually necessary to make Date variables readable. Permanently associate format with a variable in a given dataset: data d; set loan1; format Origination mmddyy8.; put Origination rate; proc print; run; Put statement or Proc Print automatically use this format. Temporarily: You can also specify format in Proc Print proc print; format Origination mmddyy8.; run; SAS stores dates as the number of days from Jan. 1, 1960. 5 Data Type Conversion: the issue SAS only has three data types: Numeric, Character and Date/time. When you accidentally mix variable types, SAS tries to fix your program by converting them. Log File - “Note: Numeric Values have been converted to Character!” Cannot ignore this! For example: 110 can be numeric or character, when you use numerical function on character variables or vice versa. SAS tries to convert to appropriate data type first, then perform function calculations. How to Fix? A practical way is to use INPUT/PUT functions. Close cousin of input/put statements, but different! 6 Data Type Conversion: Solution Numeric to Character Character to Numeric new=PUT (old, format); new=INPUT (old, informat); Format must be the type you are converting from – numeric Informat must be the type you are converting to – numeric or SASdate Rate_chr = put (rate, 5.2) Rate_num=input(rate_chr,5.); To verify, apply a character function: Digit1=substr(Rate_chr,1,1); To verify, apply a numeric function: Intgr_r=floor(Rate_num); 7 Titles and Footnotes SAS allows up to ten lines of text at the top (titles) and bottom (footnotes) of each page of output, specified with title and footnote statements. The form of these statements is title<n> text; or footnote<n> text; where n, if specified, can range from 1 to 10, and text must be surrounded by single quotes or no quotes. Title ‘Mortgage dataset’; Proc print; run; If text is omitted, the title or footnote is deleted; otherwise it remains in effect until it is redefined. Thus, to have no titles, use: title; By default SAS includes the date and page number on the top of each piece of output. These can be suppressed with the nodate and nopage system options. 8 System Options for Output Control Syntax: option opt; Useful options to manage how SAS output (the OUTPUT window) looks like: Date/nodate (shows current date) Number/nonumber (shows pagenumber) Center/nocenter (centers output – useful for proc means, etc.) formdlim = '-'; (defines the delimiter between pages. Results in more readable output of econometric proc’s) See all available options (in the LOG window): proc options; 9 IF-THEN-ELSE The DATA step is where all variable assignment takes place Sometimes you will want to condition assignment by using IF-THEN-ELSE statement IF condition THEN action; ELSE IF condition THEN action; … ELSE action; Example: data p1; set portf; if portfolioreturn>10 then promotion=1; else if portfolioreturn<0 then promotion=-1; /*fired*/ else promotion=0; run; The ELSE statements are optional With the above syntax you can only assign a single action with each statement 10 IF-THEN-DO-END Use a DO-END loop inside of an IF-THEN statement to perform multiple actions on the given condition IF condition THEN DO; action; action; END; Examples: if portfolioreturn>10 then do; promotion=1; bonus=50000+10000*sqrt(portfolioreturn); end; else if portfolioreturn<0 then do; promotion=-1; bonus=10000; end; else do; promotion=0; bonus=50000; end; Conditions can be specified with symbols or mnemonics = ^= , ~= & |,! EQ NE AND OR > >= < <= GT GE LT LE 11 Logical Conditions Other useful conditions can be set by the following: IF var1 IN(val1, val2, val3 …) THEN …; if state in ('OR', 'WA', 'CA') then region='Pacific'; IF var1 BETWEEN val1 AND val3 THEN …; if GPA between 3.7 and 4 then letterGPA=‘A’; Alternatively: if 3.7<=GPA<=4 then letterGPA=‘A’; Conditions can contain functions, numeric and character variables, constants, and mathematical expressions if rate**2 > 25 then highsqrate=1; 12 Statements and Options That Control Reading and Writing Task Manage variables Manage observations Statements Data set options System options DROP DROP= KEEP KEEP= RENAME RENAME= WHERE WHERE= subsetting IF FIRSTOBS= FIRSTOBS= DELETE OBS= OBS= OUTPUT 13 Manage variables: KEEP, DROP statements DROP list-of-variables tells which variables from the input dataset should NOT be included in the output dataset. data d1; set d; drop i j temp_variable; KEEP list-of-variables tells which variables from the input dataset should be included in the output dataset (the other variables are dropped). data d1; set d; keep rate balance; 14 Manage Observations: Subsetting IF statement. DELETE statement A special case of the IF-THEN statement is an IF statement without a ‘then’ action, i.e. IF condition; data d1; set d; if Origination>'01Jan2002'd; rte=rate/100; run; If the condition is true, then SAS continues with the DATA statements for this observation. Otherwise no further statements are processed for that observation, and the observation is not added to the data set. To delete certain observations (the opposite of the subsetting IF statement) use: IF condition THEN DELETE; data d1; set d; if Origination>'01Jan2002'd then delete; 15 Subsetting IF: dealing with missing observations If you want to keep only non-missing observations: if portfolioreturn; leaves only observations where portfolioreturn is not missing. An equivalent statement: if portfolioreturn^=.; Note that a missing value in SAS is considered to be smaller than all numeric or character values. Thus, if portfolioreturn<0; includes observations with missing returns! To avoid “firing” traders with missing return records, add: if portfolioreturn=. then promotion=.; 16 Manage Observations: WHERE statement Alternative to IF statement for sub-setting data WHERE condition; data d1; set loan1; where year(Origination)>2002; Differences between IF and WHERE: http://support.sas.com/faq/042/FAQ04278.html WHERE can be used in both DATA and PROC steps. IF is only for DATA steps proc print data=loan1; where year(Origination)>2002; Can use WHERE with CONTAINS operator: data d1; set data=loan1; where city contains 'SANTA'; WHERE cannot be used to modify data from INPUT statements. Only to control data that comes from existing SAS data sets via SET or MERGE. (wrong use: data d; input a; where a>0;) WHERE cannot be applied to new variables created in the current DATA step; IF can. (wrong: data d1; set loan1; where Rate_num>3; Use: If Rate_num>3;) 17 Manage Observations: Data Set Options WHERE = condition data d1; set d (where= (2002<=year(Origination)<2005)); KEEP = variable list, DROP = variable list data d1; set d (keep= origination rate); data d1 (drop=x y); infile f; input x y z; ... run; Tells SAS to rename certain variables. RENAME = (oldvar = newvar) data d1; set LOAN1 (rename= (origination=issued)); 18 Dataset options (contd.) Start reading from observation # n. Syntax: FIRSTOBS = n Stop reading at observation # n. Syntax: OBS = n Data d1; set d (firstobs=5 obs=20); … run; In procedures: proc means data=d (firstobs=5 obs=20); run; Here Proc Means analyzes only observations 5 through 20 of the data set d. 19 Concatenating Two Data Sets Concatenating the data sets appends the observations from one data set to another data set. The DATA step reads DATA1 sequentially until all observations have been processed, and then reads DATA2. Data set COMBINED contains the results of the concatenation. Note that the data sets are processed in the order in which they are listed in the SET statement. 20 Interleaving Two Data Sets The datasets must be sorted by the values of the variables listed in the BY statement. Similar to Concatenating, but preserves the sorting order. 21 One-to-One Reading and One-to-One Merging (use this method with caution) One-to-one reading combines observations from two or more SAS data sets by creating observations that contain all of the variables from each contributing data set. The first observation in one data set with the first in the other, and so on. The DATA step stops after it has read the last observation from the smallest data set. One-to-one merging is similar to a one-to-one reading, with two exceptions: you use the MERGE statement instead of multiple SET statements, and the DATA step reads all observations from all data sets. 22 Match-Merging (most common data set manipulation) Match-merging combines observations from two or more SAS data sets into a single observation in a new data set based on the values of one or more common variables. 23 Updating • Input data sets must be sorted by the values of the variables listed in the BY statement. (In this example, MASTER and TRANSACTION are both sorted by Year.) • UPDATE replaces an existing file with a new file • UPDATE does not replace nonmissing values in a master data set with missing values from a transaction data set. 24 Merging datasets Sort the datasets according to the var list in BY. 2. Use the MERGE statement inside a DATA step. proc sort data=d1; by var_list; proc sort data=d2; by var_list; DATA newdata; MERGE d1 d2 …; BY var_list; The input data sets specified in MERGE will not be modified 1. Values of any common variables not specified in the BY statement are likely to be mixed up in the new data set. To prevent this, use the RENAME data set option Example. Dataset d1 has variable ret containing market returns, and d2 has variable ret containing individual stock returns. Merge the dataset by tradedate. We rename ret in d1 to mktret: data newd; merge d1 (rename= (rate=loanr)) d2; by origination; 25 Merging datasets. IN= Data Set Option The IN= option allows the user to omit observations that are not common to all data sets. Creates a temp. variable for tracking whether that data set contributed to the current observation IN = index_var_name data d; merge d1 (in=indicator1) d2; by tradedate; if indicator1; Indicator is 1 if the data set contributed and 0 otherwise Use the IF statement on the index variables. In the above example, only observations found in d1 will be included in d. Variable will not be written to the new data set. To include it in d, assign its value to a standard variable, e.g. ind1=indicator1; 26 Example of IN= option. Merging by ID variable. Dataset d1 (in=a) ID V1 2 3 4 421 129 122 6 7 8 534 343 324 Dataset d2 (in=b) ID V2 1 343 2 85 4 5 6 763 229 554 8 895 27 If a; (i.e. observation must be in dataset d1) Dataset d1 (in=a) ID V1 2 3 4 421 129 122 6 7 8 534 343 324 Dataset d2 (in=b) ID V2 1 343 2 85 . 4 763 5 229 6 554 . 8 895 28 If b; (i.e. observation must be in dataset d2) Dataset d1 (in=a) ID V1 . 2 421 3 129 4 122 . 6 534 7 343 8 324 Dataset d2 (in=b) ID V2 1 343 2 85 4 5 6 763 229 554 8 895 29 Preview of Lecture 3: Procedures for dataset manipulation: PROC APPEND adds the observations from one SAS data set to the end of another SAS data set. PROC SQL reads observations from up to 32 SAS data sets and joins them into single observations; manipulates observations in a SAS data set in place; easily produces a Cartesian product. 30 OUTPUT command The OUTPUT statement is used in the datastep to write the current values of all variables to a data set. There is an IMPLICIT output statement at the end of each datastep iteration (unless an output statement appears somewhere in the datastep). The following pieces of code are equivalent: data d; input r1-r9; run; cards; ... and data d; input r1-r9; output; run; cards; ... The OUTPUT statement is commonly used to create several SAS data sets in a single datastep. Specify the dataset name after OUTPUT. Example: Split the mortgage data into separate datasets for each state. data ca wa; set loan1; if state='WA' then OUTPUT wa; if state='CA' then OUTPUT ca; proc print data=wa; proc print data=ca; run; Once an OUTPUT statement is specified, the implied OUTPUT at the end of the DATA step no longer exists and all observation writing must be specified by the user. 31 DO loop Example: The input data is a line of four quarters of earnings for 100 firms. Read the data, indexing each observation by quarter. data earnings; input ticker $ @; do quarter=1 to 4; input earn @; output; end; Datalines; ibm 10.2 15 12 8 msft 25.1 27 29.4 35 ; run; Other examples: do state='CA','OR'; ... end; do weekdays=1,3,5; ... end; Output: Obs 1 2 3 4 5 6 7 8 ticker quarter earn ibm 1 10.2 ibm 2 15.0 ibm 3 12.0 ibm 4 8.0 msft 1 25.1 msft 2 27.0 msft 3 29.4 msft 4 35.0 32 Variable Arrays Arrays are used mainly to group variables Useful for performing the same calculations on a group of variables or searching through a set of variables. For example, your balance sheet data variables d_1 … d_150 are in millions, and you need to make them in 100s of millions. Arrays defined using the ARRAY statement in a DATA step Syntax: ARRAY name (n) variable_list; ARRAY all_vars var1-var10; n is the number of elements in the array and is optional The variable list is also optional but either n or the variable list must be specified The variable list can contain variables that have not yet been created – option for initializing variable values A $ should precede a variable list of character variables 33 Arrays (cont.) In the calculation section of a DATA step the array can be referenced by name(i) where i is the position of the element you wish to refer to Since parenthesis are also used in functions it is not a good idea to give your array the same name as a SAS function Example. DATA d1; input var1-var10; ARRAY all_vars var1-var10; DO i = 1 to 10; all_vars(i) = i/100; END; RUN; 34 Controlling the Built-in Data Loop: RETAIN Statement The built-in loop stores the data for a given observation for the current run of the DATA step When the loop reaches the end of the DATA step and returns to the top to read for the next observation all values are reset to missing To force the built-in loop to keep values from previous observations use the RETAIN statement: RETAIN variable-list; The values of the variables specified in the RETAIN statement will keep their values until they are reset by an INPUT or assignment statement. Example: Calculate the highest mortgage balance to date. proc sort data=loan1; by origination; data d1; set loan1; retain maxbal; maxbal=max(maxbal, balance); run; 35 Controlling the Built-in Data Loop: SUM statement A special case is a plus sign in an assignment that does not have an equal sign, e.g. cumsum + newvar; This sum implicitly retains the previous value of newvar and adds it to cumsum. Example. Calculate the growth of the total appraised value of houses in the dataset to date. proc sort data=loan1; by origination; data d1; set loan1; totalvalue + appraisal; run; This is equivalent to retain totalvalue 0; /*initialize to 0*/ totalvalue =sum(totalvalue, appraisal); 36 LAG Function In general SAS is not very convenient about directly accessing observations, e.g. for particular dates If you want to do serious time series analysis you should use the procedures in the SAS/ETS package The lag function is used to reference previous values of a variable newvar = LAG (variable); or newvar = LAGn (variable); Where n refers to the number of observations to go back Example (quarterly earnings): lag2_earn = LAG2(earn); If we are in observation 100, for example, this statement will assign the price from observation 98 to the variable lag2_price in observation 100. Similarly the value of lag2_price in observation 98 will be the value of price in observation 96. 37 LAG Function (contd.) The order of observations is determined by the current sort BY does not work with lags. So you need do manual checks to prevent nonsensical lags when dealing with panel data. (This is the issue with the earnings example). Lags are tricky to use because of the built-in loop. Sometimes the lag value is not available (missing). The lag queue is not initialized until the lag function is called. Similarly the lag queue is not updated until the lag function is called Hints: Use separate data steps to create lags and levels Do not use the LAG function in a loop 38 Lecture 2 References SAS onlinedoc > “BASE SAS”, “SAS Language Reference: Dictionary” > “Data step options” Manuals in pdf: http://www.math.wpi.edu/saspdf/common/mainpdf.htm “Base SAS” section SAS User Group International “Beginning tutorials” http://www.lexjansen.com/cgi-bin/sugi.php?x=sbt&s=sugi_s Merging datasets: http://support.sas.com/techsup/technote/ts644.html 39