Modifying and Combing SAS Data Sets (Chapter in the 6 Little SAS Book) Animal Science 500 Lecture No. 6 September 16, 2010 IOWA STATE UNIVERSITY Department of Animal Science Modifying Variable Lengths The code below uses only 3 character spaces for the county by specifying the length of the location variable You can also use this with numeric variables data new; length location $3; input location $ 1-15 date $ rainfall; Cards; /*remember you could use the datalines statement to do the same thing);*/ Story county 6/10/10 4.5 Polk county 6/10/10 6.5 Story county 7/10/10 4.9 Polk county 7/10/10 2.4 ; Run; Quit; IOWA STATE UNIVERSITY Department of Animal Science Creating new variables You can create variables in the initial data step where you are inputting the data or you can use a new data step to create the variables. As we did when we created 1. 2. 3. Adjusted backfat variables for barrows and gilts Total weight gain during the test period Average daily gain during the test period You do not have to make calculations when making new variables. You could divide backfat, loin muscle are, gain, etc into categories to evaluate if model effects differ by the classes you developed IOWA STATE UNIVERSITY Department of Animal Science Creating new variables you choose to create a new data step – you will need to either create a new file in SAS or modify the existing file. If Example what we have done in Lab by the following statements Data Pig13 Set Pig12; Alternatively you could create a new file in SAS If you create a new file you may need to merge it with the original data file; This is a place where students often have difficulty IOWA STATE UNIVERSITY Department of Animal Science Creating New Variables in SAS Both of these options will result in a data set named “New” with all the variables that have been defined This option creates the variables lograinfall and sqrtrainfall in the initial data step In the second set of code you are creating a new file in SAS and naming it “New” the set statement tells SAS to Assign the data from the first “New” to this file “New” data new; input @1 location $ 1-15 date mmddyy8. rainfall; log_rain = log(rainfall); sqrt_rain= sqrt(rain); datalines; Story county 6/10/10 4.5 Polk county 6/10/10 6.5 Story county 7/10/10 4.9 Polk county 7/10/10 2.4 ; run; data new; set new; log_rain = log(rainfall); sqrt_rain = sqrt(rain); run; IOWA STATE UNIVERSITY Department of Animal Science Creating New Variables in SAS SAS has many other functions to perform various calculations for trigonometry, finance, and other applications. Some examples assuming x is the variable you want to modify: Log = log(x) Sin = sin(x) Cos = cos(x) As we will see in the next lab, you may find the distribution does not meet the assumptions of the analysis of variance Will need to transform data often using some of the Use SAS help to find the correct notation Some helpful search hints are: Search under SAS Functions, Arithmetic Functions, Numeric Variables, Logical Operators IOWA STATE UNIVERSITY Department of Animal Science Operators Recall from previous discussions Addition, subtraction, multiplication, and division are specified by +, -, *, and /, respectively. For exponentiation, a double asterisk ** is used. exprainfall = exprainfall**2 Parentheses can be used to group expressions, and these expressions can be nested within several levels. SAS follows the standard PEMDAS order, () ** * / + - , for evaluating functions. IOWA STATE UNIVERSITY Department of Animal Science Logical Operators SAS can also evaluate logical expressions < ,>, =, if, then, else, else if, and (&), or(|), not (^)… Search Logical Operators in SAS Help data new; input @1 location $ 1-15 date mmddyy8. rainfall; *creating a new variable based on location; if location = “Story county" then x = rainfall +5; else x = 5; *creating a new variable based on level of rainfall; if rainfall < 3 then y=1; else if rainfall < 4.9 then y=2; else y = 3; datalines; Story county 6/10/10 4.5 Polk county 6/10/10 6.5 Story county 7/10/10 4.9 Polk county 7/10/10 2.4 ; Run; Quit; data new; set new; log_rain = log(rainfall); sqrt_rain = sqrt(rain); run; IOWA STATE UNIVERSITY Department of Animal Science Other ways to Modify Variables - Do Loops DO loops can be used to create an ordered sequence of numbers. Below is an example of a do loop in SAS The program "loops" through the values of Q from 1 to 5 and performs the calculations requested for the current value of Q. The OUTPUT statement tells SAS to export Q and the new variables to the dataset EXAMPLE. The END statement signifies the end of the loop. An END statement is necessary for each DO statement! Notice that neither INPUT nor DATALINES statements are used. data do; set new; do q=1 to 5; q_rain=q*rainfall; q_rainsquared=q_rain**2; output; end; proc print data = do; run; Quit; IOWA STATE UNIVERSITY Department of Animal Science Modifying your data using PROC Transpose Sometimes you need to reshape your data which is in a long format (shown below) FamID Year FamInc 1 2007 40000 1 2008 40500 1 2009 41000 2 2007 45000 2 2008 45400 2 2009 45800 3 2007 75000 3 2008 76000 3 2009 77000 IOWA STATE UNIVERSITY Department of Animal Science Modifying your data using PROC Transpose into a wide format (shown below). FamID FamInc07 FamInc08 FamInc0 1 40000 40500 41000 2 45000 45400 45800 3 75000 76000 77000 IOWA STATE UNIVERSITY Department of Animal Science Modifying your data using PROC Transpose How do we accomplish this? SAS proc transpose to reshape the data from a long to a wide format. IOWA STATE UNIVERSITY Department of Animal Science Modifying your data using PROC Transpose data long1 ; input famid year faminc ; cards ; 1 96 40000 1 97 40500 1 98 41000 2 96 45000 2 97 45400 2 98 45800 3 96 75000 3 97 76000 3 98 77000 ; run; quit; proc transpose data=long1 out=wide1 prefix=faminc; by famid ; id year; var famin; run; quit; proc print data = wide1; run; quit; Notice that the option prefix= faminc specifies a prefix to use in constructing names for transposed variables in the output data set. SAS automatic variable _NAME_ contains the name of the variable being transposed IOWA STATE UNIVERSITY Department of Animal Science Modifying your data using PROC Transpose What SAS does this get you? output that looks like the following Obs famid _Name_ faminc96 faminc97 1 1 faminc 40000 40500 41000 2 2 faminc 45000 45400 45800 3 3 faminc 75000 76000 77000 IOWA STATE UNIVERSITY Department of Animal Science faminc98 Removing Observations When performing statistical calculations, SAS, by default, uses all of the observations that are in the dataset. You can selectively delete observations that you do not wish to use with: IF statements - specify which observations to keep, IF and THEN DELETE will delete observations. IOWA STATE UNIVERSITY Department of Animal Science Removing Observations Both statements result in a file with only data from Story County data new; set new; if location = “Story county"; Run; Quit; data new; set new; if location = “Polk county" then delete; run; IOWA STATE UNIVERSITY Department of Animal Science Removing Variables You may only need to use a few variables from a larger dataset. This can be done with KEEP or DROP statements. For many datasets, you can keep unneeded variables in the dataset, and SAS can handle them with ease. This will be the case for most of you dealing with several hundred or even several thousand observations Those dealing with very large data sets might benefit from using the Keep or Drop statements Caution be sure you are thinking ahead so that you do not drop a variable needed later in some calculation or function IOWA STATE UNIVERSITY Department of Animal Science Removing Variables Let’s say we wanted to keep only the location, x and y variables from the “new” file. Both task. data steps below will accomplish the data subset; set new; drop date rainfall; run; data subset; set new; keep location x y; run; IOWA STATE UNIVERSITY Department of Animal Science Combining Datasets There are two different ways to combine datasets using the SET and MERGE statements in a Data Step. The SET statement is used to add observations to an existing dataset. This is what we have done in lab We have developed new variables using calculations with existing data and have adjusted current data in the data set. Data Pig12 Set Pig12; ADG = (OFFWT – ONWT) / DOT; Consider the following example, using monthly rainfall totals in Gainesville in 1995 and 1996. When used to combine observations into one dataset, the SET command works as follows: IOWA STATE UNIVERSITY Department of Animal Science Combining Datasets – Set data rain95; input month rainfall @@; year=1995; datalines; 1 3.08 2 1.07 3 6.14 4 5.18 5 2.47 6 7.55 7 7.66 8 7.20 9 2.10 10 4.33 11 3.15 12 1.29 ; run; quit; data rain96; input month rainfall @@; year=1996; datalines; 1 0.97 2 0.66 3 10.52 4 1.72 5 2.01 6 6.05 7 11.00 8 4.90 9 2.23 10 6.18 11 1.73 12 6.63 ; run; quit; data rain9596; set rain95 rain96; run; quit; IOWA STATE UNIVERSITY Department of Animal Science Combining Datasets The MERGE statement adds the variables in one dataset to another dataset. Consider the following example using Southern teams in the National Basketball Association: IOWA STATE UNIVERSITY Department of Animal Science Combining Datasets - Merge data nba1; input @1 city $11. @11 division $; Cards; Orlando Atlantic Miami Atlantic Atlanta Central Charlotte Central ; run; quit; data nba2; input mascot $ @@; Cards; Magic Heat Hawks Hornets ; run; quit; data nba3; merge nba1 nba2; run; quit; Proc Print; run; quit; IOWA STATE UNIVERSITY Department of Animal Science Combining Datasets – Merge Output Results Obs city division mascot 1 Orlando Atlantic Magic 2 Miami Atlantic Heat 3 Atlanta 4 Charlotte Central Central IOWA STATE UNIVERSITY Department of Animal Science Hawks Hornets Combining Datasets In the last example, the observations were ordered so that each dataset corresponds to the other. You will often need to put datasets together based on values of variables which are included in both datasets. To do this, both datasets must first be sorted in order by the common variable or variables. IOWA STATE UNIVERSITY Department of Animal Science Combining Datasets Proc sort sorts the dat file using the variable chosen in the by state we will discuss Proc S more detail later on IOWA STATE UNIVERSITY Department of Animal Science Combining Datasets with Merge -Output The SAS System 10:16 Saturday, September 11, 2010 14 Obs city division mascot 1 Atlanta Central Hawks 2 Charlotte Central Hornets 3 Miami Atlantic Heat 4 Orlando Atlantic Magic IOWA STATE UNIVERSITY Department of Animal Science