Timothy Forsyth Ashok Viswanathan Debbie McCullough Donn Garvert Some operations are more convenient when there is one observation per subject Example: Regression analysis Some operations are more convenient when there are several observations per subject Example: Plot individual observations as a time series Participant #206 3.0 2.5 2.0 Ave. Positive Mood Ave. Negative Mood 1.5 1.0 0.5 0.0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Day Since Injury • First method: Using arrays – Can use arrays to transpose the data set – Gives user more control • Original data: • Transposing the entire data set • Transposing the entire data set back to original form • Key commands: • RETAIN: since these variables are not in the dataset rainfall1 we must retain these values as SAS will set to missing if we do not • CALL MISSING: this will set any number of numeric values or character values to missing all at once • In our example, if there was a missing data point somewhere in between month 1 and month 5 for each subject the CALL MISSING command will set these values to missing all at once. We use the OF command so that SAS looks at all 5 values. • Without adding options to PROC TRANSPOSE we will get a confusing output, but the data will be correct: • When we add options to PROC TRANSPOSE we can get a cleaner looking data set • Renamed _name_ to “month” • Renamed col1 to “rainfall” • Dropped subjects with missing data • Transposing the rainfall data back to original form PREFIX= option: this was not used in the example, but can be used as an option in this conversion Useful when your id variable is a number For example if month was listed as 1, 2, 3, 4, 5 we could use PREFIX=Month ▪ This will result in Month 1, Month 2, …, Month 5 Data Sets Often have two or more observations per subject (or other groupings) Patients who have repeated visits to the Doctor’s office or clinic Sales at a store on a given day. Inches of rainfall over many months in a given city (Current Example) This is known as Longitudinal Data SAS processes data one observation at a time so special techniques are needed to perform calculations across observations I will cover : 1) Identifying the first and last observation in a group. 2) A few different ways to count the number of occurrences of a subject (or other grouping.). In our example we will count, for a given city, the number of months that had their average rainfall recorded. Listing of SAS Data Set Rainfall Ave Average City Rainfall Temp Sunlight Alameda Fremont Fremont Fremont Fremont Fremont Hayward Hayward Hayward Hayward Hayward Oakland Oakland Sunnyvale Sunnyvale Sunnyvale Sunnyvale 5.8 5.2 4.5 3.3 2.2 1.8 1.4 3.6 2.0 4.1 5.7 3.1 3.4 5.7 3.4 4.9 5.0 53.6 52.9 57.6 59.7 60.6 63.9 60.4 57.8 62.5 55.2 53.6 59.2 60.1 54.2 56.8 55.2 52.9 8.9 8.7 9.5 10.1 11.0 11.5 9.3 10.1 10.2 10.8 11.2 9.5 10.1 9.4 9.8 10.3 10.7 Three Step process: Step 1) Sort the data first by the grouping variable (City) and then by the counting variable(Rainfall). Step 2) Create First and Last Variables. Step 3) Counting the months of recorded rainfall using either a data step or proc sql. city Rainfall Alameda Fremont Fremont Fremont Fremont Hayward Hayward Hayward Hayward Hayward Oakland Oakland Sunnyvale Sunnyvale Sunnyvale Sunnyvale 5.8 5.2 4.5 3.3 1.8 1.4 3.6 2.0 4.1 5.7 3.1 3.4 5.7 3.4 4.9 5.0 proc sort data=rainfall5; by city rainfall; run; proc print data=rainfall5; var city rainfall; run; city Rainfall Alameda Fremont Fremont Fremont Fremont Fremont Hayward Hayward Hayward Hayward Hayward Oakland Oakland Sunnyvale Sunnyvale Sunnyvale Sunnyvale Dataset Rainfall has been sorted first by city and second by rainfall 5.8 1.8 2.2 3.3 4.5 5.2 1.4 2.0 3.6 4.1 5.7 3.1 3.4 3.4 4.9 5.0 5.7 city Rainfall Alameda 5.8 Fremont 1.8 Fremont 2.2 Fremont 3.3 Fremont 4.5 Fremont 5.2 Hayward 1.4 Hayward 2.0 Hayward 3.6 Hayward 4.1 Hayward 5.7 Oakland 3.1 Oakland 3.4 Sunnyvale 3.4 Sunnyvale 4.9 Sunnyvale 5.0 Sunnyvale 5.7 data rainfall_last rainfall_first; set rainfall5; by city; Listing of First_City if last.city then output rainfall_last; else if first.city then output rainfall_first; run; Fremont Hayward Oakland Sunnyvale city Rainfall 1.8 1.4 3.1 3.4 proc print data=rainfall_first; run; Using the set dataset; by var(city); creates two temporary SAS variables, first.var and last.var. These two are logical variables; They equal 1 if true and 0 if false. In our case we have generated variables first.city and last.city. city Rainfall Alameda 5.8 Fremont 1.8 Fremont 2.2 Fremont 3.3 Fremont 4.5 Fremont 5.2 Hayward 1.4 Hayward 2.0 Hayward 3.6 Hayward 4.1 Hayward 5.7 Oakland 3.1 Oakland 3.4 Sunnyvale 3.4 Sunnyvale 4.9 Sunnyvale 5.0 Sunnyvale 5.7 data rainfall_last rainfall_first; set rainfall5; by city; put city= rainfall= first.city= last.city=; if last.city then output rainfall_last; else if first.city then output rainfall_first; run; city=Alameda Rainfall=5.8 FIRST.city=1 LAST.city=1 city=Fremont Rainfall=1.8 FIRST.city=1 LAST.city=0 city=Fremont Rainfall=2.2 FIRST.city=0 LAST.city=0 city=Fremont Rainfall=3.3 FIRST.city=0 LAST.city=0 city=Fremont Rainfall=4.5 FIRST.city=0 LAST.city=0 city=Fremont Rainfall=5.2 FIRST.city=0 LAST.city=1 city=Hayward Rainfall=1.4 FIRST.city=1 LAST.city=0 city=Hayward Rainfall=2 FIRST.city=0 LAST.city=0 Contents from the log Observation 1 for Alameda is both the first and the last variable so first.city =1(true )and Last.city=1(true). Obs 1 for Fremont first.city = 1 last.city =0, etc., etc. city Rainfall Alameda 5.8 Fremont 1.8 Fremont 2.2 Fremont 3.3 Fremont 4.5 Fremont 5.2 Hayward 1.4 Hayward 2.0 Hayward 3.6 Hayward 4.1 Hayward 5.7 Oakland 3.1 Oakland 3.4 Sunnyvale 3.4 Sunnyvale 4.9 Sunnyvale 5.0 Sunnyvale 5.7 data months_of_rec_rainfall; set rainfall5; by city; if first.city then N_months_rec = 0; N_months_rec +1; if last.city then output; run; title 'Listing of Counts'; proc print data=months_of_rec_rainfall label noobs; label N_months_rec = '# of Months Recorded'; var City Rainfall N_months_rec;; run; Listing of Counts City Rainfall Alameda 5.8 Fremont 5.2 Hayward 5.7 Oakland 3.4 Sunnyvale 5.7 # of Months Recorded 1 5 5 2 4 Prediction(actual) ; Alameda 1(1) , Fremont 5(5), Hayward 5(5) Oakland 2(2), Sunnyvale 4(4). data months_of_rec_rainfall; set rainfall5; by city; if first.city then N_months_rec= 0; 1) N_months_rec +1; 2) if last.city then output; 3) run; title 'Listing of Counts'; proc print data=months_of_rec_rainfall label noobs; label N_months_rec = '# of Months Recorded'; var City Rainfall N_months_rec;; run; 1) Initialize the counter at zero 2) Sum Statement 3) Conditional Statement city Rainfall Alameda 5.8 Fremont 1.8 Fremont 2.2 Fremont 3.3 Fremont 4.5 Fremont 5.2 Hayward 1.4 Hayward 2.0 Hayward 3.6 Hayward 4.1 Hayward 5.7 Oakland 3.1 Oakland 3.4 Sunnyvale 3.4 Sunnyvale 4.9 Sunnyvale 5.0 Sunnyvale 5.7 proc sql; create table Months_Rec_Rain as select city, count(city) as months_rec_rain from rainfall5 group by city; quit; city months_rec_ rain Alameda Fremont Hayward Oakland Sunnyvale The proc sql gives the same result as the data step. 1 5 5 2 4 It was shown, via a three step process, how to create a variable to count the number of occurrences of a grouping variable. The example shown dealt with the number of months of recorded rainfall in a given city. These techniques have utility in the medical or clinical setting. SAS CODE SAS OUTPUT /*eliminate missing values*/ /*create longitudinal data*/ Output: data rainfall1; set rainfall; array rainfall_array{5} month_1-month_5; array temp_array{5} temp_1-temp_5; array hours_array{5} hours_1-hours_5; do month = 1 to 5; if missing(rainfall_array{month}) then leave; Rain = rainfall_array{month}; AveTemp = temp_array{month}; AverageSunlight = hours_array{month}; Output; end; keep city rain month AveTem AverageSunlight; run; proc print data=rainfall1; run; Average month Rain AveTemp Sunlight Obs city 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Hayward Hayward Hayward Hayward Hayward Oakland Oakland Alameda Sunnyval Sunnyval Sunnyval Sunnyval Fremont Fremont Fremont Fremont Fremont 1 2 3 4 5 1 2 1 1 2 3 4 1 2 3 4 5 1.4 3.6 2.0 4.1 5.7 3.1 3.4 5.8 5.7 3.4 4.9 5.0 5.2 4.5 3.3 2.2 1.8 60.4 57.8 62.5 55.2 53.6 59.2 60.1 53.6 54.2 56.8 55.2 52.9 52.9 57.6 59.7 60.6 63.9 9.3 10.1 10.2 10.8 11.2 9.5 10.1 8.9 9.4 9.8 10.3 10.7 8.7 9.5 10.1 11.0 11.5 A FEW POINTS TO NOTE: A proc freq approach can be used in addition to a proc means approach We present here the proc means approach Both ways can give us the same result/output CODE proc means data=rainfall1 nway noprint; class city; output out=counts (rename=(_freq_ = N_Recorded) drop = _type_); run; proc print data=counts; run; Things to notice: nway nprint rename The SAS System Obs city 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 14:46 Sunday, May 30, 2010 13 N_Recorded Alameda Alameda Alameda Alameda Alameda Fremont Fremont Fremont Fremont Fremont Hayward Hayward Hayward Hayward Hayward Oakland Oakland Oakland Oakland Oakland Sunnyval Sunnyval Sunnyval Sunnyval Sunnyval 1 1 1 1 1 5 5 5 5 5 5 5 5 5 5 2 2 2 2 2 4 4 4 4 4 _STAT_ month N 1.00000 MIN 1.00000 MAX 1.00000 MEAN 1.00000 STD . . N 5.00000 MIN 1.00000 MAX 5.00000 MEAN 3.00000 STD 1.58114 N 5.00000 MIN 1.00000 MAX 5.00000 MEAN 3.00000 STD 1.58114 N 2.00000 MIN 1.00000 MAX 2.00000 MEAN 1.50000 STD 0.70711 N 4.00000 MIN 1.00000 MAX 4.00000 MEAN 2.50000 STD 1.29099 Rain AveTemp 1.00000 5.80000 5.80000 5.80000 . 5.00000 1.80000 5.20000 3.40000 1.45430 5.00000 1.40000 5.70000 3.36000 1.71552 2.00000 3.10000 3.40000 3.25000 0.21213 4.00000 3.40000 5.70000 4.75000 0.96782 1.0000 53.6000 53.6000 53.6000 . 5.0000 52.9000 63.9000 58.9400 4.0685 5.0000 53.6000 62.5000 57.9000 3.6469 2.0000 59.2000 60.1000 59.6500 0.6364 4.0000 52.9000 56.8000 54.7750 1.6460 Average Sunlight 1.0000 8.9000 8.9000 8.9000 5.0000 8.7000 11.5000 10.1600 1.1261 5.0000 9.3000 11.2000 10.3200 0.7259 2.0000 9.5000 10.1000 9.8000 0.4243 4.0000 9.4000 10.7000 10.0500 0.5686 We would like to see the differences in values by city and by month for the following variables: The average rainfall recorded (given by ‘rain’) The average temperature recorded (given by ‘avetemp’) The average sunlight recorded (given by ‘avesunlight’) proc sort data=rainfall1 out=rainfall1; by city month; run; data last; set rainfall1; by city; put city=month=first.city = last.city=; if last.city; run; log file: NOTE: There were 17 observations read from the data set WORK.RAINFALL1. NOTE: The data set WORK.RAINFALL1 has 17 observations and 5 variables. NOTE: PROCEDURE SORT used (Total process time): real time 0.01 seconds cpu time 0.00 seconds 61 62 63 64 65 66 67 data last; set rainfall1; by city; put city=month=first.city = last.city=; if last.city; run; city=Alameda month=1 FIRST.city=1 LAST.city=1 city=Fremont month=1 FIRST.city=1 LAST.city=0 city=Fremont month=2 FIRST.city=0 LAST.city=0 city=Fremont month=3 FIRST.city=0 LAST.city=0 city=Fremont month=4 FIRST.city=0 LAST.city=0 city=Fremont month=5 FIRST.city=0 LAST.city=1 city=Hayward month=1 FIRST.city=1 LAST.city=0 city=Hayward month=2 FIRST.city=0 LAST.city=0 city=Hayward month=3 FIRST.city=0 LAST.city=0 city=Hayward month=4 FIRST.city=0 LAST.city=0 city=Hayward month=5 FIRST.city=0 LAST.city=1 city=Oakland month=1 FIRST.city=1 LAST.city=0 city=Oakland month=2 FIRST.city=0 LAST.city=1 city=Sunnyval month=1 FIRST.city=1 LAST.city=0 city=Sunnyval month=2 FIRST.city=0 LAST.city=0 city=Sunnyval month=3 FIRST.city=0 LAST.city=0 city=Sunnyval month=4 FIRST.city=0 LAST.city=1 NOTE: There were 17 observations read from the data set WORK.RAINFALL1. NOTE: The data set WORK.LAST has 5 observations and 5 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.01 seconds data difference; set rainfall1; by city; if first.city and last.city then delete; Diff_Rain = rain - lag(rain); Diff_AveTemp = AveTemp - lag(AveTemp); Diff_AverageSunlight = AverageSunlight - lag(AverageSunlight); if not first.city then output; run; proc print data=difference; run; The SAS System 14:46 Sunday, May 30, 2010 5 Obs city month 1 2 3 4 5 6 7 8 9 10 11 12 Fremont Fremont Fremont Fremont Hayward Hayward Hayward Hayward Oakland Sunnyval Sunnyval Sunnyval 2 3 4 5 2 3 4 5 2 2 3 4 Rain 4.5 3.3 2.2 1.8 3.6 2.0 4.1 5.7 3.4 3.4 4.9 5.0 Ave Temp 57.6 59.7 60.6 63.9 57.8 62.5 55.2 53.6 60.1 56.8 55.2 52.9 Average Sunlight 9.5 10.1 11.0 11.5 10.1 10.2 10.8 11.2 10.1 9.8 10.3 10.7 Diff_ Rain Diff_ Diff_ Average AveTemp Sunlight -0.7 -1.2 -1.1 -0.4 2.2 -1.6 2.1 1.6 0.3 -2.3 1.5 0.1 4.7 2.1 0.9 3.3 -2.6 4.7 -7.3 -1.6 0.9 2.6 -1.6 -2.3 0.8 0.6 0.9 0.5 0.8 0.1 0.6 0.4 0.6 0.4 0.5 0.4 data first_last; set rainfall1; by city; if first.city and last.city then delete; if first.city or last.city then do; diff_rain = rain - lag(rain); diff_temp = avetemp - lag(avetemp); diff_sunlight = averagesunlight - lag(averagesunlight); end; if last.city then output; run; proc print data=first_last; run; Obs 1 2 3 4 city Fremont Hayward Oakland Sunnyval Ave Average Average diff_ month Rain Temp Sunlight rain 5 5 2 4 1.8 5.7 3.4 5.0 63.9 53.6 60.1 52.9 11.5 11.2 10.1 10.7 -3.4 4.3 0.3 -0.7 diff_ temp 11.0 -6.8 0.9 -1.3 diff_ Sunlight 2.8 1.9 0.6 1.3 • • Using RETAIN statement is one of the best ways to “remember” values from previous observations Variables that do not come from SAS data sets are set to a missing value during each iteration of the DATA step • A RETAIN statement allows you to tell SAS not to do this data new_data_set; set old_data_set; Need to sort by the ID, or which ever by ID; variable you are grouping by! if first.ID and last.ID then delete; Retain First_Var_1 First_Var_2 … First_Var_3; The RETAIN statement If first .ID then do; ensures these variables are First_Var_1 = Var_1 First_Var_2 = Var_2 not set back to missing … values during iteration. First_Var_n = Var_n If last.ID then do; Diff_Var_1 = Var_1 - First_Var_1; The Variables named within the Diff_Var_2 = Var_2 - First_Var_2; RETAIN Statement are not replaced … with missing values during the Diff_Var_n = Var_n - First_Var_n; iterations. End; Drop Frist_: ; Run; The RETAIN statement ensures your variables are not set back to missing values. When processing first observation, the retained variables are set to respective variable values. The last iteration subtracts the retained first values from the respective last variable values. proc sort data=rainfall; by City Month; run; data last; set rainfall; by City; put City= Month= First.City= Last.City=; if last.City; run; data first_last; set rainfall; by City; if first.city and last.city then delete; retain First_Rainfall First_Temp First_Hours; if first.City then do; First_Rainfall = Rainfall; First_Temp = Temp; First_Hours = Hours; end; if last.City then do; Diff_Rainfall = Rainfall - First_Rainfall; Diff_Temp = Temp - First_Temp; Diff_Hours = Hours - First_Hours; output; end; drop First_: ; run; Displaying the RETAIN Statement City Month Fremont 5 Hayward 5 Oakland 5 Sunnyvale 4 Rainfall Temp 1.8 5.7 3.6 5.0 63.9 53.6 60.3 52.9 Hours 11.5 11.2 10.8 10.7 Diff_ Rainfall -3.4 4.3 0.5 -0.7 Diff_ Temp 11.0 -6.8 1.1 -1.3 Diff_Hours 2.8 1.9 1.3 1.3 Output from LAG statement: city month Fremont Hayward Oakland Sunnyvale Rain Temp 1.8 5.7 3.4 5.0 63.9 53.6 60.1 52.9 5 5 2 4 Sunlight diff_rain 11.5 11.2 10.1 10.7 -3.4 4.3 0.3 -0.7 diff_temp diff_Sunlight 11.0 -6.8 0.9 -1.3 2.8 1.9 0.6 1.3 Output from RETAIN statement City Month Fremont Hayward Oakland Sunnyvale 5 5 5 4 Rainfall Temp 1.8 5.7 3.6 5.0 63.9 53.6 60.3 52.9 Hours 11.5 11.2 10.8 10.7 Diff_ Rainfall -3.4 4.3 0.5 -0.7 Diff_ Temp 11.0 -6.8 1.1 -1.3 Diff_Hours 2.8 1.9 1.3 1.3 Suppose you want to know if a certain variable value is the maximum of all of you observations The RETAIN Statement allows us to easily find this while preserving a variable’s value from previous iteration data Maximums; Set rainfall; Retain Max_Rainfall Max_Temp Max_Hours; Max_Rainfall = Max(Max_Rainfall, Rainfall); Max_Temp = Max(Max_Temp, Temp); Max_Hours = Max(Max_Hours, Hours); run; title "Displaying the RETAIN Statement- Finding Maximums“; proc print data=Maximums noobs; run; Displaying the RETAIN Statement- Finding Maximums city Month Rainfall Temp Hours Max_Rainfall Max_Temp Max_Hours Alameda Fremont Fremont Fremont Fremont Fremont Hayward Hayward Hayward Hayward Hayward Oakland Oakland Oakland Sunnyvale Sunnyvale Sunnyvale Sunnyvale 1 1 2 3 4 5 1 2 3 4 5 1 2 5 1 2 3 4 5.8 5.2 4.5 3.3 2.2 1.8 1.4 3.6 2.0 4.1 5.7 3.1 3.4 3.6 5.7 3.4 4.9 5.0 53.6 52.9 57.6 59.7 60.6 63.9 60.4 57.8 62.5 55.2 53.6 59.2 60.1 60.3 54.2 56.8 55.2 52.9 8.9 8.7 9.5 10.1 11.0 11.5 9.3 10.1 10.2 10.8 11.2 9.5 10.1 10.8 9.4 9.8 10.3 10.7 5.8 5.8 5.8 5.8 5.8 5.8 5.8 5.8 5.8 5.8 5.8 5.8 5.8 5.8 5.8 5.8 5.8 5.8 53.6 53.6 57.6 59.7 60.6 63.9 63.9 63.9 63.9 63.9 63.9 63.9 63.9 63.9 63.9 63.9 63.9 63.9 8.9 8.9 9.5 10.1 11.0 11.5 11.5 11.5 11.5 11.5 11.5 11.5 11.5 11.5 11.5 11.5 11.5 11.5