run

advertisement
Timothy Forsyth
Ashok Viswanathan
Debbie McCullough
Donn Garvert

Some operations are more convenient when there is one observation per
subject


Example: Regression analysis
Some operations are more convenient when there are several observations
per subject

Example: Plot individual observations as a time series
Participant #206
3.0
2.5
2.0
Ave. Positive Mood
Ave. Negative Mood
1.5
1.0
0.5
0.0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16
Day Since Injury
•
First method: Using arrays
– Can use arrays to transpose the data set
– Gives user more control
•
Original data:
•
Transposing the entire data set
•
Transposing the entire data set back to original form
•
Key commands:
• RETAIN: since these variables are not in the dataset rainfall1 we must
retain these values as SAS will set to missing if we do not
• CALL MISSING: this will set any number of numeric values or character
values to missing all at once
• In our example, if there was a missing data point somewhere in between
month 1 and month 5 for each subject the CALL MISSING command will set
these values to missing all at once. We use the OF command so that SAS
looks at all 5 values.
•
Without adding options to PROC TRANSPOSE we will get a
confusing output, but the data will be correct:
•
When we add options to PROC TRANSPOSE we can get a cleaner
looking data set
• Renamed _name_ to “month”
• Renamed col1 to “rainfall”
• Dropped subjects with missing data
•
Transposing the rainfall data back to original form

PREFIX= option: this was not used in the example, but can be
used as an option in this conversion
 Useful when your id variable is a number
 For example if month was listed as 1, 2, 3, 4, 5 we could use
PREFIX=Month
▪ This will result in Month 1, Month 2, …, Month 5
Data Sets Often have two or more observations per subject (or other groupings)
Patients who have repeated visits to the Doctor’s office or clinic
 Sales at a store on a given day.
 Inches of rainfall over many months in a given city (Current Example)
This is known as Longitudinal Data
SAS processes data one observation at a time so special techniques are needed
to perform calculations across observations
I will cover :
1) Identifying the first and last observation in a group.
2) A few different ways to count the number of occurrences of a subject (or
other grouping.). In our example we will count, for a given city, the number of
months that had their average rainfall recorded.
Listing of SAS Data Set Rainfall
Ave Average
City
Rainfall Temp Sunlight
Alameda
Fremont
Fremont
Fremont
Fremont
Fremont
Hayward
Hayward
Hayward
Hayward
Hayward
Oakland
Oakland
Sunnyvale
Sunnyvale
Sunnyvale
Sunnyvale
5.8
5.2
4.5
3.3
2.2
1.8
1.4
3.6
2.0
4.1
5.7
3.1
3.4
5.7
3.4
4.9
5.0
53.6
52.9
57.6
59.7
60.6
63.9
60.4
57.8
62.5
55.2
53.6
59.2
60.1
54.2
56.8
55.2
52.9
8.9
8.7
9.5
10.1
11.0
11.5
9.3
10.1
10.2
10.8
11.2
9.5
10.1
9.4
9.8
10.3
10.7
Three Step process:
Step 1) Sort the data first by the grouping
variable (City) and then by the counting
variable(Rainfall).
Step 2) Create First and Last Variables.
Step 3) Counting the months of recorded
rainfall using either a data step or proc sql.
city
Rainfall
Alameda
Fremont
Fremont
Fremont
Fremont
Hayward
Hayward
Hayward
Hayward
Hayward
Oakland
Oakland
Sunnyvale
Sunnyvale
Sunnyvale
Sunnyvale
5.8
5.2
4.5
3.3
1.8
1.4
3.6
2.0
4.1
5.7
3.1
3.4
5.7
3.4
4.9
5.0
proc sort data=rainfall5;
by city rainfall;
run;
proc print data=rainfall5;
var city rainfall;
run;
city
Rainfall
Alameda
Fremont
Fremont
Fremont
Fremont
Fremont
Hayward
Hayward
Hayward
Hayward
Hayward
Oakland
Oakland
Sunnyvale
Sunnyvale
Sunnyvale
Sunnyvale
Dataset Rainfall has been sorted first by city and second by rainfall
5.8
1.8
2.2
3.3
4.5
5.2
1.4
2.0
3.6
4.1
5.7
3.1
3.4
3.4
4.9
5.0
5.7
city
Rainfall
Alameda
5.8
Fremont
1.8
Fremont
2.2
Fremont
3.3
Fremont
4.5
Fremont
5.2
Hayward
1.4
Hayward
2.0
Hayward
3.6
Hayward
4.1
Hayward
5.7
Oakland
3.1
Oakland
3.4
Sunnyvale
3.4
Sunnyvale
4.9
Sunnyvale
5.0
Sunnyvale
5.7
data rainfall_last rainfall_first;
set rainfall5;
by city;
Listing of First_City
if last.city then output
rainfall_last;
else if first.city then output
rainfall_first;
run;
Fremont
Hayward
Oakland
Sunnyvale
city
Rainfall
1.8
1.4
3.1
3.4
proc print data=rainfall_first;
run;
Using the set dataset; by var(city); creates two temporary SAS
variables, first.var and last.var. These two are logical variables; They
equal 1 if true and 0 if false. In our case we have generated
variables first.city and last.city.
city
Rainfall
Alameda
5.8
Fremont
1.8
Fremont
2.2
Fremont
3.3
Fremont
4.5
Fremont
5.2
Hayward
1.4
Hayward
2.0
Hayward
3.6
Hayward
4.1
Hayward
5.7
Oakland
3.1
Oakland
3.4
Sunnyvale
3.4
Sunnyvale
4.9
Sunnyvale
5.0
Sunnyvale
5.7
data rainfall_last
rainfall_first;
set rainfall5;
by city;
put city= rainfall=
first.city= last.city=;
if last.city then
output rainfall_last;
else if first.city then
output rainfall_first;
run;
city=Alameda Rainfall=5.8 FIRST.city=1 LAST.city=1
city=Fremont Rainfall=1.8 FIRST.city=1 LAST.city=0
city=Fremont Rainfall=2.2 FIRST.city=0 LAST.city=0
city=Fremont Rainfall=3.3 FIRST.city=0 LAST.city=0
city=Fremont Rainfall=4.5 FIRST.city=0 LAST.city=0
city=Fremont Rainfall=5.2 FIRST.city=0 LAST.city=1
city=Hayward Rainfall=1.4 FIRST.city=1 LAST.city=0
city=Hayward Rainfall=2 FIRST.city=0 LAST.city=0
Contents from the log
Observation 1 for Alameda is both the first and the last variable so first.city =1(true )and
Last.city=1(true). Obs 1 for Fremont first.city = 1 last.city =0, etc., etc.
city
Rainfall
Alameda
5.8
Fremont
1.8
Fremont
2.2
Fremont
3.3
Fremont
4.5
Fremont
5.2
Hayward
1.4
Hayward
2.0
Hayward
3.6
Hayward
4.1
Hayward
5.7
Oakland
3.1
Oakland
3.4
Sunnyvale
3.4
Sunnyvale
4.9
Sunnyvale
5.0
Sunnyvale
5.7
data months_of_rec_rainfall;
set rainfall5;
by city;
if first.city then N_months_rec =
0;
N_months_rec +1;
if last.city then output;
run;
title 'Listing of Counts';
proc print
data=months_of_rec_rainfall label
noobs;
label N_months_rec = '# of Months
Recorded';
var City Rainfall N_months_rec;;
run;
Listing of Counts
City
Rainfall
Alameda
5.8
Fremont
5.2
Hayward
5.7
Oakland
3.4
Sunnyvale
5.7
# of
Months
Recorded
1
5
5
2
4
Prediction(actual) ; Alameda 1(1) , Fremont 5(5), Hayward 5(5)
Oakland 2(2), Sunnyvale 4(4).
data months_of_rec_rainfall;
set rainfall5;
by city;
if first.city then N_months_rec= 0; 1)
N_months_rec +1; 2)
if last.city then output; 3)
run;
title 'Listing of Counts';
proc print data=months_of_rec_rainfall
label noobs;
label N_months_rec = '# of Months
Recorded';
var City Rainfall N_months_rec;;
run;
1) Initialize the counter at zero 2) Sum Statement 3) Conditional Statement
city
Rainfall
Alameda
5.8
Fremont
1.8
Fremont
2.2
Fremont
3.3
Fremont
4.5
Fremont
5.2
Hayward
1.4
Hayward
2.0
Hayward
3.6
Hayward
4.1
Hayward
5.7
Oakland
3.1
Oakland
3.4
Sunnyvale
3.4
Sunnyvale
4.9
Sunnyvale
5.0
Sunnyvale
5.7
proc sql;
create table Months_Rec_Rain as
select city,
count(city) as months_rec_rain
from rainfall5
group by city;
quit;
city
months_rec_
rain
Alameda
Fremont
Hayward
Oakland
Sunnyvale
The proc sql gives the same result as the data step.
1
5
5
2
4
It was shown, via a three step process, how to
create a variable to count the number of
occurrences of a grouping variable. The
example shown dealt with the number of
months of recorded rainfall in a given city.
These techniques have utility in the medical or
clinical setting.
SAS CODE
SAS OUTPUT
/*eliminate missing values*/
/*create longitudinal data*/
Output:
data rainfall1;
set rainfall;
array rainfall_array{5} month_1-month_5;
array temp_array{5} temp_1-temp_5;
array hours_array{5} hours_1-hours_5;
do month = 1 to 5;
if missing(rainfall_array{month}) then leave;
Rain = rainfall_array{month};
AveTemp = temp_array{month};
AverageSunlight = hours_array{month};
Output;
end;
keep city rain month AveTem AverageSunlight;
run;
proc print data=rainfall1;
run;
Average
month Rain AveTemp Sunlight
Obs
city
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Hayward
Hayward
Hayward
Hayward
Hayward
Oakland
Oakland
Alameda
Sunnyval
Sunnyval
Sunnyval
Sunnyval
Fremont
Fremont
Fremont
Fremont
Fremont
1
2
3
4
5
1
2
1
1
2
3
4
1
2
3
4
5
1.4
3.6
2.0
4.1
5.7
3.1
3.4
5.8
5.7
3.4
4.9
5.0
5.2
4.5
3.3
2.2
1.8
60.4
57.8
62.5
55.2
53.6
59.2
60.1
53.6
54.2
56.8
55.2
52.9
52.9
57.6
59.7
60.6
63.9
9.3
10.1
10.2
10.8
11.2
9.5
10.1
8.9
9.4
9.8
10.3
10.7
8.7
9.5
10.1
11.0
11.5
A FEW POINTS TO NOTE:



A proc freq approach can be
used in addition to a proc
means approach
We present here the proc
means approach
Both ways can give us the
same result/output
CODE
proc means data=rainfall1 nway noprint;
class city;
output out=counts (rename=(_freq_ = N_Recorded)
drop = _type_);
run;
proc print data=counts;
run;
Things to notice:
 nway
 nprint
 rename
The SAS System
Obs
city
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
14:46 Sunday, May 30, 2010 13
N_Recorded
Alameda
Alameda
Alameda
Alameda
Alameda
Fremont
Fremont
Fremont
Fremont
Fremont
Hayward
Hayward
Hayward
Hayward
Hayward
Oakland
Oakland
Oakland
Oakland
Oakland
Sunnyval
Sunnyval
Sunnyval
Sunnyval
Sunnyval
1
1
1
1
1
5
5
5
5
5
5
5
5
5
5
2
2
2
2
2
4
4
4
4
4
_STAT_
month
N
1.00000
MIN 1.00000
MAX 1.00000
MEAN 1.00000
STD
.
.
N
5.00000
MIN
1.00000
MAX 5.00000
MEAN 3.00000
STD 1.58114
N
5.00000
MIN
1.00000
MAX 5.00000
MEAN 3.00000
STD 1.58114
N
2.00000
MIN
1.00000
MAX 2.00000
MEAN 1.50000
STD
0.70711
N
4.00000
MIN
1.00000
MAX 4.00000
MEAN 2.50000
STD 1.29099
Rain
AveTemp
1.00000
5.80000
5.80000
5.80000
.
5.00000
1.80000
5.20000
3.40000
1.45430
5.00000
1.40000
5.70000
3.36000
1.71552
2.00000
3.10000
3.40000
3.25000
0.21213
4.00000
3.40000
5.70000
4.75000
0.96782
1.0000
53.6000
53.6000
53.6000
.
5.0000
52.9000
63.9000
58.9400
4.0685
5.0000
53.6000
62.5000
57.9000
3.6469
2.0000
59.2000
60.1000
59.6500
0.6364
4.0000
52.9000
56.8000
54.7750
1.6460
Average
Sunlight
1.0000
8.9000
8.9000
8.9000
5.0000
8.7000
11.5000
10.1600
1.1261
5.0000
9.3000
11.2000
10.3200
0.7259
2.0000
9.5000
10.1000
9.8000
0.4243
4.0000
9.4000
10.7000
10.0500
0.5686




We would like to see the differences in values by city and by month
for the following variables:
The average rainfall recorded (given by ‘rain’)
The average temperature recorded (given by ‘avetemp’)
The average sunlight recorded (given by ‘avesunlight’)
proc sort data=rainfall1 out=rainfall1;
by city month;
run;
data last;
set rainfall1;
by city;
put city=month=first.city = last.city=;
if last.city;
run;
log file:
NOTE: There were 17 observations read from the data set WORK.RAINFALL1.
NOTE: The data set WORK.RAINFALL1 has 17 observations and 5 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time
0.01 seconds
cpu time
0.00 seconds
61
62
63
64
65
66
67
data last;
set rainfall1;
by city;
put city=month=first.city = last.city=;
if last.city;
run;
city=Alameda month=1 FIRST.city=1 LAST.city=1
city=Fremont month=1 FIRST.city=1 LAST.city=0
city=Fremont month=2 FIRST.city=0 LAST.city=0
city=Fremont month=3 FIRST.city=0 LAST.city=0
city=Fremont month=4 FIRST.city=0 LAST.city=0
city=Fremont month=5 FIRST.city=0 LAST.city=1
city=Hayward month=1 FIRST.city=1 LAST.city=0
city=Hayward month=2 FIRST.city=0 LAST.city=0
city=Hayward month=3 FIRST.city=0 LAST.city=0
city=Hayward month=4 FIRST.city=0 LAST.city=0
city=Hayward month=5 FIRST.city=0 LAST.city=1
city=Oakland month=1 FIRST.city=1 LAST.city=0
city=Oakland month=2 FIRST.city=0 LAST.city=1
city=Sunnyval month=1 FIRST.city=1 LAST.city=0
city=Sunnyval month=2 FIRST.city=0 LAST.city=0
city=Sunnyval month=3 FIRST.city=0 LAST.city=0
city=Sunnyval month=4 FIRST.city=0 LAST.city=1
NOTE: There were 17 observations read from the data set WORK.RAINFALL1.
NOTE: The data set WORK.LAST has 5 observations and 5 variables.
NOTE: DATA statement used (Total process time):
real time
0.01 seconds
cpu time
0.01 seconds
data difference;
set rainfall1;
by city;
if first.city and last.city then delete;
Diff_Rain = rain - lag(rain);
Diff_AveTemp = AveTemp - lag(AveTemp);
Diff_AverageSunlight = AverageSunlight - lag(AverageSunlight);
if not first.city then output;
run;
proc print data=difference;
run;
The SAS System
14:46 Sunday, May 30, 2010 5

Obs
city
month












1
2
3
4
5
6
7
8
9
10
11
12
Fremont
Fremont
Fremont
Fremont
Hayward
Hayward
Hayward
Hayward
Oakland
Sunnyval
Sunnyval
Sunnyval
2
3
4
5
2
3
4
5
2
2
3
4
Rain
4.5
3.3
2.2
1.8
3.6
2.0
4.1
5.7
3.4
3.4
4.9
5.0
Ave
Temp
57.6
59.7
60.6
63.9
57.8
62.5
55.2
53.6
60.1
56.8
55.2
52.9
Average
Sunlight
9.5
10.1
11.0
11.5
10.1
10.2
10.8
11.2
10.1
9.8
10.3
10.7
Diff_
Rain
Diff_ Diff_ Average
AveTemp Sunlight
-0.7
-1.2
-1.1
-0.4
2.2
-1.6
2.1
1.6
0.3
-2.3
1.5
0.1
4.7
2.1
0.9
3.3
-2.6
4.7
-7.3
-1.6
0.9
2.6
-1.6
-2.3
0.8
0.6
0.9
0.5
0.8
0.1
0.6
0.4
0.6
0.4
0.5
0.4
data first_last;
set rainfall1;
by city;
if first.city and last.city then delete;
if first.city or last.city then do;
diff_rain = rain - lag(rain);
diff_temp = avetemp - lag(avetemp);
diff_sunlight = averagesunlight - lag(averagesunlight);
end;
if last.city then output;
run;
proc print data=first_last;
run;
Obs
1
2
3
4
city
Fremont
Hayward
Oakland
Sunnyval
Ave Average Average diff_
month Rain Temp Sunlight rain
5
5
2
4
1.8
5.7
3.4
5.0
63.9
53.6
60.1
52.9
11.5
11.2
10.1
10.7
-3.4
4.3
0.3
-0.7
diff_
temp
11.0
-6.8
0.9
-1.3
diff_
Sunlight
2.8
1.9
0.6
1.3
•
•
Using RETAIN statement is one of the best ways to “remember” values
from previous observations
Variables that do not come from SAS data sets are set to a missing
value during each iteration of the DATA step
• A RETAIN statement allows you to tell SAS not to do this
data new_data_set;
set old_data_set;
Need to sort by the ID, or which ever
by ID;
variable you are grouping by!
if first.ID and last.ID then delete;
Retain First_Var_1 First_Var_2 … First_Var_3;
The RETAIN statement
If first .ID then do;
ensures these variables are
First_Var_1 = Var_1
First_Var_2 = Var_2
not set back to missing
…
values during iteration.
First_Var_n = Var_n
If last.ID then do;
Diff_Var_1 = Var_1 - First_Var_1;
The Variables named within the
Diff_Var_2 = Var_2 - First_Var_2;
RETAIN Statement are not replaced
…
with missing values during the
Diff_Var_n = Var_n - First_Var_n;
iterations.
End;
Drop Frist_: ;
Run;



The RETAIN statement ensures your variables are not set back to
missing values.
When processing first observation, the retained variables are set to
respective variable values.
The last iteration subtracts the retained first values from the
respective last variable values.
proc sort data=rainfall;
by City Month;
run;
data last;
set rainfall;
by City;
put City= Month= First.City= Last.City=;
if last.City;
run;
data first_last;
set rainfall;
by City;
if first.city and last.city then delete;
retain First_Rainfall First_Temp First_Hours;
if first.City then do;
First_Rainfall = Rainfall;
First_Temp = Temp;
First_Hours = Hours;
end;
if last.City then do;
Diff_Rainfall = Rainfall - First_Rainfall;
Diff_Temp = Temp - First_Temp;
Diff_Hours = Hours - First_Hours;
output;
end;
drop First_: ;
run;
Displaying the RETAIN Statement
City
Month
Fremont 5
Hayward 5
Oakland 5
Sunnyvale 4
Rainfall Temp
1.8
5.7
3.6
5.0
63.9
53.6
60.3
52.9
Hours
11.5
11.2
10.8
10.7
Diff_ Rainfall
-3.4
4.3
0.5
-0.7
Diff_ Temp
11.0
-6.8
1.1
-1.3
Diff_Hours
2.8
1.9
1.3
1.3
Output from LAG statement:
city
month
Fremont
Hayward
Oakland
Sunnyvale
Rain
Temp
1.8
5.7
3.4
5.0
63.9
53.6
60.1
52.9
5
5
2
4
Sunlight diff_rain
11.5
11.2
10.1
10.7
-3.4
4.3
0.3
-0.7
diff_temp
diff_Sunlight
11.0
-6.8
0.9
-1.3
2.8
1.9
0.6
1.3
Output from RETAIN statement
City
Month
Fremont
Hayward
Oakland
Sunnyvale
5
5
5
4
Rainfall Temp
1.8
5.7
3.6
5.0
63.9
53.6
60.3
52.9
Hours
11.5
11.2
10.8
10.7
Diff_ Rainfall
-3.4
4.3
0.5
-0.7
Diff_ Temp
11.0
-6.8
1.1
-1.3
Diff_Hours
2.8
1.9
1.3
1.3


Suppose you want to know if a certain variable value is the
maximum of all of you observations
The RETAIN Statement allows us to easily find this while preserving
a variable’s value from previous iteration
data Maximums;
Set rainfall;
Retain Max_Rainfall Max_Temp Max_Hours;
Max_Rainfall = Max(Max_Rainfall, Rainfall);
Max_Temp = Max(Max_Temp, Temp);
Max_Hours = Max(Max_Hours, Hours);
run;
title "Displaying the RETAIN Statement- Finding
Maximums“;
proc print data=Maximums noobs;
run;
Displaying the RETAIN Statement- Finding Maximums
city
Month Rainfall Temp Hours Max_Rainfall Max_Temp Max_Hours
Alameda
Fremont
Fremont
Fremont
Fremont
Fremont
Hayward
Hayward
Hayward
Hayward
Hayward
Oakland
Oakland
Oakland
Sunnyvale
Sunnyvale
Sunnyvale
Sunnyvale
1
1
2
3
4
5
1
2
3
4
5
1
2
5
1
2
3
4
5.8
5.2
4.5
3.3
2.2
1.8
1.4
3.6
2.0
4.1
5.7
3.1
3.4
3.6
5.7
3.4
4.9
5.0
53.6
52.9
57.6
59.7
60.6
63.9
60.4
57.8
62.5
55.2
53.6
59.2
60.1
60.3
54.2
56.8
55.2
52.9
8.9
8.7
9.5
10.1
11.0
11.5
9.3
10.1
10.2
10.8
11.2
9.5
10.1
10.8
9.4
9.8
10.3
10.7
5.8
5.8
5.8
5.8
5.8
5.8
5.8
5.8
5.8
5.8
5.8
5.8
5.8
5.8
5.8
5.8
5.8
5.8
53.6
53.6
57.6
59.7
60.6
63.9
63.9
63.9
63.9
63.9
63.9
63.9
63.9
63.9
63.9
63.9
63.9
63.9
8.9
8.9
9.5
10.1
11.0
11.5
11.5
11.5
11.5
11.5
11.5
11.5
11.5
11.5
11.5
11.5
11.5
11.5
Download