Lesson 5 (Oct 12)

advertisement
Lesson 5 - Topics
• Creating new variables in the data step
• SAS Functions
• Programs 5-6 in course notes
• LSB 3:1-6,11-12
Creating New Variables
• Direct assignments(formulas):
c = a + b ;
d = 2*a + 3*b + 7*c ;
bmi = weight/(height*height);
• Indirect assignments (if/then/else)
if age < 50 then young = 1; else young = 2;
if income < 15 then tax = 1; else
if income < 25 then tax = 2; else
if income >=25 then tax = 3;
Direct Assignments
(Formulas)
• Example
c = a + b ;
So if a = 2, b =3, c = 5;
What if a is missing, what is c? C
will be missing
What if b is missing?
If/then/else Statements
With if-then-else definitions SAS stops
executing after the first true statement
if income < 15 then tax = 1; else
if income < 25 then tax = 2; else
if income >=25 then tax = 3;
What
What
What
What
if
if
if
if
income
income
income
income
is
is
is
is
10?
23?
30?
missing?
Tax
Tax
Tax
Tax
=
=
=
=
1
2
3
1
Creating New Variables
Create a new variable with 2 levels, one for
college graduates and one for non-college
graduates.
Program 5
DATA tdata;
INFILE ‘C:\SAS_Files\tomhs.data' ;
INPUT
@ 1 ptid
$10.
New variable defines go after the input
statement
@ 49 educ
1.
@123 sbp12
3. ;
* This way will code missing values to the value 2;
if educ < 7 then grad1 = 2 ; else
if educ >=7 then grad1 = 1 ;
* The next two ways are equivalent and are correct;
if educ < 7 and educ ne . then grad2 = 2; else
if educ >=7 then grad2 = 1;
* IN is a special function in SAS ;
if educ IN(1,2,3,4,5,6) then grad3 = 2; else
if educ IN(7,8,9) then grad3 = 1;
PROC FREQ DATA=tdata;
TABLES educ grad1 grad2 grad3 ;
Cumulative
Cumulative
educ
Frequency
Percent
Frequency
Percent
--------------------------------------------------------1
3
3.03
3
3.03
3
4
4.04
7
7.07
4
23
23.23
30
30.30
5
14
14.14
44
44.44
6
12
12.12
56
56.57
7
16
16.16
72
72.73
8
10
10.10
82
82.83
9
17
17.17
99
100.00
Frequency Missing = 1
Cumulative
Cumulative
grad1
Frequency
Percent
Frequency
Percent
----------------------------------------------------------1
43
43.00
43
43.00
2
57
57.00
100
100.00
Cumulative
Cumulative
grad2
Frequency
Percent
Frequency
Percent
----------------------------------------------------------1
43
43.43
43
43.43
2
56
56.57
99
100.00
Frequency Missing = 1
Cumulative
Cumulative
grad3
Frequency
Percent
Frequency
Percent
----------------------------------------------------------1
43
43.43
43
43.43
2
56
56.57
99
100.00
Frequency Missing = 1
Coded the missing
value for educ to 2
PROC FREQ DATA=tdata;
TABLES educ*grad1 /MISSING NOCUM NOPERCENT NOROW
NOCOL;
TITLE 'Use Crosstabulation to Verify Recoding';
RUN;
Table of educ by grad1
educ
grad1
Frequency‚
1‚
2‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
. ‚
0 ‚
1 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
1 ‚
0 ‚
3 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
3 ‚
0 ‚
4 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
4 ‚
0 ‚
23 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
5 ‚
0 ‚
14 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
6 ‚
0 ‚
12 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
7 ‚
16 ‚
0 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
8 ‚
10 ‚
0 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
9 ‚
17 ‚
0 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
Total
43
57
Total
1
3
4
23
14
12
16
10
17
100
This shows that the
missing value for educ got
assigned a value of 2
* Recode sbp12 into 3 levels;
if
if
if
if
sbp12
sbp12
sbp12
sbp12
= .
< 120
< 140
>=140
then
then
then
then
sbp12c
sbp12c
sbp12c
sbp12c
=
=
=
=
.
1
2
3
; else
; else
; else
;
With if-then-else definitions SAS stops
executing after the first true statement
Values < 120 will be assigned value of 1
Values 120-139 will be assigned value of 2
Values >=140 will be assigned value of 3
Missing values will be assigned to missing
PROC FREQ DATA=tdata;
TABLES sbp12c sbp12;
RUN;
OUTPUT
Cumulative
Cumulative
sbp12c
Frequency
Percent
Frequency
Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
1
36
39.13
36
39.13
2
43
46.74
79
85.87
3
13
14.13
92
100.00
Frequency Missing = 8
Cumulative
Cumulative
sbp12
Frequency
Percent
Frequency
Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
93
1
1.09
1
1.09
94
1
1.09
2
2.17
101
1
1.09
3
3.26
104
1
1.09
4
4.35
105
1
1.09
5
5.43
(more values)
147
1
1.09
87
94.57
148
1
1.09
88
95.65
149
1
1.09
89
96.74
153
1
1.09
90
97.83
154
1
1.09
91
98.91
158
1
1.09
92
100.00
Frequency Missing = 8
* Easy but costly error to make;
if
if
if
if
sbp12
sbp12
sbp12
sbp12
= .
< 120
< 140
>=140
then
then
then
then
sbp12c
sbp12c
sbp12
sbp12c
PROC FREQ DATA=tdata;
TABLES sbp12c;
RUN;
=
=
=
=
.
1
2
3
; else
; else
; else
;
How come no values
of 2 and why so
many missing?
The FREQ Procedure
Cumulative
Cumulative
sbp12c
Frequency
Percent
Frequency
Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
1
36
73.47
36
73.47
3
13
26.53
49
100.00
Frequency Missing = 51
Important Facts When Creating New Variable
1.
New variables are initialized to missing
2.
Missing values are < any value
if var < value (true if var is missing)
3.
Reference missing values for numeric
variables as .
4.
Reference missing values for character
variables as ' '
if sbp = . then ... (or if missing(sbp))
if clinic = ' ' then ...
SAS Handling of Missing Data When Creating
New Variables
• Direct assignments(formulas):
c = a + b ;
d = 2*a + 3*b + 7*c ;
bmi = weight/(height*height);
If any variable on the right-hand side is
missing then the new variable will be missing
• Indirect assignments
if age < 50 then young = 1; else young=2;
New variables are initialized to missing but
may be given a value if any of the IF
statements are true
Checks you can make to be sure new variables
are created correctly
1. Display original and new variables.
PROC PRINT DATA=tdata (OBS=20);
VAR educ college ;
2. Run PROC MEANS on original and new
variable. Make sure both variables have
same number of missing values.
PROC MEANS DATA=tdata;
VAR educ college;
3. Run PROC FREQ on original and new
variable.
PROC FREQ DATA=tdata;
TABLES educ college educ*college;
What Value to Set New Variable
if age < 20 then teenager = 1; else
if age >=20 then teenager = 2;
if age < 20 then teenager = 1; else
if age >=20 then teenager = 0;
if age < 20 then teenager = ‘YES’; else
if age >=20 then teenager = ‘NO’;
If-then-do statements
* Conditionally execute several statements;
* Create indicator variables for race;
* Make sure race variable not missing;
if race ne .
white = 0;
black = 0;
asian = 0;
other = 0;
if race = 1
if race = 2
if race = 3
if race = 4
end;
then do;
then
then
then
then
white
black
asian
other
=
=
=
=
1;
1;
1;
1;
DO LOOPS WITH ARRAYS
- Used to Shorten Code
- Used when repeating same code
- Used with DO/END loop
ARRAY wtlb(3) wt1 wt2 wt3;
ARRAY wtkg(3) newwt1 newwt2 newwt3;
DO index = 1 to 3;
wtkg(index) = wtlb(index) / 2.2;
END;
/* same as the following code
Newwt1 = wt1 / 2.2 ;
Newwt2 = wt2 / 2.2;
Newwt3 = wt3 / 2.2;
*************************************/
* Program 6 SAS Functions ;
DATA example;
INFILE ‘C:\SAS_Files\tomhs.data' ;
INPUT @058 height 4.1
@085 weight 5.1
@172 ursod 3.
@236 (se1-se10) (1.0 + 1);
bmi
rbmi1
rbmi2
lursod
=
=
=
=
(weight*703.0768)/(height*height);
ROUND(bmi,1);
ROUND(bmi,.1);
LOG(ursod);
seavg = MEAN (OF se1-se10);
semax = MAX (OF se1-se10);
semin = MIN (OF se1-se10);
* Use of dash notation ;
seavg = MEAN (OF se1-se10);
This is the same as
seavg = MEAN (se1,se2,se3,se4,se5,se6,se7,se8,se9,se10);
The OF is very important. Otherwise SAS thinks
you are subtracting se10 from se1.
To use this notation the ROOT of the name must
be the same.
* Two ways of computing average ;
seavg = MEAN (se1,se2,se3,se4,se5,se6,se7,se8,se9,se10);
Versus
seavg = (se1+se2+se3+se4+se5+se6+se7+se8+se9+se10)/10;
Using mean function computes the average of nonmissing values. Result is missing only if all values
all missing.
Using + formula requires all values be non-missing
otherwise result will be missing
if N(of se1-se10) > 5 then seavg = MEAN(of se1-se10);
What does this statement do?
* Compute 10 new variables, 100 if se is
present and 0 if not;
ARRAY se (10) se1-se10;
ARRAY hse(10) hse1-hse10;
New variables
DO senumber = 1 to 10;
if se(senumber) = 1 then hse(senumber) = 0; else
if se(senumber) in(2,3,4) then hse(senumber) = 100;
END;
*** For senumber = 1 the code is *************
if se1 = 1 then hse1 = 0; else
if se1 in(2,3,4) then hse1 = 100;
PROC PRINT DATA = example (OBS=10);
VAR bmi rbmi1 rbmi2 seavg semin semax ;
TITLE 'Listing of Selected Data for 10 Patients ';
RUN;
PROC FREQ DATA = example;
TABLES semax;
TITLE 'Distribution of Worse Side Effect Value';
TITLE2 'Side Effect Scores Range from 1 to 4';
RUN;
PROC MEANS DATA = example;
VAR hse1-hse10;
TITLE 'Percent of Patients With Condition by
Condition';
RUN;
ods graphics on;
PROC UNIVARIATE DATA = example ;
VAR ursod lursod;
QQPLOT ursod lursod;
TITLE 'Quantile Plots for Urine Sodium Data';
RUN;
Listing of Selected Data for 10 Patients
Obs
bmi
rbmi1
1
28.2620
28
2
35.9963
3
rbmi2
seavg
semin
semax
28.3
1.1
1
2
36
36.0
1.0
1
1
27.0489
27
27.0
1.0
1
1
4
28.2620
28
28.3
1.1
1
2
5
33.2008
33
33.2
1.0
1
1
6
27.7691
28
27.8
1.2
1
2
7
32.6040
33
32.6
1.0
1
1
8
22.4057
22
22.4
1.2
1
2
9
37.2037
37
37.2
1.1
1
2
10
33.1717
33
33.2
1.7
1
3
Distribution of Worse Side Effect Value
Side Effect Scores Range from 1 to 4
The FREQ Procedure
semax
Frequency
Cumulative
Cumulative
Frequency
Percent
Percent
---------------------------------------------------------1
33
33.00
33
33.00
2
52
52.00
85
85.00
3
13
13.00
98
98.00
4
2
2.00
100
100.00
2 patients had at least 1
severe side effect
Percent of Patients With Condition by Condition Type
The MEANS Procedure
Variable
hse1
hse2
hse3
hse4
hse5
hse6
hse7
hse8
hse9
hse10
N
Mean
Std Dev
Minimum
Maximum
100
100
100
100
100
100
100
100
100
100
12.0000000
21.0000000
8.0000000
13.0000000
10.0000000
30.0000000
16.0000000
31.0000000
7.0000000
14.0000000
32.6598632
40.9360181
27.2659924
33.7997669
30.1511345
46.0566186
36.8452949
46.4823199
25.6432400
34.8735088
0
0
0
0
0
0
0
0
0
0
100.0000000
100.0000000
100.0000000
100.0000000
100.0000000
100.0000000
100.0000000
100.0000000
100.0000000
100.0000000
These means are percent
of patients with se
Log transformed value shows a
better linear pattern
Download