Lecture 4

advertisement
Lesson 4 - Topics
• Creating new variables in the data step
• SAS Functions
Creating New Variables
• Direct assignments(formulas):
c = a + b ;
d = 2*a + 3*b + 7*c ;
bmi = weight/(height*height);
• Indirect assignments (if/then/else)
if age < 50 then young = 1; else young = 2;
if income < 15 then tax = 1; else
if income < 25 then tax = 2; else
if income >=25 then tax = 3;
Direct Assignments
(Formulas)
• Example
c = a + b ;
So if a = 2, b =3, c = 5;
What if a is missing, what is c? C
will be missing
What if b is missing?
If/then/else Statements
With if-then-else definitions SAS stops
executing after the first true statement
if income < 15 then tax = 1; else
if income < 25 then tax = 2; else
if income >=25 then tax = 3;
What
What
What
What
if
if
if
if
income
income
income
income
is
is
is
is
10?
23?
30?
missing?
Tax
Tax
Tax
Tax
=
=
=
=
1
2
3
1
Creating New Variables
Create a new variable with 2 levels, one for
college graduates and one for non-college
graduates.
Program 5
DATA tdata;
INFILE ‘C:\SAS_Files\tomhs.data' ;
INPUT
@ 1 ptid
$10.
New variable defines go after the input
statement
@ 49 educ
1.
@123 sbp12
3. ;
* This way will code missing values to the value 2;
if educ < 7 then grad1 = 2 ; else
if educ >=7 then grad1 = 1 ;
* The next two ways are equivalent and are correct;
if educ < 7 and educ ne . then grad2 = 2; else
if educ >=7 then grad2 = 1;
* IN is a useful function in SAS ;
if educ IN(1,2,3,4,5,6) then grad3 = 2; else
if educ IN(7,8,9) then grad3 = 1;
PROC FREQ DATA=tdata;
TABLES educ grad1 grad2 grad3 ;
Cumulative
Cumulative
educ
Frequency
Percent
Frequency
Percent
--------------------------------------------------------1
3
3.03
3
3.03
3
4
4.04
7
7.07
4
23
23.23
30
30.30
5
14
14.14
44
44.44
6
12
12.12
56
56.57
7
16
16.16
72
72.73
8
10
10.10
82
82.83
9
17
17.17
99
100.00
Frequency Missing = 1
Cumulative
Cumulative
grad1
Frequency
Percent
Frequency
Percent
----------------------------------------------------------1
43
43.00
43
43.00
2
57
57.00
100
100.00
Cumulative
Cumulative
grad2
Frequency
Percent
Frequency
Percent
----------------------------------------------------------1
43
43.43
43
43.43
2
56
56.57
99
100.00
Frequency Missing = 1
Cumulative
Cumulative
grad3
Frequency
Percent
Frequency
Percent
----------------------------------------------------------1
43
43.43
43
43.43
2
56
56.57
99
100.00
Frequency Missing = 1
Coded the missing
value for educ to 2
PROC FREQ DATA=tdata;
TABLES educ*grad1 /MISSING NOCUM NOPERCENT NOROW
NOCOL;
TITLE 'Use Crosstabulation to Verify Recoding';
RUN;
Table of educ by grad1
educ
grad1
Frequency‚
1‚
2‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
. ‚
0 ‚
1 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
1 ‚
0 ‚
3 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
3 ‚
0 ‚
4 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
4 ‚
0 ‚
23 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
5 ‚
0 ‚
14 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
6 ‚
0 ‚
12 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
7 ‚
16 ‚
0 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
8 ‚
10 ‚
0 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
9 ‚
17 ‚
0 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
Total
43
57
Total
1
3
4
23
14
12
16
10
17
100
This shows that the
missing value for educ got
assigned a value of 2
* Recode sbp12 into 3 levels;
if
if
if
if
sbp12
sbp12
sbp12
sbp12
= .
< 120
< 140
>=140
then
then
then
then
sbp12c
sbp12c
sbp12c
sbp12c
=
=
=
=
.
1
2
3
; else
; else
; else
;
With if-then-else definitions SAS stops
executing after the first true statement
Values < 120 will be assigned value of 1
Values 120-139 will be assigned value of 2
Values >=140 will be assigned value of 3
Missing values will be assigned to missing
PROC FREQ DATA=tdata;
TABLES sbp12c sbp12;
RUN;
OUTPUT
Cumulative
Cumulative
sbp12c
Frequency
Percent
Frequency
Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
1
36
39.13
36
39.13
2
43
46.74
79
85.87
3
13
14.13
92
100.00
Frequency Missing = 8
Cumulative
Cumulative
sbp12
Frequency
Percent
Frequency
Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
93
1
1.09
1
1.09
94
1
1.09
2
2.17
101
1
1.09
3
3.26
104
1
1.09
4
4.35
105
1
1.09
5
5.43
(more values)
147
1
1.09
87
94.57
148
1
1.09
88
95.65
149
1
1.09
89
96.74
153
1
1.09
90
97.83
154
1
1.09
91
98.91
158
1
1.09
92
100.00
Frequency Missing = 8
* Easy but costly error to make;
if
if
if
if
sbp12
sbp12
sbp12
sbp12
= .
< 120
< 140
>=140
then
then
then
then
sbp12c
sbp12c
sbp12
sbp12c
PROC FREQ DATA=tdata;
TABLES sbp12c;
RUN;
=
=
=
=
.
1
2
3
; else
; else
; else
;
How come no values
of 2 and why so
many missing?
The FREQ Procedure
Cumulative
Cumulative
sbp12c
Frequency
Percent
Frequency
Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
1
36
73.47
36
73.47
3
13
26.53
49
100.00
Frequency Missing = 51
Important Facts When Creating New Variable
1.
New variables are initialized to missing
2.
Missing values are < any value
if var < value (true if var is missing)
3.
Reference missing values for numeric
variables as .
4.
Reference missing values for character
variables as ' '
if sbp = . then ... (or if missing(sbp))
if clinic = ' ' then ...
SAS Handling of Missing Data When Creating
New Variables
• Direct assignments(formulas):
c = a + b ;
d = 2*a + 3*b + 7*c ;
bmi = weight/(height*height);
If any variable on the right-hand side is
missing then the new variable will be missing
• Indirect assignments
if age < 50 then young = 1; else young=2;
New variables are initialized to missing but
may be given a value if any of the IF
statements are true
What Value to Set New Variable
if age < 20 then teenager = 1; else
if age >=20 then teenager = 2;
if age < 20 then teenager = 1; else
if age >=20 then teenager = 0;
if age < 20 then teenager = ‘YES’; else
if age >=20 then teenager = ‘NO’;
* Program 6 SAS Functions ;
DATA example;
INFILE ‘C:\SAS_Files\tomhs.data' ;
INPUT @058 height 4.1
@085 weight 5.1
@172 ursod 3.
@236 (se1-se10) (1.0 + 1);
bmi
= (weight*703.0768)/(height*height);
rbmi1 = ROUND(bmi,1);
lursod = LOG(ursod);
seavg = MEAN (OF se1-se10);
semax = MAX (OF se1-se10);
semin = MIN (OF se1-se10);
* Use of dash notation ;
seavg = MEAN (OF se1-se10);
This is the same as
seavg = MEAN (se1,se2,se3,se4,se5,se6,se7,se8,se9,se10);
The OF is very important. Otherwise SAS thinks
you are subtracting se10 from se1.
To use this notation the ROOT of the name must
be the same.
* Two ways of computing average ;
seavg = MEAN (se1,se2,se3,se4,se5,se6,se7,se8,se9,se10);
Versus
seavg = (se1+se2+se3+se4+se5+se6+se7+se8+se9+se10)/10;
Using mean function computes the average of nonmissing values. Result is missing only if all values
all missing.
Using + formula requires all values be non-missing
otherwise result will be missing
if N(of se1-se10) > 5 then seavg = MEAN(of se1-se10);
What does this statement do?
PROC PRINT DATA = example (OBS=15);
VAR bmi rbmi1 rbmi2 seavg semin semax ;
TITLE 'Listing of Selected Data for 15 Patients ';
RUN;
PROC FREQ DATA = example;
TABLES semax;
TITLE 'Distribution of Worse Side Effect Value';
TITLE2 'Side Effect Scores Range from 1 to 4';
RUN;
ods graphics on;
PROC UNIVARIATE DATA = example ;
VAR ursod lursod;
QQPLOT ursod lursod;
TITLE 'Quantile Plots for Urine Sodium Data';
RUN;
Listing of Selected Data for 10 Patients
Obs
bmi
rbmi1
seavg
semin
semax
1
28.2620
28
1.1
1
2
2
35.9963
36
1.0
1
1
3
27.0489
27
1.0
1
1
4
28.2620
28
1.1
1
2
5
33.2008
33
1.0
1
1
6
27.7691
28
1.2
1
2
7
32.6040
33
1.0
1
1
8
22.4057
22
1.2
1
2
9
37.2037
37
1.1
1
2
10
33.1717
33
1.7
1
3
Distribution of Worse Side Effect Value
Side Effect Scores Ranges from 1 to 4
The FREQ Procedure
semax
Frequency
Cumulative
Cumulative
Frequency
Percent
Percent
---------------------------------------------------------1
33
33.00
33
33.00
2
52
52.00
85
85.00
3
13
13.00
98
98.00
4
2
2.00
100
100.00
2 patients had at least 1
severe side effect
Log transformed value shows a
better linear pattern
Download