Week 11 - users.muohio.edu

advertisement
Week 11+ “IML” Class Activities
File:
week-11-19oct05.doc
Directory (hp/compaq):
C:\baileraj\Classes\Fall 2005\sta402\handouts
C:\Documents and Settings\John Bailer\My
Documents\baileraj\Classes\Fall 2004\sta402\handouts\week-1111nov04.doc
SAS/IML Programming
* Basic matrix concepts: rows, columns, scalars
* matrix operators
* subscripting
* matrix functions
* creating matrices from data sets and vice versa
* sample applications
Acknowledgments:
Thanks to Jim Deddens (UC) for sharing handouts that served
as the starting place for a number of the illustrations
contained herein (Deddens Handouts #17-22)
References:
SAS/IML User's Guide on SAS OnlineDoc, Version 8
(http://www.units.muohio.edu/doc/sassystem/sasonlinedocv8/sasdoc/sasht
ml/iml/index.htm)
Introductory remarks
* IML = Interactive Matrix Language – it is a programming
language for matrix manipulations (can control flow of
operations with commands (e.g. DO/END, START/FINISH, IF-THEN/
ELSE, etc.). Lots of built-in matrix manipulation functions
and subroutines are available.
* basic object = data matrix (so can perform operations on
entire matrices)
* can run interactively or store statements in a module
* can build functions/subroutines using these modules
Matrix Definitions and vocabulary
* Matrices have entries of the same type (either all numeric or character – note the difference
from a SAS data set which can include variables of either type).
* Matrix names are 1 to 8 characters (begin with letter or underscore and continue with letters,
numbers or underscores)
* Matrix possess property of DIMENSION (# rows and columns)
* ROW VECTOR = 1xn matrix
COLUMN vector = mx1 matrix
SCALAR = 1x1 matrix
Example 1: Basic matrix definition + subscripting
options nodate nocenter formdlim="-";
PROC IML;
*makes a 2x3 matrix;
C = {1 2 3,4 5 6};
print '2x3 example matrix C = {1 2 3,4 5 6} =' C;
*select 2nd row;
C_r2 = C[2,];
print '2nd row of C = C[2,] =' C_r2;
*select 3rd column;
C_c3 = C[,3];
print '3rd column of C = C[,3] =' C_c3;
*select the (2,3) element of C;
C23 = C[2,3];
print '(2,3) element of C = C[2,3] =' C23;
*makes a 1x3 matrix by summing the columns;
C_csum=C[+,];
print '1x3 column sums of C = C[+,] =' C_csum;
*makes a 2x1 matrix by summing the rows;
C_rsum=C[,+];
print '2x1 row sums of C = C[,+] =' C_rsum;
run;
C
2x3 example matrix C = {1 2 3,4 5 6} =
1
4
2
5
3
6
C_R2
2nd row of C = C[2,] =
4
5
6
C_C3
3rd column of C = C[,3] =
3
6
C23
(2,3) element of C = C[2,3] =
6
C_CSUM
1x3 column sums of C = C[+,] =
5
7
9
C_RSUM
2x1 row sums of C = C[,+] =
6
15
Example 2: Basic IML code and matrix definitions
options nodate nocenter formdlim="-";
PROC IML;
*makes a 2x3 matrix;
C = {1 2 3,4 5 6};
print '2x3 example matrix C = {1 2 3,4 5 6} =' C;
*makes a 1x3 matrix by summing the columns;
C1=C[+,];
print '1x3 column sums of C = C[+,] =' C1;
*makes a 2x1 matrix by summing the rows;
C2=C[,+];
print '2x1 row sums of C = C[,+] =' C2;
*makes a matrix out of second column of C;
F = C[,2];
print 'extract 2nd column of C into new vector (F) = C[,2] =' F;
*puts second column of c on diagonal;
D = DIAG( C[,2] );
print 'put 2nd column of C into a diagonal matrix (D) = DIAG(C[,2]) =' D;
*makes a vector out of the diagonal;
CC= VECDIAG(D);
print 'convert diagonal (of D) into vector (CC) = VECDIAG(D) =' CC;
*exponentiates each entry;
DD = EXP(D);
print 'exponentiate D yielding DD = EXP(D) =' DD;
*puts c next to itself - column binds C with itself;
E = C || C;
print 'Column bind C with itself yielding E = C||C =' E;
*puts a row of 2's below C - row bind ;
F = C // SHAPE(2,1,3);
print "Row bind C with vector of 2's (F) = C // SHAPE(2,1,3) =" F;
*creates a 3x2 matrix with matrix entry C;
K = REPEAT(C,3,2);
print '6x6 matrix = ' K;
*raises each entry of columns 2 & 3 of C to the third power then multiples by
3 and adds 3;
G = 3+3*(C[,2:3]##3);
print '3 + 3*(col2&3)^3 (G) = ' G;
*raises each entry of C to itself;
H = C ## C;
print 'raise C elements to itself (H) = C##C =' H;
*multiplies each entry of C by itself;
J = C # C;
print 'elementwise multiplication of C with itself (J) = C#C =' J;
quit;
SAS OUTPUT . . .
C
1
4
2x3 example matrix C = {1 2 3,4 5 6} =
C1
5
1x3 column sums of C = C[+,] =
2x1 row sums of C = C[,+] =
2
5
7
3
6
9
C2
6
15
F
2
5
extract 2nd column of C into new vector (F) = C[,2] =
D
2
0
put 2nd column of C into a diagonal matrix (D) = DIAG(C[,2]) =
CC
2
5
convert diagonal (of D) into vector (CC) = VECDIAG(D) =
DD
exponentiate D yielding DD = EXP(D) = 7.3890561
1
1 148.41316
Column bind C with itself yielding E = C||C =
E
1
2
3
1
4
5
6
4
2
5
3
6
0
5
F
1
4
2
Row bind C with vector of 2's (F) = C // SHAPE(2,1,3) =
K
1
4
1
4
1
4
6x6 matrix =
3 + 3*(col2&3)^3 (G) =
2
5
2
5
2
5
3
6
3
6
3
6
G
27
378
84
651
1
4
1
4
1
4
H
1
256
raise C elements to itself (H) = C##C =
elementwise multiplication of C with itself (J) = C#C =
2
5
2
5
2
5
4
3125
2
5
2
3
6
2
4
25
9
36
3
6
3
6
3
6
27
46656
J
1
16
Example 3 More IML Matrix definitions
proc iml;
c1 = shape(1,5,1); * using shape function;
c2 = {1,1,1,1,1};
c3 = T({1 1 1 1 1}); * transpose matrix;
c4 = T({[5] 1}) ;
* using repetition factors;
print 'shape(1,5,1) = ' c1;
print '{1,1,1,1,1} = ' c2;
print 'T({1 1 1 1 1}) = ' c3;
print 'T({[5] 1}) = ' c4;
quit;
C1
1
1
1
1
1
shape(1,5,1) =
{1,1,1,1,1} =
C2
1
1
1
1
1
C3
1
1
1
1
1
T({1 1 1 1 1}) =
T({[5] 1}) =
C4
1
1
1
1
1
Example 4: creating matrices from data sets and vice
versa
libname mydat
'D:\baileraj\Classes\Fall 2003\sta402\data';
proc iml;
/* read SAS data in IML */
use mydat.nitrofen;
read all var { total conc } into nitro;
/*
alternative coding
*/
use mydat.nitrofen var{ total conc };
read all into nitro2;
nitro = nitro || nitro[,2]##2;
* adding column with conc^2;
nitro2 = nitro2 || (nitro2[,2]- nitro2[+,2]/nrow(nitro2)) ; * add column with
centered concentration;
nitro2 = nitro2 || nitro2[,3]##2; * adding column with scaled conc^2;
show names;
* show matrices constructed in IML;
*print nitro;
*print nitro2;
create n2 from nitro2;
append from nitro2;
* creates SAS data set n2 from matrix nitro;
/* a little graphing in IML
*/
call pgraf(nitro[,2:1],'*','Nitrofen concentration', 'Number of young', 'Plot
of #young vs. conc');
quit;
proc print data=n2;
title 'print of data constructed in IML';
run;
N
u
m
b
e
r
o
f
y
o
u
n
g
Plot of #young vs. conc
40 ˆ
‚
‚
‚
‚
*
*
35 ˆ
*
‚
*
‚
*
*
‚
*
*
‚
*
*
*
30 ˆ
*
*
‚
*
*
‚
‚
*
*
*
*
‚
*
*
25 ˆ
‚
*
‚
*
*
‚
‚
*
20 ˆ
‚
‚
‚
*
‚
*
15 ˆ
*
*
‚
‚
*
‚
*
‚
10 ˆ
‚
‚
‚
*
*
‚
*
5 ˆ
*
‚
*
‚
‚
‚
0 ˆ
*
‚
‚
‚
Šƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒˆƒƒƒƒ
0
20
40
60
80
100 120 140 160 180 200 220 240 260 280 300 320
Nitrofen concentration
print of data constructed in IML
print of data constructed in IML
Obs
COL1
COL2
COL3
COL4
1
27
0
-157
24649
2
32
0
-157
24649
3
34
0
-157
24649
4
33
0
-157
24649
5
36
0
-157
24649
6
34
0
-157
24649
7
33
0
-157
24649
8
30
0
-157
24649
9
24
0
-157
24649
10
31
0
-157
24649
11
33
80
-77
5929
12
33
80
-77
5929
13
35
80
-77
5929
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
33
36
26
27
31
32
29
29
29
23
27
30
31
30
26
29
29
23
21
7
12
27
16
13
15
21
17
6
6
7
0
15
5
6
4
6
5
80
80
80
80
80
80
80
160
160
160
160
160
160
160
160
160
160
235
235
235
235
235
235
235
235
235
235
310
310
310
310
310
310
310
310
310
310
-77
-77
-77
-77
-77
-77
-77
3
3
3
3
3
3
3
3
3
3
78
78
78
78
78
78
78
78
78
78
153
153
153
153
153
153
153
153
153
153
5929
5929
5929
5929
5929
5929
5929
9
9
9
9
9
9
9
9
9
9
6084
6084
6084
6084
6084
6084
6084
6084
6084
6084
23409
23409
23409
23409
23409
23409
23409
23409
23409
23409
Example 5: sample applications I: estimating PI
proc iml;
nsim = 4000;
temp_mat = J(nsim,3,0);
do isim = 1 to nsim;
temp_mat[isim,1] = uniform(0);
temp_mat[isim,2] = uniform(0);
end;
/*
print
junk1
print
junk2
print
*/
* generate X ~ unif(0,1);
* generate Y ~ unif(0,1);
‘temp_mat = ‘ temp_mat;
= temp_mat[,1]#temp_mat[,1];
‘temp_mat[,1]#temp_mat[,1] = ‘ junk1;
= temp_mat[,2];
‘temp_mat[,2] = ‘ junk2;
temp_mat[,3] = (temp_mat[,2]<=
sqrt(J(nsim,1,1)-temp_mat[,1]#temp_mat[,1]));
pi_over4 = temp_mat[+,3]/nsim;
pi_est
se_est
pi_LCL
pi_UCL
print
print
print
print
print
print
=
=
=
=
4*pi_over4;
4*sqrt(pi_over4*(1-pi_over4)/nsim);
pi_est – 2*se_est;
pi_est + 2*se_est;
‘------------------------------------------------------------------‘;
‘Estimating PI using MC simulation methods with ‘ nsim ‘data points’;
‘PI-estimate = ‘ pi_est;
‘SE(PI-estimate) = ‘ se_est;
‘CI: [‘ pi_LCL ’ , ‘ pi_UCL ’]’;
‘------------------------------------------------------------------‘;
quit;
------------------------------------------------------------------
Estimating PI using MC simulation methods with
PI_EST
3.126
PI-estimate =
SE(PI-estimate) =
PI_LCL
CI: [ 3.0737303
NSIM
4000 data points
,
SE_EST
0.0261349
PI_UCL
3.1782697 ]
------------------------------------------------------------------
Example 6: sample applications I: estimating PI using a MODULE
options nocenter nodate;
proc iml;
/* MODULE TO ESTIMATE PI
- Monte Carlo simulation used
- Strategy:
Generate X~Unif(0,1) and Y~Unif(0,1)
Determine if Y <= sqrt(1-X*X)
PI/4 estimated by proportion of times condition above is true
- INPUT
nsim
- OUTPUT
estimate of PI along with SE and CI
*/
start MC_PI(nsim);
temp_mat = J(nsim,3,0);
do isim = 1 to nsim;
temp_mat[isim,1] = uniform(0);
temp_mat[isim,2] = uniform(0);
end;
* generate X ~ unif(0,1);
* generate Y ~ unif(0,1);
temp_mat[,3] = (temp_mat[,2]<=
sqrt(J(nsim,1,1)-temp_mat[,1]#temp_mat[,1]));
pi_over4 = temp_mat[+,3]/nsim;
pi_est
se_est
pi_LCL
pi_UCL
print
print
print
print
print
print
=
=
=
=
4*pi_over4;
4*sqrt(pi_over4*(1-pi_over4)/nsim);
pi_est - 2*se_est;
pi_est + 2*se_est;
'------------------------------------------------------------------';
'Estimating PI using MC simulation methods with ' nsim 'data points';
'PI-estimate = ' pi_est;
'SE(PI-estimate) = ' se_est;
'CI: [' pi_LCL ' , ' pi_UCL ']';
'------------------------------------------------------------------';
finish MC_PI;
/**************************************************************************/
run MC_PI(400);
run MC_PI(1600);
run MC_PI(4000);
quit;
SAS OUTPUT . . . (edited)
-----------------------------------------------------------------NSIM
Estimating PI using MC simulation methods with
400 data points
PI_EST
3.08
PI-estimate =
SE_EST
0.0841665
SE(PI-estimate) =
CI: [
PI_LCL
2.911667
,
PI_UCL
3.248333 ]
-----------------------------------------------------------------NSIM
Estimating PI using MC simulation methods with
1600 data points
PI_EST
3.125
PI-estimate =
SE_EST
0.0413399
SE(PI-estimate) =
PI_LCL
CI: [ 3.0423203
,
PI_UCL
3.2076797 ]
-----------------------------------------------------------------NSIM
Estimating PI using MC simulation methods with
4000 data points
PI-estimate =
SE(PI-estimate) =
PI_EST
3.198
SE_EST
0.0253219
PI_LCL
CI: [ 3.1473562
,
PI_UCL
3.2486438 ]
------------------------------------------------------------------
Example 7: Using IML to construct a randomization test for testing equality
of 2 populations (revisiting WEEK7 example)
/*
RECALL – t-test previously conducted
proc ttest;
title NITROFEN:
class conc;
var total;
run;
t-test of (0, 160) concentrations;
NITROFEN: t-test of (0, 160) concentrations
The TTEST Procedure
Statistics
Lower CL
Upper CL
Variable conc
N
Mean
Mean
Mean
total
total
total
0
160
Diff (1-2)
Variable
total
total
Method
Pooled
Satterthwaite
Variable
total
*/
Method
Folded F
10
10
28.827
26.612
0.2424
T-Tests
Variances
Equal
Unequal
Equality of Variances
Num DF
Den DF
9
9
31.4
28.3
3.1
DF
18
15.5
33.973
29.988
5.9576
t Value
2.28
2.28
F Value
2.32
Lower CL
Std Dev
Std Dev
Upper CL
Std Dev
Std Err
2.4737
1.6229
2.2981
3.5963
2.3594
3.0414
6.5654
4.3073
4.4977
1.1372
0.7461
1.3601
Pr > |t|
0.0351
0.0372
Pr > F
0.2252
/* ------------------------------------------------------------use PLAN to generate a set of indices for the randomization test
and then use TRANSPOSE to package the output
Do the calculations in PROC IML instead of DATA step programming
------------------------------------------------------------- */
options nocenter nodate;
libname class 'D:\baileraj\Classes\Fall 2003\sta402\data';
title “Randomization test in IML – Nitrofen conc 0 vs. 160 compared”;
data test; set class.nitrofen;
if conc=0 | conc=160;
proc plan;
factors test=4000 ordered in=20;
output out=d_permut;
run;
proc transpose data=d_permut prefix=in out=out_permut(keep=in1-in20); by
test;
run;
proc iml;
/* read SAS data in IML */
use class.nitrofen;
read all var { total conc } where (conc=0|conc=160) into nitro;
use out_permut;
read all into perm_index;
obs_vec = nitro[,1];
obs_diff = sum(obs_vec[1:10]) - sum(obs_vec[11:20]);
PERM_RESULTS = J(nrow(perm_index),2,0);
* test statistic;
* initialize results matrix;
do iperm = 1 to nrow(perm_index);
ind = perm_index[iperm,];
* extract permutation index;
perm_resp = obs_vec[ind];
* select corresponding obs;
perm_diff = sum(perm_resp[1:10]) - sum(perm_resp[11:20]);
PERM_RESULTS[iperm,1] = perm_diff;
* store perm TS value/indicator;
PERM_RESULTS[iperm,2] = abs(perm_diff) >= abs(obs_diff);
end;
perm_Pvalue = PERM_RESULTS[+,2]/nrow(PERM_RESULTS);
print ‘Permutation P-value = ‘ perm_Pvalue;
from SAS OUTPUT . . .
Permutation P-value =
PERM_PVALUE
0.03575
/* code for testing components */
print nitro;
print perm_index;
obs_vec = nitro[,1];
print obs_vec;
ind = perm_index[1,];
print ind;
permdat = obs_vec[ind];
print permdat;
tranind = T(ind);
print obs_vec tranind permdat;
/* alternative coding */
obs_vec = shape(nitro[,1],1,20);
obs_diff = sum(obs_vec[1,1:10]) - sum(obs_vec[1,11:20]);
print obs_vec obs_diff;
PERM_RESULTS = J(nrow(perm_index),2,0);
matrix;
* initialize results
do iperm = 1 to nrow(perm_index);
ind = perm_index[iperm,];
perm_resp = shape(obs_vec[1,ind],1,20);
perm_diff = sum(perm_resp[1,1:10]) - sum(perm_resp[1,11:20]);
PERM_RESULTS[iperm,1] = perm_diff;
PERM_RESULTS[iperm,2] = abs(perm_diff) >= abs(obs_diff);
end;
perm_Pvalue = PERM_RESULTS[+,2]/nrow(PERM_RESULTS);
print 'Permutation P-value = ' perm_Pvalue;
Noble IML example 1: bisection method
options ls=78 formdlim='-' nodate pageno=1;
/* find sqrt(x = 3) using bisection */
proc iml;
x = 3;
hi = x;
lo = 0;
history = 0||lo||hi;
iteration = 1;
delta = hi - lo;
do while(delta > 1e-7);
mid = (hi + lo)/2;
check = mid**2 > x;
if check
then hi = mid;
else lo = mid;
delta = hi - lo;
history = history//(iteration||lo||hi);
iteration = iteration + 1;
end;
print mid;
create process var {iteration low high};
append from history;
-----------------------------------------------------------------------------MID
1.7320509
proc print data=process;
run;
/*
output from PROC PRINT
-----------------------------------------------------------------------------Obs
ITERATION
LOW
HIGH
1
0
0.00000
3.00000
2
1
1.50000
3.00000
3
2
1.50000
2.25000
4
3
1.50000
1.87500
5
4
1.68750
1.87500
6
5
1.68750
1.78125
7
6
1.68750
1.73438
8
7
1.71094
1.73438
9
8
1.72266
1.73438
10
9
1.72852
1.73438
11
10
1.73145
1.73438
12
11
1.73145
1.73291
13
12
1.73145
1.73218
14
13
1.73181
1.73218
15
14
1.73199
1.73218
16
15
1.73199
1.73209
17
16
1.73204
1.73209
18
17
1.73204
1.73206
19
18
1.73204
1.73205
20
19
1.73205
1.73205
21
20
1.73205
1.73205
22
21
1.73205
1.73205
23
22
1.73205
1.73205
24
23
1.73205
1.73205
25
24
1.73205
1.73205
26
25
1.73205
1.73205
------------------------------------------------------------------------------
*/
symbol interpol=join value=#;
axis1 label=('Limits');
proc gplot data=process;
plot (low high)*iteration / overlay vaxis=axis1;
run;
quit;
L i mi t s
3
2
1
0
0
10
20
I T E RA T I ON
Noble IML example 2: Ridge regression
Ridge regression
Yi  0  1x1i     k x ik  i
iid
where
 i ~ N (0,  2 ) for i  1, , n
1. Center and scale all variables
 x  x1 
 x  xk 
Yi  Y
    *k  ki

 1*  1i

 s n 1 
sY n  1
 s1 n  1 
 k

~ R
~
y  X  
30
Note:
 0  Ys Y n  1 
j 
1*s Y x1
* s x
 k Y k
s1
sk
*j
for j  1,  , k
sj n 1
2. Let
~~
~
R
b  (XX  cI) 1 X~
y
~~
~~ R
R
E[b ]  (XX  cI) 1 XX (i.e. biased estimation)
~ ~
~ ~ ~ ~
R
var ( b )  (X X  cI) 1 X X(X X  cI) 1
OLS solution when c = 0
1
r
12
~ ~ 
X X   r13


r1k
r12
1
r23

r2 k
r13
r23
1

r3k
 r1k 
 r2 k 

 r3k 

 
 1 
Example:
A. Suppose k = 3 and the indendepent variables are orthogonal to each other
then
1 0 0
1 0 0
~
R


X X  0 1 0  var( b )  0 1 0




0 0 1
0 0 1
B. Suppose k = 3 and the indendepent variables and two of the variables are
highly correlated with each other then
0.98 0
 1
 25.2525 - 24.7474 0
~ ~ 
~ ~ -1 

XX  0.98 1
0  ( X X)  - 24.7474 25.2525 0




0
1
0
0
1
 0

Let c = 0.01
0 
 11.3611 - 10.8611
~ ~
-1 ~ ~ ~ ~
-1

var( b )  ( XX  cI) X X( XX  cI)  - 10.8611 11.3611
0 


0
0
0.9803

R
Illustration:
Measurements were taken on 17 U.S. Navy hospitals
Independent variables =
Average daily patient load
Monthly x-ray exposures
Monthly occupied bed days
Eligible population in area (divided by 1000)
Average length of patients’ stay in days
Response = Monthly labor hours
options ls=78 formdlim='-' nodate pageno=1;
data hospital;
input patient_load x_ray bed_days population stay_length labor_hours;
cards;
15.57
2463
472.92
18.0
4.45
566.52
44.02
2048
1339.75
9.5
6.92
696.82
20.42
3940
620.25
12.8
4.28
1033.15
18.74
6505
568.33
36.7
3.90
1603.62
49.20
5723
1497.60
35.7
5.50
1611.37
44.92
11520
1365.83
24.0
4.60
1613.27
55.48
5779
1687.00
43.3
5.62
1854.17
59.28
5969
1639.92
46.7
5.15
2160.55
94.39
8461
2872.33
78.7
6.18
2305.58
128.02 20106
3655.08 180.5
6.15
3503.93
96.00
13313
2912.00
60.9
5.88
3571.89
131.42 10771
3912.00 103.7
4.88
3741.40
127.21 15543
3865.67 126.8
5.50
4026.52
252.90 36194
7684.10 157.7
7.00 10343.81
409.20 34703
12446.33 169.4 10.78 11732.17
463.70 39204
14098.40 331.4
7.05 15414.94
510.22 86533
15524.00 371.6
6.35 18854.45
;
proc reg data=hospital;
model labor_hours = patient_load x_ray bed_days population stay_length / vif;
run;
-----------------------------------------------------------------------------The SAS System
The REG Procedure
Model: MODEL1
Dependent Variable: labor_hours
1
Analysis of Variance
DF
5
11
16
Sum of
Squares
490195304
4517236
494712540
Root MSE
Dependent Mean
Coeff Var
640.82590
4978.48000
12.87192
Source
Model
Error
Corrected Total
Mean
Square
98039061
410658
R-Square
Adj R-Sq
F Value
238.74
Pr > F
<.0001
0.9909
0.9867
Parameter Estimates
Variable
Intercept
patient_load
x_ray
bed_days
population
stay_length
DF
1
1
1
1
1
1
Parameter
Estimate
1957.65555
-19.08612
0.05574
1.69311
-4.07037
-392.64933
Standard
Error
1062.65900
96.18624
0.02123
3.04722
7.11731
207.75252
t Value
1.84
-0.20
2.62
0.56
-0.57
-1.89
Pr > |t|
0.0925
0.8463
0.0236
0.5896
0.5789
0.0854
Variance
Inflation
0
9348.14904
7.95401
8710.16847
23.00120
4.21971
%macro find_c(datain=,y=,x=,dataout=);
/*--------------------------------------------------+
| parameters
|
|
datain = dataset to be analyzed
|
|
y = response variable
|
|
x = list of independent variables
|
|
dataout = dataset containing ridge estimates |
+--------------------------------------------------*/
proc iml;
/* read in data */
use &datain;
read all var {&x} into data;
/* center */
n = nrow(data);
k = ncol(data);
centered = data -j(n,n,1)*data/n;
/* scale */
r = j(k,k,.);
do i = 1 to k;
do j = 1 to k;
r[i,j] = (centered[,i]`*centered[,j])
/sqrt(centered[,i]`*centered[,i]*centered[,j]`*centered[,j]);
end;
end;
/* find c via bisection */
hi = 1;
lo = 0;
delta = hi - lo;
do while(delta > 1e-8);
c = (hi + lo)/2;
maxvif = max(diag(inv(r+c*i(k))*r*inv(r+c*i(k))));
if maxvif > 5
then lo = c;
else hi = c;
delta = hi - lo;
end;
mattrib c label='Biasing constant';
print c;
create temp var {c};
append from c;
data _null_;
set temp;
call symput('c',c);
run;
proc reg data=hospital outest=&dataout noprint;
model &y = &x / ridge=&c;
run;
%mend;
%find_c(datain=hospital,
y=labor_hours,
x=patient_load x_ray bed_days population stay_length,
dataout=ridge);
proc print data = ridge;
run;
quit;
-----------------------------------------------------------------------------Biasing constant
0.0318075
-----------------------------------------------------------------------------Obs
_MODEL_
_TYPE_
_DEPVAR_
_RIDGE_
_PCOMIT_
_RMSE_
Intercept
1
2
MODEL1
MODEL1
PARMS
RIDGE
labor_hours
labor_hours
.
0.031808
.
.
640.826
724.538
1957.66
798.25
Obs
patient_
load
x_ray
bed_days
population
stay_
length
labor_
hours
1
2
-19.0861
12.5864
0.055738
0.063608
1.69311
0.43146
-4.07037
2.24981
-392.649
-171.973
-1
-1
Example:
PROC IML;
A={1 2 3,4 5 6,5 7 9};
print A;
G=GINV(A);
print G;
B=A`*A;
print B;
CALL EIGEN(M,E,B);
print M;
print E;
M=FUZZ(M);
print M;
BB=E*DIAG(M)*E`;
print BB;
RUN;
A
1
4
5
2
5
7
3
6
9
G
-0.777778 0.6111111 -0.166667
-0.111111 0.1111111 7.027E-17
0.5555556 -0.388889 0.1666667
B
42
57
72
57
78
99
M
245.33969
0.660309
-1.14E-14
E
72
99
126
Diagonalize a symmetric matrix:
(based on Deddens h.o.)
* this inputs a matrix;
* this is the generalized inverse;
* this computes the eigenvalues of B;
* M is a vector of eigenvalues;
* E is a matrix whose columns are;
* orthogonal eigenvectors;
0.4115876 0.8148184 -0.408248
0.5638129 0.1242915 0.8164966
0.7160382 -0.566235 -0.408248
M
245.33969
0.660309
0
BB
42
57
72
57
78
99
72
99
126
Example Generating Multivariate Normal random variables
(based on Deddens h.o.)
/*
Generate a random sample of size 100 from a bivariate normal
distribution with mean (mu1,mu2)=(8,10), and variance covariance
matrix V=(4 3,3 9). X1~ N(8,4); X2 ~ N(10,9)
with Cov(X1, X2) = 3, so the correlation is 3/sqrt(4*9)=.5
*/
PROC IML;
X=SHAPE(0,2,100);
MU=8*SHAPE(1,1,100)//10*SHAPE(1,1,100);
DO N=1 TO 100;
DO M=1 TO 2;
X[M,N]=RANNOR(0);
END;
END;
*X= 2x100 matrix of 0s;
*MU=2x100 matrix of 8s and 10s;
* X = 2x100 matrix with indep. z-values;
V={4 3,3 9};
CALL EIGEN (M,E,V);
T=E*DIAG(SQRT(M))*E`;
* this is the square root of V;
Y=MU+T*X;
W=Y`;
*Y is a 2x100 matrix;
*W is a 100x2 matrix;
CREATE LAST FROM W;
APPEND FROM W;
*creates a data set from the matrix W;
PROC CORR COV DATA=LAST;
VAR COL1 COL2;
RUN;
*computes covariances and correlations;
The CORR Procedure
2
Variables:
COL1
COL2
Covariance Matrix, DF = 99
COL1
COL2
COL1
COL2
4.42686739
3.57748316
Variable
COL1
COL2
3.57748316
10.77696443
N
100
100
Mean
8.00686
10.01085
Simple Statistics
Std Dev
Sum
2.10401
800.68631
3.28283
1001
Minimum
2.11136
2.90976
Maximum
13.18996
17.50832
Pearson Correlation Coefficients, N = 100
Prob > |r| under H0: Rho=0
COL1
COL2
COL1
1.00000
COL2
0.51794
<.0001
0.51794
<.0001
1.00000
This could also be done using the following code:
PROC IML;
MU={8 10};
COV={4 3, 3 9};
CALL VNORMAL(W,MU,COV,1000,0);
CREATE LAST FROM W;
APPEND FROM W;
PROC CORR COV DATA=LAST;
VAR COL1 COL2;
RUN;
The CORR Procedure
2 Variables:
COL1
COL2
Covariance Matrix, DF = 999
COL1
COL2
Variable
COL1
COL2
COL1
4.252231110
3.409982609
N
1000
1000
COL2
3.409982609
9.263162292
Mean
8.00001
10.03342
Simple Statistics
Std Dev
2.06209
3.04354
Sum
8000
10033
Minimum
1.44165
-0.04513
Maximum
13.93833
19.35617
Pearson Correlation Coefficients, N = 1000
Prob > |r| under H0: Rho=0
COL1
COL1
1.00000
COL2
0.54333
<.0001
COL2
0.54333
1.00000
Example: Regression using matrix definition for LSE
(based on Deddens h.o.)
/*
PROC IML is a interactive matrix language that contains lots of
matrix functions, SAS functions, and various programming statements.
This handout uses PROC IML to perform usual linear regression. (from
h.o. #22)
*/
DATA DRAPER;
INPUT X1 X2 X3 X4 Y;
datalines;
7 26 6 60 78.5
1 29 15 52 74.3
11 56 8 20 104.3
11 31 8 47 87.6
7 52 6 33 95.9
11 55 9 22 109.2
3 71 17 6 102.7
1 31 22 44 72.5
2 54 18 22 93.1
21 47 4 26 115.9
1 40 23 34 83.8
11 66 9 12 113.3
10 68 8 12 109.4
PROC IML;
USE DRAPER;
READ ALL INTO XX;
*this makes a matrix out of a data set;
Y=XX[,5]; N=NROW(XX); R=NCOL(XX);
ONE=SHAPE(1,N,1);
X=ONE||XX[,1:4];
BETA=INV(X`*X)*X`*Y;
SIGMA=SQRT((Y-X*BETA)`*(Y-X*BETA)/(N-R));
SE=SIGMA*SQRT(VECDIAG(INV(X`*X)));
TVALUE=BETA/SE;
PVALUE=(1-PROBT(ABS(TVALUE),N-R))*2;
FINAL=BETA||SE||TVALUE||PVALUE;
print FINAL (|colname={ESTIMATE SE TVALUE PVALUE}
rowname={INTERCPT X1 X2 X3 X4} FORMAT=8.4|);
CREATE LAST FROM FINAL;
APPEND FROM FINAL;
*this creates a SAS data set from a matrix;
PROC print DATA=LAST;
RUN;
PROC REG DATA=DRAPER;
MODEL Y=X1 X2 X3 X4;
RUN;
Alternative coding:
could read in the matrices XX, Y, and X using:
READ ALL VAR{ Y } INTO Y;
READ ALL VAR{ X1 X2 X3 X4 } INTO XX;
X=ONE||XX;
Example: MLE for Weibull Parameters using IML
/* -------------------------------------------------------------------------------------------------------------Find the maximium likelihood estimate for the Weibull distribution
using the Newton Raphson method.
SUPPOSE
beta
beta-1
- alpha * x
f(x) = alpha * beta * x
* e
for x > 0
beta
log[f(x)] = log(alpha) + log(beta) + (beta-1) log(x) – alpha * x
In order to somewhat conform with PROC LIFEREG we will let:
mu=-log(alpha), i.e.
log[f(x)] = -mu + log(beta) + (beta-1)log(x) – exp(-mu)*x^beta
Recall that in order to find the mle, one forms the log-likelihood,
takes the derivative F(theta) = LL'(theta).
To find the maximium,
one needs to find when the function F(theta) equals 0. One then uses
the second derivative to find the standard error.
Newton's method is
an iterative procedure to find when a function is equal to 0.
-1
theta(n+1) = theta(n) - F'(theta(n))
* F(theta(n))
here theta=(alpha,beta) is the vector of unknown parameters. (based on
Deddens h.o. #23)
revised: 2/26/2003
comments and elaboration: 11/09/2003
----------------------------------------------------------------- */
DATA D1;
DO I=1 TO 500;
X=RANEXP(123456789)/10;
OUTPUT;
END;
ONE=1;
*we generate some data;
/* obtain estimates from LIFEREG */
PROC LIFEREG data=D1;
MODEL X=ONE/D=WEIBULL; RUN;
Parameter
Intercept
ONE
Scale
Weibull Shape
DF Estimate
1
0
1
1
-2.3355
0.0000
1.0474
0.9547
Error
0.0494
0.0000
0.0366
0.0333
Limits
-2.4322
0.0000
0.9781
0.8916
Square Pr > ChiSq
-2.2388 2239.54
0.0000
.
1.1216
1.0224
/* repeat parameter estimation using IML
DATA D2; SET D1; KEEP X ;
*/
<.0001
.
PROC IML;
USE D2;
READ ALL INTO X;
X=X`;
LL1= { 1, 1 };
NN=NCOL(X);
* compute an initial estimate of theta;
THETA={ 0 , .1 };
* do at most 20 iterations or until the deriv(of LL) is quite small;
DO JJ=1 TO 20 UNTIL ( SUM(LL1) < .0001 );
MU=THETA[1,];
BETA=THETA[2,];
SUMIT=SHAPE(1,NN,1); * vector of 1s;
* compute the Log-Likelihood;
LL = (MU +LOG(BETA) +LOG(X)*(BETA-1) -EXP(MU)*(X##BETA)) * SUMIT;
* compute the derivative of the log-likelihood;
LL1 = ( ( 1 - EXP(MU)*(X##BETA) ) * SUMIT ) //
( ( (1/BETA) + LOG(X) - EXP(MU)*LOG(X)#(X##BETA) )* SUMIT );
* compute the matrix of second derivatives;
LL2 = ( ( - EXP(MU)*(X##BETA) ) *SUMIT ||
( -(EXP(MU)*LOG(X)#(X##BETA) ) *SUMIT ) ) //
( ( -(EXP(MU)*LOG(X)#(X##BETA) ) * SUMIT ) ||
( (-1/(BETA*BETA) -EXP(MU)*LOG(X)#LOG(X)#(X##BETA) ) *SUMIT ) );
* iterate to find the mle;
THETA=THETA - INV(LL2)*LL1;
* compute the estimates corresponding to PROC LIFEREG;
SIGMA=1/BETA;
INTERCEPT=-MU*SIGMA;
* print standard errors using the second derivative;
SE=SQRT(-VECDIAG(INV(LL2)));
ZVALUE=THETA/SE;
* print iteration history;
print JJ THETA SE ;
END;
* print the final estimates;
print JJ THETA SE ZVALUE INTERCEPT SIGMA;
JJ
THETA
SE
1 0.6000362 0.0530356
0.1972324 0.0044464
2 1.0112191 0.0490821
0.3726667 0.0085816
3 1.6117082 0.0578591
0.6303589 0.0154567
4 2.0731078 0.0678525
0.8612001 0.0240567
5 2.2180598 0.0746581
0.9469739 0.030719
6 2.2296905 0.077051
0.9546784 0.0331178
7 2.2297713 0.0772825
0.9547346 0.0333382
8 2.2297713 0.0772842
0.9547346 0.0333398
JJ
THETA
SE
ZVALUE INTERCEPT
SIGMA
8 2.2297713 0.0772842 28.851564 -2.335488 1.0474115
0.9547346 0.0333398 28.63646
Example: Poisson Regression using IML and GENMOD
/*
This is SAS MACRO which uses PROC IML to perform maximium
likelihood estimation, using Newtons method to find the iterative
solution (based on Deddens h.o. #26).
Incidence of nonmelanoma skin cancer among women in Minn./
St. Paul vs. women in Dallas/Ft. Worth;
Ref:
Lunneborg (1994) ch. 19 problem 2;
Kleinbaum, Kupper and Muller Regression book as well
Input variables:
age = midpt of age categ. 15-24, 25-34, . . . 75-84, 85+
city = 0 (Minneapolis-St. Paul)
= 1 (Dallas-Ft. Worth)
Response variables:
Cases = # nonmelanoma skin cancers
PYRS = person years of exposure
Additional documentation added:
*/
DATA KKM;
INPUT CITY AGE CASES PYRS @@;
LAGE=LOG((AGE-15)/35);
LPYRS=LOG(PYRS);
CARDS;
0 20
1 172675
1 20
4 181343
9 November 2003
0
0
0
0
0
0
0
;
30 16 123065
40 30 96216
50 71 92051
60 102 72159
70 130 54722
80 133 32185
90 40
8328
1
1
1
1
1
1
1
30
40
50
60
70
80
90
38 146207
119 121374
221 111353
259 83004
310 55932
226 29007
65
7538
%MACRO POISSON(DATASET,COUNT,NUMBER,XVAR);
%* &count = response/count of the number of cases;
%* &number = person years of exposure;
%* &xvar = predictor variables;
PROC IML;
/* select response variables, predictors and offset */
USE &DATASET;
READ ALL VAR{&XVAR}
INTO XX; NR=NROW(XX); NC=NCOL(XX); NV=NC+1;
READ ALL VAR{&COUNT} INTO Y;
READ ALL VAR{&NUMBER} INTO P;
/* initialize BETA vector and set up design matrix */
BETA=J(NV,1,0);
ONE=J(NR,1,1);
XX=ONE||XX;
DO ITER=1 TO 20 UNTIL (ADLL<.00001);
EB=EXP(XX*BETA);
* estimated RATE;
PEB=P#EB;
* predicted number of cases;
* = person-years X RATE;
LL=ONE`*(Y#LOG(PEB)-PEB);
LL0=ONE`*(Y#LOG(Y)-Y);
DLL=XX`*(-PEB+Y);
ADLL=MAX(ABS(DLL));
TWO=J(1,NV,1);
DDLL=-XX`*((PEB*TWO)#(XX));
BETA=BETA-INV(DDLL)*DLL; * update estimate;
BETAP=BETA`;
print BETAP (|FORMAT=8.4|) ITER ADLL (|FORMAT=12.4|);
DEV=2*(LL0-LL);
VAR=-INV(DDLL);
SE=SQRT(VECDIAG(VAR));
ZVALUE=BETA/SE;
LOWER=BETA-1.96*SE;
UPPER=BETA+1.96*SE;
CHISQ=ZVALUE##2;
PVALUE=2*PROBNORM(-ABS(ZVALUE));
END;
FINAL=BETA||SE||LOWER||UPPER||CHISQ||PVALUE;
print FINAL (|colname={ESTIMATE SE LOWER UPPER CHISQUARE PVALUE}
rowname={intercept &xvar} FORMAT=12.4|);
print
ITER ;
print DEV (| FORMAT=12.4 |) LL (| FORMAT=12.4 |) LL0 (| FORMAT=12.4
|);
%MEND;
%POISSON(KKM,CASES,PYRS,CITY LAGE)
BETAP
-0.9983
-1.9937
-2.9814
-3.9485
-4.8631
-5.6579
-6.2430
-6.6197
-6.8890
-7.0416
-7.0748
-7.0760
-7.0760
-7.0760
0.0009
0.0032
0.0096
0.0266
0.0707
0.1763
0.3814
0.6290
0.7648
0.7992
0.8030
0.8031
0.8031
0.8031
INTERCEPT
CITY
LAGE
ITER
ADLL
1 1385394.0000
2 509192.4582
3 186857.4228
4
68281.0472
5
24670.1512
6
8655.4209
7
2835.8993
8
930.6167
9
355.1142
10
107.8144
11
21.2470
12
0.7735
13
0.0009
14
0.0000
0.0014
0.0053
0.0157
0.0437
0.1167
0.2958
0.6737
1.2675
1.8734
2.2156
2.2859
2.2883
2.2883
2.2883
ESTIMATE
-7.0760
0.8031
2.2883
SE
0.0476
0.0522
0.0627
FINAL
LOWER
-7.1694
0.7008
2.1654
UPPER
-6.9826
0.9054
2.4112
CHISQUARE
22070.6964
236.9187
1331.6379
PVALUE
0.0000
0.0000
0.0000
ITER
14
DEV
14.3722
LL
7198.7749
LL0
7205.9610
PROC GENMOD;
MODEL CASES=CITY LAGE/D=POISSON OFFSET=LPYRS ITPRINT;
RUN;
The GENMOD Procedure
Model Information
Data Set
WORK.KKM
Distribution
Poisson
Link Function
Log
Dependent Variable
CASES
Offset Variable
LPYRS
Observations Used
16
Parameter Information
Parameter
Effect
Prm1
Intercept
Prm2
CITY
Prm3
LAGE
Iter
0
1
2
Iteration History For Parameter Estimates
Log
Ridge
Likelihood
Prm1
Prm2
0
7198.75807
-7.069337
0.7996953
0
7198.77491
-7.07598
0.8030978
0
7198.77491
-7.076002
0.803114
Criteria For Assessing Goodness Of Fit
Criterion
DF
Value
Deviance
13
14.3722
Scaled Deviance
13
14.3722
Pearson Chi-Square
13
14.2004
Scaled Pearson X2
13
14.2004
Log Likelihood
7198.7749
Prm3
2.287589
2.2882666
2.2882695
Value/DF
1.1056
1.1056
1.0923
1.0923
Last Evaluation Of The Negative
Of The Gradient and Hessian
Gradient
Prm1
Prm2
Prm3
Prm1
1.2864E-7
1765
1242
486.23975
Prm2
1.4635E-8
1242
1242
328.74852
Prm3
3.068E-8
486.23975
328.74852
388.75768
--------------------------------------------------------------------------------------------Algorithm converged.
Analysis Of Parameter Estimates
Parameter
Intercept
CITY
LAGE
Scale
DF
1
1
1
0
Estimate
-7.0760
0.8031
2.2883
1.0000
Standard
Error
0.0476
0.0522
0.0627
0.0000
NOTE: The scale parameter was held fixed
Wald 95% Confidence
Limits
-7.1694
-6.9826
0.7008
0.9054
2.1654
2.4112
1.0000
1.0000
ChiSquare
22070.7
236.92
1331.64
Pr > ChiSq
<.0001
<.0001
<.0001
Download