Lecture 12 Slides

advertisement
Today: Feb 28
• Reading Data from existing SAS
dataset
• One-way ANOVA
• Reading Le 7:5
• Reading C&S 7:A-H
Reading SAS Datasets
Sometimes your “raw” data is already a SAS dataset
LIBNAME tomhs 'c:/my documents/ph5415/';
PROC CONTENTS DATA=tomhs.bpstudy;
PROC PRINT DATA=tomhs.bpstudy (obs=10);
RUN;
The libname statement tells SAS which directory (folder) the
dataset is in.
DATA=tomhs.bpstudy
Tells SAS to look for a SAS dataset called bpstudy in the
directory referenced by tomhs.
PROC CONTENTS OUTPUT
The CONTENTS Procedure
Data Set Name:
Member Type:
Engine:
Created:
Last Modified:
TOMHS.BPSTUDY
DATA
V8
9:07 Saturday, February 26, 2005
9:07 Saturday, February 26, 2005
-----Alphabetic List of Variables and Attributes----#
Variable
Type
Len
Pos
-----------------------------------------3
AGE
Num
8
16
6
CHOL12
Num
8
40
2
GROUP
Num
8
8
8
HDL12
Num
8
56
9
PULSE12
Num
8
64
10
PULSEBL
Num
8
72
4
SBP12
Num
8
24
5
SBPBL
Num
8
32
1
SEX
Num
8
0
7
TRIG12
Num
8
48
11
WT12
Num
8
80
12
WTBL
Num
8
88
13
cholbl
Num
8
96
14
hdlbl
Num
8
104
16
id
Char
6
120
15
trigbl
Num
8
112
Observations:
Variables:
Indexes:
Observation Length:
Deleted Observations:
902
16
0
128
0
PROC PRINT – 10 Observations
C
S
E
X
G
R
O
U
P
S
B
P
B
L
H
O
L
1
2
A
G
E
1
1
3
54
.
139.5
.
2
2
6
62
129
144.0
3
2
5
64
118
4
1
5
47
5
1
3
6
1
7
U
R
I
G
1
2
U
c
H
D
L
1
2
L
S
E
1
2
L
S
E
B
L
.
.
.
76
241
65
66
80
72
141.0
307
425
41
80
81
.
134.0
.
.
.
.
80
51
.
132.5
.
.
.
.
73
2
62
133
133.0
196
72
44
72
76
2
2
59
113
136.0
231
75
61
72
8
1
3
63
127
137.5
217
137
35
9
2
4
64
122
151.0
201
57
10
2
5
52
122
140.0
209
105
O
b
s
S
B
P
1
2
T
t
W
T
B
L
h
o
l
b
l
224.0
205
24
179
A00001
124.0
141.0
260
75
67
A00010
144.0
157.0
228
29
564
A00021
.
214.0
194
66
49
A00023
.
206.5
226
40
53
A00056
211.0
227.5
207
47
126
A00075
74
125.0
137.0
214
62
119
A00083
64
74
195.0
211.5
214
37
165
A00105
44
56
63
150.0
159.5
214
47
133
A00133
57
60
81
168.5
196.5
215
55
105
A00143
W
T
1
2
.
h
d
l
b
l
r
i
g
b
l
i
d
Reading a SAS Dataset
DATA temp;
SET tomhs.bpstudy;
sbpdif = sbp12-sbpbl;
PROC MEANS DATA=temp;
Reads in an observation.
Replaces the infile and input
statements when reading in text
data
The MEANS Procedure
Variable
SEX
GROUP
AGE
SBP12
SBPBL
CHOL12
TRIG12
HDL12
PULSE12
PULSEBL
WT12
WTBL
cholbl
hdlbl
trigbl
sbpdif
N
Mean
Std Dev
Minimum
Maximum
902
902
902
848
902
849
849
849
847
901
848
902
900
900
900
848
1.3824834
3.7882483
54.7727273
124.1002358
140.3636364
220.8386337
106.9634865
45.4923439
69.3506494
73.6925638
176.8225236
187.3791574
228.2511111
43.6122222
131.7366667
-16.5176887
0.4862633
1.7874130
6.4039396
15.1891840
12.4446043
38.8624342
62.5307082
12.1059688
10.0301471
8.6698610
30.4251368
31.0782720
38.4169684
11.6124701
76.5211232
14.4532685
1.0000000
1.0000000
44.0000000
87.0000000
113.5000000
111.0000000
24.0000000
18.0000000
44.0000000
48.0000000
105.5000000
113.0000000
113.0000000
17.0000000
17.0000000
-75.5000000
2.0000000
6.0000000
69.0000000
187.0000000
190.0000000
456.0000000
592.0000000
102.0000000
112.0000000
109.0000000
286.0000000
289.2500000
357.0000000
97.0000000
815.0000000
30.0000000
One-Way Analysis of Variance
• Two-sample t-test; compare means of two
groups
– Are the means different?
• What if we have more than two groups?
Examples;
• compare three different behavioral
interventions
• compare 5 different BP drugs
Analysis of Variance
Could compare all pairs of means with ttests
three groups: A-B, B-C, A-C
five groups:
A-B, A-C, A-D, A-E
B-C, B-D, B-E
C-D, C-E
D-E
Analysis of Variance
Problem - multiple comparisons!!
When performing many tests, may reject
null hypothesis by chance (Type I error)
With  = 0.05, you allow for possibility of
rejecting 1 out of 20 tests by chance
Even if all group means are equal then there is a
fairly large chance that one-pair will be different
Analysis of Variance
ANOVA simultaneously tests for difference
in k means
•
•
•
•
•
Y - continuous
k samples from k normal distributions
each size ni, not necessarily equal
each with possibly different mean  i
each with constant variance 2
Constant variance
ANOVA is robust for violations of constant
variance (and normality)
Rule of thumb:
If largest standard deviation is less than twice the
smallest standard deviation, you’re ok.
Can sometimes transform to achieve equal variance
or normality
Analysis of Variance
Ho: 1 = 2= ... = k
Ha: Not all i equal
Two-sample t-test is
special case; k = 2
Sometimes referred to as a global
or omnibus test
For each group i;
ni = number of observations
Yi = sample mean
2
si = sample variance
Y = overall mean
Two-sample T-test
• Compared means
for two groups
•
y1 - y 2
t=
1
1
This compares
+
sp
variation between
n1 n 2
groups with
variation within
groups
Variation Within Groups
Variation Between
Groups
ANOVA F-test
• Compared means
for all groups
(Y - Y )

F=
• This compares
variation between
groups with
variation within
groups
i
sp
Variation Within Groups
2
2
Variation Between
Groups – Compared
to Grand Mean
Analysis of Variance
Variation for all observations:
2
(
Y
Y
)
 ij
Called the “(corrected) total sum of squares” or
SST
Can be divided into two parts:
•deviation of individual observation from its
sample mean
• deviation of sample means from overall mean
Yij - Y = (Yij - Yi ) + (Yi - Y )
Similar to regression
Analysis of Variance
(Yij - Yi ) Measures variation within samples
(Yi - Y ) Measures variation between samples
Each has a corresponding “sum of squares”
2
(
Y
Y
)
 ij i
2
(
Y
Y
)
 i
Sum of squares within (SSW)
Sum of squares between (SSB)
Analysis of Variance
Each has a corresponding degrees of freedom (DF)
SST = n-1 df
SSB = k-1 df
SSW = (n-1) - (k-1) = n-k df
Ratio of each sum of squares over its degrees of
freedom gives us the mean squares
MSW = SSW / (n-k) = average variation within k samples
MSB = SSB / (k-1) = average variation between k samples
Analysis of Variance
MSW is estimate of the total variance, 2
MSW = SSW/(n-k)
2
(
Y
Y
)
SSW =  ij i
Sample variance for ith group,
SSW =  (Yij - Yi ) 2 = (ni - 1) si
MSW =
 (ni - 1)si
 (ni -1)
si =
2
2
(
Y
Y
)
 ij i
ni - 1
2
2
= Pooled variance for k groups
Analysis of Variance
The null hypothesis is tested by looking at F ratio:
F = MSB/MSW, compare to F distribution with k-1, n-k df
If variation between groups much greater than variation
within groups;
F >> 1, reject null hypothesis
F  1, fail to reject null hypothesis
Analysis of Variance
Results often presented in an ANOVA table
Source
SS
df
MS
F
p-value
Between
SSB
k-1
MSB
MSB/MSW
p
Within
SSW
n-k
MSW
Total
SST
n-1
SAS uses “Model” for “Between” and “Error” for “Within”
ANOVA in SAS; two ways
PROC ANOVA DATA = LIPID;
CLASS diet;
MODEL lipid = diet;
Both test for difference
RUN;
in mean lipid reduction
for the two diets
PROC GLM DATA = LIPID;
CLASS diet;
MODEL lipid = diet;
RUN;
PROC ANOVA and GLM
• Almost exactly the same for this case
• GLM is a more general procedure
TOMHS Study
• 6 Treatment groups (Variable GROUP)
–
–
–
–
–
–
–
Beta-blocker
Calcium channel blocker
Diuretic
Alpha-blocker
ACE inhibitor
Placebo
All Treatments given lifestyle intervention to
lower BP
ANOVA – TOMHS Study
PROC GLM DATA=temp;
CLASS group;
MODEL sbpdif = group;
MEANS group;
RUN;
Creates 5 dummy variables for
you
OUTPUT
The GLM Procedure
Class Level Information
Class
Levels
GROUP
6
Number of observations
Values
1 2 3 4 5 6
902
NOTE: Due to missing values, only 848 observations can be used in this analysis
GLM – OUTPUT
The GLM Procedure
Dependent Variable: sbpdif
ANOVA TABLE
Source
DF
Sum of
Squares
Mean Square
F Value
Pr > F
Model
5
13149.8402
2629.9680
13.52
<.0001
Error
842
163785.8945
194.5201
Corrected Total
847
176935.7347
R-Square
Coeff Var
Root MSE
sbpdif Mean
0.074320
-84.43703
13.94705
-16.51769
If H0 is true than F should be near 1
F = 2629.97/194.52
Pooled (over 6 groups) standard deviation
Estimates 
GLM – OUTPUT
Source
GROUP
Source
GROUP
DF
Type I SS
Mean Square
F Value
Pr > F
5
13149.84018
2629.96804
13.52
<.0001
DF
Type III SS
Mean Square
F Value
Pr > F
5
13149.84018
2629.96804
13.52
<.0001
If no covariates are in the model this portion of the output will be the same as the
ANOVA table because the model includes only GROUP.
The GLM Procedure
Level of
GROUP
1
2
3
4
5
6
N
126
121
124
129
127
221
------------sbpdif----------Mean
Std Dev
-20.0555556
-17.5289256
-21.8467742
-16.0697674
-17.6023622
-10.5950226
15.3474717
11.6080607
14.4977118
14.0005223
13.1844874
14.3539675
Contrasts
PROC GLM DATA=temp;
CLASS group;
MODEL sbpdif = group;
MEANS group;
ESTIMATE 'BB
vs Placebo'
ESTIMATE 'CCB vs Placebo'
ESTIMATE 'Diur vs Placebo'
ESTIMATE 'AB
vs Placebo'
ESTIMATE 'ACE vs Placebo'
RUN;
The GLM Procedure
group
group
group
group
group
1
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
1
-1
-1
-1
-1
-1
OUTPUT
Dependent Variable: sbpdif
Parameter
BB
vs Placebo
CCB vs Placebo
Diur vs Placebo
AB
vs Placebo
ACE vs Placebo
Estimate
Standard
Error
t Value
Pr > |t|
-9.4605329
-6.9339030
-11.2517516
-5.4747448
-7.0073396
1.55691725
1.57727142
1.56489344
1.54534422
1.55300848
-6.08
-4.40
-7.19
-3.54
-4.51
<.0001
<.0001
<.0001
0.0004
<.0001
;
;
;
;
;
Compare all Groups
PROC GLM DATA=temp;
CLASS group;
MODEL sbpdif = group;
LSMEANS group/PDIF;
RUN;
GLM – OUTPUT
The GLM Procedure Least Squares Means
GROUP
1
2
3
4
5
6
sbpdif
LSMEAN
LSMEAN
Number
-20.0555556
-17.5289256
-21.8467742
-16.0697674
-17.6023622
-10.5950226
1
2
3
4
5
6
P-value: Group 1 v Group 2
Least Squares Means for effect GROUP
Pr > |t| for H0: LSMean(i)=LSMean(j)
Dependent Variable: sbpdif
i/j
1
1
2
3
4
5
6
0.1550
0.3103
0.0228
0.1622
<.0001
2
3
4
5
6
0.1550
0.3103
0.0156
0.0228
0.4087
0.0010
0.1622
0.9669
0.0161
0.3796
<.0001
<.0001
<.0001
0.0004
<.0001
0.0156
0.4087
0.9669
<.0001
0.0010
0.0161
<.0001
0.3796
0.0004
<.0001
NOTE: To ensure overall protection level, only probabilities associated with pre-planned
comparisons should be use
Download