Today: Feb 28 • Reading Data from existing SAS dataset • One-way ANOVA • Reading Le 7:5 • Reading C&S 7:A-H Reading SAS Datasets Sometimes your “raw” data is already a SAS dataset LIBNAME tomhs 'c:/my documents/ph5415/'; PROC CONTENTS DATA=tomhs.bpstudy; PROC PRINT DATA=tomhs.bpstudy (obs=10); RUN; The libname statement tells SAS which directory (folder) the dataset is in. DATA=tomhs.bpstudy Tells SAS to look for a SAS dataset called bpstudy in the directory referenced by tomhs. PROC CONTENTS OUTPUT The CONTENTS Procedure Data Set Name: Member Type: Engine: Created: Last Modified: TOMHS.BPSTUDY DATA V8 9:07 Saturday, February 26, 2005 9:07 Saturday, February 26, 2005 -----Alphabetic List of Variables and Attributes----# Variable Type Len Pos -----------------------------------------3 AGE Num 8 16 6 CHOL12 Num 8 40 2 GROUP Num 8 8 8 HDL12 Num 8 56 9 PULSE12 Num 8 64 10 PULSEBL Num 8 72 4 SBP12 Num 8 24 5 SBPBL Num 8 32 1 SEX Num 8 0 7 TRIG12 Num 8 48 11 WT12 Num 8 80 12 WTBL Num 8 88 13 cholbl Num 8 96 14 hdlbl Num 8 104 16 id Char 6 120 15 trigbl Num 8 112 Observations: Variables: Indexes: Observation Length: Deleted Observations: 902 16 0 128 0 PROC PRINT – 10 Observations C S E X G R O U P S B P B L H O L 1 2 A G E 1 1 3 54 . 139.5 . 2 2 6 62 129 144.0 3 2 5 64 118 4 1 5 47 5 1 3 6 1 7 U R I G 1 2 U c H D L 1 2 L S E 1 2 L S E B L . . . 76 241 65 66 80 72 141.0 307 425 41 80 81 . 134.0 . . . . 80 51 . 132.5 . . . . 73 2 62 133 133.0 196 72 44 72 76 2 2 59 113 136.0 231 75 61 72 8 1 3 63 127 137.5 217 137 35 9 2 4 64 122 151.0 201 57 10 2 5 52 122 140.0 209 105 O b s S B P 1 2 T t W T B L h o l b l 224.0 205 24 179 A00001 124.0 141.0 260 75 67 A00010 144.0 157.0 228 29 564 A00021 . 214.0 194 66 49 A00023 . 206.5 226 40 53 A00056 211.0 227.5 207 47 126 A00075 74 125.0 137.0 214 62 119 A00083 64 74 195.0 211.5 214 37 165 A00105 44 56 63 150.0 159.5 214 47 133 A00133 57 60 81 168.5 196.5 215 55 105 A00143 W T 1 2 . h d l b l r i g b l i d Reading a SAS Dataset DATA temp; SET tomhs.bpstudy; sbpdif = sbp12-sbpbl; PROC MEANS DATA=temp; Reads in an observation. Replaces the infile and input statements when reading in text data The MEANS Procedure Variable SEX GROUP AGE SBP12 SBPBL CHOL12 TRIG12 HDL12 PULSE12 PULSEBL WT12 WTBL cholbl hdlbl trigbl sbpdif N Mean Std Dev Minimum Maximum 902 902 902 848 902 849 849 849 847 901 848 902 900 900 900 848 1.3824834 3.7882483 54.7727273 124.1002358 140.3636364 220.8386337 106.9634865 45.4923439 69.3506494 73.6925638 176.8225236 187.3791574 228.2511111 43.6122222 131.7366667 -16.5176887 0.4862633 1.7874130 6.4039396 15.1891840 12.4446043 38.8624342 62.5307082 12.1059688 10.0301471 8.6698610 30.4251368 31.0782720 38.4169684 11.6124701 76.5211232 14.4532685 1.0000000 1.0000000 44.0000000 87.0000000 113.5000000 111.0000000 24.0000000 18.0000000 44.0000000 48.0000000 105.5000000 113.0000000 113.0000000 17.0000000 17.0000000 -75.5000000 2.0000000 6.0000000 69.0000000 187.0000000 190.0000000 456.0000000 592.0000000 102.0000000 112.0000000 109.0000000 286.0000000 289.2500000 357.0000000 97.0000000 815.0000000 30.0000000 One-Way Analysis of Variance • Two-sample t-test; compare means of two groups – Are the means different? • What if we have more than two groups? Examples; • compare three different behavioral interventions • compare 5 different BP drugs Analysis of Variance Could compare all pairs of means with ttests three groups: A-B, B-C, A-C five groups: A-B, A-C, A-D, A-E B-C, B-D, B-E C-D, C-E D-E Analysis of Variance Problem - multiple comparisons!! When performing many tests, may reject null hypothesis by chance (Type I error) With = 0.05, you allow for possibility of rejecting 1 out of 20 tests by chance Even if all group means are equal then there is a fairly large chance that one-pair will be different Analysis of Variance ANOVA simultaneously tests for difference in k means • • • • • Y - continuous k samples from k normal distributions each size ni, not necessarily equal each with possibly different mean i each with constant variance 2 Constant variance ANOVA is robust for violations of constant variance (and normality) Rule of thumb: If largest standard deviation is less than twice the smallest standard deviation, you’re ok. Can sometimes transform to achieve equal variance or normality Analysis of Variance Ho: 1 = 2= ... = k Ha: Not all i equal Two-sample t-test is special case; k = 2 Sometimes referred to as a global or omnibus test For each group i; ni = number of observations Yi = sample mean 2 si = sample variance Y = overall mean Two-sample T-test • Compared means for two groups • y1 - y 2 t= 1 1 This compares + sp variation between n1 n 2 groups with variation within groups Variation Within Groups Variation Between Groups ANOVA F-test • Compared means for all groups (Y - Y ) F= • This compares variation between groups with variation within groups i sp Variation Within Groups 2 2 Variation Between Groups – Compared to Grand Mean Analysis of Variance Variation for all observations: 2 ( Y Y ) ij Called the “(corrected) total sum of squares” or SST Can be divided into two parts: •deviation of individual observation from its sample mean • deviation of sample means from overall mean Yij - Y = (Yij - Yi ) + (Yi - Y ) Similar to regression Analysis of Variance (Yij - Yi ) Measures variation within samples (Yi - Y ) Measures variation between samples Each has a corresponding “sum of squares” 2 ( Y Y ) ij i 2 ( Y Y ) i Sum of squares within (SSW) Sum of squares between (SSB) Analysis of Variance Each has a corresponding degrees of freedom (DF) SST = n-1 df SSB = k-1 df SSW = (n-1) - (k-1) = n-k df Ratio of each sum of squares over its degrees of freedom gives us the mean squares MSW = SSW / (n-k) = average variation within k samples MSB = SSB / (k-1) = average variation between k samples Analysis of Variance MSW is estimate of the total variance, 2 MSW = SSW/(n-k) 2 ( Y Y ) SSW = ij i Sample variance for ith group, SSW = (Yij - Yi ) 2 = (ni - 1) si MSW = (ni - 1)si (ni -1) si = 2 2 ( Y Y ) ij i ni - 1 2 2 = Pooled variance for k groups Analysis of Variance The null hypothesis is tested by looking at F ratio: F = MSB/MSW, compare to F distribution with k-1, n-k df If variation between groups much greater than variation within groups; F >> 1, reject null hypothesis F 1, fail to reject null hypothesis Analysis of Variance Results often presented in an ANOVA table Source SS df MS F p-value Between SSB k-1 MSB MSB/MSW p Within SSW n-k MSW Total SST n-1 SAS uses “Model” for “Between” and “Error” for “Within” ANOVA in SAS; two ways PROC ANOVA DATA = LIPID; CLASS diet; MODEL lipid = diet; Both test for difference RUN; in mean lipid reduction for the two diets PROC GLM DATA = LIPID; CLASS diet; MODEL lipid = diet; RUN; PROC ANOVA and GLM • Almost exactly the same for this case • GLM is a more general procedure TOMHS Study • 6 Treatment groups (Variable GROUP) – – – – – – – Beta-blocker Calcium channel blocker Diuretic Alpha-blocker ACE inhibitor Placebo All Treatments given lifestyle intervention to lower BP ANOVA – TOMHS Study PROC GLM DATA=temp; CLASS group; MODEL sbpdif = group; MEANS group; RUN; Creates 5 dummy variables for you OUTPUT The GLM Procedure Class Level Information Class Levels GROUP 6 Number of observations Values 1 2 3 4 5 6 902 NOTE: Due to missing values, only 848 observations can be used in this analysis GLM – OUTPUT The GLM Procedure Dependent Variable: sbpdif ANOVA TABLE Source DF Sum of Squares Mean Square F Value Pr > F Model 5 13149.8402 2629.9680 13.52 <.0001 Error 842 163785.8945 194.5201 Corrected Total 847 176935.7347 R-Square Coeff Var Root MSE sbpdif Mean 0.074320 -84.43703 13.94705 -16.51769 If H0 is true than F should be near 1 F = 2629.97/194.52 Pooled (over 6 groups) standard deviation Estimates GLM – OUTPUT Source GROUP Source GROUP DF Type I SS Mean Square F Value Pr > F 5 13149.84018 2629.96804 13.52 <.0001 DF Type III SS Mean Square F Value Pr > F 5 13149.84018 2629.96804 13.52 <.0001 If no covariates are in the model this portion of the output will be the same as the ANOVA table because the model includes only GROUP. The GLM Procedure Level of GROUP 1 2 3 4 5 6 N 126 121 124 129 127 221 ------------sbpdif----------Mean Std Dev -20.0555556 -17.5289256 -21.8467742 -16.0697674 -17.6023622 -10.5950226 15.3474717 11.6080607 14.4977118 14.0005223 13.1844874 14.3539675 Contrasts PROC GLM DATA=temp; CLASS group; MODEL sbpdif = group; MEANS group; ESTIMATE 'BB vs Placebo' ESTIMATE 'CCB vs Placebo' ESTIMATE 'Diur vs Placebo' ESTIMATE 'AB vs Placebo' ESTIMATE 'ACE vs Placebo' RUN; The GLM Procedure group group group group group 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 -1 -1 -1 -1 -1 OUTPUT Dependent Variable: sbpdif Parameter BB vs Placebo CCB vs Placebo Diur vs Placebo AB vs Placebo ACE vs Placebo Estimate Standard Error t Value Pr > |t| -9.4605329 -6.9339030 -11.2517516 -5.4747448 -7.0073396 1.55691725 1.57727142 1.56489344 1.54534422 1.55300848 -6.08 -4.40 -7.19 -3.54 -4.51 <.0001 <.0001 <.0001 0.0004 <.0001 ; ; ; ; ; Compare all Groups PROC GLM DATA=temp; CLASS group; MODEL sbpdif = group; LSMEANS group/PDIF; RUN; GLM – OUTPUT The GLM Procedure Least Squares Means GROUP 1 2 3 4 5 6 sbpdif LSMEAN LSMEAN Number -20.0555556 -17.5289256 -21.8467742 -16.0697674 -17.6023622 -10.5950226 1 2 3 4 5 6 P-value: Group 1 v Group 2 Least Squares Means for effect GROUP Pr > |t| for H0: LSMean(i)=LSMean(j) Dependent Variable: sbpdif i/j 1 1 2 3 4 5 6 0.1550 0.3103 0.0228 0.1622 <.0001 2 3 4 5 6 0.1550 0.3103 0.0156 0.0228 0.4087 0.0010 0.1622 0.9669 0.0161 0.3796 <.0001 <.0001 <.0001 0.0004 <.0001 0.0156 0.4087 0.9669 <.0001 0.0010 0.0161 <.0001 0.3796 0.0004 <.0001 NOTE: To ensure overall protection level, only probabilities associated with pre-planned comparisons should be use