CROP 590 Experimental Design in Agriculture Lab exercise – 4th week ANOVA assumptions Transformations SAS On-line Documentation PROC GLM PROC UNIVARIATE PROC BOXPLOT PROC GLIMMIX Part 1. Data Entry Up to this point we have entered data into SAS in the format of a SAS dataset. If you are working with large datasets that are in a different format, you may prefer to write a short program to rearrange the data in SAS. This can be achieved using Do loops and the ‘@’ symbol, which tells SAS to read another data point from the same line. Two ‘@@’ symbols would tell SAS to continue reading from the same line until there are no more data points to be read. Run the program below, and note how the data is reformatted. Data one; Input herbicide Do i=1 to 10; Input weedcount Output; End; Datalines; H-A 33 21 48 18 H-B 25 28 20 15 H-C 4 5 2 5 4 1 H-D 8 11 9 13 6 ; Proc Print; Run; $ @; @; 53 31 14 30 3 4 2 5 9 7 39 26 44 25 27 17 23 13 6 6 12 Part 2. Testing ANOVA assumptions Conduct a one-way ANOVA of the above data set using herbicide as the independent variable. Use the plots=diagnostics option in PROC GLM to obtain a standard group of preformatted diagnostic plots for evaluating ANOVA assumptions. Adding the plots=diagnostics(unpack) option will print each of the default dignostic plots as a separate image. Request a Bartlett’s test for Homogeneity of Variances, or use the default which is Levene’s test. Output the residuals and predicted values to a new data set for further diagnosis. PROC GLM plots=diagnostics; Class herbicide; Model weedcount = herbicide; Means herbicide / hovtest=bartlett; Means herbicide / hovtest; output out=new r=residual p=predicted; Run; 1 Manually obtain residual plots, using the herbicide names as labels and the vref=0 option to include a reference line at eij = 0: PROC GPLOT data=new; plot residual*predicted = herbicide / vref=0; run; You can obtain box plots similar to the ones that SAS creates when you run PROC GLM by using the BOXPLOT procedure. PROC BOXPLOT data=new; Plot residual*herbicide; Run; Proc Univariate can be used to test for normality (normal statement) and to obtain a variety of descriptive plots, including normal probability plots (plots statement). proc univariate data=new normal plots; histogram residual/kernel; QQPLOT residual /NORMAL(MU=EST SIGMA=EST L=1); var residual; run; Part 3. Data Transformation Are the residuals for the variable ‘weedcounts’ normally distributed? Do they have homogeneous variance? What is your proof? If not, can you determine what transformation is needed? Rerun your analysis on the transformed data and recheck the ANOVA assumptions. Tests for homogeneity of variance in SAS can only be run on one-way models. According to the SAS help information, "homogeneity of variance testing for more complex models is a subject of current research." Two approaches can be found in the literature for testing for HOV in a Randomized Block Design. One is to simply ignore blocks and check for HOV of treatment groups. Another is to adjust the data for blocks (remove the block effects) and then run hovtests using a one-way model for the treatment groups. If no suitable transformation can be identified, data with heterogeneous variance can still be analyzed with PROC MIXED, using a Repeated statement with an unstructured covariance option (Type=un). We will discuss this topic more in the lab exercise on repeated measures. Part 4. Generalized Linear Models Generalized Linear Models are becoming the method of choice for analyzing data that are not normally distributed, but follow another probability distribution in the exponential family (which includes normal, binomial, Poisson, gamma, and negative binomial distributions.) Appropriate use of Generalized Linear Models requires an understanding of the underlying theory, which is beyond the scope of this class. To get some idea about the SAS syntax and output for this type of analysis, run the following program: 2 proc glimmix data=one; Class herbicide; Model weedcount = herbicide / link=log s dist=poisson; lsmeans herbicide/ilink diff CL; run; How do the F tests and estimates of the means compare to the analysis of transformed data? Later in the term, we will conduct combined analyses across multiple experiments and learn how to determine if the variances for those experiments are homogeneous. PROC MIXED and PROC GLIMMIX can both be used to account for heterogeneous variances without the use of transformations. Part 5. Detection of Outliers In general, obvious outliers can be identified visually from residual plots. If you are confident that a mistake or abnormality occurred during data collection, it may be best to treat that observation as a missing plot. On the other hand, unusual variation that is real may provide valuable insights and should not be ignored or discarded. We should also avoid the temptation to massage our data to make it meet our expectations. If you are not sure how to handle an outlier, you may want to “studentize” the residuals to express them in standardized units. This may help in establishing an objective criterion for outlier detection. For example, you might decide that values of studentized residuals that are greater than 4 standard errors from a mean of zero are too extreme and were likely the result of a mistake during data collection. Replace one of the data points in the original data set with an outlier and rerun the analysis. Various options can be used in PROC GLM to obtain diagnostic statistics for outlier detection. Proc GLM data=three plots=diagnostics; Class herbicide; Model weedcount = herbicide; Means herbicide / hovtest=bartlett; Means herbicide / hovtest; output out=new r=residual p=predicted student=studres stdr=rstderr cookd=cookct covratio=covrat dffits=dffit; Run; Proc print; proc gplot data=new; plot residual*predicted = herbicide; plot studres*predicted = herbicide; Run; Note that the shapes of the two residual plots are the same, but values of studres are expressed in standard units. PROC MIXED also has some nice features for analyzing the influence of data outliers. proc mixed data=three; Class herbicide; Model weedcount = herbicide / influence; run; 3