Plots, Correlations, and Regression Getting a feel for the data using plots, then analyzing the data with correlations and linear regression. Introduction to Plots Before you decide to conduct simple linear regression on a data set, it is important to determine whether a linear relationship between the two variables appears justified. If it appears a linear relationship exists, you can proceed with regression analysis. If there is clearly no linear relationship between the two variables of interest, a different type of analysis may be preferred. An easy way to eye-ball the data for a linear relationship is to plot the two variables: the independent variable on the x-axis, and the dependent variable on the y-axis. Two Types of Plots in SAS You can plot your data in SAS using either PROC PLOT or PLOT GPLOT. Both are acceptable methods, although some find that GPLOT creates a better-looking graph (it also creates the plot in the separate Graph window in SAS, as opposed to PLOT, which creates the plot in the Output window). Fog Data Set The data set densefog.csv contains information on number of deaths and sulfur dioxide (SO2) level for various locations. Input the data set into SAS using the following code (with the necessary modfications to the file location): DATA fog; INFILE 'C:\Documents and Settings\My Documents\ densefog.csv' dsd firstobs = 1; INPUT deaths sd; PROC PRINT DATA = fog; RUN; Plotting the Data We now want to plot the data to determine whether a linear relationship between number of deaths and SO2 levels seems justified. SO2 is considered the independent variable (X) and #deaths is the dependent variable (Y). The PLOT statement is generally “PLOT Y*X”. First, use PROC PLOT: PROC PLOT DATA = fog; TITLE 'Scatterplot of Deaths by SO2 Using Proc Plot'; PLOT deaths * sd; RUN; Notice the Plot is in the Output Interpreting the Plot • Each data point is represented by the letter A. If two points have the same X and Y values, the letter B denotes this. If three points were to fall on the same location, it would be denoted by a C, etc. • From this plot, it appears a linear relationship could be justified (imagine drawing a line through the points). PROC GPLOT Now plot the same data using a slightly different method, PROC GPLOT. This plot will also indicate that a linear relationship appears to be justified (it is the same plot as in PROC PLOT, only in a slightly different format): PROC GPLOT DATA = fog; TITLE 'Scatterplot Using GPLOT'; PLOT deaths * sd; RUN; GPLOT in GRAPH Window Notes on GPLOT • Notice that the GPLOT is nicely contained on one page, whereas the plot from PLOT is more spread out in the Output. • To save the PLOT, simply save the Output as a .rtf; it can be opened in Word later. • To save the GPLOT, you can copy and paste the graph into Word. If this doesn’t work, you can export the image under File -> Export as Image…(see previous slide) and save the graph as a .bmp file. This file can then be accessed later and inserted into a Word document. Correlation One way to test whether two variables are linearly related is by finding the correlation between them and testing the hypotheses H0: r = 0 vs. H1: r ≠ 0 A large r value (closer to 1 or -1) indicates a strong relationship. A positive r indicates a positive correlation (as one variable increases, the other variable also increases); a negative r indicates a negative correlation (as one variable increases, the other variable decreases). PROC CORR in SAS An easy way to calculate the correlation between variables in SAS is with the CORR procedure. Make sure to check your Log after running this program: PROC CORR DATA = fog; TITLE 'Correlation of Deaths vs. SO2'; VAR deaths sd; RUN; PROC CORR Output Interpreting Output • The correlation (r) between deaths and sulfur dioxide is 0.89. • The p-value of this correlation is p<0.0001, indicating we reject the null hypothesis and conclude that there is a correlation between deaths and sulfur dioxide. • There is a strong, positive, linear relationship between deaths and sulfur dioxide. Linear Regression Now that we have determined a linear relationship exists between these two variables, we can conduct linear regression analysis to quantify this relationship. Linear regression will define a line that describes the relationship between these two variables. (Note: It is not necessary to test for a correlation before doing regression analysis; it is only important to eye-ball the data to determine whether a linear relationship seems justified.) PROC REG in SAS The following code runs the regression procedure in SAS. The general model statement is: model y-variable = x-variable You can also request a plot of the two variables showing the fitted regression line. PROC REG DATA = fog; MODEL deaths = sd; PLOT deaths * sd; RUN; QUIT; Linear Regression Output Regression Line Plot (you may have to scroll down in your GRAPH window to see it—notice it has the same title as the PROC CORR, because we did not define a new title) Interpreting Output The value for b0 can be found under Parameter Estimates to the right of “Intercept.” The value for b1 can also be found under Parameter Estimates, to the right of the name of the predictor variable (in this case, sulfur dioxide (sd)). Using the output from PROC REG, you can now estimate the regression equation: Yhat = 138.64 + 234.10x If you wanted to estimate the number of deaths with a sulfur dioxide level of 0.20, you would put this value into your regression equation and solve for Yhat: Yhat = 138.64 + 234.10(0.20) = 185.46 deaths. Interpreting Output, cont. Find the R2 value on the output (0.7960). This value is the amount of variability in your dependent variable explained by the presence of the predictor variable in the model. In this case, 80% of the variability in number of deaths is explained by sulfur dioxide levels. Also notice that R2 = r2 =0.7960 = (0.89217)2. Testing β1 = 0 When conducting linear regression, you want to test whether there is a significant linear relationship between your predictor and outcome variables. In simple linear regression (only one predictor variable), this can done by either testing r = 0 or β1=0. This test of independence is: Ho: β1=0 vs. Ha: β1≠ 0 Because β1 is the slope of the regression line, if there is no relationship between the two variables (i.e. they are independent), you would expect the slope of the line to be 0 (meaning that levels of y do not change with changes in the levels of x). The alternative to this is that the line does have some non-zero slope, indicating that the two variables are dependent. Testing Independence,cont. In simple linear regression, the overall F-test and individual t-test of β1=0 have the same p-value. The test of Ho: β1=0 is t* = b1/se(b1). In this example, the t* for this test = 234.10/32.87 = 7.12, with a p-value < 0.0001. Notice that SAS computes t* and calculates the p-value. Because the p-value < 0.05, we reject the null hypothesis and conclude that sulfur dioxide and number of deaths are not independent. Conclusion Now you are familiar with conducting simple linear regression in SAS. The next tutorial introduces you to model diagnostics using SAS. These help you determine whether the assumptions of the regression model are met, whether the model is a good fit for the data, and whether there are any outlying data points.