Plots, Correlations, and Regression

advertisement
Plots, Correlations, and
Regression
Getting a feel for the data using
plots, then analyzing the data with
correlations and linear regression.
Introduction to Plots
Before you decide to conduct simple linear
regression on a data set, it is important to
determine whether a linear relationship between
the two variables appears justified. If it appears
a linear relationship exists, you can proceed with
regression analysis. If there is clearly no linear
relationship between the two variables of
interest, a different type of analysis may be
preferred.
An easy way to eye-ball the data for a linear
relationship is to plot the two variables: the
independent variable on the x-axis, and the
dependent variable on the y-axis.
Two Types of Plots in SAS
You can plot your data in SAS using either
PROC PLOT or PLOT GPLOT. Both are
acceptable methods, although some find
that GPLOT creates a better-looking graph
(it also creates the plot in the separate
Graph window in SAS, as opposed to
PLOT, which creates the plot in the Output
window).
Fog Data Set
The data set densefog.csv contains information on
number of deaths and sulfur dioxide (SO2) level
for various locations. Input the data set into SAS
using the following code (with the necessary
modfications to the file location):
DATA fog;
INFILE 'C:\Documents and Settings\My
Documents\ densefog.csv' dsd firstobs = 1;
INPUT deaths sd;
PROC PRINT DATA = fog;
RUN;
Plotting the Data
We now want to plot the data to determine whether a linear
relationship between number of deaths and SO2 levels
seems justified. SO2 is considered the independent
variable (X) and #deaths is the dependent variable (Y).
The PLOT statement is generally “PLOT Y*X”.
First, use PROC PLOT:
PROC PLOT DATA = fog;
TITLE 'Scatterplot of Deaths by SO2
Using Proc Plot';
PLOT deaths * sd;
RUN;
Notice the Plot is in the Output
Interpreting the Plot
• Each data point is represented by the
letter A. If two points have the same X
and Y values, the letter B denotes this. If
three points were to fall on the same
location, it would be denoted by a C, etc.
• From this plot, it appears a linear
relationship could be justified (imagine
drawing a line through the points).
PROC GPLOT
Now plot the same data using a slightly different
method, PROC GPLOT. This plot will also
indicate that a linear relationship appears to be
justified (it is the same plot as in PROC PLOT,
only in a slightly different format):
PROC GPLOT DATA = fog;
TITLE 'Scatterplot Using GPLOT';
PLOT deaths * sd;
RUN;
GPLOT in GRAPH Window
Notes on GPLOT
• Notice that the GPLOT is nicely contained on
one page, whereas the plot from PLOT is more
spread out in the Output.
• To save the PLOT, simply save the Output as a
.rtf; it can be opened in Word later.
• To save the GPLOT, you can copy and paste the
graph into Word. If this doesn’t work, you can
export the image under File -> Export as
Image…(see previous slide) and save the graph
as a .bmp file. This file can then be accessed
later and inserted into a Word document.
Correlation
One way to test whether two variables are
linearly related is by finding the correlation
between them and testing the hypotheses
H0: r = 0 vs. H1: r ≠ 0
A large r value (closer to 1 or -1) indicates a
strong relationship. A positive r indicates a
positive correlation (as one variable
increases, the other variable also
increases); a negative r indicates a
negative correlation (as one variable
increases, the other variable decreases).
PROC CORR in SAS
An easy way to calculate the correlation between
variables in SAS is with the CORR procedure.
Make sure to check your Log after running this
program:
PROC CORR DATA = fog;
TITLE 'Correlation of Deaths vs. SO2';
VAR deaths sd;
RUN;
PROC CORR Output
Interpreting Output
• The correlation (r) between deaths and
sulfur dioxide is 0.89.
• The p-value of this correlation is p<0.0001,
indicating we reject the null hypothesis
and conclude that there is a correlation
between deaths and sulfur dioxide.
• There is a strong, positive, linear
relationship between deaths and sulfur
dioxide.
Linear Regression
Now that we have determined a linear
relationship exists between these two
variables, we can conduct linear
regression analysis to quantify this
relationship. Linear regression will define
a line that describes the relationship
between these two variables.
(Note: It is not necessary to test for a
correlation before doing regression
analysis; it is only important to eye-ball the
data to determine whether a linear
relationship seems justified.)
PROC REG in SAS
The following code runs the regression procedure
in SAS. The general model statement is:
model y-variable = x-variable
You can also request a plot of the two variables
showing the fitted regression line.
PROC REG DATA = fog;
MODEL deaths = sd;
PLOT deaths * sd;
RUN;
QUIT;
Linear Regression Output
Regression Line Plot (you may have to scroll down in your
GRAPH window to see it—notice it has the same title as
the PROC CORR, because we did not define a new title)
Interpreting Output
The value for b0 can be found under Parameter
Estimates to the right of “Intercept.” The value
for b1 can also be found under Parameter
Estimates, to the right of the name of the
predictor variable (in this case, sulfur dioxide
(sd)).
Using the output from PROC REG, you can now
estimate the regression equation:
Yhat = 138.64 + 234.10x
If you wanted to estimate the number of deaths
with a sulfur dioxide level of 0.20, you would put
this value into your regression equation and
solve for Yhat: Yhat = 138.64 + 234.10(0.20) =
185.46 deaths.
Interpreting Output, cont.
Find the R2 value on the output (0.7960).
This value is the amount of variability in
your dependent variable explained by the
presence of the predictor variable in the
model. In this case, 80% of the variability
in number of deaths is explained by sulfur
dioxide levels.
Also notice that R2 = r2 =0.7960 =
(0.89217)2.
Testing β1 = 0
When conducting linear regression, you want to test
whether there is a significant linear relationship between
your predictor and outcome variables. In simple linear
regression (only one predictor variable), this can done by
either testing r = 0 or β1=0. This test of independence is:
Ho: β1=0 vs. Ha: β1≠ 0
Because β1 is the slope of the regression line, if there is no
relationship between the two variables (i.e. they are
independent), you would expect the slope of the line to
be 0 (meaning that levels of y do not change with
changes in the levels of x). The alternative to this is that
the line does have some non-zero slope, indicating that
the two variables are dependent.
Testing Independence,cont.
In simple linear regression, the overall F-test
and individual t-test of β1=0 have the same
p-value. The test of Ho: β1=0 is t* =
b1/se(b1). In this example, the t* for this
test = 234.10/32.87 = 7.12, with a p-value
< 0.0001. Notice that SAS computes t*
and calculates the p-value. Because the
p-value < 0.05, we reject the null
hypothesis and conclude that sulfur
dioxide and number of deaths are not
independent.
Conclusion
Now you are familiar with conducting simple
linear regression in SAS. The next tutorial
introduces you to model diagnostics using
SAS. These help you determine whether
the assumptions of the regression model
are met, whether the model is a good fit
for the data, and whether there are any
outlying data points.
Download