Simple Linear Regression and Correlation

advertisement
STAT 602 - Simple Linear Regression and Correlation
Spring 2015
Regression is used to study relationships between variables. Linear regression is used for
a special class of relationships, namely, those that can be described by straight lines, or
by generalizations of straight lines to many dimensions.
In regression we seek to understand how the value of a response of variable (Y) is related
to a set of explanatory variables or predictors (X’s). This is done by specifying a model
that relates the mean or average value of the response to the predictors in mathematical
form. In simple linear regression we have single predictor variable (X) and we use a line
(of the form y = mx + b) to relate the response to the predictor.
Regression Model Notation:
E(Y|X) =
Var(Y|X) =
In regression we first specify a functional form or model for E(Y|X). We then fit that
model with using available data and then assess the adequacy of the model. We may then
decide to remove some of the explanatory variables that do not appear important, add
others to improve the model, change the function form of the model, etc. Regression is
an iterative process where we fit a preliminary model and then modify our model based
on the results.
Multiple Regression Examples:
a) E(Mercury Level in Walleyes |Weight, Length, Alkalinity, pH) =
 0  1Weight   2 Length   3 Alkalinity   4 pH
b) E(Systolic BP|
)=
1
STAT 602 - Simple Linear Regression and Correlation
Spring 2015
Example 1: Length of Sunfish in Camp Lake
Data File: Camplake.JMP
Background: These data were collected as part of study of bluegill sunfish in Camp
Lake.
The variables measured in this study were:
 Age - age of the sunfish (determined by counting growth rings on scales, and
recorded to the nearest year)
 Scalerad - scale radius (mm/100)
 Length - length of sunfish (mm)  Response of interest (Y)
Goal: Investigate the relationship between age (X) and the length of a sunfish from
Camp Lake (Y). Also the relationship between scale radius and length is of potential
interest.
Assumptions for a Simple Linear Regression Model
1. The mean of the response variable (Y) can be modeled using X in the following
form: E (Y | X )   o   x * X
2. The variability in the response variable (Y) must be the same for each X, i.e.
Var (Y | X )   2 or equivalently SD(Y | X )   . In other words, the variation in the
response Y must be constant across the entire range of X.
3. The response measurements (Y’s) should be independent of each other. If a
simple random sample is taken from the population this typically the case. One
situation where this assumption is violated is when the data are collected over time.
4. The response measurements (Y), for a given value of X, should follow a normal
distribution.
We will discuss how to check the assumptions outlined above after fitting our initial
model.
2
STAT 602 - Simple Linear Regression and Correlation
Spring 2015
Correlation (initial investigation):
This gives us some idea about whether or not the predictor(s) X and response Y are
linearly related. The correlations are obtained under Analyze > Multivariate.
Next, select the variables of interest and place them all in the Y box.
Correlation measures to what degree two numeric variables X and Y are related to
another in a linear fashion. If the correlation is 1 or -1, then the variables are perfectly
linearly related. A scale for correlations is given below.
Sample Correlation (r)  (Pearson’s)
n
r
 (x
i 1
n
 (x
i 1
i
i
 x )( y i  y )
 x)2
n
(y
i 1
i
 y) 2
What is the correlation between the variables here? In particular, what is the correlation
between age and scale radius with length?
3
STAT 602 - Simple Linear Regression and Correlation
Spring 2015
NEVER CALCULATE CORRELATIONS WITHOUT PLOTTING YOUR DATA!
To test whether a correlation is significantly different from 0, we select Multivariate >
Pairwise Correlations. Here we see that all three pair-wise correlations are significantly
different from zero.
4
STAT 602 - Simple Linear Regression and Correlation
Spring 2015
More on Correlation
5
STAT 602 - Simple Linear Regression and Correlation
Spring 2015
Fitting the regression model relating Length (Y) to Age (X)
What is the population model? We want to model the mean value of Y using X, so the
model is given by:
E (Y | X )   o   x * X
or being specific for this situation, we have
E ( Length | Age)   o  1  Age
Again: E(Y|X) is the notation we use to denote the mean value of Y given X.
Taking a closer look at its pieces:
For this data set, Y = Length (mm) and X = Age (years).
How is the line that best fits our data determined? Using the Method of Least Squares.
6
STAT 602 - Simple Linear Regression and Correlation
Spring 2015
To fit the model
E ( Length | Age)   o   Age * Age
we first select Analyze > Fit Y by X and place Length in the Y box and Age in the X box
as shown below. This will give us scatter plot of Y vs. X, from which we can fit the
simple linear regression model.
The resulting scatter plot is shown below.
With regression line added.
To perform the regression of Length on Age in JMP, select Fit Line from the Bivariate
Fit... pull-down menu.
7
STAT 602 - Simple Linear Regression and Correlation
Spring 2015
The resulting output is shown below.
We begin by looking at whether or not this regression stuff is even helpful:
H o : Regression is NOT useful
H a : Regression is useful
This says that Age (X) explains a significant
amount of the variation in the response Length (Y).
Next we can assess the importance of the X variable, Age in this case:
H o :  Age  0
H 1 :  Age  0
This is equivalent to saying the X is not
useful for explaining Y. Thus the results
of these two tests are identical in every
8
way!
STAT 602 - Simple Linear Regression and Correlation
Spring 2015
Conclusions from both tests:
Determining how well the model is doing in terms of explaining the response is done
using the R-Square and Root Mean Square Error:
The proportion or percentage of variation explained
by the regression of Length (Y) on Age (X) is given
by the R-Square = .838 or 83.8%
More on the R-Square
The amount of unexplained variability in Length (Y)
taking Age (X) into consideration is given by the
RMSE (root mean square error). This is an estimate
of the SD(Length|Age).
Describing the Relationship:
Eˆ ( Length | Age)  70.99  16.71 * Age
Interpret each of the parameter estimates:
9
STAT 602 - Simple Linear Regression and Correlation
Spring 2015
Estimating the Mean Y for a Given X and Making Individual Predictions
What do we estimate for the mean length of all sunfish in Camp Lake that are 4 years
old?
If picked a 4 year old fish from Camp Lake at random, what do we predict its length will
be?
C
hecking the Assumptions:
To check the adequacy of the model in terms of the basic assumptions involves looking a
plots of the residuals from the fit. One plot that is particular helpful for checking a
number of the model assumptions is a plot of the residuals vs. the fitted values. When we
are performing simple linear regression we can alternatively plot the residuals vs. X,
which is what JMP gives by default when selecting the Linear Fit > Plot Residuals
option.
Ideal Residual Plot:
Violations to Assumption #1:
The trend need not be linear (BAD)
The trend need not be linear (BAD)
10
STAT 602 - Simple Linear Regression and Correlation
Spring 2015
Violations to Assumption #2:
Variance of Y given X is NOT constant.
Megaphone (BAD)
Violations to Assumption #3:
One point closely following another -positive autocorrelation, (BAD)
Extreme bouncing back and forth -- negative
autocorrelation (BAD)
Violations to Assumption #4:
To check this assumption, simple save the residuals out and make a histogram of the
residuals and/or look at a normal quantile plot. Recall, you can easily make a histogram of a
variable under Analyze > Distribution. We should generally assess normality using a
normal quantile plot as well.
11
STAT 602 - Simple Linear Regression and Correlation
Spring 2015
Checking for outliers:
Determine the value of 2*RMSE. Any observations outside these bands are potential outliers
and should be investigated further to determine whether or not they adversely affect the
model.
Checking for outliers in this example we find:
To obtain the residual plot in JMP select Linear Fit > Plot Residuals
12
STAT 602 - Simple Linear Regression and Correlation
Spring 2015
THE ASSUMPTION CHECKLIST:
Model Appropriate:
Constant Variance:
Independence:
Normality Assumption (see histogram above):
Identify Outliers:
Confidence Interval for the Average Length:
(i.e. the average for the entire population of sunfish of a specified age.)
Select Confid Curves Fit from the Linear Fit pull-down menu located below the scatter
plot. The narrow bands in plot below represent the CI for the Average Length. For
example, from the plot below we estimate that the mean length for 2 year old sunfish is
likely to be somewhere between 95 – 107 mm. We will examine a more precise way for
obtaining such intervals later in the tutorial.
13
STAT 602 - Simple Linear Regression and Correlation
Spring 2015
Prediction Interval for the Length of Single Sunfish :
(i.e., for a single sunfish sampled from the population of all sunfish with a specified age.)
Select Confid Curves Indiv from the Linear Fit pull-down menu which is located
directly beneath the scatter plot. These are the wider bands in the plot above. For
example if we were to sampled a single 4 year old sunfish we estimate with 95%
confidence that its length will be somewhere between 115 – 155 mm. We will examine a
more precise way for obtaining exact intervals of this form later in the tutorial.
14
STAT 602 - Simple Linear Regression and Correlation
Spring 2015
Using the Analyze > Fit Model Option to Perform the Regression
An alternative to using Fit Y by X to perform simple linear regression, is to use the Fit
Model option from the Analyze menu. The advantages of this approach are two-fold:
1) You have access to more detailed results from your regression and have
enhanced features for estimation/prediction of Y.
2) Allows for the addition of more predictors (X’s) to your model. This is
called multiple regression and will be discussed later.
For the Camp Lake sunfish example we fit the model as follows:
Select Analyze > Fit Model and place Length in the Y box and Age in Model Effects
box.
Y variable
X variables
go here.
If we had more predictors (X’s) that we wanted to add to our model we would simply put
them in the Model Effects box, e.g. we could scale radius as a predictor of length as well.
Remember when we have more than one predictor in a linear regression we call it a
multiple regression.
15
STAT 602 - Simple Linear Regression and Correlation
Spring 2015
The output from the Analyze > Fit Model option is shown below:
The same numeric summaries, parameter estimates,
and test results are contained as part of the standard
output. The plot at the top is NOT a scatter plot of
Y vs. X. It is a plot of the actual Y values vs. the
predicted values from the regression model. The
stronger the trend exhibited the better the fit. It is
essentially a visualization of the R-square for the
fitted model.
In cases where we have multiple values of the response
for some of the X values we will be given the results of
the Lack Of Fit test. If p-value here is small it can
indicate that our model does not adequately represent
the mean of the response variable Y. For example, if we
fit line to clearly nonlinear relationship we would expect
this p-value to be significant. In cases where there is
significant lack of fit the plot of the residuals vs. the
fitted values will generally show some non-linear trend.
Notice the bulk of the output is same as that obtained using the Analyze > Fit Y by X
approach.
16
STAT 602 - Simple Linear Regression and Correlation
Spring 2015
Interval Estimation for E(Y|X), the Mean Value of Y for a given X &
Prediction of Y for an Individual with a given X (when using Fit Model)
You can save 95% Confidence
Intervals for E(Y|X) to the data
spread sheet by selecting Mean
Confidence Interval.
You can save 95% Prediction
Intervals for Individual Y values
to the data spread sheet by
selecting Indiv Confidence
Interval.
The table on the following page is a portion data spread sheet showing both types of
intervals for these data.
Interpretation of the 95% Confidence Interval for E(Length|Age=5)
Consider estimating the average/mean length for all five year old sunfish in Camp Lake.
A 95% confidence interval for this mean is given by the interval 151.76 mm to 157.28
mm. There is a 95% chance this interval covers the true mean length of 5 year old
sunfish in Camp Lake.
Interpretation of the 95% Prediction Interval for Length|Age=5
Suppose we picked one five year old sunfish at random from the population of all 5 year
old sunfish in Camp Lake. What do estimate the length of this sunfish will be? We
estimate, with 95% confidence, that the actual length for this particular sunfish will be
somewhere between 135.2 mm and 173.8 mm. This range of scores has a 95% chance of
covering the actual length for this randomly selected 5 year old sunfish. Notice how
much wider this interval is when compared to interval for the mean length for all 5 year
old sunfish in Camp Lake. This should seem natural as it is much harder to predict the
length of a single randomly selected fish than the average score for all fish.
17
STAT 602 - Simple Linear Regression and Correlation
Spring 2015
Example 2 – Length (Y) vs. Scale Radius (X)
190
180
Length
170
160
150
140
130
120
110
100
90
80
1.5
2
2.5
3
3.5
Scalerad
Linear Fit
Length = 42.346763 + 40.17735 Scalerad
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.731336
0.726858
12.28663
142.9355
62
Summary of R-square:
Analysis of Variance
Source
Model
Error
C. Total
DF
1
60
61
Sum of Squares
24656.064
9057.678
33713.742
Mean Square
24656.1
151.0
F Ratio
163.3271
Prob > F
<.0001
Parameter Estimates
Term
Intercept
Scalerad
Estimate
42.346763
40.17735
Std Error
8.02401
3.143781
t Ratio
5.28
12.78
Prob>|t|
<.0001
<.0001
Conclusions from tests results highlighted above:
18
STAT 602 - Simple Linear Regression and Correlation
Spring 2015
Residual Plot (residuals vs. scale radius (X))
40
30
Residual
20
10
0
-10
-20
-30
1.5
2
2.5
3
3.5
Scalerad
Discussion of Residual Plot:
Distribution of the Residuals
19
Download