X. Simple Linear Regression

advertisement
Simple Linear Regression
Data File:
NFL Offense 2003
Background: I collected this data on Nov 3, 2002. This dataset contains some
summaries for all NFL football teams.
Variables:
> Name: Name of the team
> YDS: Total offense yards for the season thus far
> YPG: Yards per game played
> RUSH: Total rushing yards for the season thus far
> YPG YPG: Rushing yards per game played
> PASS: Total passing yards for the season thus far
> YPG YPG: Passing yards per game played
> PTS: Total points scored for the season thus far
> PTS/G: Points scored per game played
Goal:
Investigate the relationship between yards per game (X) and points scored
per game (Y).
Assumptions
1. The mean of the response variable (Y) can be modeled using X in the following
form: E (Y | X )   o   x * X
(i.e. using a line to summarize the mean value of Y as a function of X)
2. The variability in the response variable (Y) must be the same for each X, i.e.
Var (Y | X )   2 or SD(Y | X )   .
3. The response measurements (Y’s) should be independent of each other.
4. The response measurements (Y) should follow a normal distribution.
You should also take the time to identify outliers. Outliers can be very problematic in a
regression model.
We will discuss how to check the assumptions outlined above after fitting our initial
model.
1
Correlations (initial investigation):
This gives us some idea about whether or not the X and Y variable are linearly related.
The correlations are obtained under Analyze > Multivariate.
Next, select the variables of interest and place them all in the Y box.
Correlations away from 0 mean X and Y are related in a linear fashion. If the correlation
is 1 or -1, then the variables are perfectly related. A scale for correlations is given below.
What is the correlation between PTS/G and YPG?
2
Fitting the model
What is the population model? We want to model the mean value of Y using X, so the
model is given by:
E (Y | X )   o   x * X
or being specific for this situation, we have
E ( PTS / G | YPG )   o  1  YPG
Note: E(Y|X) is the notation we use to denote the mean value of Y given X.
3
Taking a closer look at its pieces:
For this data set, Y = PTS/G (points per game) and X = YPG (yards per game).
To fit the model
E ( PTS / G | YPG )   o   YPG * YPG
we first select Analyze > Fit Y by X and place PTS/G in the Y box and YPG in the X
box as shown below. This will give us scatter plot of Y vs. X, from which we can fit the
model.
The resulting scatter plot is shown below.
4
To perform the regression of PTS/G on YPG, select Fit Line from the Bivariate Fit...
pull-down menu. The resulting output is shown below.
We begin by looking at whether or not this regression stuff is even helpful:
Next we can assess the importance of the X variable, YPG in this case:
Conclusions from tests:
H o :  YPG  0
H 1 :  YPG  0
Determining how well the model is doing in terms of explaining the response is done
using the R-Square and Root Mean Square Error:
5
Describing the Relationship:
Interpret each of the parameter estimates:
Checking the Assumptions:
Ideal Residual Plot:
Violations to Assumption #1:
Some existing trend remaining (BAD)
The trend need not be linear (BAD)
6
Violations to Assumption #2:
Megaphone opening to right (BAD)
Megaphone opening to the left (BAD)
Violations to Assumption #3:
One point closely following another -positive autocorrelation, (BAD)
Extreme bouncing back and forth -- negative
autocorrelation (BAD)
Violations to Assumption #4:
To check this assumption, simple save the residuals out and make a histogram of the
residuals and/or look at a normal quantile plot. Recall, you can easily make a histogram of a
variable under Analyze > Distribution.
7
Checking for outliers:
Determine the value of 2*RMSE. Any observations outside these bands are potential outliers
and should be investigated further to determine whether or not they adversely affect the
model.
Checking for outliers in this example we find:
THE ASSUMPTION CHECKLIST:
Model Appropriate:
Constant Variance:
Independence:
Normality Assumption (see histogram above):
Identify Outliers:
8
Making Predictions:
The model allows you to make prediction for observations not necessarily in your data
set. You just need to plug in an arbitrary team’s YPG to get the predicted PTS/G.
Confidence Interval for the Average PTS/G:
(the average over a set of games where we gain a specified number of yards.)
Select Confid Curves Fit from the Linear Fit pull-down menu located below the scatter
plot. The narrow bands in plot below give the a 95% CI for the mean points per game as
a function of yards per game. For example, we estimate that teams that have 300 yards of
total offense for a game will score between 14 – 17 points. We will examine a way in
which we can get these limits precisely later on in this tutorial.
Confidence Interval for the PTS/G :
(for a single game where a specified number of yards has been gained)
Select Confid Curves Indiv from the Linear Fit pull down menu. These are the wider
bands in the plot above.
9
Using the Analyze > Fit Model Option to Perform the Regression
An alternative to using Fit Y by X to perform simple linear regression, is to use the Fit
Model option from the Analyze menu. The advantages of this approach are two-fold:
1) You have access to more detailed results from your regression and have
enhanced features for estimation/prediction of Y.
2) Allows for the addition of more predictors (X’s) to your model. This is
called multiple regression and will be discussed in the next tutorial.
For the 2003 NFL Offense example we fit the model as follows:
Select Analyze > Fit Model and place PTS/G in the Y box and YPG in Model Effects
box.
The output is shown below:
The same numeric summaries, parameter
estimates, and test results are contained as
part of the standard output. The plot at the
top is NOT a scatter plot of Y vs. X. It is a
plot of the actual Y values vs. the predicted
values from the regression model. The
stronger the trend exhibited the better the fit.
10
Estimation of the E(Y|X), the Mean Value of Y for a given X
&
Prediction of Y for an Individual with a given X
You can save 95% Confidence
Intervals for E(Y|X) to the data
spread sheet by selecting Mean
Confidence Interval.
You can save 95% Prediction
Intervals for Individual Y values
to the data spread sheet by
selecting Indiv Confidence
Interval.
Below is a portion data spread sheet showing both types of intervals.
Interpretation of the 95% Confidence Interval for E(PTS/G|YPG=406.6)
Consider estimating the average/mean number of points NFL football teams would score if they gained
406.6 yards during the course of a game. Using the 95% confidence interval for the mean we estimate that
on average teams will score 25 and 32 points given they have 406 yards of total offense.
Interpretation of the 95% Prediction Interval for PTS/G|YPG=406.6
Suppose we haven’t looked at the scoreboard, but we know that the Minnesota Vikings have just
accumulated 406.6 yards of total offense in a game against the Green Bay Packers. What do estimate the
will the score for Vikings will be given only this information? We estimate, with 95% confidence, that the
actual score the Vikings will have for the game is between 19.5 and 37 points. This range of scores has a
95% chance of covering the actual score for the Vikings in the game. Notice how much wider this interval
is when compared to interval for the mean score for all teams that gain 406.6 yards in a game. This is
should seem natural as it is much harder to predict a single score than the average score for all teams that
meet a certain offensive performance statistic.
11
Download