Multiple Regression Models

advertisement
Multiple Regression Models
Recall that a simple linear regression model is a model of the form
𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝜀
where ε is a normal random variate with mean E(ε) = 0 and variance Var(ε) = σ2. We obtained
the least square estimates by choosing b0 and b1 so that the line
𝑦̂ = 𝑏0 + 𝑏1 𝑥
minimized the squared distances to the data points. This was accomplished by
(1) Taking the derivative of
𝑛
𝑆𝑆𝐸(𝑏0 , 𝑏1 ) = ∑
(𝑦𝑖 − (𝑏0 + 𝑏1 𝑥𝑖 ))2
𝑖=1
with respect to b0 and b1 and setting
𝜕𝑆𝑆𝐸⁄
𝜕𝑆𝑆𝐸⁄
𝜕𝑏0 = 0 and
𝜕𝑏1 = 0.
(2) Solving these two resulting equations for the unknowns b0 and b1. This gave us the
equations
∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )(𝑦𝑖 − 𝑦̅)
𝑏1 =
∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )2
𝑏0 = 𝑦̅ − 𝑏1 𝑥̅
In theory, any model can be fit (that is, find the least squares estimates) by an analogous
procedure. For example, if we had a model
𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝛽2 𝑥 2 + 𝜀
(Quadratic)
we would simply take the derivative of
1
𝑛
𝑆𝑆𝐸(𝑏0 , 𝑏1 , 𝑏2 ) = ∑
(𝑦𝑖 − (𝑏0 + 𝑏1 𝑥𝑖 + 𝑏2 𝑥𝑖2 ))2
𝑖=1
with respect to b0, b1, and b2 and set the resulting three equations equal to zero. These three
equations are then solved for the three unknowns, b0, b1, and b2 to give the estimated function
𝑦 = 𝑏0 + 𝑏1 𝑥 + 𝑏2 𝑥 2
of equation (Quadratic). Equations (Quadratic) is an example of a multiple regression model.
Another example of a multiple regression model is
𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + 𝛽3 𝑥3 + 𝜀
where x1, x2, and x3 are three independent variables which are being used to describe the
dependent variable y.
How does a multiple regression model compare to a simple regression model?
An analyst typically does not know what the preferred or best regression model is a priori.
Instead, the model is typically identified through careful exploration and assessment.
Unfortunately, the equations for b0 and b1 (estimates of β0 and β1) will not be the same in model
(Quadratic) as they were in the simple linear regression model!! This implies that every time we
select a different model, we have to refit the model from scratch. Fortunately, Excel provides us
a simple way to get the results for each model.
Example: The amount of electrical power (y) consumed by a manufacturing facility during a
shift is thought to depend upon (1) the number of units produced during the shift (x1), and (2) the
shift involved (first or second). The following model is being investigated.
𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + 𝜀
where
𝑥2 = {
0 if the units are produced during shift 1,
1 if the units are produced during shift 2.
Data was collected to fit the model as shown below.
2
According to the output in Multiple Regression.xlsx, our least square fit is
𝑦̂ = 7.82 + 1.47𝑥1 + 3.75𝑥2
Note: To run Data/Data Analysis/Regression, it is necessary to have all of the independent
variables (x’s) in contiguous columns.
Exercise: For the electrical power problem, fit the model
𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽3 𝑥12 + 𝛽2 𝑥2 + 𝜀
(2)
Evaluating the Model
(1) F-test: The outputs for models (1) and (2) both provided an F-statistic in the ANOVA table
and t-statistics in the table which described the estimates of the parameters (bottom table). In the
case of a multiple regression model, the F-test is testing to see if any of the independent variable
coefficients in the model are nonzero. For model (1) this becomes:
𝐻0 : 𝛽1 = 𝛽2 = 0
And for model (2):
𝐻0 : 𝛽1 = 𝛽2 = 𝛽3 = 0
versus the alternative that at least one βi ≠ 0, which is to say that at least one of the x terms
belongs in the model. What would we conclude for model (1)? Model (2)?
(2) t-tests: The t-tests are, as before, testing hypotheses about the individual βi:
𝐻0 : 𝛽𝑖 = 0
𝐻𝑎 : 𝛽𝑖 ≠ 0
Consider the term for units produced in model (1). Does it appear that β1 ≠0? How about in
model (2)?
3
This is one of the main difficulties with building models using multiple regression. It is possible
(our models are an immediate example) for a variable to appear important or not important
depending on what other variables are included in the model. (Consider the model which only
includes units produced as shown in the sheet Units produced only.)
These independent variables contain information about why the dependent variable y varies from
observation to observation. A problem arises because the independent variables overlap with
regard to this information; in fact, they contain information about each other, referred to as
multicollinearity. The most direct way to see this is through a correlation analysis. Correlation
is a measure of the strength of the linear relationship between two variables. In our model, x1
and x12 happen to be strongly correlated (this is certainly not always the case) because of the
values in the data set—see the sheet entitled Model (3). Multicollinearity confounds our ability
to attribute the independent variables’ influence on the dependent variable.
Further analysis and reflection may reconcile the problem. In our case, the units producedsquared term is not really justified unless there is reason to believe that units produced influence
the power consumption in a non-linear way.
(3) r2 and its derivatives: As before, r2 is the fraction of variation explained by the independent
variables in the model.
𝑟2 =
𝑆𝑆𝑅
𝑆𝑆𝑇
257.71
261.421
In our case for model (1), 𝑟 2 = 269.875 = 95.49%, while for model (2), 𝑟 2 = 269.875 = 96.87%.
It appears that a little more variation was explained by including the term x12 in our model;
however, one needs to be careful in using this measure to compare models. For example, if you
decided to include another variable in the model, say x3 = sunspot activity during the shift, then
you are guaranteed that your new model will have a higher r2, even though x3 probably does not
have an effect on y.
It is reasonable to compare another model with two independent variables with our model (1).
The model with the higher r2 would appear to be explaining a greater proportion of the
variability, although other factors concerning the models should also be examined, such as
residuals, etc.
The measure Multiple R is obtained by taking the square root of r2. This measures the correlation
between the dependent variable y and the set of the independent variables, as if they were a
single variable.
The measure Adjusted R Square is obtained from the computation
𝐴𝑑𝑗 𝑟 2 = 1 − [(1 − 𝑟 2 )
(𝑛 − 1)
]
(𝑛 − 𝑘 − 1)
4
where k = the number of independent variables in the model. This measure attempts to correct
for two issues—(1) the number of independent variables in the model, thus allowing us to better
compare this measure among models with different numbers of independent variables, and (2)
predicting the actual amount of variation which will be explained if the model is used for future
data sets. Adjusted R Square will be smaller, in absolute value, than r2, but it can be positive or
negative.
5
Download