Regression Analysis: Statistical Inference

Chapter 11
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a
publicly accessible website, in whole or in part.
BUSINESS ANALYTICS:
DATA ANALYSIS AND
DECISION MAKING
Regression Analysis: Statistical Inference
Introduction

Two basic problems are discussed in this chapter:

Population regression model
Inferring its characteristics—that is, its intercept and slope
term(s)—from the corresponding terms estimated by least squares
 Determining which explanatory variables belong in the equation
 Inferring whether there is any population regression equation
worth pursuing


Prediction
Predicting values of the dependent variable for new observations
 Calculating prediction intervals to measure the accuracy of the
predictions

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
The Statistical Model
(slide 1 of 7)


To perform statistical inference in a regression context,
a statistical model is required—that is, we must first
make several assumptions about the population.
These assumptions represent an idealization of reality
and are never likely to be entirely satisfied for the
population in any real study.
From a practical point of view, all we can ask is that they
represent a close approximation to reality.
 If the assumptions are grossly violated, statistical inferences
that are based on these assumptions should be viewed with
suspicion.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
The Statistical Model
(slide 2 of 7)

Regression assumptions:
 There
is a population regression line.
 It
joins the means of the dependent variable for all values of
the explanatory variables.
 For any fixed values of the explanatory variables, the mean
of the errors is zero.
 For
any values of the explanatory variables, the
variance (or standard deviation) of the dependent
variable is a constant, the same for all such values.
 For any values of the explanatory variables, the
dependent variable is normally distributed.
 The errors are probabilistically independent.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
The Statistical Model
(slide 3 of 7)

The first assumption is probably the most important.
It implies that for some set of explanatory variables, there is
an exact linear relationship in the population between the
means of the dependent variable and the values of the
explanatory variables.
 Equation for population regression line joining means:


α is the intercept term, and the βs are the slope terms. (Greek
letters are used to denote that they are unobservable population
parameters.)
Most individual Ys do not lie on the population regression
line.
 The vertical distance from any point to the line is an error.
 Equation for population regression line with error:

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
The Statistical Model
(slide 4 of 7)

Assumption 2 concerns variation around the population
regression line.

It states that the variation of the Ys about the regression line
is the same, regardless of the values of the Xs.
The technical term for this property is homoscedasticity.
 A simpler term is constant error variance.

This assumption is often questionable—the variation in Y
often increases as X increases.
 Heteroscedasticity means that the variability of Y values is
larger for some X values than for others.

A simpler term for this is nonconstant error variance.
 The easiest way to detect nonconstant error variance is through a
visual inspection of a scatterplot.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
The Statistical Model
(slide 5 of 7)

Assumption 3 is equivalent to stating that the errors
are normally distributed.
 You
can check this by forming a histogram (or a Q-Q
plot) of the residuals.
 If
assumption 3 holds, the histogram should be
approximately symmetric and bell-shaped, and the points of
a Q-Q plot should be close to a 45 degree line.
 If there is an obvious skewness or some other nonnormal
property, this indicates a violation of assumption 3.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
The Statistical Model
(slide 6 of 7)

Assumption 4 requires probabilistic independence
of the errors.
 This
assumption means that information on some of the
errors provides no information on the values of the
other errors.
 For cross-sectional data, this assumption is usually taken
for granted.
 For time-series data, this assumption is often violated.
 This
is because of a property called autocorrelation.
 The Durbin-Watson statistic is one measure of
autocorrelation and thus measures the extent to which
assumption 4 is violated.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
The Statistical Model
(slide 7 of 7)

One other assumption is important for numerical
calculations: No explanatory variable can be an
exact linear combination of any other explanatory
variables.
 The
violation occurs if one of the explanatory variables
can be written as a weighted sum of several of the
others.
 This
is called exact multicollinearity.
 If it exists, there is redundancy in the data.
A
more common and serious problem is multicollinearity,
where explanatory variables are highly, but not
exactly, correlated.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Inferences about the
Regression Coefficients



In the equation for the population regression line, α and
the βs are called the regression coefficients.
There is one other unknown constant in the model: the
variance of the errors, labeled σ2.
The choice of relevant explanatory variables is almost
never obvious.
Two guiding principles are relevance and data availability.
 One overriding principle is parsimony—to explain the most
with the least.


It favors a model with fewer explanatory variables, assuming that
this model explains the dependent variable almost as well as a
model with additional explanatory variables.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Sampling Distribution of the
Regression Coefficients


The sampling distribution of any estimate derived from sample data
is the distribution of this estimate over all possible samples.
Sampling distribution of a regression coefficient:
If the regression assumptions are valid, the standardized value
has a t distribution with n − k − 1 degrees of freedom.

This result has three important implications:


The estimate b is unbiased in the sense that its mean is β, the true but
unknown value of the slope.
The estimated standard deviation of b is labeled sb.



It is usually called the standard error of a regression coefficient, or the
standard error of b.
It measures how much the bs would vary from sample to sample.
The shape of the distribution of b is symmetric and bell-shaped.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 11.1:
Overhead Costs.xlsx



(slide 1 of 2)
Objective: To use standard regression output to make inferences about the
regression coefficients of machine hours and production runs in the equation
for overhead costs.
Solution: The dependent variable is Overhead and the explanatory
variables are Machine Hours and Production Runs.
The output from StatTools’s Regression procedure is shown below.



The estimates of the regression coefficients appear under the label Coefficient.
The column labeled Standard Error shows the sb values.
Each b represents a point estimate of the corresponding β. The corresponding sb
indicates the accuracy of this point estimate.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 11.1:
Overhead Costs.xlsx



(slide 2 of 2)
The sample data can be used to obtain a confidence
interval for a regression coefficient.
A confidence interval for any β is of the form:
where the t-multiple depends on the confidence level and
the degrees of freedom.
StatTools always provides these 95% confidence intervals
for the regression coefficients automatically, as shown at
the bottom right of the figure on the previous slide.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Hypothesis Tests for the Regression
Coefficients and p-Values


There is another important piece of information in regression outputs:
the t-values for the individual regression coefficients.

Each t-value is the ratio of the estimated coefficient to its standard error.

It indicates how many standard errors the regression coefficient is from
zero.
A t-value can be used in a hypothesis test for a regression
coefficient.


If a variable’s coefficient is zero, there is no point in including this
variable in the equation.
To run this test, simply compare the t-value in the regression output with
a tabulated t-value and reject the null hypothesis only if the t-value from
the computer output is greater in magnitude than the tabulated t-value.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
A Test for the Overall Fit:
The ANOVA Table (slide 1 of 3)

It is conceivable that none of the explanatory variables in
the regression equation explains the dependent variable.




An indication of this problem is a very small R2 value.
An equation has no explanatory power if the the same value
of Y will be predicted regardless of the values of the Xs.
The null hypothesis is that all coefficients of the
explanatory variables are zero.
The alternative is that at least one of these coefficients is
not zero.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
A Test for the Overall Fit:
The ANOVA Table (slide 2 of 3)

To test the null hypothesis, use an F test, a formal
procedure for testing whether the explained variation is
large compared to the unexplained variation.
This is also called the ANOVA (analysis of variance) test
because the elements for calculating the required F-value
are shown in an ANOVA table for regression.
 The ANOVA table splits the total variation of the Y variable
(SST):

into the part unexplained by the regression equation (SSE):
and the part that is explained (SSR):
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
A Test for the Overall Fit:
The ANOVA Table (slide 3 of 3)

The required F-ratio for the test is:
where


and
If the F-ratio is small, the explained variation is small relative to
the unexplained variation, and there is evidence that the
regression equation provides little explanatory value.
The F-ratio has an associated p-value that allows you to run
the test easily; it is reported in most regression outputs.

Reject the null hypothesis—and conclude that the X variables
have at least some explanatory value—if the F-value in the
ANOVA table is large and the corresponding p-value is small.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Multicollinearity

Multicollinearity occurs when there is a fairly strong
linear relationship among a set of explanatory
variables.
In this case, the relationship between the explanatory
variable X and the dependent variable Y is not always
accurately reflected in the coefficient of X; it depends on
which other Xs are included or not included in the equation.
 There are various degrees of multicollinearity, but in each of
them, there is a linear relationship between two or more
explanatory variables.
 The symptoms of multicollinearity can be “wrong” signs of
the coefficients, smaller-than-expected t-values, and largerthan-expected (insignificant) p-values.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 11.2:
Heights Simulation.xlsx
(slide 1 of 2)
Objective: To illustrate the problem of
multicollinearity when both foot length variables are
used in a regression for height.
 Solution: The dependent variable is Height, and the
explanatory variables are Right and Left, the length
of the right foot and the left foot, respectively.
 Simulation is used to generate a hypothetical data
set of heights and left and right foot lengths.
 Height is approximately 31.8 plus 3.2 times foot
length (all expressed in inches).

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 11.2:
Heights Simulation.xlsx

(slide 2 of 2)
The regression output when both Right and Left are
entered in the equation for Height appears at the
bottom right of the figure below.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
INCLUDE/EXCLUDE DECISIONS


The t-values of regression coefficients can be used to
make include/exclude decisions for explanatory
variables in a regression equation.
Finding the best Xs to include in a regression equation is
the most difficult part of any real regression analysis.
You are always trying to get the best fit possible, but the
principle of parsimony suggests using the fewest number of
variables.
 This presents a trade-off, where there are not always easy
answers.
 To help with this decision, several guidelines are presented
on the next slide.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Guidelines for Including/Excluding
Variables in a Regression Equation





Look at a variable’s t-value and its associated p-value. If the
p-value is above some accepted significance level, such as 0.05, this
variable is a candidate for exclusion.
Check whether a variable’s t-value is less than 1 or greater than 1 in
magnitude. If it is less than 1, then it is a mathematical fact that se will
decrease (and adjusted R2 will increase) if this variable is excluded from
the equation.
Look at t-values and p-values, rather than correlations, when making
include/exclude decisions. An explanatory variable can have a fairly high
correlation with the dependent variable, but because of other variables
included in the equation, it might not be needed.
When there is a group of variables that are in some sense logically
related, it is sometimes a good idea to include all of them or exclude all of
them.
Use economic and/or physical theory to decide whether to include or
exclude variables, and put less reliance on t-values and/or p-values.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 11.3:
Catalog Marketing.xlsx






(slide 1 of 2)
Objective: To see which potential explanatory variables are useful
for explaining current year spending amounts at HyTex with multiple
regression.
Solution: Data file contains data on 1000 customers who purchased
mail-order products from HyTex Company.
For each customer, data on several variables are included.
Base the regression on the first 750 observations and use the other
250 for validation.
Enter all of the potential explanatory variables.
Then exclude unnecessary variables based on their t-values and pvalues.


Four variables, Age, Gender, Own Home, and Married, have p-values
well above 0.05 and are obvious candidates for exclusion.
Exclude variables one at a time, starting with the variable that has the
highest p-value, and rerun the regression after each exclusion.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 11.3:
Catalog Marketing.xlsx

(slide 2 of 2)
The resulting output appears below.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Stepwise Regression

Many statistical packages provide some assistance in include/exclude
decisions by including automatic equation-building options.



There are three types of equation-building procedures:




These options estimate a series of regression equations by successively adding
(or deleting) variables according to prescribed rules.
Generically, these methods are referred to as stepwise regression.
Forward—begins with no explanatory variables in the equation and successfully
adds one at a time until no remaining variables make a significant contribution.
Backward—begins with all potential explanatory variables in the equation and
deletes them one at a time until further deletion would do more harm than good.
Stepwise—is much like a forward procedure, except that it also considers
possible deletions along the way.
All of these procedures have the same basic objective—to find an equation
with a small se and a large R2 (or adjusted R2).
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 11.3 (Continued):
Catalog Marketing.xlsx




Objective: To use StatTools’s
Stepwise Regression
procedure to analyze the
HyTex data.
Solution: Choose either the
forward, backward, or
stepwise procedure from the
Regression Type dropdown list
in the Regression dialog box.
Specify Amount Spent as the
dependent variable and
select all of the other
variables (besides Customer)
as potential explanatory
variables.
A sample of the stepwise
output appears to the right.
The variables that enter or
exit the equation are listed at
the bottom of the output.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Outliers
(slide 1 of 2)

An observation can be considered an outlier for one or more of the
following reasons:


It has an extreme value for at least one variable.
Its value of the dependent variable is much larger or smaller than
predicted by the regression line, and its residual is abnormally large in
magnitude.

An example of this type of outlier is shown below.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Outliers
(slide 2 of 2)

Its residual is not only large in magnitude, but this point “tilts” the
regression line toward it.



Its values of individual explanatory variables are not extreme, but they
fall outside the general pattern of the other observations.


This type of outlier is called an influential point.
An example of this type of outlier is shown below, on the left.
An example of this type of outlier is shown below, on the right.
In most cases, the regression output will look “nicer” if you delete the
outliers, but this is not necessarily appropriate.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 11.4:
Bank Salaries.xlsx


(slide 1 of 2)
Objective: To locate possible outliers in the bank salary data, and to
see to what extent they affect the regression model.
Solution: Examine each variable for outliers, using box plots of the
variables or scatterplots of the residuals versus the fitted values.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 11.4:
Bank Salaries.xlsx

(slide 2 of 2)
Then run the regression
with and without the
outlier.
 The
output with the
outlier included is
shown on the top right;
the output with the
outlier excluded is
shown on the bottom
right.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Violations of Regression Assumptions

There are three major issues related to violations of
regression assumptions:
 How
to detect violations of the assumptions
 This
is usually relatively easy, using scatterplots, histograms,
time series graphs, and numerical measures.
 What
goes wrong if the violations are ignored
 This
depends on the type of violation and its severity.
 What
to do about violations if they are detected
 This
issue is the most difficult to resolve.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Nonconstant Error Variance



The second regression assumption—that the variance of the
errors should be constant for all values of the explanatory
variables—is almost always violated to some extent.
Mild violations do not have much effect on the validity of the
regression output.
One common form of nonconstant error variance that should be
dealt with is the fan-shape phenomenon.



It occurs when increases in a variable result in increases in variability.
It can cause an incorrect value for the standard error of estimate, so
that confidence intervals and hypothesis tests for the regression
coefficients are not valid.
There are two ways to deal with it:


Use a different estimation method than least squares, called weighted
least squares.
Use a logarithmic transformation of the dependent variable.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Nonnormality of Residuals


The third regression assumption states that the error
terms are normally distributed.
Check this assumption by forming a histogram of the
residuals.
 Unless
the distribution of the residuals is severely
nonnormal, the inferences made from the regression
output are still approximately valid.
 One form of nonnormality often encountered is
skewness to the right.
 This
can often be remedied by the same logarithmic
transformation of the dependent variable that remedies
nonconstant error variance.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Autocorrelated Residuals

The fourth regression assumption states that the error terms
are probabilistically independent, but this assumption is
often violated for time series data.

The problem with time series data is that the residuals are often
correlated with nearby residuals, a property called
autocorrelation of residuals.



The most frequent type of autocorrelation is positive autocorrelation.
If residuals separated by one time period are correlated, it is called
lag 1 autocorrelation.
The Durbin-Watson (DW) statistic is a numerical measure used to
check for lag 1 autocorrelation.


A DW statistic below 2 signals that nearby residuals are positively
correlated with one another.
When the number of observations is about 30 and the number of
explanatory variables is fairly small, then any DW statistic less than
1.2 warrants attention.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 11.1 (Continued):
Overhead Costs.xlsx



Objective: To use the Durbin-Watson statistic to check whether there is any
lag 1 autocorrelation in the residuals from the Bendrix regression model for
overhead costs.
Solution: Run the usual multiple regression and check the graph of residuals
versus fitted values.
Check for lag 1 autocorrelation in two ways: with the DW statistic and by
examining the time series graph of the residuals.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Prediction
(slide 1 of 4)


Once you have estimated a regression equation from a set of
data, you might want to use it to predict the value of the
dependent variable for new observations.
There are two types of prediction problems in regression:
1.
2.

Predicting the value of the dependent variable for one or more
individual members of the population
Predicting the mean of the dependent variable for all members of
the population with certain values of the explanatory variables
The second problem is inherently easier in the sense that the
resulting prediction is bound to be more accurate.


When you predict a mean, there is a single source of error: the
possibly inaccurate estimates of the regression coefficients.
When you predict an individual value, there are two sources of
error: the inaccurate estimates of the regression coefficients and the
inherent variation of individual points around the regression line.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Prediction
(slide 2 of 4)

Predictions for values of the Xs close to their means are likely to be
more accurate than predictions for Xs far from their means.

Trying to predict for Xs beyond the range of the data set is called
extrapolation, and it is quite risky.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Prediction
(slide 3 of 4)

The point prediction, or best guess, is found by
substituting the given values of the Xs into the estimated
regression equation.

To measure the accuracy of the point predictions, calculate
standard errors of prediction.

Standard error of prediction for a single Y:


This error is approximately equal to the standard error of estimate.
Standard error of prediction for the mean Y:

This error is approximately equal to the standard error of estimate
divided by the square root of the sample size.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Prediction
(slide 4 of 4)

These standard errors can be used to calculate a
95% prediction interval for an individual value and
a 95% confidence interval for a mean value.
 Go
out a t-multiple of the relevant standard error on
either side of the point prediction.
 The term prediction interval (rather than confidence
interval) is used for an individual value because an
individual value of Y is not a population parameter.
 However, the interpretation is basically the same.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 11.1 (Continued):
Overhead Costs.xlsx



Objective: To predict Overhead at Bendrix for the next three
months, given anticipated values of Machine Hours and Production
Runs.
Solution: Suppose Bendrix expects the values of Machine Hours and
Production Runs for the next three months to be 1430, 1560, 1520,
and 35, 45, 40, respectively.
StatTools has the capability to provide predictions and 95%
prediction intervals, but you must set up a second data set to
capture the results.

It should have the same variable name headings, and it should include values of
the explanatory variable to be used for prediction.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.