Uploaded by Kristine Morales

W10,11,12 Module 8 Simple and Multiple Linear Regression 2

advertisement
Chapters 4, 16 & 17
Module 8: Simple and Multiple Linear Regression
Week10 (Jul 17-23)
Correlation and Regression Analysis
•
Correlation
•
Simple Linear Regression
•
Multiple Linear Regression
4.1
Week10 (Jul 17-23)
Week11 (Jul 24-30)
Week12 (Jul 31 – Aug 6)
Correlation Analysis
4.2
▪ Correlation is a statistical term indicating a relationship between two
variables. For example, the temperature is correlated with the number of
cars that will not start in the morning. As the temperature decreases, the
number of cars that will not start in the morning increases.
▪ The sample correlation coefficient, denoted r, is a measure of the
strength of a linear relationship between two quantitative variables x and
y.
➢ If large values of x are associated with large values of y, or if as x
increases, the corresponding value of y tends to increase, then x and y
are positively related.
➢ If small values of x are associated with large values of y, or if as x
increases, the corresponding value of y tends to decrease, then x and y
are negatively related.
Correlation Analysis
4.3
▪ Definition: Suppose there are n pairs of observations (x1, y1), (x2, y2), . . .,
(xn, yn). The sample Pearson correlation coefficient for these n pairs is
defined as the covariance divided by the standard deviations of the
variables:
r=
This coefficient
answers the question:
What is the direction
and how strong is the
association between
X and Y?
=
( xi − x )( yi − y )

=
2
2
SSxx SSyy
(
x
−
x
)
(
y
−
y
)
 i
 i
1
x
y
−
 i i n (  xi )(  yi )
 x 2 − 1 ( x )2   y 2 − 1 ( y )2 
 i n  i   i n  i 
S xy
2
x
2
y
Greek letter “rho”
Correlation Analysis
▪ The value of r does not depend on
the order of the variables and is
independent of units.
▪ −1 ≤ r ≤ +1.
r is exactly +1 if and only if all the
ordered pairs lie on a straight line
with positive slope.
r is exactly −1 if and only if all the
ordered pairs lie on a straight line
with negative slope.
▪ If r is near 0, there is no evidence
of a linear relationship, but x and y
may be related in another way.
▪ Correlation between two variables
does not imply causation.
4.4
Coefficient of Correlation
4.5
+1 Strong positive linear relationship
r or r =
0
No linear relationship
-1 Strong negative linear relationship
General Guidelines
Direction
Negative
Strength
strong moderate
-1.0
- 0.7
- 0.3
weak
- 0.1
4.6
None
0
Positive
weak
0.1
r = + 0.85
Direction
positive
correlation
Strength
a strong
relationship
moderate
0.3
0.7
strong
1.0
Example1: Income and Credit Score
Although income is not used in calculating a credit score, some
evidence suggests that income is related to credit score. A random
sample of adult consumers was obtained, and their credit score and
yearly income level (in hundreds of thousands of dollars) were
recorded. The data are given in the table. Calculate the sample
correlation coefficient between credit score and yearly income; and
interpret this value.
4.7
S x2
S y2
S x2 S y2
Because r = 0.2794 < 0.3, there is a
weak positive linear relationship
between income level and credit score
(as a person’s income level increases,
so does the credit score).
Example2: Work Experience and Hourly Wage
4.8
A researcher wants to explore the relationship between work experience and
the hourly wage received by selected workers. The researcher obtained
sampled data which provide information on hourly wage and work
experience of the selected workers. The researcher wants to know whether
work experience is related to how much a worker is paid hourly. The
researcher decides to conduct a correlation analysis to explore the direction
and strength of the relationship between work experience and hourly wage.
Years of
Hourly Wage
Experience (X)
(Y)
3
17
4
18
1
15
5
22
2
19
4
20
7
23
Example2: Work Experience and Hourly Wage
r=
S xy
2
x
S S
2
y
4.9
29.29
(23.43)( 46.86)
=
0.88
S xy =  xi yi
xy

−
i
i
n
(26)(134)
= 527 −
 29.29
7
S x2 =  xi2 −
( xi ) 2
n
, S y2 =  yi2 −
(26) 2
= 120 −
 23.43,
7
( yi ) 2
Total (∑)
X
3
4
1
5
2
4
7
26
X2
Y2
XY
9
289 51
17
16 324 72
18
1
225 15
15
25 484 110
22
4
361 38
19
16 400 80
20
49 529 161
23
134 120 2612 527
Y
n
(134) 2
= 2612 −
 46.86
7
A strong positive relationship between work experience and hourly wages.
Thus, higher wage is strongly associated with longer work experience (a
worker with longer work experience is expected to receive higher wages)
∑X =
∑Y =
∑X2 =
∑Y2 =
∑XY =
26
134
120
2612
527
Correlation Using Excel
4.10
Data > Data Analysis > Correlation > ok > select the 2 columns of
variables in the input range > check off the box of Labels in First Row > ok
Experience
Experience
1
Wage 0.883883
=CORREL(array1,array2)
Drawing the Scatter plot in Excel:
Select the 2 vars. > Insert > Insert Scatter
Wage
1
Regression Analysis
16.11
▪
If we are interested only in determining whether a relationship between 2
quantitative variables exists, we employ correlation analysis.
▪
To determine which variable is the dependent and which one is the
independent, we use the Regression Analysis. It is used to predict the value
of the dependent variable based on other independent variable(s).
Dependent variable: denoted Y
Independent variables: denoted X1, X2, …, Xk
▪
The linear equation that relates the dependent and independent variables is
called regression model.
▪
Deterministic Model: an equation or set of equations that allow us to fully
determine the value of the dependent variable from the values of the
independent variables. Deterministic models are usually unrealistic.
E.g. Is it reasonable to believe that we can determine the selling price of a
house solely based on its size?
▪
Probabilistic Model: a method used to capture the randomness that is part
of a real-life process.
E.g. Do all houses of the same size (measured in square feet) sell for the
same price?
A Model
▪
To construct a probabilistic model, we start with a deterministic model
that approximates the relationship we want to model and add a
random term that measures the error of the deterministic component.
E.g., the cost of building a new house is about $100 per square foot and
most lots sell for about $100,000. Hence the approximate selling price (y)
would be (Deterministic Model):
y = $100,000 + (100$/ft2)(x)
where x is the size of the house (independent var.) in square foot.
In this model, the price of
the house is completely
determined by the size.
House Price
▪
16.12
Most lots sell
for $100,000
House size
A Model
In real life however, the house cost will vary even among the
same size of house
Lower vs. Higher
Variability
House Price
We now represent the price of
a house as a function of its size
in this Probabilistic Model:
16.13
Price = 100,000 + 100(Size) + ɛ
where ɛ (Greek letter epsilon)
is the random term (a.k.a.
error variable). It is the
difference between the actual
selling price and the estimated
price based on the size of the
house. Its value will vary from
house sale to house sale, even
if the square footage (i.e. x)
remains the same.
100K$
x
House size
Same square footage, but different price
points (e.g. décor options, lot location…)
Simple Linear Regression Model
Break 5 minutes
16.14
A straight-line model with one independent variable is called a
first order linear model or a simple linear regression model.
Its is written as:
dependent
variable
independent
variable
y-intercept
slope of the line error variable
(constant) (regression coefficient)
Note that both β0 and β1 are population parameters which are usually
unknown and hence estimated from the data.
x = independent = predictor = explanatory = covariate = input = effect
y = dependent = response = output
Simple Linear Regression Model
16.15
The least squares estimates of the y-intercept, β0 and the
slope, β1 of the true regression model are:
ˆ1 =
ˆ1 =
S xy
S x2
(  xi )(  yi )
2
n  xi2 − (  xi )
n  xi yi −
ˆ x
y
−

 i 1  i = y − ˆ1x
ˆ0 =
n
The estimated regression model/line is
ŷ = ˆ0 + ˆ1 x
y
rise
run
β0 =y-intercept
β1 =slope (=rise/run)
x
The Least Squares Method
4.16
The line of best fit, or estimated regression line, is obtained using the
principle of least squares: Minimize the sum of the squared deviations
(errors), or vertical distances from the observed points to the regression
line. i.e., The principle of least squares produces an estimated regression
line such that the sum of all squared vertical distances is a minimum. This
line is represented by the equation:
(“y” hat) is the value of y determined by the line.
2
e
 i is minimized
2
2
ˆ
e
=
(
y
−
y
)
i  i i
Interpretation of the Slope, Intercept, and R2
▪
▪
4.17
The intercept is the estimated average value of y when the value of x is
zero.
The slope is the estimated change (increase or decrease) in the average
value of y as a result of a one-unit increase in x.
Coefficient of Determination, denoted as R2:
▪ Coefficient of determination, which is calculated by squaring the
coefficient of correlation R2 = (r)2, measures the amount of variation in
the dependent variable that is explained by the variation in the
independent variable.
▪ In general, the higher the value of R2, the better the model fits the data.
There is no cutoff value to use for R2 that indicates that we have a good
model.
▪ 0 ≤ R2 ≤ 1
How much percentage of variation in the y-variable can
be explained by the variation in the x variable?
Example 16.1:
The annual bonuses ($1,000s) of six employees with different
years of experience were recorded as follows. We wish to
determine the straight-line relationship between annual bonus
and years of experience.
Years of experience (x)
1 2 3 4 5 6
Annual bonus (y)
6 1 9 5 17 12
16.18
To apply the shortcut formula, we need to compute four summations, the covariance
and the variance of x using a calculator:
Example 16.1:
16.19
Thus, the least squares line is:
What is the predicted annual bonus of such an employee who has 5 years
of experience?
Predict annual bonus
yˆ = .934 + 2.114(5) = 11.504
The predicted annual bonus is 11.504(1000) = $11,504
The regression
equation should not
be used to make
predictions for x
values that fall
outside the range of
values covered by
the original data.
these differences are
called residuals
Example: House Price and its Size
4.20
A real estate agent wishes to examine the relationship between
the selling price of a home (in $1000s) and its size (in square
foot). A random sample of 10 houses is selected.
58.08% of the variation in house prices is
explained by the variation in house size.
Price = 98.24833 + 0.10977Size
House
Price
(y)
House
size
(x)
245
1400
312
1600
279
1700
308
1875
199
1100
219
1550
405
2350
324
2450
319
1425
255
1700
Example: House Price and its Size
Interpreting the Intercept and the Slope: Price = 98.24833 + 0.10977Size
4.21
• The intercept, b0 = 98.24833. One interpretation would be that when x = 0
(No houses had 0 square foot) the house price is 98.24833($1000) =
$98,248.33. However, in this case, the intercept is probably meaningless.
Because our sample did not include any house with zero square foot of
the size, we have no basis for interpreting b0.
• The slope, b1 = 0.10977 tells us that the average value of a house
increases by 0.10977($1000) = $109.77 for each additional one square
foot in the house size.
Predict the price for a house with 2000 square foot.
Price = 98.24833 + 0.10977Size
= 98.24833 + 0.10977(2000) = 317.78833
The predicted price for a house with 2000 square foot is 317.78833(1000) =
$317,788.33
The forecast would not be reliable if the house size is an outlier, such as, 4000 sqft.
Example: House Price and its Size
4.22
How much will the price be expected to change if the house size increases
by 1400 square foot?
Price = 98.24833 + 0.10977Size
Expected change = 0.10977(1400)
= 153.678
153.678(1000) = $153,678
Regression Using Excel
4.23
Data > Data Analysis > Regression
Selecting “Line Fit
Plots” on the
Regression dialog
box, will produce
a scatter plot and
the regression line
Regression Using Excel
To show up the regression/trend line, R2, and the regression equation on
the chart:
Right click on any point in the scatter plot > Add Trendline > select
Linear, then scroll down to check off the 2 boxes of:
Display Equation on chart
Display R-squared value on chart
And close the pane on the right
Then you can drag the equation and R2 using the mouse to any place in
the chart to be shown clearly.
4.24
Example 16.2
Week11 (Jul 24-30)
Car dealers use the “Blue Book" to determine the value of
used cars that their customers trade in when purchasing
new cars.
The book, which is published monthly, lists the trade-in values for all basic
models of cars. It provides alternative values for each car model
according to its condition and optional features. The values are
determined on the basis of the average paid at recent used-car auctions,
the source of supply for many used-car dealers.
However, the Blue Book does not indicate the value
determined by the odometer reading, despite the fact that a
critical factor for used-car buyers is how far the car has
been driven.
To examine this issue, a used-car dealer randomly
selected 100 three-year old Toyota Camrys that were sold
at auction during the past month.
The dealer recorded the price ($1,000) and the number of
miles (thousands) on the odometer.
The dealer wants to find the regression line/model.
16.25
Xm16-02
Part of dataset
Price
Odometer
14.6
37.4
14.1
44.8
14.0
45.8
15.6
30.9
15.6
31.7
14.7
34.0
14.5
45.9
15.7
19.1
15.1
40.1
14.8
40.2
15.2
32.4
Example 16.2
COMPUTE
16.26
Example 16.2 – Using Excel
s
sb1 =
COMPUTE
16.27
The predicted car price will
typically differ from
the actual price by $0.3266 x
1000 = $326.60.
Example 16.2 – Using Excel
4.28
Selecting “line fit plots” on the Regression dialog box, will produce a scatter
plot of the data and the regression line.
Example 16.2
INTERPRET
16.29
The slope coefficient, b1, is –0.0669, that is, for each additional mile on the
odometer, the price decreases on average by $0.0669 or 6.69¢. Equivalently,
for each additional 1000 miles on the odometer, the price decreases on
average by $0.0669(1000) = $66.9
The intercept, b0 = 17.250. One interpretation would be that when x = 0 (no
miles on the car or the car was not driven at all) the selling price is $17,250.
However, in this case, the intercept is probably meaningless. Because our
sample did not include any cars with zero miles on the odometer, we have no
basis for interpreting b0..
As a general rule, we cannot determine the value of for a value of x that is
far outside (an outlier) the range of the sample values of x.
Coefficient of Determination
R2 = 0.6483. This means 64.83% of the variation in the auction selling
prices (y) is explained by the variation in the odometer readings (x). The
remaining 35.17% is unexplained, i.e. due to error.
Example 16.2
16.30
To predict the selling price of a car with 40 miles on it:
𝑦ො = 17.250 – .0669x
= 17.250 – .0669(40) = 14.574
We call this value (14.574(1000) = $14,574) a point prediction. The
chance of finding a different actual selling price is expected, hence we can
estimate the selling price in terms of an interval (beyond the scope of this
course).
Note: Let’s say the regression equation shows us r2 = 0.65 and the slope =
- 0.07. What is the linear correlation coefficient, r?
Note: If it's not mentioned in the question of how many decimals you should
round the result to, you should round to 2 decimals.
Testing the Slope, β1
16.31
▪
We can draw inferences about the population slope β1 from the sample
slope b1.
▪
The process of testing hypotheses about β1 is identical to the process of
testing any parameter. We begin with the hypotheses.
▪
We can conduct one- or two-tail tests of β1. Most often, we perform a twotail test.
▪
The null and alternative hypotheses become:
H0: β1 = 0 (there is no linear relationship)
H1: β1 ≠ 0 (there is a linear relationship)
▪
TS:
where Sb1 is the standard
deviation of b1, defined as:
If the error variable (ɛ) is normally distributed, the test statistic has a Student
t-distribution with n–2 degrees of freedom.
© 2018 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in
a license distributed with a certain product or service or otherwise on a password-protected website for classroom use.
Example 16.4 Using Example 16.2
16.32
Test to determine if there is a linear relationship between the price & the
odometer reading at 5% significance level.
H0: β1 = 0
H1: β1 ≠ 0
claim
(if the null hypothesis is true, no linear relationship exists)
The rejection region is:
TS:
b1 − 1 −.0669 − 0
t=
=
sb1
.00497
= −13.46
sx2 = 43.509
From slide 27
s
.3265
sb1 =
=
2
99(43.509)
(n − 1) sx
= .00497
© 2018 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in
a license distributed with a certain product or service or otherwise on a password-protected website for classroom use.
Example 16.4 Using Example 16.2
Although, F-statistic is mainly used in the multiple regression, we can
use it in the simple regression as an alternative to the t-statistic.
INTERPRET
16.33
Decision: The value of the TS = -13.44 which is < the CV = -1.984 lies in
the rejection region. Equivalently, we found p ≈ 0 is < α = 0.05.
Therefore, we reject the null hypothesis in favor of the H1 at α = 0.05.
Conclusion: There is overwhelming evidence to infer that a significant
linear relationship exists. What this means is that the odometer reading may
affect the auction selling price of the cars.
Note: Regression analysis can only show that a statistical relationship exists. We
cannot infer that one variable causes another.
© 2018 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in
a license distributed with a certain product or service or otherwise on a password-protected website for classroom use.
Testing the Slope
16.34
If we wish to test for negative or positive linear relationships, we conduct
one-tail test, i.e., our research hypothesis becomes:
H1: β1 < 0 (testing for a negative slope)
or
H1: β1 > 0 (testing for a positive slope)
Of course, the null hypothesis remains: H0: β1 = 0.
However, in this case the p-value would be the two-tail p-value divided by
2; using Excel’s p-value, this would be
which is still approximately 0.
Note: Remember Excel gives us the two-tail p-value.
© 2018 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in
a license distributed with a certain product or service or otherwise on a password-protected website for classroom use.
Required Conditions
16.35
For these regression methods to be valid the following four conditions for
the error variable (ɛ) must be met:

The probability distribution of ɛ is normal.

The mean of the distribution is 0; that is, E(ɛ) = 0.

The standard deviation of ɛ is σɛ, which is a constant regardless of the
value of x.

The value of ɛ associated with any particular value of y is independent
of ɛ associated with any other value of y.
Chapter 17
Multiple Regression
• Model and Required Conditions
• Estimating the Coefficients and Assessing the Model
• Regression Diagnostics-I
• Regression Diagnostics-II (Time Series)
17.36
Multiple Regression
17.37
▪ The simple linear regression model was used to analyze how one
interval variable (the dependent variable y) is related to one other
interval variable (the independent variable x).
▪ Multiple regression allows for any number of independent variables.
▪ We expect to develop models that fit the data better than would a
simple linear regression model.
▪ The data for a simple linear regression problem consists of n
observations (𝑥𝑖, 𝑦𝑖) of two variables.
▪ Data for multiple linear regression
consists of the value of a response
variable y and k explanatory variables
(𝑥1, 𝑥2, …, 𝑥k) on each of n cases.
▪ We write the data and enter them into
software in the form:
Variables
…
Individual
x1
x2
xk
y
1
x11
x12 … x1k
y1
2
x21
x22 … x2k
y2
⁞
⁞
n
xn1
⁞
…
⁞
xn2 … xnk
⁞
yn
The Model
17.38
We now assume we have k independent variables potentially related to
the one dependent variable. This relationship is represented in this first
order linear equation:
dependent
variable
independent variables
error variable
coefficients
The coefficient βi (i = 1, …, k) has the following interpretation:
It represents the average change (increase or decrease) in the response
variable, y, when the independent variable xi increases by one unit and all
other x variables are held constant.
The Model
17.39
▪ In the simple regression model with one independent variable, we drew a
straight regression line.
▪ When there is more than one independent variable in the regression
model, we refer to the graphical depiction of the equation as a response
surface rather than as a straight line.
▪ The Figure below depicts a scatter diagram of a response surface (plane)
with k=2. Whenever k > 2, we cannot draw the response surface.
Required Conditions
17.40
For these regression methods to be valid the following four conditions for
the error variable (ɛ) must be met:
• The probability distribution of the error variable (ɛ) is normal.
• The mean of the error variable is 0.
• The standard deviation of ɛ is σɛ , which is a constant.
• The errors are independent.
Estimating the Coefficients
17.41
The multiple regression equation is expressed as:
We will use computer output to:
Assess the model…
How well it fits the data
Are any required conditions violated?
Employ the model…
Interpreting the coefficients
Estimating the expected value of the dependent variable
Regression Analysis Steps
17.42
u Use a computer and software to generate the coefficients and
the statistics used to assess the model.
v Diagnose violations of required conditions. If there are problems,
attempt to remedy them.
w Assess the model’s fit.
standard error of estimate,
coefficient of determination,
F-test of the analysis of variance.
x If u, v, and w are OK, use the model to predict or estimate the
expected value of the dependent variable.
File name in Assignment 3
The Excel file name must be “Lastname, Firstname A3”
In A3, the Excel file should include 1 sheet.
4.43
Chapter-Opening Example
Week12 (Jul 31 – Aug 6)
Data file: Xm17-00
17.44
GENERAL SOCIAL SURVEY: VARIABLES THAT AFFECT INCOME
In Chapter 16 opening example, it is showed using the General Social
Survey that income and education are linearly related. This raises the
question, what other variables affect one’s income?
To answer this question, we need to expand the simple linear regression
technique used in the previous chapter to allow for more than one
independent variable.
Here is a list of selected variables the General Social Survey created:
1. Age (AGE): For most people, income increases with age.
2. Years of education (EDUC): It is possible that education and income
are linearly related.
3. Hours of work per week (HRS1): More hours of work should produce
more income.
Chapter-Opening Example
17.45
4. Spouse’s hours of work (SPHRS1): It is possible that, if one’s spouse
works more and earns more, the other spouse may choose to work less
and thus earn less.
5. Number of family members earning money (EARNRS): As is the
case With SPHRS1, if more family members earn income there may be
less pressure on the respondent to work harder.
6. Number of children (CHILDS): Children are expensive, which may
encourage their parents to work harder and thus earn more.
Chapter-Opening Example– Using Excel
17.46
Data > Data Analysis > Regression
Data file: Xm17-00
Chapter-Opening Example– Using Excel
17.47
The results:
s
Chapter-Opening Example– Using Excel
17.48
The estimated regression model is:
yˆ = −110186.4004 + 921.9746x1 +5780.6634x2 + 1095.5988x3 − 238.9901x4 + 149.7858x5 + 469.3977x6
y=
x1=
x2=
x3=
x4=
x5=
x6=
Income
Age
Education
Hours of work
Spouse’s hours of work
Number of family members earning money
Number of children
Does there appear to be a significant linear relationship between the
income and at least one of the 6 independent variables?
Model Assessment
We will assess the estimated model in three ways:
Standard error of estimate,
Coefficient of determination, and
F-test of the analysis of variance.
17.49
In multiple regression, the standard error of estimate is defined as:
SSE = sum squares of errors
n = sample size
k = number of independent variables in the model
From Excel output:
s = 35901.56
It seems the standard error of estimate is quite large. However, we use it for
the comparison with other estimated models. The estimated model with the
smallest s is the best one (the closer the data values are to the regression
line).
Model Assessment
17.50
Coefficient of Determination:
Again, the coefficient of determination, R2 is defined as:
R2= 34.34%. This means that 34.34% of the variation in income is
explained by the six independent variables, but 65.66% remains
unexplained.
Adjusted R2 is the coefficient of determination adjusted for the sample size,
n and the number of independent variables k:
In the income model, the Adjusted R2 = 33.44%.
We use it in the multiple regression model.
Testing the Validity of the Model
17.51
In a multiple regression model (i.e. more than one independent variable), we
utilize an analysis of variance (ANOVA) technique to test the overall
validity of the model. Here’s the hypotheses:
H 0:
(The model is not valid/significant)
H1: At least one βi is not equal to zero.
▪
If the null hypothesis is true, none of the independent variables is linearly
related to y, and so the model is invalid.
▪
If at least one βi is not equal to 0, the model does have some validity.
Testing the Validity of the Model
n=446
ANOVA table for regression analysis
Source of
Variation
degrees of
freedom
Sums of
Squares
Mean Squares
Regression
k
SSR
MSR = SSR/k
17.52
F-Statistic
F=MSR/MSE
Error
n–k–1
Total
n–1
SSE
MSE = SSE/(n–k-1)
Chapter-Opening Example
A large value of F indicates that most of the variation in y is explained by the
regression equation and that the model is valid.
A small value of F indicates that most of the variation in y is unexplained.
pvalue
Grades
Explained and Unexplained Variation
Y
4.53
student grade
yi

SSE = (yi - yi )2
_
SST = (yi - y)2

y
_
y

y
 _
SSR = (yi - y)2
Unexplained
(Unpredicted portion)
Explained
(Predicted portion)
SSR
SSE
SST
yi
Xi
No. of study hours
x
Testing the Validity of the Model
n=446
Our rejection region is:
F > Fα,k,n-k-1 = F.05,6,439 ≈ F.05,6,∞ = 2.10 (Table 6)
Do not reject H0
Decision: We reject H0 in favor of H1 because
the F statistic = 38.26 is > 2.10, or equivalently
the p ≈ 0 is > α = .05
Conclusion: There is a great deal of evidence
to infer that there is a significant linear
relationship between the income and at least
one of the 6 independent variables.
Chapter-Opening Example
v1= k
v2= n-k-1
▪ F is always zero or positive
Notes:
▪ Large values of F statistic are evidence against H0
▪ The F test is upper-one-sided
17.54
Reject H0
2.10
p-value
Relationship among SSE, sℇ, R2, and F
Summary
17.55
SSE
R2
F
Assessment
of Model
0
0
1
Perfect
small
small
close to 1
large
Good
large
large
close to 0
small
Poor
0
0
Invalid
Once we’re satisfied that the model fits the data as well as possible, and that
the required conditions are satisfied, we can interpret and test the individual
coefficients and use the model for prediction.
Interpreting the Coefficients
17.56
Intercept
The intercept is b0 = −110186.40. This is the average income when all the
independent variables are zero. As we observed in Chapter 16, it is often
misleading to try to interpret this value, particularly if 0 is outside the range
of the values of the independent variables (as is the case here).
Age, x1
The relationship between income and age is described by b1 = 921.97. For
each additional year of age, the income increases on average by $921.97,
assuming that the other independent variables in this model are held
constant.
Education, x2
The coefficient b2 = 5780.66 specifies that for each additional year of
education the income increases on average by $5,780.66, assuming all
other independent variables in this model are held constant.
yˆ = −110186.4004 + 921.9746x1 +5780.6634x2 + 1095.5988x3 − 238.9901x4 + 149.7858x5 + 469.3977x6
Interpreting the Coefficients
17.57
Hours of work, x3
The relationship between hours of work per week is expressed by b3=
1,095.60. We interpret this number as the average increase in annual
income for each additional hour of work per week keeping the other
independent variables fixed.
Spouse’s hours of work, x4
The relationship between annual income and a spouse’s hours of work per
week is described b4 = −238.99, which means that for each additional hour
a spouse works per week, the income decreases on average by $238.99
when the other variables are constant.
Number of family members earning income, x5
The relationship between annual income and the number of family members
who earn money is expressed by b5 = 149.79, which tells us that for each
additional family member earning money, the annual income increases on
average by $149.79 assuming that the other independent variables are
constant.
yˆ = −110186.4004 + 921.9746x1 +5780.6634x2 + 1095.5988x3 − 238.9901x4 + 149.7858x5 + 469.3977x6
Interpreting the Coefficients
17.58
Number of children, x6
The relationship between annual income and number of children is expressed
by b6 = 469.40, which tells us that for each additional child, annual income
increases on average by $469.40 assuming that the other independent
variables are held constant.
yˆ = −110186.4004 + 921.9746x1 +5780.6634x2 + 1095.5988x3 − 238.9901x4 + 149.7858x5 + 469.3977x6
Testing the Coefficients
Break 5 minutes
17.59
For each independent variable, we can test to determine whether there is
enough evidence of a linear relationship between this independent variable
and the dependent variable in the entire population.
H0: βi = 0
(for i = 1, 2, …, 6)
H1: βi ≠ 0
TS:
(with n–k–1 degrees of freedom)
Testing the Coefficients
17.60
Test of β1 (Coefficient of age): Value of the test statistic: t = 6.16; p-value ≈ 0. We
reject H0 in favor of H1
Test of β2 (Coefficient of education): Value of the test statistic: t = 9.57; p-value ≈ 0.
We reject H0 in favor of H1
Test of β3 (Coefficient of number of hours of work per week): Value of the test
statistic: t = 9.34; p-value ≈ 0. We reject H0 in favor of H1
Test of β4 (Coefficient of spouse’s number of hours of work per week): Value of the
test statistic: t = −1.88; p-value = .061. We do not reject H0 in favor of H1
Test of β5 (Coefficient of number of earners in family): Value of the test statistic: t =
.05; p-value = .960. We do not reject H0
Test of β6 (Coefficient of number of children): Value of the test statistic: t = .387; pvalue = .699. We do not reject H0
Testing the Coefficients
INTERPRET
17.61
Conclusion:
There is sufficient evidence at the 5% significance level to infer that each of
the following variables is linearly related to income:
• Age
• Education
• Number of hours of work per week
In this model there is not enough evidence to conclude that each of the
following variables is linearly related to income:
• Spouse’s number of hours of work per week
• Number of earners in the family
• Number of children
Note that this may mean that there is no evidence of a linear relationship
between these 3 independent variables. However, it may also mean that
there is a linear relationship between the 2 variables, but because of a
condition called multicollinearity, the t-test revealed no linear relationship.
We will discuss multicollinearity next.
Using the Regression Equation for Prediction
17.62
As we did with simple linear regression, we can predict the income of a 50year-old respondent, with 12 years of education, who works 40 hours per
week, whose spouse also works 40 hours per week, 2 earners in the family,
and has 2 children.
yˆ = −110186.4004 + 921.9746x1 +5780.6634x2 + 1095.5988x3 − 238.9901x4 + 149.7858x5 + 469.3977x6
= −110186.4004 + 921.9746(50) + 5780.6634(12) + 1095.5988(40) − 238.9901(40) + 149.7858(2) + 469.3977(2)
$40,783.01
Regression Diagnostics I
17.63
▪ Multiple regression models have a problem that simple regressions do
not have, namely multicollinearity. It happens when the independent
variables are highly correlated.
▪ The adverse effect of multicollinearity is that the estimated regression
coefficients of the independent variables that are correlated tend to have
large sampling errors.
▪ The consequence of the multicollinearity is that when the coefficients are
tested, the t-statistics will be small which leads to the inference that there
is no linear relationship between the affected independent variables and
the dependent variable. In some cases, this inference will be wrong.
▪ Fortunately, multicollinearity does not affect the F test of the analysis of
variance.
Multicollinearity
17.64
To illustrate, we’ll use the General Social Survey of 2012. When we conducted a
regression analysis similar to the chapter-opening example, we found that the
number of children in the family was not statistically significant at the 5%
significance level. However, when we tested the coefficient of correlation
between income and number of children, we found it to be statistically
significant.
Correlation Between Income and Number of Children
How do we explain the apparent contradiction between the insignificant
multiple regression t-test of the coefficient of the number of children, β6, and
the significant correlation coefficient of number of children and income?
The answer is multicollinearity.
Multicollinearity
17.65
▪ There is a relatively high degree of correlation between number of family
members who earn income, x5, and number of children, x6.
▪ The result of the t-test of the correlation between number of earners and
number of children is shown significant. This result should not be
surprising, as more earners in a family are very likely children.
Correlation Between Number of Earners and Number of Children
Multicollinearity
4.66
▪ Another problem caused by multicollinearity is the interpretation of the
coefficients.
▪ We interpret the coefficients as measuring the change in the
dependent variable when the corresponding independent variable
increases by one unit while all the other independent variables are held
constant.
▪ This interpretation may be impossible when the independent variables
are highly correlated, because when the independent variable
increases by one unit, some or all of the other independent variables
will change.
Regression Diagnostics II
17.67
▪ We pointed out that one of the required conditions in the regression
analysis is the errors should be independent (review slide 35).
▪ In the time series data (i.e., when the data are gathered sequentially over
a series of time periods), there is a high possibility of violating the
independency condition between the errors.
▪ So, when this condition is violated, this means we have the problem
called autocorrelation (a condition in which a relationship exists
between consecutive residuals, i.e., ei and ei-1 (i is the time period)).
▪ The Durbin-Watson test (beyond the scop of this course) allows us to
determine whether there is evidence of first-order autocorrelation.
Regression Diagnostics II
17.68
▪ This graph reveals a serious problem. There is a strong relationship
between consecutive values of the residuals, which indicates that the
requirement that the errors are independent has been violated.
▪ To confirm this diagnosis, we can use Excel to calculate the Durbin–
Watson statistic.
Download