Uploaded by AINAVOLU SAIVARDHAN REDDY

Larose2e LecturePowerPointSlides Ch13

advertisement
+
Discovering Statistics
2nd Edition Daniel T. Larose
Chapter 13:
Inference in Regression
Lecture PowerPoint Slides
+ Chapter 13 Overview

13.1 Inference About the Slope of the Regression
Line

13.2 Confidence Intervals and Prediction Intervals

13.3 Multiple Regression
2
+ The Big Picture
3
Where we are coming from and where we are headed…
In the later chapters of Discovering Statistics, we have been
studying more advanced methods in statistical inference.

Here in Chapter 13, we return to regression analysis, first
discussed in Chapter 4. At that time, we learned descriptive methods
for regression analysis; now it is time to learn how to perform
statistical inference in regression.


In the last chapter, we will explore nonparametric statistics.
+ 13.1: Inference About the Slope
of the Regression Line
Objectives:
Explain the regression model and the regression model
assumptions.

Perform the hypothesis test for the slope b1 of the
population regression equation.


Construct confidence intervals for the slope b1.
Use confidence intervals to perform the hypothesis test
for the slope b1.

4
5
The Regression Model
Recall that the regression line approximates the relationship
between two continuous variables and is described by the
regression equation y-hat = b1x + b0.
Regression Model
The population regression equation is defined as:
y  b1x  b0  e
where b0 is the y-intercept of the population regression line, b1 is
the slope, and e is the error term.

Regression Model Assumptions
1. Zero Mean: The error term is a random variable with mean = 0.
2. Constant Variance: The variance of e is the same regardless of
the value of x.
3. Independence: The values of e are independent of each other.
4. Normality: The error term e is a normal random variable.

6
Hypothesis Tests for b1
To test whether there is a relationship between x and y, we begin with the
hypothesis test to determine whether or not b1 equals 0.
H0: b1 = 0 There is no linear relationship between x and y.
Ha: b2 ≠ 0 There is a linear relationship between x and y.
Test Statistic tdata
tdata 
b1
s
 (x  x)
2
where b1 represents the slope of the regression line
s
SSE
represents the standard error of the estimate, and
n 2
(x  x )
2
represents the numerator of the sample variance of the x data.
7
Hypothesis Tests for b1
H0: b1 = 0 There is no linear relationship between x and y.
Ha: b2 ≠ 0 There is a linear relationship between x and y.
Hypothesis Test for Slope b1
If the conditions for the regression model are met:
Step 1: State the hypotheses.
Step 2: Find the t critical value and the rejection rule.
Step 3: Calculate the test statistic and p-value.
tdata 
b1
s
2
(
x

x
)

Step 4: State the conclusion and the interpretation.
8
Example
Ten subjects were given a set of nonsense words to memorize
within a certain amount of time and were later scored on the number
of words they could remember. The results are in Table 13.4.
Test whether there is a relationship between time and score using
level of significance a = 0.01. Note the graphs on page 640,
indicating the conditions for the regression model have been met.
H0: b1 = 0 There is no linear relationship between time and score.
Ha: b1 ≠ 0 There is a linear relationship between time and score.
Reject H0 if the p-value is less than a= 0.01.
9
Example
10
Example
11
Example
Since the p-value of about 0.000 is less than a= 0.01, we reject H0.
There is evidence for a linear relationship between time and score.
12
Confidence Interval for b1
Confidence Interval for Slope b1
When the regression assumptions are met, a 100(1 – a)%
confidence interval for b1 is given by:
b1  ta / 2 
s
(x  x)
2
where t has n – 2 degrees of freedom.

Margin of Error
The margin of error for a 100(1 – a)% confidence interval for b1 is
s
given by:
E t 
a /2
(x  x )
2
As in earlier sections, we may use a confidence interval for the
slope to performa two-tailed test for b1. If the interval does not
contain 0, we would reject the null hypothesis.
+ 13.2: Confidence Intervals and
Prediction Intervals
13
Objectives:
Construct confidence intervals for the mean value of y for a given
value of x.

Construct prediction intervals for a randomly chosen value of y for
a given value of x.

Confidence Interval for the Mean
Value of y for a Given x
Confidence Interval for the Mean Value of y for a Given x
A (100 – a)% confidence interval for the mean response, that is, for the
population mean of all values of y, given a value of x, may be constructed
using the following lower and upper bounds:
1
(x * x ) 2
Lower Bound : yˆ  ta / 2  s

n  (x i  x ) 2
1
(x * x ) 2
Upper Bound : yˆ  ta / 2  s

n  (x i  x ) 2
where x* represents the given value of the predictor variable. The
requirements are
 that the regression assumptions are met or the sample
size is large.
14
Prediction Interval for an Individual
Value of y for a Given x
Prediction Interval for an Individual Value of y for a Given x
A (100 – a)% confidence interval for a randomly selected value of y given a
value of x may be constructed using the following lower and upper bounds:
Lower Bound : yˆ  ta / 2  s 1
1
(x * x ) 2

n  (x i  x ) 2
Upper Bound : yˆ  ta / 2  s 1
1
(x * x ) 2

n  (x i  x ) 2
where x* represents the given value of the predictor variable. The
requirements are that the regression assumptions are met or the sample
size is large.
15
+ 13.3: Multiple Regression
16
Objectives:
Find the multiple regression equation, interpret the multiple regression
coefficients, and use the multiple regression equation to make predictions.


Calculate and interpret the adjusted coefficient of determination.

Perform the F test for the overall significance of the multiple regression.

Conduct t tests for the significance of individual predictor variables.

Explain the use and effect of dummy variables in multiple regression.
 Apply
the strategy for building a multiple regression model.
17
Multiple Regression
Thus far, we have examined the relationship between the response
variable y and a single predictor variable x. In our data-filled world,
however, we often encounter situations where we can use more
than one x variable to predict the y variable.
Multiple regression describes the linear relationship between one response
variable y and more than one predictor variable x1, x2, …. The multiple
regression equation is an extension of the regression equation
yˆ  b0  b1x1  b2 x2  ... bk xk
where k represents the number of x variables in the equation and b0, b1, …
represent the multiple regression coefficients.

The interpretation of the regression coefficients is similar to the
interpretation of the slope in simple linear regression, except that we
add that the other x variables are held constant.
18
Adjusted Coefficient of Determination
We measure the goodness of a regression equation using the
coefficient of determination r2 = SSR/SST. In multiple regression, we
use the same formula for the coefficient of determination (though the
letter r is promoted to a capital R).
Multiple Coefficient of Determination R2
The multiple coefficient of determination is given by:
R2 = SSR/SST 0 ≤ R2 ≤ 1
where SSR is the sum of squares regression and SST is the total sum of
squares. The multiple coefficient of determination represents the proportion
of the variability in the response y that is explained by the multiple
regression equation.
19
Adjusted Coefficient of Determination
Unfortunately, when a new x variable is added to the multiple
regression equation, the value of R2 always increases, even when
the variable is not useful for predicting y. So, we need a way to
adjust the value of R2 as a penalty for having too many unhelpful x
variables in the equation.
Adjusted Coefficient of Determination R2adj
The adjusted coefficient of determination is given by:
 n 1 
2
Radj
 1  (1  R 2 )

 n  k 1 
where n is the number of observations, k is the number of x variables, and
R2 is the multiple coefficient of determination.
20
F Test for Multiple Regression
The multiple regression model is an extension of the model from
Section 13.1, and approximates the relationship between y and the
collection of x variables.
Multiple Regression Model
The population multiple regression equation is defined as:
y  b1x1  b2 x2  ... bk xk  e
where b1, b2, …, bk are the parameters of the population regression
equation, k is the number of x variables, and e is the error term that
followsa normal distribution with mean 0 and constant variance.
The population parameters are unknown, so we must perform
inference to learn about them. We begin by asking: Is our multiple
regression useful? To answer this, we perform the F test for the
overall significance of the multiple regression.
21
F Test for Multiple Regression
The hypotheses for the F test are:
H 0: b1 = b2 = … = b k = 0
Ha: At least one of the b’s ≠ 0.
The F test is not valid if there is strong evidence that the regression
assumptions have been violated.
F Test for Multiple Regression
If the conditions for the regression model are met
Step 1: State the hypotheses and the rejection rule.
Step 2: Find the F statistic and the p-value. (Located in the ANOVA table of
computer output.)
Step 3: State the conclusion and the interpretation.
22
t Test for Individual Predictor Variables
To determine whether a particular x variable has a significant linear
relationship with the response variable y, we perform the t test that
was used in Section 13.1 to test for the significance of that x variable.
t Test for Individual Predictor Variables
One may perform as many t tests as there are predictor variables in the
model, which is k.
If the conditions for the regression model are met:
Step 1: For each hypothesis test, state the hypotheses and the rejection
rule.
Step 2: For each hypothesis test, find the t statistic and the p-value.
Step 3: For each hypothesis test, state the conclusion and the
interpretation.
23
Dummy Variables
It is possible to include binomial categorical variables in multiple
regression by using a “dummy variable.”
A dummy variable is a predictor variable used to recode a
binomial categorical variable in regression by taking values 0 or 1.
Recoding the multiple regression equation will result in two different
regression equations, one for one value of the categorical variable
and one for the other.
These two regression equations will have the same slope, but
different y-intercepts.
24
Building a Multiple Regression Model
Strategy for Building a Multiple Regression Model
Step 1: The F Test – Construct the multiple regression equation using all
relevant predictor variables. Apply the F test in order to make sure that a
linear relationship exists between the response y and at least one of the
predictor variables.
Step 2: The t Tests – Perform the t tests for the individual predictors. If at
least one of the predictors is not significant, then eliminate the x variable
with the largest p-value from the model. Repeat until all remaining predictors
are significant.
Step 3: Verify the Assumptions – For your final model, verify the
regression assumptions.
Step 4: Report and Interpret Your Final Model – Provide the multiple
regression equation, interpret the multiple regression coefficients, and report
and interpret the standard error of the estimate and the adjusted coefficient
of determination.
+ Chapter 13 Overview

13.1 Inference About the Slope of the Regression
Line

13.2 Confidence Intervals and Prediction Intervals

13.3 Multiple Regression
25
Download