Lecture 8-9

advertisement
Regression Analysis
© 2007 Prentice Hall
17-1
Chapter Outline
1) Correlations
2) Bivariate Regression
3) Statistics Associated with Bivariate
Regression
4) Conducting Bivariate Regression Analysis
i. Scatter Diagram
ii. Bivariate Regression Model
iii. Estimation of Parameters
iv. Standardized Regression Coefficient
v. Significance Testing
© 2007 Prentice Hall
17-2
Chapter Outline
vi. Strength and Significance of Association
vii. Assumptions
5) Multiple Regression
6) Statistics Associated with Multiple
Regression
7) Conducting Multiple Regression
i. Partial Regression Coefficients
ii. Strength of Association
iii. Significance Testing
8) Multicollinearity
9) Relative Importance of Predictors
© 2007 Prentice Hall
17-3
Product Moment Correlation




The product moment correlation, r, summarizes
the strength of association between two metric
(interval or ratio scaled) variables, say X and Y.
It is an index used to determine whether a linear or
straight-line relationship exists between X and Y.
r varies between -1.0 and +1.0.
The correlation coefficient between two variables will
be the same regardless of their underlying units of
measurement.
© 2007 Prentice Hall
17-4
Explaining Attitude Toward
the City of Residence
Table 17.1
Respondent No Attitude Toward
the City
© 2007 Prentice Hall
Duration of
Residence
Importance
Attached to
Weather
1
6
10
3
2
9
12
11
3
8
12
4
4
3
4
1
5
10
12
11
6
4
6
1
7
5
8
7
8
2
2
4
9
11
18
8
10
9
9
10
11
10
17
8
12
2
2
5
17-5
Product Moment Correlation


When it is computed for a population rather than a
sample, the product moment correlation is denoted
by r , the Greek letter rho. The coefficient r is an
estimator of r .
The statistical significance of the relationship
between two variables measured by using r can be
conveniently tested. The hypotheses are:
H0 : r = 0
H1 : r  0
© 2007 Prentice Hall
17-6
Significance of correlation
•The test statistic has a t dist.
•The r bet. ‘Attitude towards city’ and ‘Duration’ is 0.9361
• The value of t-stat is 8.414.
•From the t table (Table 4 in the Stat Appdx), the critical value
of t for a two-tailed test and a= 0.05 is 2.228.
•Hence, the null hypothesis of no relationship between X and
Y is rejected
© 2007 Prentice Hall
17-7
Regression Analysis
Regression analysis examines associative relationships
between a metric dependent variable and one or more
independent variables in the following ways:



Determine whether the independent variables explain a
significant variation in the dependent variable: whether a
relationship exists.
Determine how much of the variation in the dependent
variable can be explained by the independent variables:
strength of the relationship.
Predict the values of the dependent variable.
© 2007 Prentice Hall
17-8
Statistics Associated with Bivariate
Regression Analysis



Regression model. Yi = b 0 + b 1 Xi + ei whereY
= dep var, X = indep var, b 0 = intercept of the
line, b 1 = slope of the line, and ei is the error term
for the i th observation.
Coefficient of determination: r 2. Measures
strength of association. Varies bet. 0 and 1 and
signifies proportion of the variation in Y
accounted for by the variation in X.
Estimated or predicted value of Yi
is Yi = a + bx where Yi is the predicted value of
Yi and a and b are estimators of b 0 and b 1
© 2007 Prentice Hall
17-9
Statistics Associated with Bivariate
Regression Analysis



Regression coefficient. The estimated
parameter b is usually referred to as the nonstandardized regression coefficient.
Standard error of estimate. This statistic is
the standard deviation of the actual Y values
from the predicted Y values.
Standard error. The standard deviation of b,
SEb is called the standard error.
© 2007 Prentice Hall
17-10
Statistics Associated with Bivariate
Regression Analysis


Sum of squared errors. The distances of
all the points from the regression line are
squared and added together to arrive at the
sum of squared errors, which is a measure
of total error,Se 2 .
j
t statistic. A t statistic can be used to test
the null hypothesis that no linear
relationship exists between X and Y
© 2007 Prentice Hall
17-11
Idea Behind Estimating Regression Eqn



A scatter diagram, or scattergram, is a plot of the
values of two variables
The most commonly used technique for fitting a
straight line to a scattergram is the least-squares
procedure.
In fitting the line, the least-squares procedure
minimizes the sum of squared errors, Se j2 .
© 2007 Prentice Hall
17-12
Conducting Bivariate Regression Analysis
Plot the Scatter Diagram
Formulate the General Model
Estimate the Parameters
Estimate Regression Coefficients
Test for Significance
Determine the Strength and Significance of Association
© 2007 Prentice Hall
17-13
Plot of Attitude with Duration
Attitude
Fig. 17.3
9
6
3
2.25
4.5
6.75
9
11.25 13.5 15.75
18
Duration of Residence
© 2007 Prentice Hall
17-14
Which Straight Line Is Best?
Fig. 17.4
Line 1
Line 2
9
Line 3
Line 4
6
3
2.25 4.5
© 2007 Prentice Hall
6.75
9
11.25 13.5 15.75 18
17-15
Decomposing the Total Variation
Fig. 17.6
Y
Residual Variation
SSres
Explained Variation
SSreg
Y
X1
© 2007 Prentice Hall
X2
X3
X4
X5
X
17-16
Decomposing the Total Variation
The total variation, SSy, may be decomposed into the variation
accounted for by the regression line, SSreg, and the error or residual
variation, SSerror or SSres, as follows:
SSy = SSreg + SSres
where
n
SSy = iS=1 (Yi - Y)2
n
SSreg = iS (Yi - Y)2
=1
n
SSres = iS= (Yi - Yi)2
1
© 2007 Prentice Hall
17-17
Strength and Significance of Association
The strength of association is:
2
R =
SS
re g
SS
y
Answers the question: ”What percentage of total
variation in Y is explained by X?”
© 2007 Prentice Hall
17-18
Test for Significance
The statistical significance of the linear relationship
between X and Y may be tested by examining the
hypotheses:
H0 : b 1 = 0
H1 : b 1  0
A t statistic can be used, where t=b/SEb
SEb denotes the standard deviation of b and is called
the standard error.
© 2007 Prentice Hall
17-19
Illustration of Bivariate Regression
The regression of attitude on duration of residence, using the
data shown in Table 17.1, yielded the results shown in Table
17.2. a= 1.0793, b= 0.5897. The estimated equation is:
Attitude (Y ) = 1.0793 + 0.5897 (Duration of residence)
The standard error, or standard deviation of b is 0.07008, and
t = 0.5897/0.0700 =8.414.
The p-value corresponding to the calculated t is 0.000. Since this
is smaller than a =0.05, the null hypothesis is rejected.
© 2007 Prentice Hall
17-20
Bivariate Regression
Table 17.2
Multiple R
R2
Adjusted R2
Standard Error
0.93608
0.87624
0.86387
1.22329
df
Regression
Residual
F = 70.80266
1
10
ANALYSIS OF VARIANCE
Sum of Squares Mean Square
105.95222
105.95222
14.96444
1.49644
Significance of F = 0.0000
Variable
VARIABLES IN THE EQUATION
b
SEb
Beta (ß)
T
Duration
(Constant)
0.58972
1.07932
© 2007 Prentice Hall
0.07008
0.74335
0.93608
8.414
1.452
Significance
of T
0.0000
0.1772
17-21
Strength and Significance of Association
The predicted values ( Y) can be calculated using
Attitude ( Y ) = 1.0793 + 0.5897 (Duration of residence)
 For the first observation in Table 17.1, this value is:
Y = 1.0793 + 0.5897 x 10 = 6.9763.
 For each observation, we can obtain this value
 Using these,

SSreg =105.9524,
SSres =14.9644
R2=105.95/(105.95+14.96)=0.8762,
© 2007 Prentice Hall
17-22
Strength and Significance of Association
Another, equivalent test for examining the
significance of the linear relationship between X and
Y (significance of b) is the test for the significance of
the coefficient of determination. The hypotheses in
this case are:
H0: R2pop = 0
H1: R2pop > 0
© 2007 Prentice Hall
17-23
Strength and Significance of Association
•
The appropriate test statistic is the F statistic which has an F
distribution.
•
The p-value corresponding to the F statistic is: 0.0000
Therefore, the relationship is significant at the α=0.05 level,
corroborating the results of the t test.
© 2007 Prentice Hall
17-24
Assumptions

The error term is normally distributed.

The mean of the error term is 0.


The variance of the error term is constant.
This variance does not depend on the
values assumed by X.
The error terms are uncorrelated. In other
words, the observations have been drawn
independently.
© 2007 Prentice Hall
17-25
Multiple Regression
The general form of the multiple regression model
is as follows:
Y = b 0 + b 1 X1 + b 2 X2 + b 3 X3+ . . . + b k Xk + e
which is estimated by the following equation:
Y = a + b1X1 + b2X2 + b3X3+ . . . + bkXk
As before, the coefficient a represents the intercept,
but the b's are now the partial regression coefficients.
© 2007 Prentice Hall
17-26
Stats Associated with Multiple Reg



Coefficient of multiple determination. The
strength of association is measured by R2.
Adjusted R2. R2, coefficient of multiple
determination, is adjusted for the number of
independent variables and the sample size.
F test. The F test is used to test the null
hypothesis that the coefficient of multiple
determination in the population, R2pop, is zero.
The test statistic has an F distribution
© 2007 Prentice Hall
17-27
The Multiple Regression Equation

For data in Table 17.1, suppose we want to
explain ‘Attitude Towards City’ by ‘Duration’
and ‘Importance of Weather’
From Table 17.3, the estimated regression
equation is:
( Y) = 0.33732 + 0.48108 X1 + 0.28865 X2
or
Attitude = 0.33732 + 0.48108 (Duration) +
0.28865 (Importance)

© 2007 Prentice Hall
17-28
Multiple Regression
Table 17.3
Multiple R
R2
Adjusted R2
Standard Error
0.97210
0.94498
0.93276
0.85974
df
Regression
Residual
F = 77.29364
2
9
ANALYSIS OF VARIANCE
Sum of Squares Mean Square
114.26425
57.13213
6.65241
0.73916
Significance of F = 0.0000
Variable
VARIABLES IN THE EQUATION
b
SEb
Beta (ß)
T
IMPORTANCE
DURATION
(Constant)
0.28865
0.48108
0.33732
© 2007 Prentice Hall
0.08608
0.05895
0.56736
0.31382
0.76363
3.353
8.160
0.595
Significance
of T
0.0085
0.0000
0.5668
17-29
Strength of Association
The strength of association is measured by R2, which is
similar to bivariate case
R2 is adjusted for the number of independent variables
and the sample size. It is called Adjusted R2
© 2007 Prentice Hall
17-30
Conducting Multiple Regression Analysis:
Significance Testing
H0 : R2pop = 0, This is equivalent to the following null hypothesis:
H0: b 1 = b 2 = b 3 = . . . = b k = 0
The overall test (for all βi’s collectively) can be conducted by
using an F statistic which has an F distribution.
Testing for the significance of the individual βi’s can be done in a
manner similar to that in the bivariate case by using t tests
© 2007 Prentice Hall
17-31
Multicollinearity


Multicollinearity arises when intercorrelations
among the predictors are very high.
Multicollinearity can result in several problems,
including:
 The partial regression coefficients may not
be estimated precisely. The standard errors
are likely to be high.

It becomes difficult to assess the relative
importance of the independent variables in
explaining the variation in the dependent
variable.
© 2007 Prentice Hall
17-32
Relative Importance of Predictors


Statistical significance. If the partial
regression coefficient of a variable is not
significant, that variable is judged to be
unimportant.
Measures based on standardized
coefficients or beta weights. The most
commonly used measures are the absolute
values of the beta weights, |Bi| , or the squared
values, Bi 2.
© 2007 Prentice Hall
17-33
Related documents
Download