Uploaded by Hamza Ben Nasr

Ch2 Eco

advertisement
Chapter II
Simple Regression Analysis
Dr Hédi ESSID
Format of the simple linear regression model
We write the simple linear regression model as
Yi = β0 + β1 Xi + ui
for i = 1,2,...,n
Yi, the value of the dependent variable in observation i,
has two components:
(1) the non-random component β0 + β1Xi, X being
described as the explanatory (or independent)
variable, and the fixed quantities β0 and β1 as the
parameters of the equation, and
(2) the disturbance term, ui.
Econometrics
2
Format of the simple linear regression model
Figure 2.1 illustrates how these two components
combine to determine Y.
X1, X2, X3, and X4 are four hypothetical values of the
explanatory variable. If the relationship between Y and X
were exact, the corresponding values of Y would be
represented by the points Q1 – Q4 on the line.
The disturbance term causes the actual values of Y to be
different. In the diagram, the disturbance term has been
assumed to be positive in the first and fourth
observations and negative in the other two, with the
result that, if one plots the actual values of Y against the
values of X, one obtains the points P1 – P4.
Econometrics
3
Format of the simple linear regression model
It must be emphasized that in practice the P points are all one can see of
Figure 2.1. The actual values of β0 and β1, and hence the location of the
Q points, are unknown, as are the values of the disturbance term in the
observations.
The task of regression analysis is to obtain estimates of β0 and β1, and
hence an estimate of the location of the line, Econometrics
given the P points.
4
Format of the simple linear regression model
Why does the disturbance term exist? There are several
reasons ?
1. Omission of explanatory variables
2. Aggregation of variables
3. Model misspecification
4. Functional misspecification
5. Measurement error
The disturbance term is the collective outcome of all these
factors.
Econometrics
5
Least squares regression
Suppose that you are given the four observations on X
and Y represented in Figure 2.1 and you are asked to
obtain estimates of the values of β0 and β1. As a rough
approximation, you could do this by plotting the four P
points and drawing a line to fit them as best you can.
This has been
done in Figure 2.2.
Econometrics
6
Least squares regression
The intersection of the line with the Y-axis provides an
estimate of the intercept β0, which will be denoted βˆ0 ,
and the slope provides an estimate of the slope
coefficient β1, which will be denoted βˆ1 .
The line, known as the fitted model, will be written
Yˆi = βˆ0 + βˆ1 X i
the caret mark over Y indicating that it is the fitted value
of Y corresponding to X, not the actual value.
Econometrics
7
Least squares regression
In Figure 2.3, the fitted points are represented by the
points R1 – R4.
Econometrics
8
Least squares regression
Drawing a regression line by eye is all very well, but it
leaves a lot to subjective judgment. Furthermore, as will
become obvious, it is not even possible when you have
a variable Y depending on two or more explanatory
variables instead of only one.
The question arises, is there a way of calculating good
estimates of β0 and β1 algebraically ?
Econometrics
9
Least squares regression
The first step is to define what is known as a residual for
each observation.
This is the difference between the actual value of Y in
any observation and the fitted value given by the
regression line, that is, the vertical distance between Pi
and Ri in observation i. It will be denoted ei (or uˆi ):
ei = Yi − Yˆi
= Yi − βˆ0 − βˆ1 X i
Econometrics
10
Least squares regression
Obviously, we wish to fit the regression line, that is,
choose β0 and β1, in such a way as to make the
residuals as small as possible.
Equally obviously, a line that fits some observations well
will fit others badly and vice versa.
We need to devise a criterion of fit that takes account of
the size of all the residuals simultaneously.
Econometrics
11
Least squares regression
One way of overcoming the problem is to minimize RSS,
the sum of the squares of the residuals.
RSS = ∑ ei2 (or
i
2
ˆ
u
∑ i)
i
The smaller one can make RSS, the better is the fit,
according to this criterion. If one could reduce RSS to 0,
one would have a perfect fit, for this would imply that all
the residuals are equal to 0. The line would go through
all the points, but of course in general the disturbance
term makes this impossible.
Econometrics
12
Least squares regression
There are other quite reasonable solutions, but the least
squares criterion yields estimates of β1 and β2 that are
unbiased and the most efficient of their type, provided
that certain conditions are satisfied.
For this reason, the least squares technique is far and
away the most popular in uncomplicated applications of
regression analysis.
The form used here is usually referred to as Ordinary
Least Squares and abbreviated OLS.
Econometrics
13
Derivation of the Normal Equations
Least squares estimation chooses estimators β1 and β2
so as to minimise the sum of the squares of the
differences between the actual and fitted values of Y i.e.
choose βˆ0 and βˆ1 to minimise
RSS = ∑ (Yi − Yˆi )2
i
where Yˆi = βˆ0 + βˆ1 X i
Substituting for Ŷ
we have
RSS = ∑ (Yi − βˆ0 − βˆ1 X i )2
i
Econometrics
14
The necessary conditions for minimising RSS with respect
to βˆ0 and βˆ1 are
∂RSS
∂RSS
=0,
=0
∂βˆ0
∂βˆ1
i .e.
∂RSS
= −2∑ (Yi − βˆ0 − βˆ1 X i ) = 0 and
∂βˆ0
i
∂RSS
= −2∑ X i (Yi − βˆ0 − βˆ1 X i ) = 0
∂βˆ1
i
After rearrangement this gives the Normal Equations
∑Y = nβˆ + βˆ ∑ X
∑ XY = βˆ ∑ X + βˆ ∑ X
0
0
1
2
1
Econometrics
15
These can now solved for βˆ0 and βˆ1
n
•
βˆ1 =
n
∑(X
i
∑XY
− X )(Yi − Y )
i =1
n
i
=
i
− nXY
i =1
n
2
(
X
−
X
)
∑ i
∑
i =1
i =1
X i2 − nX 2
Cov ( X ,Y )
=
Var X
and
•
βˆ0 = Y − βˆ1 X
Econometrics
16
Example : Some (fictitious) sales-advertising data
Observation
1
2
3
4
5
6
7
8
9
10
11
12
Sales(Y)
36
48
45
40
30
56
63
53
61
68
66
65
Advertising(X)
56.7
63.9
62.7
59.7
55.9
68.7
69.2
65.5
69.4
73.4
74.1
74.4
NOTE: Both variables are measured in thousands of dollars
Econometrics
17
The sales-advertising model : regression output
Econometrics
18
Are the coefficient estimates plausible ?
The results here show an estimated intercept of -75
and a slope (X) coefficient of just under 2
What do you think about these values ?
Are they significantly different from zero ?
How good is the fit ?
Econometrics
19
Scatter diagram of sales vs ads with fitted regression line
Econometrics
20
Analysis of Variance (ANOVA) and Sums of Squares
As you can see from the ANOVA table (regression
output) we can decompose the Total Sum of Squares
(of the dependent variable Y around its mean) (TSS)
into two parts:
the Explained (or Regression) Sum of Squares (ESS)
and
the Residual Sum of Squares (RSS).
(
)
2
(
)
2
(
)
2
Yi −Y = ∑ Yˆi −Y + ∑ Yi −Yˆi
∑
TSS
ESS
RSS
Econometrics
21
Goodness of fit : R squared (the Coefficient of
determination)
We can now define the Coefficient of Determination or R squared
as the proportion of the Total Variation of the dependent variable
(around its mean) which can be explained by, or attributed to, the
regression.
2
R =∑
(
Yˆi − Y
2
) / ∑ (Y − Y )
i
2
= ESS / TSS = 1 − RSS / TSS
R squared is taken as a measure of the “ goodness of fit” of the
regression.
0 ≤ R2 ≤ 1
The closer to 1 is R squared, the better the fit.
Econometrics
22
Alternative interpretation of R2
It should be intuitively obvious that, the better is the fit achieved by
the regression equation, the higher should be the correlation
coefficient for the actual and predicted values of Y.
We will show that R2 is in fact equal to the square of this correlation
coefficient, which we will denote
rY ,Yˆ =
∑ (Yˆ − Y )
2
i
2
∑ (Y − Y ) ∑ (Yˆ − Y )
i
i
2
=
∑ (Yˆ − Y )
∑ (Y − Y )
2
i
2
= R2
i
Econometrics
23
Download