3. Estimation of the Linear Regression Function

advertisement
Chapter 3
Statistical Estimation of The Regression Function
3.1 Statistical Estimation
If the population can be observed, there is no statistical problem: all the features if the population
are known. The problem of statistical inference arises when the available information consists of a
limited sample that is randomly drawn from the (possibly infinitely large) population and we want to
infer something about the population using the sample at hand. Statistical estimation is one aspect of
statistical inference1 - it concerns the estimation of population parameters such as the population mean
and variance and the coefficients of a linear regression function.
In Chapter 2 the population regression function of Y given X is defined as the conditional mean
function of Y given X, written as E(Y | X). An important reason for our interest in this functional
relationship is that it allows us predict Y given values of X and to quantify the effect of a change in X on
Y (measured by say the derivative with respect to X.) Moreover, the conditional predictions of Y that are
produced by the regression function are optimal under specific but generally applicable conditions. This
chapter is concerned with the problem of estimating the population regression function using a sample
drawn from the population.
3.1.1 Parametric Versus Nonparametric Methods
Figure 2.7 and the related discussion illustrates how a sample can be used to estimate a
population regression. Since the population regression function of Y given X is the conditional mean of
Y given X, we simply computed a sequence of conditional means using the sample and plotted them.
Nothing in the procedure constrains the shape of the estimated regression. Indeed, the empirical
regression of Size given Price (the plot of
in Figure 2.7) wanders about quite irregularly (although
as it does so it retains a key feature that we expect of the population regression of S given P, namely that
its average slope is steeper than the major axis - the empirical regression starts off below the major axis
and then climbs above it.) The method used to estimate the empirical regression functions in Figure 2.7
1
Two other important inference problems are: hypothesis testing and the prediction of random
variables.
Econometrics Text by D M Prescott ©
Chapter 3, 2
can be described as nonparametric. While there is a huge literature on nonparametric estimation, this
book is concerned almost entirely with parametric models.
To illustrate the distinction between parametric and nonparametric methods, consider the
equation Y = a + bX. This equation has two parameters (or coefficients): a and b and clearly the
relationship between Y and X is linear. By varying the values of a and b the line’s height and slope can
be changed, but the fundamental relationship is constrained to be linear. If a quadratic term (and one
more parameter) is added: Y = a + bX + cX2 , the relationship between Y and X becomes more flexible
than the linear function. Indeed, the quadratic form embraces the linear form as a special case (set c = 0).
But the linear form does not embrace the quadratic form: no values of a and b can make the linear
equation quadratic. Of course, the three parameter quadratic equation is also constrained. A quadratic
function can have a single maximum or a single minimum but not both. Quadratic functions are also
symmetric about some axis2. If further powers of X are added, each with its own parameter, the
relationship becomes increasingly flexible in terms of the shape it can take. But as long as the number of
parameters remains finite, the shape remains constrained to some degree. The nonparametric case is
paradoxically not the one with zero parameters but the limiting case as the number of parameters
increases without bound. As the number of terms in the polynomial tends to infinity, the functional
relationship becomes unconstrained - it can take any shape. As noted above, the method used to
construct the empirical regressions in Figure 2.7 did not constrain the shape to be linear, quadratic or any
other specific functional relationship. In that sense the method used in Chapter 2 to estimate the
population regression can be called nonparametric.
In the context of regression estimation, the great appeal of nonparametric methods is that they do
not impose a predetermined shape on the regression function - which seems like a good idea in the
absence of any information as to the shape of the population regression. However, there is a cost
associated with this flexibility and that concerns the sample size. To perform well3, the nonparametric
estimator generally requires a large sample (the empirical regressions in Figure 2.7 used a sample of
almost 5,000 observations). In contrast, parametric methods that estimate a limited number of parameters
can be applied when samples are relatively small. The following examples by-pass the statistical aspect
2
The graph of Y = a + bX + cX2 is symmetric about the line X = - b/(2c)
3
The meaning of “performing well” will be discussed later in the chapter.
Econometrics Text by D M Prescott ©
Chapter 3, 3
of the argument but nevertheless provide some intuition. If you know that Y is a linear function of X,
then two points (2 observations) are sufficient to locate the line (and to determine the two parameters.)
If you know the relationship is quadratic, just three points are sufficient to plot the unique quadratic
function that connects the three points and therefore three observations will identify the three parameters
of the quadratic equation. The relationship continues: in general n points will determine the n parameters
of an nth order polynomial.
3.2 Principles of Estimation
As discussed in Chapter 2, there are examples of bivariate distributions in which the population
regression functions are known to be linear. In the remainder of this chapter we will be concerned with
linear population regressions and the methods that can be used to estimate them. We begin with a
discussion of alternative approaches to statistical estimation - all of which are parametric.
3.2.1 The Method of Moments
The quantities
are referred to as the first, second and third uncentred moments of the random variable X. The centred
moments are measured around the mean
The Method of Moments approach to estimating these quantities is to simply calculate their sample
equivalents, all of which take the form of averages. Table 3.1 provides the details for the first two
moments. Notice the parallels between the expressions for the population moments and their sample
counterparts. First, the estimator uses
instead of the expectation operator E. Both “take an
average”, one in the sample, the other in the population. Second, the estimator is a function of the
observations Xi whereas the population moment is defined in terms of the random variable X.
Econometrics Text by D M Prescott ©
Chapter 3, 4
Table 3.1
Population Moment (parameter)
Method of Moments Estimator
The justification for the Method of Moments approach to estimation is based on a Law of Large
Numbers4 which, loosely, states that as the sample size tends to infinity the probability that the sample
mean differs from the population mean tends to zero. In other words, the probability limit of the sample
mean is the population mean. In fact, the probability limit of any sample average is the expected value of
that quantity. In the following expressions plim refers to the probability limit.
Recall that the sample covariance is also a sample average, so this too is a consistent estimator of
the population covariance. An estimator whose probability limit is identical to the parameter it estimates
is said to be consistent. By the Law of Large Numbers, the Method of Moments (MM) estimator is a
consistent estimator. An important property of the probability limit is provided by the following
theorem:
Theorem 3.1 If
is a consistent estimator for the population parameter
and f( ) is a
continuous function then
4
See the Appendix to this chapter for more details on the Law of Large Numbers and the notion
of probability limit
Econometrics Text by D M Prescott ©
Chapter 3, 5
Theorem 3.1 implies for example that
whereas
.
Now let’s apply the MM estimator to the bivariate linear regression. Table 3.2 presents the
details; they are based on Theorem 2.1 of Chapter 2. That theorem states that for any linear population
regression E( Y | X ), the slope and intercept are given by
.
The MM estimator is simply the sample counterpart to the expression that defines the population
parameter of interest.
Table 3.2
The Method of Moments Estimator
for The Bivariate Linear Regression: E(Y | X)
Population Parameters
Slope:
MM Estimator
.
Intercept:
Later in this chapter we will report the MM estimator for the linear regression of Price on Size
using the house-price data that were discussed in Chapter 2.
3.2.2 The Maximum Likelihood Estimator
An important difference between the Maximum Likelihood Estimator (MLE) and the MM
estimator discussed in the previous section is that the MLE demands that the specific distribution that
generated the data be identified. In the previous section we assumed that the population regression is
linear, but we did not specify or assume that the random variables X and Y are, for example, bivariate
normal. If it is known5 that X and Y are bivariate normal, then intuitively, it seems sensible to take this
into account when estimating the parameters of the regression function. An important property shared by
5
In practical situations it can rarely be known with certainty what distribution actually generated
the data but through various tests the statistician may be comfortable assuming that X and Y are, say,
normally distributed
Econometrics Text by D M Prescott ©
Chapter 3, 6
MM and ML is that both estimators are consistent.
In the context of MLE, the researcher is assumed to know the distribution from which the data
are drawn - as noted in Chapter 1, it may be helpful to think of this distribution as a “data generating
process” in the way that rolling a six-sided die generates data. The principle behind MLE is essentially
this: given the data, what are the values of the population parameters that make the observed sample the
most likely. That is, what kind of population is likely to have generated this particular sample? Suppose
there are two colleges - one specializes in sports the other in music. Suppose the population mean height
of female students at these colleges is 1.70 metres (sports) and 1.63 metres (music). A random sample of
20 students is taken from one of the colleges and the sample mean height is 1.64 metres. From which
college was the sample drawn? It might never be known for certain, but the music college is more likely
than the sports college6 to generate a sample mean of 1.64. The ML principle identifies the music college
as the source of the sample.
Now consider a more formal example that illustrates how the ML principle is applied. Suppose
the object is to estimate the proportion of grade 10 students that smoke cigarettes. In the population of
grade 10 students the true proportion is B. A random sample of size n reveals that n1 smoke and n0 do
not. The probability of observing n1smokers and n0 non-smokers in a sample of n = n1 + n0 is given by
the binomial distribution:
where k is the binomial coefficient
The MLE treats the sample as given (n and n1 are thought of as fixed) and
asks what value of B makes the actual sample most likely (most probable in this case.) Let
and
be the MLE
is some other
value. The MLE
satisfies:
The value of
6
can be found using calculus: take the derivative with respect to B of the
We are implicitly assuming the variance of height is the same at the two colleges.
Econometrics Text by D M Prescott ©
Chapter 3, 7
probability of observing the sample and set it to zero7. The solution is
= n1/n , namely the proportion
of smokers in the sample. The MLE of B is therefore perfectly intuitive: the proportion of smokers in the
population is estimated by the proportion of smokers in the sample.
To apply the ML principle to the bivariate regression model it is necessary to specify the
distribution that generated the data, such as the bivariate normal. Equation [2.9] of Chapter 2 describes
the regression of Y given X for the bivariate normal distribution. It is reproduced here:
where , is a normally distributed random variable with a mean of zero and a variance
. The normal
density function is given be equation [1.13]. For the random variable , it has the form
The sample consists of n observations (Xi, Yi) , i = 1, 2,..., n. The corresponding values of
are not
observable, but nevertheless the likelihood of observing the sample can be expressed as the product of
the densities:
Recall that the exp(a)exp(b) = exp(a+b) - the
product of exponentials is the exponential of the summed exponents. Apply this idea to the likelihood
function and we get
The final step is to substitute for the unobserved ,’s using [2.9]. This expresses the likelihood of the
sample in terms of observable data:
7
Treat the probability as a product i.e., use the product rule of differentiation. It can be shown
that the first order condition identifies a maximum - not a minimum.
Econometrics Text by D M Prescott ©
Chapter 3, 8
In equation [3.3], X and Y represent the n observed values (Xi, Yi) , i = 1, 2,..., n.. Note also that the
likelihood function is seen as a function of the unknown parameters. The ML estimators are the
parameter values
that maximize the likelihood function, treating X and Y as fixed. If
are any other parameter values then the MLE satisfies
As in the previous example, calculus can be used to determine the MLE. The details are omitted and we
go straight to the solution. It turns out that in this case the MLE of " and $ are identical to the MM
estimators given in Table 3.2
3.2.3 Ordinary Least Squares
In Chapter 2 it was explained why the population regression function can be described as the
“least squares function.” The argument is briefly reviewed here. If the object is to find the best
predictor of the random variable Y , say Y*, such that Y* minimises
, the solution is
Y* = E(Y), the population mean of Y. Further, if X is correlated with Y and the value of X is known
when the forecast of Y is made, then the solution to the optimal prediction problem is Y* = E(Y | X) i.e.,
the conditional mean of Y given X. This is none other than the population regression function of Y given
X. The regression function therefore minimises the expected (long run average) squared prediction error.
Consider now what this implies if the population regression function is linear. In such a case it
can be written as
Let Y* = . + 0X be a representative linear equation. Consider the problem: Determine the values of .
and 0 that:
minimise E (Y - Y* | X)2 = E ( Y - . - 0X | X)2
Econometrics Text by D M Prescott ©
Chapter 3, 9
Since we know the solution to this minimisation problem is the population regression function E(Y | X)
and since in this case it is linear, the solution values are: . = " and 0 = $ where " and $ are the specific
parameter values defined above. This analysis suggests that the least squares line drawn through a
sample scatter is a viable estimator of the linear population regression function.
Table 3.3 compares the properties of the linear population regression with the sample least
squares regression. The sample of n observations is represented by the points (Xi , Yi ) for i = 1,2,...,n .
Table 3.3
The Least Squares Method of Estimation
The Population
The Sample
Y = " + $X + ,
Yi = a + b Xi + ei for i = 1,2,...,n .
Linear Population Regression
Estimated Linear Regression
" and $ are the unique values that minimise:
The Least Squares Principle:
Choose a and b such that the following quantity
is minimised:
Population Parameter Values
The Least Squares Solution Values
Table 3.3 emphasises that the population regression function is the least squares function. To estimate
the parameters of this function using a sample that has been drawn from the population we find the least
squares function within the available sample. Notice that to apply the least squares principle the
Econometrics Text by D M Prescott ©
Chapter 3, 10
expectation operator E (which gives a population mean) is replaced by its sample equivalent
(which gives a sample mean).
Before looking at the details of how the least squares solution is obtained, consider the numerical
example in Table3.4
Table 3.4
Annual Advertising and Sales Data for Eight Stores
(Thousands of Dollars)
Store No.
1
2
3
4
5
6
7
8
Advertising
Expenditures
15
10
12
18
20
28
25
32
Sales
1000
865
945
930
990
1105
1070
1095
The artificial data in Table 3.4 represent the sales for eight stores (the dependent or Y-axis
variable) together with each store’s advertising expenditure (the explanatory or X-axis variable.) The
data are plotted in Figure 3.1
along with the least squares regression line.
Figure 3.1
The Least Squares Regression
1150
1100
Y^ = 800 + 10*18 = 980
Sales
1050
Y^ = a + bX
1000
e = -50
950
(X=18,Y=930)
900
850
10
15
20
25
30
35
Advertising Expenditure
The least squares equation is written as:
Y$ = a + bX
. For each data point, the vertical distance
from to the least squares line is referred to as the least squares residual, which is represented by the
Econometrics Text by D M Prescott ©
Chapter 3, 11
symbol e . For the ith data point (Xi, Yi ), the least squares residual is
The least squares residual can also be described as a within-sample prediction error since it is the
difference between the observed value of Y and the predicted value of Y, that is, the value predicted by
the least squares regression equation.
The equation of the least squares regression in Figure 3.1 is
Y$ = 800 + 10 X .
In thousands
of dollars, store number 4 spent 18 on advertising and had sales of 930. The L.S. regression line predicts
sales of 800 + 10*18) = 980 (thousands of $). The prediction error is therefore
(thousands of $). Notice that all the data points below the L.S.
regression line have negative residuals since they are over-predicted by the L.S. regression while all
points above the line have positive residuals.
3.2.4 Solving the Least Squares Problem
The slope and the intercept of the L.S. regression line are chosen in such a way as to minimise
the sum of squared residuals, SSR. If the slope and intercept are changed, the residuals will obviously
change as well and so too will the sum of squared residuals, SSR. In short, SSR is a function of a and b
which can be written as follows:
To solution to the L.S. minimisation problem can be found by setting to zero the first derivatives of
SSR(a,b) with respect to a and b . The pair of first order conditions provide two equations that
determine the solution values of a and b . The two partial derivatives are shown in equations [3.4] and
[3.5]
Econometrics Text by D M Prescott ©
Chapter 3, 12
Cons
ider equation [3.4] first. Notice that the differential operator can pass through the summation sign
because the derivative of a sum of items is the same as the sum of the derivatives of the individual items.
Equation [3.4] can be written as:
The derivative of the typical element
can be evaluated in one of two ways. Either the
quadratic term can be expanded and then differentiated, or the function of a function rule can be used.
Using the function of a function rule we find that
Since the derivative is set to zero at the minimum point of the function S(a, b), the term -2 can be
Econometrics Text by D M Prescott ©
Chapter 3, 13
cancelled. The first order condition [3.4] can therefore be written as
Equation [3.6] has an interesting interpretation that will be discussed later. The final step is to rewrite
[3.6] in a more useable form. Recall that the sum of n numbers is always equal to n times the mean of the
numbers:
Also, recall that
The final form of the first order
condition [3.4] is
Now consider the second of the first order conditions, [3.5]. The derivative of the typical
element
with respect to the slope coefficient b is
Equation [3.5] can therefore be written as
Dividing through by minus two yields the following equation that is equivalent to equation [3.6].
Econometrics Text by D M Prescott ©
Chapter 3, 14
After some rearrangement, equation [3.8] implies that the least squares coefficients satisfy:
Equations [3.7] and [3.9] can be solved for the least squares coefficients. Equation [3.7] is used
to solve for the intercept a.
Now substitute for a in equation [3.9] and solve for b:
Notice that the least squares equation for the slope coefficient b can be expressed in deviation form,
where
A Numerical Example
Table 3.5 illustrates the calculation of the least squares coefficients for the advertising/ sales data
that are plotted in Figure 1. The first two columns present the original data on advertising expenditure
and sales at the two stores. The least squares slope coefficient b is calculated according to equation
[4.13], which requires the computation of GXiYi and G(Xi)2 as well as the means of X and Y. The
squared X values appear in the third column and the cross products between X and Y appear in the fourth
column of Table 3.5. These sums and the means of X and Y are presented at the bottom of the
appropriate columns. Finally, the least squares formulae are used to compute the intercept and slope of
the least squares line for these data. These calculations show that the line drawn in Figure 1 is indeed the
least squares line.
Econometrics Text by D M Prescott ©
Chapter 3, 15
Table 3.5
Calculation of the Least Squares Coefficients
(Advertising)
Xi
(Sales)
Yi
(Xi)2
XiYi
15
1000
225
15000
10
865
100
8650
12
945
144
11340
18
930
324
16740
20
990
400
19800
28
1105
784
30940
25
1070
625
26750
32
1095
(GYi)/n = Y= 1000
1024
35040
G (Xi)2 = 3626
G XiYi = 164260
(GXi)/n = X= 20
3.2.5 Interpretation of the L.S. Regression Coefficients
The parameter $ in Table 3.3 can be described as the slope of the population regression function
E(Y | X) i.e., it is the derivative of E(Y | X) with respect to X. A more intuitive $ interpretation is that it
represents the effect on the conditional mean of Y, E(Y | X), of a unit change in X. $ therefore has units
which are equal to the units of Y divided by the units of X. The L.S. estimator b estimates $ and so it is
Econometrics Text by D M Prescott ©
Chapter 3, 16
the estimated effect on E(Y | X) of a unit change in X.
Table 3.6 shows the L.S. regression of house price on a constant and house size. These data have
been described in Chapter 2. Recall that the data were collected over a six year period, 1983-87. The
variable “price” records the price at which the house sold and “size” is its size in square feet.
The coefficient on SIZE is $60.5 per square foot and it represents the estimated effect on market
price of an increase in SIZE of one square foot. More specifically, it is the effect on the conditional
mean price (conditional on size) of a unit increase in size. Consider the population mean price of all
houses that are exactly (a) 1500 square feet and (b) 1501 square feet. The difference in these conditional
means is estimated to be $60.5 per square foot. Note that the relationship between the conditional mean
price and size is linear so this estimate applies over the entire range of house sizes. However, it is best to
think of the estimate as being particularly relevant to at the sample mean size, since this is where the
weight of the data is concentrated (the balance point of the size distribution.) Also, since the data were
collected over a period of 6 years when house prices were rising, it would be appropriate to think of the
estimate of $ as applying at a date “in the middle” of the sample period, say January 1985.
Table 3.6
The Least Squares Regression of Price on Size
Dependent variable: PRICE
Number of observations: 2515
Mean of dep. var. = 95248.7
Std. dev. of dep. var. = 43887.6
Variable
C
SIZE
Estimated
Coefficient
15476.9
60.5055
The intercept of the L.S. regression is $15,476.90 Note that the intercept has the same units as
the dependent or Y-axis variable which is PRICE in this case. In most L.S. regressions the intercept has
no meaningful interpretation. On the other hand it is usually important to include the intercept in the
equation otherwise the estimated linear relationship between Y and X will be forced through the origin
Econometrics Text by D M Prescott ©
Chapter 3, 17
(0, 0) and this is rarely justified. It could be argued that in the current example, the predicted price of a
house of zero size refers to the price of an empty lot. However, since the sample did not include any
market transactions in which empty lots were bought and sold it is unlikely that the value of $15, 476.90
is a particularly good estimate of the market value of an empty lot in say January 1985. L.S. chooses a
slope and intercept to fit the data and the resulting linear equation is an approximation to the population
regression over the range of the available data. In this case the scatter plot is “a long way” from SIZE = 0.
What is meant by “a long way?” Table 3.7 shows that the minimum SIZE in the sample is 700 square
feet and the standard deviation of SIZE is 392 square feet so SIZE = 0 is 1.8 standard deviations below
the minimum size in the sample and 3.4 standard deviations below the sample mean of SIZE.
It is extremely important to bear in mind that the interpretation of a particular regression
coefficient depends crucially on the list of explanatory variable that is included in the regression. To
illustrate this important point consider a model in which there are two continuous explanatory variables.
To make the example specific, you might think of X1 as house size and X2 as lot size. The coefficient $1
is the partial derivative of E(Y | X1, X2 ) with respect to X1 . It is therefore the effect of a change in X1 on
the conditional mean price while holding X2 constant This “holding X2 constant “ is a new condition
that did not apply when X2 was not in the model. To make this point clear, compare the conditional
mean of Y at two values, say X1 and X1 + 1. The change in the conditional mean is
The important point to note is that the terms
cancel only if
conditional means. When we consider the coefficient
takes the same value in the two
we are therefore comparing the mean price in
two subpopulations of houses that have the same lot size but differ in house size by one square foot.
We now turn to a model of house prices in which there are several explanatory variables. The
definitions of these variables is given in Table 3.7
Econometrics Text by D M Prescott ©
Chapter 3, 18
Table 3.7
Variable Definitions
Symbol
Description & Units
PRICE
Transaction price, 1983-987 ($)
SIZE
House size (square feet)
LSIZE
Lot size (square feet)
AGE
Age of house at time of sale (years)
BATHP
Number of bathroom pieces
POOL
If pool exists, POOL = 1, otherwise POOL = 0
SGAR
If single-car garage, SGAR =1, otherwise SGAR =0
DGAR
If double-car garage, DGAR =1, otherwise DGAR =0
FP
If fireplace exists, FP = 1, otherwise FP =0
BUSY_RD
BUSY_RD = 1 if on busy road, otherwise BUSY_RD = 0
T
Time of sale to nearest month. T =1 if Jan. ‘83; T = 2 if Feb. ‘83 etc.
Table 3.8 reports summary statistics for the variables described on Table 3.7 and the L.S.
regression of PRICE on ten explanatory variables plus a constant term, which allows for the intercept.
The coefficient on SIZE is $41.22 per square foot which is just 2/3 the value of the corresponding
coefficient in Table 3.6 The reason for this substantial difference is that the regression in Table 3.6
conditions only on SIZE. But in Table 3.8 PRICE is conditioned on a much longer list of variables.
From Table 3.8 we infer that a one square foot increase in size increases the mean price of houses by
$41.22 while holding constant the lot size, the age of the house, the number of bathroom pieces and so
on. If you were to walk round your neighbourhood, you will probably find that bigger houses are likely
to be on bigger lots, have more bathroom pieces and perhaps have a double rather than a single garage.
This is reflected in the larger L.S. regression coefficient on SIZE in Table 3.6 compared to that in Table
3.8.
Now let’s turn to a few other coefficients in Table 3.8. The coefficient on LSIZE is positive,
which confirms larger lots add to the market value of houses. On the other hand, older houses sell for
Econometrics Text by D M Prescott ©
Chapter 3, 19
less. The coefficient on AGE suggests that for every additional 10 years since construction, house prices
fall by $1,891. Being on a busy road reduces the expected price by $3,215 while a fireplace is estimated
to add $6,672 to market value. The coefficient on T is $1,397 which provides an estimate of how quickly
prices were rising over the sample period, 1983-87. T records the month in which the transaction took
place so an increase in T of 1 means one month has passed. The data suggest that house prices rose
$1,397 per month over this six year period. Note that the model is linear in T so it estimates an average
monthly increase - a linear time trend in prices. The model as it stands cannot reveal if prices rose slowly
at first and then accelerated or rose quickly and then slowed. We would a need a more sophisticated
model to track anything other than a linear price path.
Table 3.8
Number of Observations: 2515
Mean
Std Dev
PRICE
95248.74831
43887.55213
SIZE
1318.42346
391.67924
LSIZE
6058.16581
3711.30361
AGE
32.34314
30.36091
POOL
0.043738
0.20455
BATHP
5.88867
2.01060
FP
0.38449
0.48657
SGAR
0.29463
0.45597
DGAR
0.11412
0.31801
BUSY_RD
0.11451
0.31850
T
37.00915
21.98579
Min
22500.0
700.0
1024.0
0.0
0.0
3.0
0.0
0.0
0.0
0.0
1.0
Dependent variable: PRICE
Number of observations: 2515
Mean of dep. var. = 95248.7
Std. dev. of dep. var. = 43887.6
Variable
C
SIZE
LSIZE
AGE
POOL
BATHP
FP
SGAR
DGAR
BUSY_RD
T
Estimated
Coefficient
-25768.0
41.2247
.789266
-189.109
6284.14
1907.28
6672.45
3816.29
12858.7
-3215.07
1397.08
3.2.6 Effects of Rescaling Data on The Least Squares Coefficients
Max
345000.0
3573.0
95928.0
150.0
1.0
16.0
1.0
1.0
1.0
1.0
72.0
Econometrics Text by D M Prescott ©
Chapter 3, 20
In the previous section it was argued that a complete discussion of the least squares must include
the units in which the variables are measured. This section presents two rules that show precisely how
rescaling the data affects the least squares intercept and slope coefficient. The dependent and
independent variables are quantities that have two parts: one component is the numerical part that is
perhaps stored in a data file or on paper. The other component is the unit of measurement. Consider a
small town with a population of 25,000 people. Clearly the "population" has two parts, a pure number
(25,000) and the units (a person). In symbols: Quantity = (number) times (units). The same quantity can
be expressed in different ways, for example we may prefer to reduce the number of digits we write down
by recording the population in thousands of people: now Quantity = (25) times (thousands of people).
Notice that this rescaling can be expressed as Quantity = (number/1000) times (1000xunits). The number
component is divided by 1000 and the units component is multiplied by 1000 (the units are transformed
from people to thousands of people).
In the equation Y = a + bX, X and Y refer only to the number components of the relevant
quantities, which is why it is so important to be aware of the units of measure. First consider rescaling
the number component X by a scale factor mx; define the result to be X* = mxX. Although X* and X are
different numbers they represent the same quantity. Quantity = X times (units of X) = (mxX) times (units
of X divided by mx) = X* times (units of X*). The units of X* are the units of X divided by mx.
Replacing X with X* in the equation Y = a + bX will result in a new slope coefficient b*. Notice
that Y is simply a number and it will not change as a result of this substitution so the new right hand side
(a + b*X*) must give the same result. The intercept a remains unchanged and the product of the slope
and the X-axis variable, b*X*, is the same as before, i.e., b*X* = bX. The previous equation implies that
the new slope coefficient is b* = b(X/X*) = b/mx. The effect of rescaling the X-axis data is summarized
in the following rule.
Rescaling Rule #1
If the X-axis data are rescaled by a multiplicative factor mx, the least squares intercept is
unchanged but the least squares slope is divided by mx.
This rule illustrated by the following example. Suppose that the advertising data had been
recorded in dollars instead of thousands of dollars but sales continue to be recorded in thousands of
dollars. For example, store #1 spent $15,000 on advertising so instead of recording 15, suppose 15,000
Econometrics Text by D M Prescott ©
Chapter 3, 21
appeared in Table 3.4. The slope of the least squares line can be recomputed using the method presented
in Table 3.5 or a computer program such as TSP could be used. The result will be Y = 800 + 0.01X*,
where X* is advertising measured in dollars. The new slope coefficient of 0.01 still represents the effect
of a one unit increase in advertising expenditures on sales. A one dollar increase in advertising leads to a
sales increase of (0.01)x(units of sales) = 0.01x$1 000 = $10. The basic conclusion remains in tact and is
entirely independent of the units that the data are measured in.8
The effects of rescaling Y by a multiplicative factor my can be worked out in a similar way.
When Y is multiplied by my we obtain Y* = myY. Using Y* to compute the least squares line instead of Y
we multiply the original least squares equation by my: Y* = myY = (mya) + (myb)X. In this case, both the
intercept and the slope coefficient are multiplied by my.
Rescaling Rule #2
If the Y-axis data are rescaled by a multiplicative factor my, both the least squares
intercept and slope coefficient are multiplied by my.
To illustrate this rule suppose that the sales data are measured in dollars while advertising figures
continue to be measured in thousands of dollars. This change would cause all the numbers in the last row
of Table 3.4 and all the numbers in the second column of Table 3.5 to be multiplied by my = 1000. If you
work through the calculations in Table 3.5 using the new numbers you will find that the new intercept
coefficient is 800 000, i.e., the previous intercept is multiplied by 1000. Also, the new slope coefficient
^
is 10 000 - it too is increased by a factor of 1000. The new least squares equation is Y = 800 000 + 10
000X. Again, the rescaling does not make any substantive change to the interpretation of the fitted line.
A one unit increase in advertising expenditures ($1 000) raises sales by 10 000 times (units of Y), which
amounts to a $10 000 increase in sales since the units of Y are simply dollars. Also, the predicted sales
for store #1 are $800 000 + (10 000)($15) = $950 000, just as before.
3.2.7 Some Important Properties of the L.S. Regression
The least squares fit has a number of important properties that can be derived from the first order
8
Notice also, that the rescaling of X into X* has no effect on the predicted value of sales for store #1.
When advertising is measured in thousands of dollars, the predicted value of sales is 800 + 10x15 = 950,
which represents $950,000. When advertising is measured in dollars, the predicted value of sales is 800
+ 0.01x15,000 = 950, which also represents $950,000.
Econometrics Text by D M Prescott ©
Chapter 3, 22
conditions [3.4] and [3.5]. In this section the following properties will be demonstrated.
Least Squares Property #1
If the least squares line includes an intercept term (the line is not forced through the origin) then
the sum and mean of the least squares residuals is zero, i.e.
.
Least Squares Property #2
The sum of the cross products between the explanatory variable X and the least squares errors is
zero, i.e.,
. When an intercept is included in the least squares equation, this means
that Cov(X, e) = 0 and Corr(X, e) = 0.
Least Squares Property #3
The sum of the cross products between the least squares errors and the predicted values of the
dependent variable,
, is zero, i.e.,
. When an intercept is included in the least
squares equation this means that
Property #1 is based on equation [3.6] which was derived from the partial derivative of the sum
of squared errors (first order condition [3.4]). It is reproduced here for convenience.
[3.6]
Recall that the least squares errors were defined in equation [4.3] to be
so equation [3.6] implies that the sum of the least squares errors is
zero, that is
. Clearly, if the sum of the least squares errors is zero, then the average least
squares error is zero as well. Another way to think of this property of the least squares fit is that the least
squares line passes through the mean point of the data
The mean of X in the advertising/sales
example is 20 and when this is substituted into the equation of the least squares line, the result is
Econometrics Text by D M Prescott ©
Chapter 3, 23
. In other words, when the mean value of X is substituted into the
equation of the least squares line, the result is the mean value of Y. This is not an accident due to the
numbers we have chosen, it is a property of least squares that holds in every case and is directly related to
the fact that the sample mean least squares error is zero. However, it is important to note that these
conclusions are derived from the partial derivative of the sum of squared errors with respect to the
intercept parameter a. This presupposes that least squares is free to determine the intercept parameter. If
the intercept is not included (effectively, fixed at zero), then the least squares errors will generally not
sum to zero and the least squares line will not pass through the sample mean.
The second of the first order conditions, [3.5], is the basis of L.S. Property #2, which says that
the least squares errors are uncorrelated with the explanatory variable X. The partial derivative of the
sum of squared errors with respect to the slope coefficient b takes the form of equation [3.8]:
It has just been pointed out that the term in parentheses is the least squares error, so [3.8] can be written
as
Recall that the sample covariance between two variables Z and W is
Econometrics Text by D M Prescott ©
Chapter 3, 24
Clearly, if either (or both) of the means of Z and W is zero, then the covariance formula simplifies to
It has already been shown that
so it follows from equation [3.10] that the covariance of the least
squares errors and the explanatory variable X is zero, i.e., Cov(X, e) = 0. Since the numerator of the
correlation coefficient between two variables is the covariance between these same variables, it also
follows that e and X are uncorrelated. Let's consider the intuition behind this property of least squares.
^
The basic problem that least squares is trying to solve is to find the particular equation Y = a +
^
bX that best explains the variable Y. The value of Y is broken down into two parts, Y = Y + e. The first
^
component, Y, is the part of Y that is explained by X - the fitted line translates changes in X into changes
in predicted values of Y. The second component, e, is the error term and this is the part of Y that cannot
be explained by X. But what does it mean to say that X cannot "explain" e? Suppose that X and e are
positively correlated so that Cov(X, e) > 0. A scatter plot of X and e would reveal that whenever X is
above its average value, e tends to be above its average value as well and when X is below average e
tends to be below average. But if this is true, then increases in X would be associated with increases in e.
In other words, changes in X would "explain" changes in e. This situation is clearly not consistent with
the idea that the error e represents the part of Y that cannot be explained by X. To say that X cannot
explain e is the same thing as saying X and e are uncorrelated and this is precisely what equation [3.11]
means9.
The calculations in Table 3.9 illustrate the two important properties of least squares that have
been discussed in this section. The first two columns of Table 3.9 present the original advertising and
9
In fact, if Z and W are uncorrelated we can say only that X cannot be explained by linear equations
in Z (and vice versa). As shown in Chapter 2, it is possible to find examples in which Z and W are
uncorrelated yet functionally related in a nonlinear way.
Econometrics Text by D M Prescott ©
Chapter 3, 25
sales data. The predicted values of Y corresponding to each level of advertising expenditures are in the
third column. These predicted sales levels all lie on the least squares line. The fourth column presents
the differences between actual and predicted sales, i.e. the least squares errors ei. Notice that the sum of
the least squares errors is zero, i.e. G ei = 0. To demonstrate that the explanatory variable X is
uncorrelated with the least squares errors, the fifth column presents the products eiXi. Summing all the
numbers in the fifth column shows that GeiXi = 0. Since the mean error is zero, this implies that Cov(X,
e) = 0, which in turn means that the correlation coefficient between X and e is also zero.
Finally, consider L.S. Property #3 which says that the predicted values of the dependent variable
are uncorrelated with the least squares errors. A numerical illustration is given in Table 3.9. The
products
are obtained by multiplying together the elements in columns three and four. The sum of
these products,
is
(950)(50) + (900)(-35) + .... + (1120)(-25) = 0
The general result can be shown algebraically as follows:
Notice that the two unsubscripted constants, a and b, can be factored to the front of the summation signs.
Also, the two sums in the second line are both zero as direct results of L.S. Properties #1 and #2. Since
the mean error is zero, the result
^
implies that Cov(Y, e) = 0.
Econometrics Text by D M Prescott ©
Chapter 3, 26
Table 3.9
Some Properties of the Least Squares Fit
(Advertising)
Xi
(Sales)
Yi
15
1000
950
50
750
10
865
900
-35
-350
12
945
920
25
300
18
930
980
-50
-900
20
990
1000
-10
-200
28
1105
1080
25
700
25
1070
1050
20
500
32
1095
1120
-25
-800
X= 20
Y= 1000
^
(GYi)/n = 1000
G ei = 0
G eiXi = 0
To better understand why least squares predicted values are uncorrelated with the least squares
errors consider the advertising/sales example. Suppose that as the Vice President's research assistant you
^
have calculated a linear relationship between Y and X that produces predicted values Y that are positively
^
correlated with the errors, i.e. Cov(Y, e) > 0. The VP of Sales is likely to point out that your predictions
seem to have a systematic error. Stores with high advertising expenditures have high predicted sales and
^
since Cov(Y, e) > 0, these types of stores tend to have positive errors (sales are under predicted since
actual sales lie above the fitted line). Also, stores with low advertising budgets and lower than average
sales tend to below average (negative) errors, that is, sales are over predicted. Since there is a systematic
relationship between the prediction errors and the level of sales, the VP will argue that when you present
a sales prediction for a store that has above average advertising expenditures, she should lower your sales
prediction because she knows you systematically over predict sales in such cases. However, if you
^
present the VP of Sales with the least squares equation, you can be confident that Cov(Y, e) = 0. The
least squares predicted sales figures have errors that exhibit no systematic pattern that could be used to
Econometrics Text by D M Prescott ©
Chapter 3, 27
improve the forecast10.
Finally, it should be pointed out that L.S. Property #3 actually follows from L.S. Property #1.
^
Cov(e, Y)
= Cov(e, a + bX)
= Cov(e, a) + bCov(e, X)
Since a is a constant, Cov(e, a) = 0 and L.S. Property #2 states that Cov(e, X) = 0.
3.2.8 Measuring Goodness of Fit
By definition, the least squares equation provides the best fitting line through the data. But how
good is the best fitting line at explaining the observed variations in sales from store store. One way to
judge how well least squares has done is to compute a statistic known as R-squared. Essentially, Rsquared quantifies how useful the information on advertising is for explaining (or predicting) store
sales.11
The fundamental problem is to explain the variation in the dependent variable Y. The total
variation in Y is referred to as the Total Sum of Squares, TSS, and is measured by
Notice that TSS is closely related to the concept of sample variance of Y, which is TSS/n. Recall that the
variance is the average value of the squared deviations of Y around its mean. TSS is the total of the
squared deviations of Y around its mean. Whereas the variance does not depend on the size of the
sample, clearly TSS will tend to increase with the number of observations.
10
The fitted values of Y have been referred to as predicted values of Y, but it would be better to say
they are "within sample" predicted values because the actual values of Y are known to the researcher and
indeed have been used to compute the "predicted" values of Y. In a real forecasting situation the
forecaster does not know what the actual value of Y will be. Such forecasts go beyond the current
sample and are referred to as "out of sample" predictions or forecasts.
11
One should keep in mind that it often seems straightforward to explain the past but not as easy to
predict the future. R-squared measures how well one can explain the available data, but it is not a
guaranteed guide to future predictive performance of the least fit.
Econometrics Text by D M Prescott ©
Chapter 3, 28
An important feature of the least squares fit is that the Total Sum of Squares can be decomposed
into two parts: the Regression Sum of Squares, RSS, and the Sum of Squares Residuals, SSR. The
explained part of Y is
The unexplained part of Y is the least squares residual, e, so SSR = G(ei)2. The decomposition property
of least squares can be stated as TSS = RSS + SSR. Algebraically, the decomposition formula is:
Proof
To prove this important decomposition, begin with the left hand side and substitute
Now open up the square brackets treating
as two separate terms.
The first two terms on the right hand side are SSR and RSS respectively, so to complete the proof it is
necessary to show that the last sum is zero.
Notice that on the right hand side, the first sum is zero by L.S. Property #3 and the second sum is zero by
L.S. Property #1. (Notice that
can be brought through the summation sign because it is an
unsubscripted constant.) This completes the proof that
Econometrics Text by D M Prescott ©
Chapter 3, 29
that is:
TSS = SSR + RSS
This decomposition of the total sum of squares provides the foundation for the goodness of fit
measure known as R-squared, or R2. Divide through by TSS and obtain
1 = RSS/TSS + SSR/TSS
which shows that the proportion of the total sum of squares that is explained by the regression(RSS/TSS)
plus the proportion that remains unexplained (SSR/TSS) add up to one. R-squared is defined as the
proportion of the total sum of squares that is explained, that is,
]
Interpreting R-squared
First, it is straightforward to show that the goodness of fit measure R2 always lies between 0 and
1. Since TSS, RSS and SSR are all sums of squared items, it follows that none of these sums can be
negative.
To better understand what R2 measures, rewrite the decomposition of the total sum of squares in terms of
Econometrics Text by D M Prescott ©
Chapter 3, 30
variances by dividing equation [3.12] throughout by n, the number of observations. The result is
This result could also have been found by using the variance of a sum rule (see Chapter 2)
But L.S. Property #3 says that
, which implies the variance of the dependent variable is
the sum of two variances. The first of these is the variance of the explained component of Y,
, and the
second is the variance of the least squares residuals - the unexplained component of Y. R-squared can be
expressed in terms of these variances:
This demonstrates that R-squared measures the proportion of the unconditional variance of Y that can be
explained by the least squares fit.
An interesting observation that can be drawn from equations [3.12] and [3.15] is that the least
squares coefficients maximize R-squared - no other line could produce a set of predicted values of Y with
a higher variance than the least squares predictions. This follows from the fact that, by definition, least
squares minimizes the sum of squared residuals.
Figure 3.2 illustrates the decomposition of the variance. Since the concept of “variance” is not
easily represented graphically, the range is used to approximate the variance. The L.S. regression line
translates the range of X, R(X), into the range of
predicts the smallest value of
. That is, the minimum value of X in the sample
and similarly the maximum value of X predicts the maximum value of
in the sample. Notice that since
lies on the regression line, the range of
is not as large as the
Econometrics Text by D M Prescott ©
Chapter 3, 31
range of the observed values of Y that are dispersed above and below the regression line. This illustrates
the point that in all samples
Var (Y$ ) ≤ Var (Y ) .
What does it mean to say that “X explains Y “? Suppose Y is the market price of a house and X
is the house size in square feet. In the housing market, prices vary from house to house and this
variability can be measured by the unconditional variance of prices. It is this variance that the model
seeks to explain. A regression of price on size yields least squares coefficients and a set of predicted
prices that all lie on the fitted regression line. If size “explains” price then the regression equation should
predict a wide range of prices for different sizes. Thus if the variance of the predicted prices is large and
close to the variance of observed prices, then the regression equation explains a large portion of the
variance of prices. In Figure 3.2, a steep regression line contributes to a high R-squared. A relatively flat
regression line is associated with a low R-squared. Notice that in the extreme case that the regression
line if horizontal (the least squares coefficient on X is precisely zero, R-squared is zero.
Figure 3.2 can also explain why R-squared is essentially unaffected by the sample size. Note that
the sample size can be increased without affecting the unconditional variance of Y, the variance of the
predicted value of Y or the variance of X. Figure 3.2 remains unchanged except that more and more data
are packed into the parallelogram around the regression line. The quantity or density of points in this
parallelogram has no bearing on R-squared - what matters is the relationship between the variances. In
short, simply increasing the sample size will not help to increase the proportion of the variation in Y that
can be explained by X.
Finally, the fact that the name R-squared has the term "squared" in it raises the question of what
R = %(R2) represents. It turns out the R-squared is the square of the correlation coefficient between Y and
^
so R = Corr(Y, Y). It makes intuitive sense that the closer the fitted values,
be the R-squared statistic. The proof of this is straightforward.
are to Y, the higher will
Econometrics Text by D M Prescott ©
The numerator simplifies to
Chapter 3, 32
:
To obtain the previous line we have used the fact that the covariance of a variable with itself is its
variance and
by L.S. Property #3. Substituting this into [3.16], we find that
Econometrics Text by D M Prescott ©
Chapter 3, 33
Figure 3.2
An Illustration of R-Squared
Download