LINEAR REGRESSION

advertisement
LINEAR REGRESSION
Decisions in business and other areas are often based on predictions of what might
happen in the future. As one's ability to predict future improves decisions can be made
whose outcomes are more favorable. The most formidable approach in predicting the future
is to establish quantitative relationships between what is known and what is to be predicted.
Regression and correlation are interrelated statistical techniques that allow decision makers
not only to establish a quantitative relationships among such variables but also measure the
“strength” of the relationship.
In regression analysis an estimating mathematical equation is developed which relates
a known quantity to an unknown variable of interest. Examples may include the relationship
between advertising expenditure and level of demand, volume of production and material
cost, smoking habits and incidents of heart diseases. In all of the examples the relationship
to be established is of statistical (stochastic) nature. This means that we do not pretend to
imply that the level of demand is exclusively and deterministically depends on advertising
budget, rather we hypothesize that among other factors advertising budget affects the level of
demand. Thus knowing the advertising budget does not allow us to predict sales without
any error but simply affords a more accurate prediction than would be possible without that
knowledge. This is an important difference from the scientific laws where the knowledge of
certain variables allow the scientists to make very accurate predictions, as in the case of the
speed of a train determining without error the time for it to traverse a 100 mile track.
Regression and correlation analyses are based on the relationships or association
between two (or more) variables. The known variable(s) is(are) called the independent
(explanatory) variable(s), while the variable to be predicted is the dependent (or response)
variable. In the example of advertising versus sales volume, the advertising budget is the
independent variable while the sales volume is the dependent (response) variable. In
regression analysis we can have only one dependent variable but can use more than one
independent variable to predict this dependent variable. If we have only one independent
variable the regression model is called a simple regression model whereas if we have more
than one independent variable we have a multiple regression model. In what follows we
will first develop the simple regression model than extend it to the multiple regression case.
In the context of simple regression model the nature of the relationship can take
many forms-- it may be linear or non-linear (concave, convex or some arbitrary polynomial).
In the majority of the applications of regression method, a linear relationship is assumed.
Especially in business and other social sciences non-linear regression models are used with
much less frequency. We will therefore restrict our attention only to linear regression
models where the actual relationship between the dependent and the independent variable
can be represented by a straight line in the form of:
Y = A + BX + 
This is called the true model and is assumed to have been obtained from the entire
population. Here Y is the dependent variable, X is the independent variable; the parameters,
A and B are respectively are the intercept and the slope of the regression, while  is the
error term that represents the influence of all the other unknown factors on the dependent
variable. If the value of B is positive we speak of a direct relationship between the variables
as they both go in the same direction e. g., as the independent variable increases so does the
dependent variable and vice versa. On the other hand if the parameter B has a negative value
the relationship is inverse-when the independent variable increases the dependent variable
decreases and vice versa. If B is actually zero then there is no relationship between X and Y.
From this point on lets assume Y represents the monthly sales volume and X the
advertising budget. Since we assume there are other factors besides the advertising budget
affecting the sales volume, even for fixed value of X, a range of Y values are possible
(i.e., the relationship is not deterministic). For any fixed value of X, the distribution of all
Y values is referred to as the conditional distribution of Y, denoted by Y|X—read as Y
given X. For example Y|X=500 refers to the distribution of sales volumes in all months
the advertising budget has been 500. In regression analysis we make certain assumptions
about the conditional distributions of the dependent variable—variable which we try to
predict. Here are the three assumptions that are quite similar to the assumptions we made in
ANOVA
 Normality. All conditional distributions are normally distributed (e.g. the
distribution of sale volumes in all months in which advertising has been or
will ever be some fixed level is normal).
 Homoscedasticity. All conditional (normal) distributions have the same
variance, 2.
 Linearity. The means of the conditional distributions are linearly related
to the value of the independent variable.
The last assumption is implicit in the model Y = A + BX + .
Mean of Y|X, Y |X = A + BX since the mean of  is zero. (There are very large
number of other factors affecting Y, some positive some negative, thus their combined
effect is zero).
Utopia: In the example, if we knew (we probably never will!) the true model: Y =
649.58 + 1.1422X and 2 = 356, we would be able to make somewhat accurate
predictions of Y for any given value of X. For example if we wondered about sales
volume when advertising is 500, we would calculate Y = 649.58 + 1.1422*500 = 1220.
We would then say the mean (expected) value of sales when X= 500 is 1220, period.
However if we wanted to make a prediction of sales volume next month, knowing that
advertising budget is set at 500. This is a more difficult question—we now have to be
contend with the effect of all other factors and say that we expect sales to be more or
less 1220 dependent on how the other factors,  materialize (the magnitude of “more or
less” obviously depends on the value of 2.
Reality: We do not know the true model (A,B, 2) but have a random sample of
observations of Y and X values from which we can calculate estimates of A,B, 2 (don’t
worry for the time being how the sample data used to get these estimates). Let us call
these estimates a, b and se. respectively. Now in predicting Y values we have a more
difficult task, in addition to the effect of other factors () we have to consider that our
estimate of A(using a) B(using b) and 2 (using se) may have errors. After all, they came
from a random sample.
Least-squares method of estimating the true model.
Suppose we are given a sample set of observations in the form of n pairs of X and
Y values; These are plotted in the following graph. Using this method we want to
determine the values for a, the intercept and b, the slope in such a way that the sum of the
squared vertical differences between Y and the estimated Y (Y-hat) is minimized. This is
illustrated at one of the sample points on the graph.
The quantity to be minimized (by the choice of a and b) is
Let’s call this (sum of the squared errors) SSE. If we
choose
and
then SEE will be minimized. Also an estimate of the 2 is obtained from
These three formulas give then the least-squares estimate of the true model (true
relationship between Y and X) as
Y.
Sampling Distribution of and b.
Since these estimates are obtained from a random sample, their values are not fixed like
A, B but are variable, i.e., they are random variables. If had taken different random
samples by chance and calculated a and b we wouldn’t always find the same values. Thus
the standard deviation of the possible a’s and b’s that we could have calculated based on
many samples are referred to as the standard error of these estimators.
Download