LINEAR REGRESSION

LINEAR REGRESSION Decisions in business and other areas are often based on predictions of what might happen in the future. As one's ability to predict future improves decisions can be made whose outcomes are more favorable. The most formidable approach in predicting the future is to establish quantitative relationships between what is known and what is to be predicted. Regression and correlation are interrelated statistical techniques that allow decision makers not only to establish a quantitative relationships among such variables but also measure the “strength” of the relationship. In regression analysis an estimating mathematical equation is developed which relates a known quantity to an unknown variable of interest. Examples may include the relationship between advertising expenditure and level of demand, volume of production and material cost, smoking habits and incidents of heart diseases. In all of the examples the relationship to be established is of statistical (stochastic) nature. This means that we do not pretend to imply that the level of demand is exclusively and deterministically depends on advertising budget, rather we hypothesize that among other factors advertising budget affects the level of demand. Thus knowing the advertising budget does not allow us to predict sales without any error but simply affords a more accurate prediction than would be possible without that knowledge. This is an important difference from the scientific laws where the knowledge of certain variables allow the scientists to make very accurate predictions, as in the case of the speed of a train determining without error the time for it to traverse a 100 mile track. Regression and correlation analyses are based on the relationships or association between two (or more) variables. The known variable(s) is(are) called the independent (explanatory) variable(s), while the variable to be predicted is the dependent (or response) variable. In the example of advertising versus sales volume, the advertising budget is the independent variable while the sales volume is the dependent (response) variable. In regression analysis we can have only one dependent variable but can use more than one independent variable to predict this dependent variable. If we have only one independent variable the regression model is called a simple regression model whereas if we have more than one independent variable we have a multiple regression model. In what follows we will first develop the simple regression model than extend it to the multiple regression case. In the context of simple regression model the nature of the relationship can take many forms-- it may be linear or non-linear (concave, convex or some arbitrary polynomial). In the majority of the applications of regression method, a linear relationship is assumed. Especially in business and other social sciences non-linear regression models are used with much less frequency. We will therefore restrict our attention only to linear regression models where the actual relationship between the dependent and the independent variable can be represented by a straight line in the form of: Y = A + BX +  This is called the true model and is assumed to have been obtained from the entire population. Here Y is the dependent variable, X is the independent variable; the parameters, A and B are respectively are the intercept and the slope of the regression, while  is the error term that represents the influence of all the other unknown factors on the dependent variable. If the value of B is positive we speak of a direct relationship between the variables as they both go in the same direction e. g., as the independent variable increases so does the dependent variable and vice versa. On the other hand if the parameter B has a negative value the relationship is inverse-when the independent variable increases the dependent variable decreases and vice versa. If B is actually zero then there is no relationship between X and Y. From this point on lets assume Y represents the monthly sales volume and X the advertising budget. Since we assume there are other factors besides the advertising budget affecting the sales volume, even for fixed value of X, a range of Y values are possible (i.e., the relationship is not deterministic). For any fixed value of X, the distribution of all Y values is referred to as the conditional distribution of Y, denoted by Y|X—read as Y given X. For example Y|X=500 refers to the distribution of sales volumes in all months the advertising budget has been 500. In regression analysis we make certain assumptions about the conditional distributions of the dependent variable—variable which we try to predict. Here are the three assumptions that are quite similar to the assumptions we made in ANOVA  Normality. All conditional distributions are normally distributed (e.g. the distribution of sale volumes in all months in which advertising has been or will ever be some fixed level is normal).  Homoscedasticity. All conditional (normal) distributions have the same variance, 2.  Linearity. The means of the conditional distributions are linearly related to the value of the independent variable. The last assumption is implicit in the model Y = A + BX + . Mean of Y|X, Y |X = A + BX since the mean of  is zero. (There are very large number of other factors affecting Y, some positive some negative, thus their combined effect is zero). Utopia: In the example, if we knew (we probably never will!) the true model: Y = 649.58 + 1.1422X and 2 = 356, we would be able to make somewhat accurate predictions of Y for any given value of X. For example if we wondered about sales volume when advertising is 500, we would calculate Y = 649.58 + 1.1422*500 = 1220. We would then say the mean (expected) value of sales when X= 500 is 1220, period. However if we wanted to make a prediction of sales volume next month, knowing that advertising budget is set at 500. This is a more difficult question—we now have to be contend with the effect of all other factors and say that we expect sales to be more or less 1220 dependent on how the other factors,  materialize (the magnitude of “more or less” obviously depends on the value of 2. Reality: We do not know the true model (A,B, 2) but have a random sample of observations of Y and X values from which we can calculate estimates of A,B, 2 (don’t worry for the time being how the sample data used to get these estimates). Let us call these estimates a, b and se. respectively. Now in predicting Y values we have a more difficult task, in addition to the effect of other factors () we have to consider that our estimate of A(using a) B(using b) and 2 (using se) may have errors. After all, they came from a random sample. Least-squares method of estimating the true model. Suppose we are given a sample set of observations in the form of n pairs of X and Y values; These are plotted in the following graph. Using this method we want to determine the values for a, the intercept and b, the slope in such a way that the sum of the squared vertical differences between Y and the estimated Y (Y-hat) is minimized. This is illustrated at one of the sample points on the graph. The quantity to be minimized (by the choice of a and b) is Let’s call this (sum of the squared errors) SSE. If we choose and then SEE will be minimized. Also an estimate of the 2 is obtained from These three formulas give then the least-squares estimate of the true model (true relationship between Y and X) as Y. Sampling Distribution of and b. Since these estimates are obtained from a random sample, their values are not fixed like A, B but are variable, i.e., they are random variables. If had taken different random samples by chance and calculated a and b we wouldn’t always find the same values. Thus the standard deviation of the possible a’s and b’s that we could have calculated based on many samples are referred to as the standard error of these estimators.

LINEAR REGRESSION

Related documents

Products

Support

LINEAR REGRESSION

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib