STATISTIC FOR BUSINESS TOPIC 6A: SIMPLE LINEAR REGRESSION ANALYSIS 1 Introduction Main objection of statistical investigations establish relationships which make it possible to predict one or more variables in terms of other known variables. Problem of predicting the average value of one variable in terms of known value of another variable is the problem of regression. Crucial for decision making is making prediction. Simple linear regression Common process used in all methods of prediction is: 1. To fit a model to the data which has been collected; and, 2. to use this model as the basis for any predictions made. – It is believed there is some relationships, so go ahead and record some of the observation. Once data Is collected, use the data to find some kind of model to use to a prediction device. Clearly, if our predictions are to be ‘close’, it is important that: o The fitted model gives an accurate representation of the data; and o That the mechanisms which gave rise to the collected data are also valid for the period of the prediction. For this season reason, avoid making predictions in the far future, or the circumstances well outside the scope of the initial data. Estimates should be revised when either. o More data becomes available. o The underlying mechanisms are known to have changed. Scatter Diagrams Scatter diagrams Shows the relationship between two sets of data. We are interested in estimating a straight-line relationship between variables X and Y. o X the independent variable. o Y The dependent variable (the predictable variable). st 1 step in estimating a sightline of best fit to a set of data is to ensure that a straight-line really is a reasonable representation for the data. This is done by plotting the data in a graph, a scatter diagram. Example on the next page 2 Estimating the Simple Linear Regression Line Equation of a straight line is o Y = β0 + β1 X or (equivalently written as Y = a + b * X or Y = m * X + c] β0 + β1 unknown constants β0 representing the Y-intercept (the value of Y when X = 0) β1 representing the slope of the line (the change in Y which corresponds to a 1-unit change in X) These are important when it comes to interpretating the estimated coefficient from a regression analysis. 1st once decided that the straight-line is a reasonable representation of the data, 2nd step is to estimate the equation of the straight-line. We will use the sample data to estimate β0 and β1 to give the straight line which 'best' fits the data. These estimates are denoted by The hat (^) denotes an estimate and they are called “Beta naught hat” and “Beta one hat”. Estimating the coefficients Suppose we have a sample of n pairs of observations taken from a population, and the i’th pair is If we have a single predictor variable influencing the outcome of a response variable, To get best possible fit for the straight line minimise the differences between the observed Y-values, yi, and the corresponding point on the line of best fit, . This is we wish to minimise the differences (y¡- ). This difference is called a residual. Problem since some differences will be positive and some negative. Solution use the least squares method to minimise the sum of squares of these differences – that is to minimise: Least Squares Criterion The lines that best fits the data is the line for which SSE is minimised, this line is called the regression line. Regression line = Set of equation to workout B0 hat and B1 hat 3 Use of Excel for Simple Linear Regression Common to use computer packages when dealing with problems of regression. 4 Residual Analysis Use residual analysis to determine if the regression line is a good fit to the data and that the estimates coming from the regression analysis are valid and reliable. If the model fits the data well, the residuals which represent the “error” term in the regression model should be small and not exhibit any pattern. Patterns in the residual usually indicates a predictable component whereas the residuals should be random. This usually indicates that o The linear model is not appropriate OR o The data needs to be transformed (often a log or other transformation). Standard Error of the Estimate In estimating the standard error of any parameters or predictions using the model, we require the estimate of the unknown population standard deviation, σ. This is related to the Sum of Squares for error as: The divisor here is (n - 2) as we have had to estimate 2 parameters, β0 and β1. Standard error is also displayed in excel print out. 5 Correlation Coefficient Determine the strength of the association between variables X and Y to determine how good the regression model fits the data. Two measures of the strength of the association between variables X and Y are: o The sample correlation coefficient (r). o The coefficient of determination (r²). r² measures the proportion of variation of the responses variable (y) explained by the predictor variable (x). r measures the LINEAR association between X and Y. Coefficient of Correlation The formula for determining the sample coefficient of correlation is Correlation provides the measure of association between the predictor and a response variable and is between -1 and +1. o r= +1 X and Y are perfectly positively correlated (or associated) with one another o r= -1 X and Y are perfectly negatively correlated (or associated) with one another o r= 0 X and Y are uncorrelated, there is no linear association between them. This does not mean that they are statistically independent (note that dependence may exist in a quadratic, cubic or higher nature. o r is also displayed in excel print out. o The population coefficient of correlation is ρ. This is the Greek letter rho Note: Correlation coefficient is close to 0 doesn’t mean there is no relationship between the variables being considered, only that it’s not a LINEAR relationship. Coefficient of Determination Formula for determining the sample coefficient of determination is: This describes what proportion of variation in the observed y values can be explained by the regression line (i.e. the variation in the predicted y values). o r² falls into the range 0 ≤ r2 ≤ 1 and is usually written as a percentage. o An r² value close to 1 implies that most of the variability in the y values is explained by the regression model. o r² is also displayed in excel print out. o The population coefficient of determination is ρ². This is the Greek letter rho, squared. 6 Related Statistical Inferences Confidence Interval for the population slope – β1 Standard error can be used for confidence intervals and test on the slope parameter β1. Significance of the Linear Model Hence when testing the significant of the linear relationships, test the null hypothesis at HA (as always) will depend on what the question is. o Test for positive slope HA = B1 > 0 o Test for negative slope HA = B1 < 0 o Test if the linear relationship is significant (i.e. does it exist) HA = B1 ≠ 0 Use six step procedure. Since population standard deviation is unknown, use t-test. For simple linear regression, as we are estimating two parameters β0 and β1 , the degrees of freedom is d.f. = n – 2 1. Depends on question. 2. Find suitable test statistic 3. Depends on question, specify the level of significance. The value of t is given in the Excel print out. In fact, the p-value is also given. This makes our rejection region easy to determine. 4. Reject H0 if p-value < α. 5. Calculations. 6. Conclusion. Note: The p-value given in Excel is for a 2-tailed test. So, if we are performing a 1-tailed test, then we must halve the p-value from the printout first. Confidence and Prediction Intervals