Linear regression with one independent variable Linear regression assumes a linear relationship between the dependent and independent variables. Yi = b0+b1X1+i, i = 1, …, n Yi – dependent variable b0 - intercept b1X1 – slope times independent variable i – error term In the regression which contains one independent variable (X), the slope coefficient equals Cov(Y,X) / Var(X). Assumptions of linear regression: 1. The relationship between the dependent variable, Y, and the independent variable, X, is linear in the parameters b0 and b1. The requirement does not exclude X from being raised to a power other than 1. 2. The independent variable, X, is not random. 3. The expected value of the error term equals to 0. 4. The variance of the error term is the same for all observations (homoskedasticity assumption) 5. The error term is uncorrelated across observations. 6. The error term is normally distributed. Regression analysis uses two principal types of data: - cross-sectional: data, which involves many observations on X and Y for the same time period - time series: data that use many observations from different time periods for the same company, assets class, investment fund, person, country etc. Linear regression, also known as linear least squares, computes a line that best fits the observations; it chooses values for the intercept, b0, and slope, b1, that minimize the sum of the squared vertical distances between the observations and the regression line. Understanding Excel’s linear regression output (example of estimating Intel’s beta from the class) SUMMARY OUTPUT Regression Statistics Multiple R 0.7283 R Square 0.5304 Adjusted R Square 0.5223 Standard Error 0.0927 Observations 60 ANOVA df Regression Residual Total 1 58 59 Standard Error MS 0.5627 0.0086 F 65.5155 Significance F 0.0000 -0.0029 0.0120 -0.2438 0.8083 -0.0269 Lower 95.0% 0.0210 0.0269 2.2516 0.2782 8.0942 0.0000 1.6948 2.8084 1.6948 Coefficients Intercept X Variable 1 (beta of the stock) SS 0.5627 0.4982 1.0609 t Stat P-value Lower 95% Upper 95% Upper 95.0% 2.8084 0.0210 Multiple R = The correlation between the actual and forecasted values of the dependent variable in the regression Coefficient of determination (R Square) = Explained variation / Total variation = (Total variation – Unexplained variation)/ Total variation – a measure of goodness of fit of an estimated regression to the data. 53% of total variation in the variable is explained by the model (the higher this number the better particular model is specified) Adjusted coefficient of determination (Adjusted R Square) – a measure of goodness of fit of an estimated regression to the data. Is used in the multiple regression, adjusted for degrees of freedom. Standard error of estimate – measure the standard deviation of the residual term in the regression; how precise is the regression (the lower the number the more precise is the regression). ANOVA = analysis of variance df = degrees of freedom; the number of independent observations used. SS = sum of squares, the sum of squared errors of residuals – SSE. MS = mean sum of squares, which is SS/df F = F-statistic measures how well the regression equation explains the variation in the dependent variable. If the independent variable explains little of the variation in the dependent variable, the value of the F-statistic will be very small. F is squared value of the s-statistic for the slope (X Variable 1): 8.09422 = 65.6160 Significance F = reports the overall significance of the test. The lower the number the more significant is the regression. t-stat = value of the s-statistic, used in hypothesis testing (null hypothesis against alternative hypothesis). Should be compared with critical values from the table of student’s t-Distribution ( can be found at the end of math or statistics textbook). “The rule of thumb” is to consider the critical value equal to +/- 2. p-value = the smallest level of significance at which the null hypothesis can be rejected ( the higher the p-value the stronger are the regression results) Lower 95%/ Upper 95% = the lower and upper bounds for a 95% confidence interval. Confidence interval = a range that has a given probability that it will contain the population parameter it is intended to estimate. For instance, with a 95% confidence we can say that intercept has the value between -0.0269 and 0.0210.