Lecture 1: Reading: • Brooks 2019, Chapters 1, 2, 3. Objectives Understand simple linear regression Understand model evaluation criteria Limitations of simple linear regression models Multiple regression analysis Checking for model adequacy using residual analysis Techniques in Data Modelling Perform an ‘exploratory data analysis’ i.e. plot data over time to check that the series conforms to your expectations and allow data entry errors to be spotted Obtain summary statistics and simple graphical displays such as histograms for each variable to get a feel for the data you are modelling Relationships between variables: Correlation Does a relationship exist between two variables? How strong is the relationship? Is the correlation positive or negative? Useful to know prior to model construction Cause & Effect Relationships Patterns will exist between different variables and coincidence can imply very good correlation Correlation does not imply causality Scatter diagrams and correlation coefficients are useful, but for business forecasting purposes a more precise relationship is required Simple Linear Regression Model Fitting a straight line through the data we have at our disposal y = constant + ßx + error where axis ß = slope of the line constant = intercept with vertical error = random fluctuations In our example: equity price = 4.91 + (9.33 x GNP) Prediction From Regression Line Solve for any values of x How good the prediction is depends upon i) the value of the correlation coefficient ii) the value of x used Confidence limits are associated with model predictions Prediction From Regression Line cont.... S.E. is the standard deviation of the errors. It is a good indicator of forecasting accuracy when comparing two models. in our model S.E. = 4.823 Model Evaluation Using Eviews Useful indicators of model adequacy are: t-ratio p value F statistic R2 (adjusted R2) Model Evaluation cont ... t ratio & p value Fit of parameter t ratio = coefft ÷ st dev Compare with t tables, (T-2) df, or use rule of thumb: greater than +2; less than -2 p value - level of the test in order to accept the hypothesis that the parameter is significant. If greater than 0.05 then reject. Model Evaluation cont ...Fit of Model - F statistic Analysis of variance source of variance DF SS MS F regression p-1 SSR SSR÷ (p-1)= MSR MSR MSE error n-p SSE SSE÷(n-p)= MSE total n-1 SSTO Model evaluation... Fit of model - R2 statistic The R2 statistic tells you what percentage of the variation in the response variable has been explained by your model. Derived from SSR ÷ SSTO For our model = 8651.0 ÷ 9162.6 = 94.4% of variation explained R2 adjusted for throwing in irrelevant parameters (greater restriction on degrees of freedom) Limitation of the Simple Linear Regression Model Isolating key explanatory variables is a time-consuming process requiring a thorough understanding of the underlying processes affecting your model Very few relationships are ‘linear’ - a more complex mathematical function may be required Limitation: Simple Linear Regression Model cont... The model predictions may not be very accurate, especially if you are forecasting over time horizons well outside your sample period. One explanatory variable may not be sufficient. Hence multiple regression analysis. Multiple Regression Analysis Objectives Multiple regression analysis is a multivariate statistical technique used to examine the relationship between a single dependent variable and a set of independent variables It is widely used in business for two broad classes of research problems; prediction and explanation The Multiple Regression Model Instead of having just one predictor variable on the right hand side of the equation we have many; ie y = C + b1x1 + b2x2 + .... +bnxn + error where the error, or residual term, is the component unexplained by your model. Prediction with Multiple Regression The objective is to maximise the overall predictive power of the independent (x) variables as a means of forecasting the dependent variable, eg equity prices Often, predictive power is maximised at the expense of interpretation of results A second objective is the determination of the relative importance of each independent variable in the prediction of the response variable Assumptions in Multiple Regression Analysis Five basic assumptions are made when calculating a multiple regression relationship Assumptions: Multiple Regression ...cntd 1.That we are dealing with a linear function of the independent variables plus an error term; y = C + ß1X1 + ß2X2 + ....... + ßnXn + error 2.That the error term has a mean of zero 3.That there is a constant variance in the error terms 4.That the error terms are independent 5.That there is no significant linear relationship between the independent variables Linearity of the Relationship The concept of correlation is based on a linear relationship thus making it a critical issue in regression analysis It is easily examined in residual plots Problem may be remedied by transforming the data eg taking logarithms, square roots or a polynomial may be fitted to accommodate the curvilinear effects, eg equity price = constant + ß1 x GNP2 Error Term The error term must have a zero mean by definition since the line of best fit will pass directly through the centre of the data. There will be as many positive errors as negative errors. Constant Variance of the Error Terms A non constant variance of the error terms implies the relationship is changing over time. If the model spans a long time period conditions may not be stable. Will lead to problems of prediction. Again taking logarithms or square root of the independent variable will stabilize the variance. Independence of the Error Terms The pattern of residuals should appear random and similar to the null plot. A pattern occurs if the basic model conditions change but the changes are not incorporated in the model. For example, predicting profits on swimsuits with monthly data including two winter seasons and one summer season would lead to negative residuals for the winter months with positive residuals for the summer months. Independence of the Error Terms (cont...) Could be rectified by taking first differences in the data or including a variable to represent the seasonal component. Relationship Between Independent Variables As the complexity of the model increases, so does the degree of inter-relatedness of the variables on the right hand side of the equation It is not an ideal world and when dealing with business problems variables do tend to move in line with one another Need to check the correlation matrix to identify the problem Relationship Between Independent Variables cntd High correlations between the variables leads to unstable coefficient estimates for the independent variables Remedy is to omit the correlated variable and add more data. Selection of Variables You will normally have a number of possible independent variables from which to choose To assist you to obtain the ‘best’ regression model objectively, a sequential search process may be employed in the form of a stepwise regression Most popular technique in variable selection. All variables are examined to determine their contribution to the predictive power of the model. Building a Multiple Regression Model Checklist Select dependent and independent variables Plot each variable over time and perform ‘exploratory data analysis’ Check for high correlations between y and x’s Check for high correlations within the x’s Building a Multiple Regression Model (cont...) Checklist Construct model using Eviews Stepwise regression recommended for situations where large number of independent variables are involved Delete insignificant variables from the model Building a Multiple Regression Model (cont...) Checklist Obtain final model specification Check the adequacy of model by plotting residuals over time - are they random or ‘spherical’? If not adequate, carry out the suggestions for model improvement outlined earlier