Correlation And Regression https://www.fintreeindia.com/ LOS a © 2017 FinTree Education Pvt. Ltd. Sample covariance Sample correlation Measures how two variables move together Measures strength of linear relationship between two variables Captures the linear relationship between two variables Cov(x,y) = Standardized measure of covariance ∑ (X − X) (Y − Y) Cov(x,y) r= n−1 Cov(x,y) = r × Sx × Sy Sx × Sy Unit = No unit 2 Range = −1 to +1 Range = −∞ to +∞ r = 1 means perfectly +ve correlation +ve covariance = Variables tend to move together r = 0 means no linear relationship e Unit = % r = −1 means perfectly −ve correlation re −ve covariance = Variables tend to move in opposite directions −ve covariance −ve correlation −ve slope +ve covariance +ve correlation +ve slope Scatter plot: Graph that shows the relationship between values of two variables LOS b nT Limitations to correlation analysis Nonlinear relationship Outliers Spurious correlation Measures only linear relationships, not non linear ones Extremely large or small values may influence the estimate of correlation Appearance of causal linear relationship but no economic relationship exists Fi LOS c Test of the hypothesis that the population correlation coefficient equals zero Eg. r = 0.4 n = 62 Confidence level = 95% Step 1: Define hypothesis Step 2: Calculate test statistic Step 3: Calculate critical values Perform a test of significance H0: r = 0, Ha: r ≠ 0 r × √n − 2 √1 − r 2 0.4 × √62 − 2 √1 − 0.42 3.2 t-distribution, DoF = 60 −2 +2 Since calculated test statistic lies outside the range, conclusion is ‘Reject the null hypothesis’ ‘r’ is statistically significant, which means that population ‘r’ would be different than zero https://www.fintreeindia.com/ LOS d © 2017 FinTree Education Pvt. Ltd. Dependent variable Independent variable Variable you are seeking to explain Variable you are using to explain changes in the dependent variable Also referred to as explained variable/endogenous variable/predicted variable Also referred to as explanatory variable/exogenous variable/predicting variable y Dependent variable RFR + β (Rm − RFR) Independent variable Intercept LOS e Dependent variable Rp = Slope Independent variable x Assumptions underlying linear regression Œ Ž Sum of squared errors (SSE): Regression line: Sum of the squared vertical distances between the estimated and actual Y-values Line that minimizes the SSE Describes change in ‘y’ for one unit change in ‘x’ nT Slope coefficient (beta): re e Relationship between dependent and independent variable is linear Independent variable is uncorrelated with the error term Expected value of the error term is zero Variance of the error term is constant (NOT ZERO). The economic relationship b/w variables is intact for the entire time period (eg. change in political regime) Error term is uncorrelated with other observations (eg. seasonality) ‘ Error term is normally distributed Cov (x,y) Variance (x) LOS f ‘x’ 10 15 20 30 Actual ‘y’ 17 19 35 45 Predicted ‘y’ 15.81 23.36 30.91 46.01 Errors 1.19 −4.36 4.09 −1.01 Squared errors 1.416 19 16.73 1.02 Fi Eg. Standard error of estimate, coefficient of determination and confidence interval for regression coefficient Standard error of estimate (SEE) = Coefficient of determination (R2): √ SSE n−2 = Sum of squared errors (SSE) 38.166 38.166 2 √ = 4.36 % variation of dependent variable explained by % variation of the independent variable For simple linear equation, R2 = r2 https://www.fintreeindia.com/ © 2017 FinTree Education Pvt. Ltd. Confidence interval for regression coefficient ^ b1 ± (tc × SE) Slope Standard error Critical value (t-value) ^ b1 = 0.48 Eg. SE = 0.35 n = 42 Confidence interval: LOS g 0.48 ± (1.684 × 0.35) −0.109 to 1.069 Hypothesis testing for population value of a regression coefficient ^ b1 = 0.48 SE = 0.35 n = 42 Confidence interval = 90% ^ Step 1: Define hypothesis Step 2: Calculate test statistic Step 3: Calculate critical values Perform a test of significance ^ H0: b1 = 0, Ha: b1 ≠ 0 Sample stat. − HV 0.48 − 0 e Eg. Calculate 90% confidence interval Std. error 0.35 1.371 t-distribution, DoF = 40 1.684 re −1.684 Since calculated test statistic lies inside the range, conclusion is ‘Failed to reject the null hypothesis’ Slope is not significantly different from zero LOS h & i Confidence interval for the predicted value of dependent variable nT Predicted value of dependent variable ^ Y ^ ^ b0 + b1 × Xp = Intercept Predicted value (y) Forecasted value (x) ± (tc × SE) Predicted value (y) Slope Fi Eg. ^ Y Forecasted return (x) = 12% n = 32 Intercept = −4% Critical value (t-value) Slope = 0.75 Confidence interval ^ ^ = ^ Standard error = 2.68 Calculate predicted value (y) and 95% confidence interval Predicted value Y Standard error ^ b0 + b1 × Xp Y ± (tc × SE) 5 ± (2.042 × 2.68) Y = −4 + 0.75 × 12 = 5% −0.472 to 10.472 https://www.fintreeindia.com/ © 2017 FinTree Education Pvt. Ltd. LOS j Analysis of variance (ANOVA) Y: Mean Yi: Actual value Sum of squared errors (SSE) Measures unexplained variation aka sum of squared residuals ^ Yi: Predicted value Regression sum of squares (RSS) Total sum of squares (SST) Measures explained variation Measures total variation ^ ∑ (Yi − Yi)2 ∑ (Yi − Yi)2 ^ ∑ (Yi − Yi)2 ª Higher the RSS, better the quality of regression ª R2 = RSS/SST ª R2 = Explained variation/Total variation e ANOVA Table DoF Sum of squares Mean sum of squares Regression (explained) k RSS MSR = RSS/k Error (unexplained) Total re Source of variation n−k−1 SSE n−1 SST MSE = SSE/n − k − 1 F-statistic = MSR/MSE with ‘k’ and ‘n − k − 1' DoF nT When to use F-test and t-test F-test Y = b 0 + b 1 x1 + b 2 x2 + ε t-test Limitations of regression analysis Fi LOS k t-test Linear relationships can change over time (parameter instability) Public knowledge of regression relationship may make their future usefulness ineffective If the regression assumptions are violated, hypothesis tests will not be valid (heteroscedasticity and autocorrelation) Multiple Regression And Issues In Regression Analysis https://www.fintreeindia.com/ © 2017 FinTree Education Pvt. Ltd. LOS a Multiple regression equation Y b0 + b1 X1 + b2 X2 + …. + bk Xk + ε = Intercept Dependent variable LOS b Independent variable Slope Interpreting estimated regression coefficients Slope coefficient Value of dependent variable when all independent variables are equal to zero Measures how much dependent variable changes when independent variable changes by one unit, holding other independent variables constant e Intercept term re LOS c & d Hypothesis testing for population value of a regression coefficient b1 = 0.15 SE1 = 0.38 Eg. Error term b2 = 0.28 SE2 = 0.043 Confidence interval = 90% Step 2: Calculate test statistic Calculate critical values Perform a test of significance H0: b1 = 0, Ha: b1 ≠ 0 H0: b2 = 0, Ha: b2 ≠ 0 Sample stat. − HV 0.15 − 0 Std. error 0.38 Sample stat. − HV 0.28 − 0 Std. error 0.043 0.394 6.511 t-distribution, DoF = 40 Fi Step 3: Define hypothesis nT Step 1: n = 43 −1.684 1.684 Since calculated test statistic (b1) lies inside the range, conclusion is ‘Failed to reject the null hypothesis’ And test statistic (b2) lies outside the range, conclusion is ‘Reject the null hypothesis’ Variable with slope ‘b1’ is not significantly different from zero and variable with slope ‘b2’ is significantly different from zero Solution is to drop the variable with slope ‘b1’ DoF = n − k − 1 https://www.fintreeindia.com/ © 2017 FinTree Education Pvt. Ltd. P-value Reject FTR 5 ft. P-value FTR Reject 4.5 ft. 5 ft. FTR 3.8 ft. 4 ft. 6 ft. Significance level P-value is the lowest level of significance at which null hypothesis is rejected LOS e Predicted value of dependent variable Confidence interval for regression coefficient ^ ^ Y Slope ^ ^ ^ ^ ^ Predicted value (y) Forecasted value (x) re Slope Assumptions of a multiple regression model nT Œ Relationship between dependent and independent variable is linear Independent variables are uncorrelated with the error term and there is no exact linear relation between two or more independent variables Ž Expected value of the error term is zero Variance of the error term is constant (NOT ZERO). The economic relationship b/w variables is intact for the entire time period (eg. change in political regime) Error term is uncorrelated with other observations (eg. seasonality) ‘ Error term is normally distributed LOS g F-statistic ª F-statistic = MSR/MSE with ‘k’ and ‘n − k − 1' DoF Fi ^ ^ b0 + b1 X1 + b2 X2 + …. + bk Xk Intercept Standard error Critical value (t-value) LOS f = e b1 ± (tc × SE) ª It is used to check the quality of entire regression model ª One-tailed test, rejection region is on right side ª If the result of F-test is significant, at least one of the independent variable is able to explain variation in dependent variable https://www.fintreeindia.com/ n = 48 Eg. © 2017 FinTree Education Pvt. Ltd. SST = 430 k=6 SSE = 190 Significance level = 2.5% and 5% Perform an F-test RSS = SST − SSE 430 − 190 MSR = RSS k 240 6 40 MSE = SSE n−k−1 190 41 4.634 MSR MSE 40 4.634 8.631 F-statistic = 240 Critical value (F-table) at 2.5% significance level (DoF 6,41) = 2.74 Calculated test statistic is on the right of critical value, therefore the conclusion is ‘Reject the null hypothesis’ Since the conclusion at 2.5% significance is ‘Reject’, the conclusion at 5% significance is also ‘Reject’ All the variables are significantly different from zero 2 LOS h 2 R and adjusted R e R2: % variation of dependent variable explained by % variation of all the independent variables R2 = RSS/SST re R2 = Explained variation/Total variation Adjusted R2 = 1− ]) ) n−1 n−k−1 ] × (1 − R2) Adjusted R2 < R2 in multiple regression n = 30 k=6 nT Eg. n = 30 k=8 R2 = 73% R2 = 75% Adjusted R21 = 1− ]) ) ] 41.1% Adjusted R22 = 1− ]) ) ] 39.58% 30 − 1 × (1 − 0.732) 30 − 6 − 1 30 − 1 × (1 − 0.752) 30 − 8 − 1 Fi Adding two more variables is not justified because adjusted R22 < adjusted R21 LOS i ANOVA table Source of variation DoF Sum of squares Mean sum of squares Regression (explained) k RSS MSR = RSS/k Error (unexplained) n−k−1 SSE MSE = SSE/n − k − 1 Total n−1 SST F-statistic = MSR/MSE with ‘k’ and ‘n − k − 1' DoF https://www.fintreeindia.com/ LOS j © 2017 FinTree Education Pvt. Ltd. Multiple regression equation by using dummy variables Y b0 + b1 X1 + b2 X2 + …. + bk Xk + ε = Intercept Dependent variable Independent variable Slope Error term Dummy variables: Independent variables that are binary in nature (i.e. in the form of yes/no) They are qualitative variables Values: If true = 1, if false = 0 Use n – 1 dummy variables in the model LOS k & l Types of heteroskedasticity Unconditional e Conditional nT Causes problems for statistical inference re Occurs when heteroskedasticity of the error variance is correlated with the independent variables Occurs when heteroskedasticity of the error variance is not correlated with the independent variables Does not cause major problems for statistical inference Conditional heteroskedasticity Positive serial correlation Negative serial correlation Multicollinearity Meaning Variance not constatnt Errors are correlated Errors are correlated Two or more independent variables are correlated Effect Type I errors Type I errors Type II errors Type II errors Detection Examining scatter plots or BreuschPagan test Durbin-Watson test Durbin-Watson test F - significant t - not significant Correction White-corrected standard errors Hansen method Hansen method Drop one of the variables Fi Violations ª ª ª ª ª Breusch-Pagan test: n × R2 White-corrected standard errors is also known as robust standard error Durbin-Watson test ≈ 2(1 − r). Multicollinearity: The question is never a yes or no, it is how much None of the assumption violations have any impact on slope coefficients. The impact is on standard errors and therefore on t-test https://www.fintreeindia.com/ © 2017 FinTree Education Pvt. Ltd. LOS m Model specifications Model misspecifications Model should have strong economic reasoning Omitting a variable Variable should be transformed Functional form of the variables should be appropriate Incorrectly pooling data The model should be parsimonious (concise/brief) Using lagged dependent variable as an independent variable The model should be examined for violations of assumptions Forecasting the past Measuring independent variables with error Model should be tested on out of sample data LOS n Models with qualitative dependent variables Logit Based on the logistic distribution nT Based on the normal distribution re Probit LOS o e Model misspecifications might have impact on both slope coefficient and error terms Discriminant Similar to probit and logit but uses financial ratios as independent variables Interpretation of multiple regression model Values of slope coefficients suggest that there is economic relationship between the independent and dependent variables But it may also be possible for a regression to have statistical significance even when there is no economic relationship Fi This statistical significance must also be factored into the analysis Time-series Analysis https://www.fintreeindia.com/ LOS a © 2017 FinTree Education Pvt. Ltd. Predicted trend value for a time series Time series: Set of observations on a variable’s outcomes in different time periods Used to explain the past and make predictions about the future Linear trend models Log-linear trend models Log-linear trend is a trend in which the dependent variable changes at an exponential rate with time Linear trend is a trend in which the dependent variable changes at a constant rate with time Used for financial time series Has a straight line Has a curve Upward-sloping line: +ve trend e Convex curve: +ve trend Downward-sloping line: −ve trend LOS b re Equation: yt = b0 + b1t + εt Concave curve: −ve trend Equation: ln yt = b0 + b1t + εt How to determine which model to use nT Plot the data Fi y Linear trend model y x x Log-linear trend model Limitation of trend models is that they are not useful if the error terms are serially correlated https://www.fintreeindia.com/ LOS c © 2017 FinTree Education Pvt. Ltd. Requirement for a time series to be covariance stationary A time series is covariance stationary if it satisfies the following three conditions: Constant and finite mean Constant and finite variance (same as homoskedasticity) Constant and finite covariance of time series with itself Eg. Xt = b0 + b1 Xt−1 Xt = 5 + 0.5 Xt−1 Xt = 8 Xt − 1 = 20 Xt = 15 Xt − 1 = 8 Xt = 9 Xt − 1 = 15 Xt = 12.5 Xt − 1 = 9 Xt = 9.5 e Xt − 1 = 6 re Xt − 1 = 12.5 Xt − 1 = 10 Xt = 11.25 Xt = 10 If Xt − 1 = 10, then Xt = 10, Xt + 1 = 10, Xt + 2 = 10 and so on This is called constant and finite mean b0 1 − b1 = 5 1 − 0.5 = 10 nT Mean of the time series = For a model to be valid, time series must be covariance stationary Most economic and financial time series relationships are not stationary The model can be used if the degree of nonstationarity is not significant Autoregressive (AR) model Fi LOS d AR model: A time series regressed on its own past values Equation AR(1): Xt = b0 + b1Xt − 1 + εt Equation AR(2): Xt = b0 + b1Xt − 1 + b2Xt − 2 + εt Chain rule of forecasting: Calculating successive forecasts https://www.fintreeindia.com/ © 2017 FinTree Education Pvt. Ltd. LOS e Autocorrelations of the error terms If the error terms have significant serial correlation (autocorrelation), the AR model used is not the best model to analyze the time series Procedure to test if the AR model is correct: Step 1: Calculate the intercept and slope using linear regression Step 2: Calculate the predicted values Step 3: Calculate the error terms Step 4: Calculate the autocorrelations of the error terms Step 5: Test whether the autocorrelations are significantly different from zero If the autocrrelations are not statistically significant from zero (if the decision is FTR): Model fits the time series If the autocrrelations are statistically significant from zero (if the decision is reject): Model does not fit the time series Test used to know if the autocorrelations are significantly different from zero: t-test Autocorrelation Standard error t statistic = LOS f e Mean reversion It means tendency of time series to move toward its mean LOS g Eg. b0 1 − b1 re Mean reverting level = In-sample and out-of-sample forecasts and RMSE criterion Xt − 1 Predicted value Error Squared errors - - - - 200 216.5 3.5 12.25 215 220 227.8 −12.8 163.84 205 215 225 −20 400 235 205 219.4 15.6 243.36 250 235 236.4 13.6 184.96 Sample value (Xt) 200 Fi nT 220 In-sample root mean squared error (RMSE) √ 1004.41 SSE n 1004.41 5 √ = 14.17 https://www.fintreeindia.com/ © 2017 FinTree Education Pvt. Ltd. Eg. Actual value Predicted value Error Squared errors 215 - - - 235 225 10 100 220 236.4 −16.4 268.96 240 227.9 12.1 146.41 250 239.2 10.8 116.64 632 Out-of-sample root mean squared error (RMSE) √ SSE n √ 632 4 = 12.57 Select the time series with lowest out-of-sample RMSE LOS h Instability of coefficients of time-series models e One of the important issues in time series is the sample period to use Shorter sample period → More stability but less statistical reliability Longer sample period → Less stability but more statistical reliability Random walk Random walk with a drift A time series in which predicted value of a dependent variable in one period is equal to the value of dependent variable in previous period plus an error term A time series in which predicted value of a dependent variable in one period is equal to the value of dependent variable in previous period plus or minus a constant amount and an error term Equation: Xt = Xt − 1 + εt Equation: Xt = b0 + Xt − 1 + εt nT LOS i re Data must also be covariance stationary for model to be valid Fi ª Both of the above equations have a slope (b1) of 1 ª Such time series are said to have ‘unit root’ ª They are not covariance stationary because they do not have a finite mean ª To use standard regression analysis, we must convert this data to covariance stationary. This conversion is called ‘first differencing’ https://www.fintreeindia.com/ © 2017 FinTree Education Pvt. Ltd. LOS j & k Unit root test of nonstationarity Autocorrelation approach Dickey-Fuller test If autocorrelations do not exhibit these characteristics, it is said to be a nonstationary time series: More definitive than autocorrelation approach Xt − Xt − 1 = b0 + b1Xt − 1 − Xt − 1 + εt Autocorrelations at all lags are statistically insignificant from zero Xt − Xt − 1 = b0 + (b1 − 1)Xt − 1 + εt or If null (b1 − 1 (g) = 0) can not be rejected, the time series has a unit root As the no. of lags increase, the autocorrelations drop down to zero First differencing Lag 1 - - 230 270 290 ∆ sales ∆ sales (current year) (previous year) - - - 230 40 - 270 20 40 290 20 20 30 20 nT 310 First difference e Sales re Eg. 310 340 ^ Equation: y = 30 − 0.25x ^ Equation: y = 30 − 0.25(340) ^ y = (55) Forecasted sales: 340 − 55 = 285 If time series is a random walk then we must convert this data to covariance stationary. This conversion is called first differencing How to test and correct for seasonality Fi LOS l Seasonality can be detected by plotting the values on a graph or calculating autocorrelations Seasonality is present if the autocorrelation of error term is significantly different from zero Correction: Adding a lag of dependent variable (corresponding to the same period in previous year) to the model as another independent variable https://www.fintreeindia.com/ LOS m © 2017 FinTree Education Pvt. Ltd. Autoregressive conditional heteroskedasticity (ARCH) ARCH exists if the variance of error terms in one period is dependent on the variance of error terms in previous period Testing: Squared errors from the model are regressed on the first lag of the squared residuals Equation: ^2 εt = ^2 Intercept Predicted error term of current period LOS n μt a0 + a1 εt − 1 + Predicted error term of last period Slope Error term How time-series variables should be analyzed for nonstationarity and/or cointegration e To test whether the two time series have unit roots, a Dickey-Fuller test is used Possible scenarios: nT re Œ Both time series are covariance stationary (linear regression can be used) Only the dependent variable time series is covariance stationary (linear regression should not be used) Ž Only the independent variable time series is covariance stationary (linear regression should not be used) Neither time series is covariance stationary and the two series are not cointegrated (linear regression should not be used) Neither time series is covariance stationary and the two series are cointegrated (linear regression can be used) Cointegration: Long term economic or financial relationship between two time series LOS o Appropriate time-series model to analyze a given investment problem ª Understand the investment problem you have and make a choice of model Fi ª If you have decided to use a time-series model plot the values to see whether the time series looks covariance stationary ª Use a trend model, if there is no seasonality or structural shift ª If you find significant serial correlation in the error terms, use a complex model such as AR model ª If the data has serial correlation, reexamine the data for stationarity before running an AR model ª If you find significant serial correlation in the residuals, use an AR(2) model ª Check for seasonality ª Test whether error terms have ARCH ª Perform tests of model's out-of-sample forecasting performance (RMSE) Probabilistic Approaches: Scenario Analysis, Decision Trees And Simulations https://www.fintreeindia.com/ © 2017 FinTree Education Pvt. Ltd. LOS a, b & c Step 1 Steps in running a simulation Determine probabilistic variables: No constraint on number of input variables that can be allowed to vary. Focus on a few variables that have significant impact on value. Step 2 Define probability distributions for these variables: Three ways to define probability distribution Historical data: Useful when past data is available and reliable. Estimate the distribution based on past values. Cross-sectional data: Useful when past data is unavailable or unreliable. Estimate the distribution based on the values of similar variables. Check for correlation across variables: Step 4 Run the simulation: If the correlation is strong, either allow only one of the variables to vary (focus on the variable that has the highest impact on value) or build the correlation into the simulation re Step 3 e Statistical distribution and parameters: Useful when historical and cross sectional data is insufficient or unreliable. Estimate the distribution and its parameters. It means to draw an outcome from each distribution and compute the value based on these outcomes Types of distributions: Greater the diversity of distributions, greater the number of simulations required. Range of outcomes: Greater the range of outcomes, greater the number of simulations required. Advantages of using simulations in decision making Fi LOS d nT Number of probabilistic inputs: Higher the number of probabilistic inputs, greater the number of simulations required. Better input estimation Provides a distribution of expected value rather than a point estimate An analyst will usually examine both historical and cross-sectional data to select a proper distribution and its parameters, instead of relying on single best estimates. This results in better quality of inputs Simulations provide distribution of expected value however they do not provide better estimates Simulations do not always lead to better decisions https://www.fintreeindia.com/ LOS e © 2017 FinTree Education Pvt. Ltd. Common constraints introduced into simulations Book value constraints Earnings and CF constraints Imposed internally: Analyst’s expectations Regulatory capital restrictions Likelihood of financial distress Imposed externally: Loan covenants Negative equity LOS f Market value constraints Indirect bankruptcy costs Issues in using simulations in risk assessment ª Garbage in, garbage out: Inputs should be based on analysis and data, rather than guesswork ª Inappropriate probability distributions: Using probability distributions that have no resemblance to the true distribution of an input variable will provide misleading results e ª Non-stationary distributions: Distributions may change over time due to change in market structure. There can be a change the form of distribution or the parameters of the distribution re ª Dynamic correlations: Correlation across input variables can be modeled into simulations only when they are stable. If they are not it becomes far more difficult to model them Risk-adjusted value Cash flows from simulations are not risk-adjusted and should not be discounted at RFR Asset Risk-adjusted discount rate Expected value using simulation σ from simulation A 15% $100 17% 18% $100 21% nT Eg. B ª We have already accounted for B’s greater risk by using a higher discount rate ª If we choose A over B on the basis of A’s lower standard deviation, we would be penalizing Asset B twice Fi ª An investor should be indifferent between the two investments LOS g Selecting appropriate probabilistic approach Type of risk Correlated? Sequential? Appropriate approach Continuous Yes Doesn’t matter Simulation Discrete Yes No Scenario analysis Discrete No Yes Decision tree