Residual Analysis Dr. William Lau Tel: 3943 8572 williamlau@cuhk.edu.hk The Financial Modelers' Manifesto by Emanuel Derman and Paul Wilmott in 2009 Dr. Emanuel Derman was the Managing Director of Goldman Sachs and headed the firm’s Quantitative Risk Strategies group. Dr. Paul Wilmott was the founder of the Mathematical Finance program at Oxford University “We do need models and mathematics – you cannot think about finance and economics without them – but one must never forget that models are not the world... It doesn't fit without cutting off some essential parts. And in cutting off parts for the sake of beauty and precision, models inevitably mask the true risk rather than exposing it. The most important question about any financial model is how wrong it is likely to be, and how useful it is despite its assumptions... There is no right model, because the world changes in response to the ones we use… Markets change and newer models become necessary. Simple clear models with explicit assumptions about small numbers of variables are therefore the best way to leverage your intuition without deluding yourself.” https://www.uio.no/studier/emner/sv/oekonomi/ECON4135/h09/undervisning smateriale/FinancialModelersManifesto.pdf 3 Assumptions about the Random Error, ɛ E(ɛ) = 0 Var(ɛ) is constant ɛ’s are normally distributed ɛ’s are independent (so all pairs of ɛ are uncorrelated) Lesson Outline 1 Regression Residuals 2 Detecting Lack of Fit 3 Detecting Unequal Variances 4 Checking Normality Assumption 5 Detecting Outliers and Identifying Influential Observations 6 Test Detecting Residual Correlation: The Durbin-Watson 5 Actual random error e and regression residual ^e 6 Data for 20 Athletes 7 8 SAS printout for first-order model 9 SAS printout for first-order model 10 SAS printout for quadratic (second-order) model 11 SAS printout for quadratic (second-order) model Detecting Lack of Fit 13 Detecting Lack of Fit with Residuals Plot the residuals on the vertical axis against each of the independent variables on the horizontal axis. Plot the residuals on the vertical axis against the predicted values on the horizontal axis. In each plot, look for trends, dramatic changes in variability, and/or more than 5% of residuals that lie outside 2s of 0. Any of these patterns indicates a problem with model fit. 14 First-order model 15 SAS plot of residuals for the firstorder model 16 MINITAB plot of cholesterol data with least squares line 17 Quadratic (second-order) model 18 SAS plot of residuals for the quadratic model 19 First-order model Secondorder model 20 21 SPSS regression printout for the demand model 22 SPSS plot of residuals against price for demand model 23 Partial Regression Residuals The set of partial regression residuals for the jth independent variable xj is calculated as follows: = y – (0 + 1x1 + … + xj-1 + j-1 x j+1 j+1 + … + kxk) = + jxj where = y – is the usual regression residual. Partial residuals measure the influence of xj on the dependent variable y after the effects of the other independent variables have been removed or accounted for. 24 SPSS partial residual plot for price 25 Graphs of some mathematical functions relating E(y) to p 26 SPSS regression printout for demand model with transformed price 27 Residual Plot of the Transformed Model Detecting Unequal Variances 29 A plot of residuals for Poisson data Poisson Probability Distribution 30 Two Properties of a Poisson Distribution: 1. The probability of an occurrence is the same for any two intervals of equal length. 2. The occurrence or nonoccurrence in any interval is independent of the occurrence or nonoccurrence any other interval. A Poisson distributedinrandom variable is often useful in estimating the number of occurrences over a specified interval of time or space. It is a discrete random variable that may assume an infinite sequence of values (x = 0, 1, 2, . . . ). 31 A plot of residuals for binomial data (proportions or percentages) 32 Binomial Probability Distribution Four Properties of a Binomial Experiment 1. The experiment consists of a sequence of n identical trials. 2. Two outcomes, success and failure, are possible on each trial. 3. The probability of a success, denoted by p, does not change from trial to trial. 4. The trials are independent. 33 A residual plot of data subject to multiplicative errors 34 Stabilizing transformations for heteroscedastic responses 35 Salary and work experience data for 50 social workers 36 MINITAB regression printout for second-order model of salary 37 MINITAB residual plot for second-order model of salary 38 A plot of residuals for data subject to multiplicative errors 39 Stabilizing transformations for heteroscedastic responses 40 MINITAB regression printout for second-order model of natural log of salary 41 MINITAB residual plot for secondorder model of natural log of salary 42 MINITAB regression printout firstorder model of natural log of salary 43 First-order model of natural log of salary 44 Statistical Test for Testing Heteroscedasticity Divide the sample data in half and fit the regression model to each half. Conduct two-tailed F-test to compare the estimated variances of the random error terms of the two models. H0 : 12 = 22 H1 : 12 ≠ 22 45 SAS regression printout for second-order model of salary: Subsample 1 (years of experience < 20) 46 SAS regression printout for second-order model of salary: Subsample 2 (years of experience 20) Checking the Normality Assumption 48 MINITAB histogram of residuals from log model of salary 49 MINITAB stem-and-leaf plot of residuals from log model of salary 50 MINITAB normal probability plot of residuals from log model of salary Detecting Outliers and Identifying Influential Observations 52 Standardized Residuals It is the z-score for a residual Observations with standardized residuals that exceed 3 in absolute value are considered as outliers. Possible reasons for outliers: Experimental procedures may have malfunctioned. Experimenters may have misrecorded the measurements. Data may have been input incorrectly into the computer. If none of the above, it could be accurate outliers! 53 Data for Fast-food Sales 54 MINITAB regression printout for model of fast-food sales 55 MINITAB regression printout for model of fast-food sales 56 MINITAB plot of residuals versus traffic flow 57 MINITAB plot of residuals versus city 58 MINITAB regression printout for model of fast-food sales with the corrected data point 59 MINITAB regression printout for model of fast-food sales with corrected data point 60 MINITAB plot of residuals versus traffic flow for model with corrected data point 61 MINITAB plot of residuals versus city for model with corrected data point 62 Numerical Techniques for Identifying Outlying Influential Observations Leverage [OPTIONAL] Cook’s Distance [OPTIONAL] The Jackknife 63 Leverage [OPTIONAL] ^ 𝑦 𝑖 =h 1 𝑦 1+ h2 𝑦 2 +… +h𝑖 𝑦 𝑖 +…+ h𝑛 𝑦 𝑛 64 Cook’s Distance [OPTIONAL] 2 𝐷𝑖 = ( 𝑦 𝑖− ^ 𝑦𝑖) [ h𝑖 2 ( 𝑘+ 1 ) 𝑀𝑆𝐸 (1− h𝑖 ) ] 65 The Jackknife A deleted residual, denoted di, is the difference between the observed response yi and the predicted value (i) obtained when the data for the ith observation is deleted from the analysis. di = yi - (i) An observation with an unusually large (in absolute value) deleted residual is considered to have a large influence on the fitted model. Detecting Residual Correlation: The DurbinWatson Test 67 dL, ≤ d ≤ dU, dL, ≤ (4 – d) ≤ dU, 68 A Firm’s Annual Sales Revenue 69 SAS regression printout for model of annual sales 70 SAS plot of residuals for model of annual sales 71 Reproduction of part of the Table E.8 on page 676 in the textbook 72 dL, ≤ d ≤ dU, dL, ≤ (4 – d) ≤ dU, 73 Check Your Understanding Which of the following methods is frequently used to check the normality assumption about the error term? a) Plotting partial regression residuals against xi. b) Calculating VIF for each independent variable. c) Constructing histogram of residuals. d) Plotting y against x. e) Conducting the Durbin-Watson test.