Statistical Arbitrage problem set 1 Aloke Mukherjee 1. Given n t-stats all of which are insignificant, what is the probability that at least one will indicate significance? P(at least one significant) = 1 – P(none significant) P(none significant) = P(stat1 insignificant & stat2 insignificant & … & statn insignificant) Assuming independence this is equal to P(stat1 insignificant) * P(stat2 insignificant) * … P(statn insignificant) = .95n The probability of any one stat being insignificant given that the model is insignificant is 95%. (i.e. there is only a 5% chance of a false positive) Therefore, the probability that at least one t-stat will indicate significance is: P(at least one significant) = 1 - .95n Note that this goes to 1 as n goes to infinity. The probability of a false positive increases with the number of models being tested. 2. Given three series guess the process used to generate it. a.txt – j = is constant, and the j are i.i.d. but not normal. The first graph below is of the series xn – xn-1 with the mean subtracted and divided by the standard deviation. Plotting this series for a.txt, the first term seems to be an error since it lies almost thirteen standard deviations from the mean. Removing this term allows us to make some sensible conclusions about the series. For the A series we see that the level of fluctuations remains roughly constant over the series supporting constant . a-diff-resid 2.00E+00 1.50E+00 1.00E+00 5.00E-01 -5.00E-01 -1.00E+00 -1.50E+00 197 193 189 185 181 177 173 169 165 161 157 153 149 145 141 137 133 129 125 121 117 113 109 105 97 101 93 89 85 81 77 73 69 65 61 57 53 49 45 41 37 33 29 25 21 17 9 13 5 1 0.00E+00 The histogram of the series compared with the normal pdf below shows that the distribution of the j is not normal. It is much more peaked at the center. This is visible in the residual time series’ generally small fluctuations. Further confirmation is given by the kurtosis which is greater than three and has a t-statistic greater than two allowing us to reject the normal hypothesis for the residuals of A. b.txt - is constant, and the j are i.i.d. standard normal. b-diff-resid 2.50E+00 2.00E+00 1.50E+00 1.00E+00 5.00E-01 -5.00E-01 -1.00E+00 -1.50E+00 -2.00E+00 -2.50E+00 Again, the first sample in this series alters all the subsequent analysis, it lies greater than five standard deviations from the mean. Discarding this sample produces a series with a constant level of fluctuation, supporting the hypothesis of constant . The normalized histogram fits the standard normal distribution well and the kurtosis lies within 1.4 standard deviations of its expected value. 196 191 186 181 176 171 166 161 156 151 146 141 136 131 126 121 116 111 106 96 101 91 86 81 76 71 66 61 56 51 46 41 36 31 26 21 16 6 11 1 0.00E+00 c.txt - The j are i.i.d. standard normal, but j depends on j in a “smooth” fashion: it has some consistent trend through the data sample but it does not vary randomly. The normalized residuals of C clearly show a reducing level of fluctuations. Stratifying the sample into subgroups of twenty and plotting the standard deviations of each clearly shows a linear downward trend in j. This is shown in the second graph below. 3. IBM daily price and returns. Daily holding period return is defined as the return you would’ve earned if you’d bought the stock at the last valid price before today and held till today. The last valid price can be up to ten days earlier. Holding period return: R = (Today/Last) – 1 This is equivalent to % price change if the last price was from the previous trading day and there are no adjustment factors. This is roughly equivalent to log price change but only to order R. IBM return series - 2000-2005 0.15 0.1 0.05 0 -0.05 -0.1 -0.15 1462 1401 1340 1278 1217 1156 1097 1036 975 913 852 791 732 671 610 548 487 426 367 306 245 183 122 61 1 -0.2 The IBM return series is not normal. Looking at the graph of price changes we can clearly see that volatility has changed over time, that small returns are far more likely and that large swings occur more often than would be possible if returns were normally distributed. These observations are confirmed by a comparison of the distribution with the normal PDF. It clearly shows that IBM’s returns distribution has a higher central peak and fatter tails. The kurtosis is high enough to allow the normal hypothesis for IBM’s returns to be safely rejected. The IBM return series does not seem to be i.i.d. The first correlogram below shows that at some lags the IBM return autocorrelation exceeds the 95% interval. For comparison, correlograms were computed for three series of i.i.d normal random variables of the same length as the IBM return series. Only at one lag in one of the series does the autocorrelation exceed the 95% interval. By comparison the IBM return series exceeds the 95% interval three times – at lags of one, four and eighteen days. The third graph below compares the correlograms for the first and second halves of the series. The first half of the series closely mirrors the autocorrelation for the entire series. Interestingly, the second half of the series does not reject the i.i.d. hypothesis. This shows that the autocorrelation is not necessarily stationary for this series. 4. Fit a mean-reverting model to the federal funds rate over the past thirty years. The data from CRSP includes dates with no data. These were assumed to be the same as the previous days data. To fit the model xj = α+βxj−1+σεj we regress xj on xj-1 using Excel’s regression function. This yields the following statistics: SUMMARY OUTPUT Regression Statistics Multiple R 0.994669072 R Square 0.989366562 Adjusted R Square 0.9893652 Standard Error 0.379646306 Observations 7810 ANOVA df Regression Residual Total Intercept X Variable 1 SS 104708.4384 1125.37733 105833.8157 MS 104708.4384 0.144131318 Coefficients Standard Error 0.035284897 0.008838429 0.994669072 0.00116699 t Stat 3.992213677 852.3376296 1 7808 7809 From the regression we have the following parameter values: α = 0.0353 β = 0.9947 σ = 0.3796 If this were a O-U type model (xj = κ (θ - xj−1) + σεj), this would yield parameters: κ (speed of mean-reversion) = 1 – β = 0.00533 (1/κ= 187.5 days) θ (long-run average) = α/κ = 6.62% This model does not fit the data well since the residuals are far from the standard normal distribution. The residuals exhibit time-varying volatility, variations far beyond what could occur under a normal distribution (one greater than twenty standard deviations) and excess kurtosis. This is illustrated below in a graph of the residuals vs. time and a histogram of the residual values compared with the standard normal pdf. distribution of residuals from fed fund regression 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -5.25 -4.75 -4.25 -3.75 -3.25 -2.75 -2.25 -1.75 -1.25 -0.75 -0.25 0.25 normalized histogram 0.75 1.25 1.75 2.25 2.75 3.25 3.75 standard normal pdf 4.25 4.75 5.25