Improving Tests of Abnormal Returns by Bootstrapping the Multivariate Regression Model with Event Parameters Scott E. Hein Texas Tech University Peter Westfall Texas Tech University April 2004 We would like to thank the editors, Rene Garcia and Eric Renault, and two anonymous referees and an associate editor, as well as Will Ashman, Keldon Bauer, Naomi Boyd, Phil English, Lisa Kramer, Henry Oppenheimer, Bill Maxwell, Ted Moore, Ramesh Rao, Jon Scott, Shawn Strothers, Kate Wilkerson, Eric Walden and Zhaohui Zhang for helpful comments. We would also like to thank Jonathan Stewart and Brad Ewing for providing data used in this analysis. Corresponding author and requests for reprints: Scott Hein, Jerry S. Rawls College of Business, Texas Tech University, Lubbock, TX 79409-2101, (806) 742-3433, odhen@ba.ttu.edu. Improving Tests of Abnormal Returns by Bootstrapping the Multivariate Regression Model with Event Parameters Abstract Parametric dummy variable-based tests for event studies using multivariate regression are not robust to nonnormality of the residual, even for arbitrarily large sample sizes. Bootstrap alternatives are described, investigated, and compared for cases where there are nonnormalities, cross-sectional and time series dependencies. Independent bootstrapping of residual vectors from the multivariate regression model controls type I error rates in the presence of crosssectional correlation; and, surprisingly, even in the presence of time-series dependence structures. The proposed methods not only improve upon parametric methods, but also allow development of new and powerful event study tests for which there is no parametric counterpart. Key words: Event study, Event parameter estimation, Cross-sectional correlation, Bootstrap, Significance level, Simulation. I. Introduction Researchers have recently extensively applied the multivariate regression model (MVRM), using a dummy variable representing a significant event date, to test the significance of many different events on both financial asset prices and interest rates. Binder (1985a, b) and MacKinlay (1997) provide surveys of event studies in finance and economics. Binder (1985a, b) argues that a main advantage of employing an MVRM over examining cumulative abnormal returns (CAR) lies in the fact that joint linear hypothesis tests can be carried out easily under the former. This has been especially useful in testing whether a regulatory event has a significant effect on a sample of firms.1 The MVRM approach relies on traditional t-statistics and F-tests to test statistical significance, especially joint hypotheses, whereas testing joint linear hypotheses is not typically done under the CAR approach. Rather the focus under the CAR approach has been on the impact of events on individual firms. Much research has been devoted to distributional concerns in the CAR analysis, since it is widely recognized that the excess returns are generally not normally distributed. This violation typically is generally associated with fat tails, but could also be attributed to other nonnormal characteristics such as skewness. It is understood that such problems will cause significant statistical inference problems (Brown & Warner, 1980, 1985). Corrado (1989) suggests a non-parametric rank test for event studies in the face of distributional problems. More recently Lyon, Barber and Tsai (1999) suggest a bootstrap version of a skewness-adjusted tstatistic to control for the skewness bias in their tests of long-run abnormal returns in a CAR setting. There has been less analysis of distributional violations in the dummy variable/event parameter estimation setting. The fact that tests on event parameter coefficients in the MVRM can be seriously biased has gone largely unappreciated in the literature. Three exceptions to this are Chou (2001), Kramer (2001) and Hein, Westfall and Zhang (2001, HWZ hereafter). These papers find that violations of normality assumptions in the MVRM setting with event parameter estimation do indeed have a significant influence on the statistical analysis of event significance. As an example to show the need for bootstrapping, we consider the event tests of Stewart and Hein (2002), who examined the December 4, 1990 announcement that the Federal Reserve was eliminating the reserve requirement on non-personal time deposits. There was concern as to whether the shock continued for some days after the announcement; hence event analysis was performed for the days after the announcement. Stewart and Hein only show the univariate tests of the events, but it is also desirable to find a multivariate overall summary to simplify interpretation. We used the same data to test the hypothesis that the reserve requirement announcement significantly affected the stocks of the largest banks in the country. Using the largest five banks (Citicorp, Bank America, Nations, Chemical, and First Chicago), the significance level of the multivariate normality-assuming test was p=0.039, while the HWZ bootstrap shows p=0.078 (95% Confidence interval: 0.073-0.083, based on 10,000 samples). A researcher using traditional MVRM procedures would have concluded that the event was significant for the largest banks in the country at the 5% level, whereas the bootstrap indicates that this significance is exaggerated. Chou, Kramer, and HWZ provide different bootstrapping approaches aimed at rectifying the bias in the event study tests: Chou and HWZ are similar in that they bootstrap the raw data to estimate the distribution of the test statistics, hence such methods are called “data-based bootstrap methods,” hereafter. On the other hand, Kramer bootstraps the test statistics themselves to estimate their distribution, hereafter called a “test statistic-based bootstrap method”. The current paper continues these research efforts by: (i) offering further analytical justifications for these bootstrap methods, (ii) comparing different these different bootstrap methods in cases of not only non-normality, but also time series and cross-sectional dependence structures, and (iii) developing alternative, new methods that combine elements of the various proposals. We show that the two basic types of bootstraps work well in the absence of cross-sectional correlation, even though the data may exhibit time-series dependencies such as AR, ARCH and GARCH effects. However, the test-statistic based bootstrap procedure results in grossly inflated type I error rates in the presence of cross-sectional correlation (see also Bernard, 1987, for other problems caused by cross-sectional correlation). The point is subtle but important: test statisticbased bootstrap methods are reasonable for event studies with multiple events at independent time points, but should not be used for clustered event studies. Kramer noted this concern in her dissertation, (Kramer, 1998); however, this caution was inadvertently left out of the published article (Kramer, 2001). Another serious concern with test statistic-based bootstraps is that it cannot be used at all when the number of firms is small. In some cases, event tests on single firms are desirable. Studies of dividend initiation or resumption like Boehme and Sorescu (2002) would illustrate such an event study. While Boehme and Sorescu examine CARs, the data-based bootstrap not only offers the flexibility to accommodate cross-sectional correlation, but also the ability to test for events when there are a small number of firms, or just a single firm, as in this case. On the other hand, the Kramer procedure of summing t-statistics has potentially more power than the HWZ procedure that uses the less focused multivariate test of Binder (1985b) as its base. In the final analysis, we recommend a new procedure, where the summed t-statistics are used in conjunction with the data-based bootstrap. This new method has both good power and control of type I error rates under the presence of cross-sectional correlation, as well as under time series dependence structures, and is therefore our recommended procedure. This paper is organized as follows: in section II we show that the classical parametric test is inconsistent, with asymptotic type I error levels depending on kurtosis of the excess returns. In section III motivate and present the data-based and test statistic-based bootstrap tests. Theoretical results concerning effects of cross-sectional correlation are given in section IV, and the various methods are compared theoretically and via simulation using a variety of models for financial data in section V. Conclusions are given in section VI. II. The multivariate regression model and asymptotic inconsistency a. The model and event tests Consider the traditional single market factor model Rt = β0 + β1Rmt + εt, t = 1, … ,T; (1) where Rt is the return on a specific firm or portfolio of stocks, Rmt is the return on the market, and εt represents the excess return. Additional predictors may be included in (1), these are excluded to simplify the exposition. In matrix form, R = Xβ + ε. (2) If the event time is t0, define the indicator vector D, having all elements 0's except for time t0, where the value is 1. The event test is based on the model R = Xβ + Dγ + ε, (3) and the event test is a test of H0: γ = 0, tested using the simple t-test, as reported in any standard software that performs OLS analysis. The numerator of the t-statistic is the OLS estimate of γ, which is simply the deleted residual of model (1) for time t0. The MVRM models may be expressed as the simultaneous equations, Ri = Xβi + Dγi + εi, for i= 1,…,g (firms or portfolios). (4) Observations within row t of ε = [ε1 | … | εg] are cross-sectionally correlated. Row vectors of ε are assumed to be independent and identically distributed (i.i.d.) in the classical MVRM model, and this is also a major assumption of the resampling method. However, we find that i.i.d.assuming bootstrap method is marginally robust to non-i.i.d. data. The clustered event null hypothesis is the multivariate (composite) test H0: [γ1 | … | γg] = [0 | … | 0] (5) Traditional methodologies for testing (5) with MVRM and SUR models are discussed by Binder (1985a,b), Schipper and Thompson (1985), and Karafiath (1988). Cross-sectional correlations are allowed for elements within a given row of the residual matrix; a key feature of MVRM is that such cross-section correlations are incorporated in the test of (5). The test of (5) is computed easily and automatically with standard statistical software packages, using exact (under normality) F-tests. The hypothesis (5) tested for a given event may be restated in a more general sense amenable to semi-parametric bootstrap-based testing as follows: H0: The distribution of the abnormal return (or vector of abnormal returns in case of MVRM) for the event time t0 is identical to the distribution of the abnormal returns (or vectors of abnormal returns) for a given set of times. The researcher determines the definition of the "given set of times". It may include all times other than the event, all times other than a collection of event times, or a collection of times from another time horizon. Under the assumption of normally distributed abnormal returns, Schipper and Thompson (1985) give an exact test of H0. However, as with all tests, the true type I error level differs from the nominal (usually 0.05) level under non-normality. b. Asymptotic inconsistency It is a common conception that type I error rates approach the nominal α-levels (typically 0.05) with larger sample sizes, because of the central limit theorem. However, such is not the case for dummy variable based event study tests. To see why, consider model (3). The event test is a test of H0: γ = 0, tested using the simple t-test. The numerator of the t-statistic is the OLS estimate of γ, which is simply the deleted residual of model (2) for time t0. The argument for large-sample validity of a regression test, despite nonnormally distributed residuals, requires large-sample normality of the parameter estimate. Such a statement is made using the central limit theorem, provided that the parameter estimate can, in some sense, be viewed as an "average," or at least as a weighted average that is not dominated by one or a few weights. In the case of event tests under our concern, the estimated event parameter does not become normally distributed as the sample size (T) increases, since it is just the deleted residual itself, and not an (weighted) average of residuals, as required by the central limit theorem. The distribution of the estimated event parameter comes closer to the distribution of the true residual at time t0, but unless this distribution is truly normal, the distribution of the estimated event parameter will not converge to normal. For a more formal view of this problem, consider the development of the asymptotic normality of the OLS vector presented in Greene (1990, 295-296). The requirement that the parameter estimate be an appropriate type of weighted average is equivalently formulated by the requirement that (1/T)(G'G) converges (as T tends to infinity) to a non-singular matrix, where G = [X | D]. However, in the case of event studies, the convergence requirement fails: Since (1/T)(D'X) = (1/T)[1 | Rmt0] tends to [0 | 0 ] and (1/T)(D'D) = (1/T)(1) tends to (0) as T tends to infinity, the matrix (1/T)(G'G) tends to a matrix whose third row and third column (assuming a single-factor market model as in (1)) are composed entirely of zeros, and is therefore not invertible. Thus, the conditions needed for the asymptotic normality of the estimates do not hold for this type of event model test, and convergence of the type I errors to their nominal levels cannot be assumed. This failure to converge to normality greatly affects the probability of finding a "significant" event, even for large T, as shown below. c. Kurtosis and convergence Suppose that the data consist of known true residuals [εt1,…,εtg], as would be the case when T is so large that the parameters can be estimated with essentially no error. Suppose also, without loss of generality, that the variances are identically 1.0. If they are not, simply divide each εtj by its standard deviation, whose values can be assumed to be known for large T. Assume also that the variables are i.i.d. Suppose t0=1 is the event day in question. In this case, g the parametric test reduces to the statistic S = ∑ ε1i2 , with H0 rejected when S ≥ χ g2,1−α , where i =1 χ g2,1−α denotes the (1-α) quantile of the chi-squared distribution with g degrees of freedom. This test is exact (i.e., the type I error probability is exactly α) when ε1i has a normal distribution. Now, suppose we apply the normality-assuming test when ε1i has a non-normal distribution with kurtosis κ. Then by the central limit theorem, g −1/ 2 ( S − g ) /(κ + 2)1/ 2 →d N (0,1) and g −1/ 2 ( χ g2,1−α − g ) / 21/ 2 → Z1−α , where Z1−α = Φ −1 (1 − α ) denotes the 1-α quantile of the standard normal distribution. Then P ( S ≥ χ g2,1−α ) = P( g −1/ 2 ( S − g ) /(κ + 2)1/ 2 ≥ g −1/ 2 ( χ g2,1−α − g ) /(κ + 2)1/ 2 ) → P( Z ≥ Z1−α (2 /(κ + 2))1/ 2 ) = 1 − Φ ( Z1−α (2 /(κ + 2))1/ 2 ) For outlier-prone data, as are common in financial markets, κ>0, and hence Z1-α(2/(κ+2))1/2 < Z1-α, so the true type I error level exceeds α, with greater excess for larger κ, in our asymptotic framework. Conversely, if the error distribution is less outlier-prone than normal so that κ<0, the usual parametric test is too conservative. III. Alternative Bootstrap Solutions One of the major benefits of using the bootstrap is that the researcher need not specify any distributions at all. It would be presumptuous to suppose that a single distribution applies equally across all securities, as minor shocks are amplified to a greater extent for some market sectors than others, resulting in differential kurtosis across sectors. It would be equally presumptuous to assume that there is a single distribution that applies across all historic time regimes. First, the data-based bootstrap of HWZ and Chou is described, and then the test statisticbased bootstrap of Kramer is described. a. The Data-based Bootstrap We wish to estimate quantities such as p = P( S ≥ s | FS 0 ) for event test statistics S used to test H0, where the probability is calculated under the true null distribution function FS0 of S. When s is an observed (fixed) value of the random variable S, p is the “true p-value” of the test. Under the multivariate normal assumption, the distribution FS0 is simply related to the ordinary ANOVA F distribution and is known exactly as discussed by Schipper and Thompson (1985). However, when the distributions are non-normal, the distribution FS0 depends on non-normal characteristics such as kurtosis, and is unknown. We follow the motivation of the bootstrap using the “plug-in principle” (Efron and Tibshirani, 1998, p. 35), estimating p = P( S ≥ s | FS 0 ) by pˆ = P( S ≥ s | FˆS 0 ) , for a suitable estimate of FˆS 0 of the null distribution of S. Using notation of Schipper and Thompson (1985, equation 3), suppose the event test statistic is given by S = Γˆ ' A '[ A{ X '(Σˆ −1 ⊗ I ) X }−1 A ']−1 AΓ̂ with covariance matrix Σ̂ estimated as in their equation (4). Then S depends on the data Ri in (4, above) only through the least squares estimates of the event parameters and the sample residual covariance matrix. Noting that the sample residual covariance matrix is a function of the sample OLS residuals ei, the sample covariance matrix is identical, no matter whether it is calculated from the data Ri or the true residuals εi: ei = (I-G(G’G)-1G’)Ri = (I-G(G’G)-1G’)εi. Similarly, under the null hypothesis, γˆ i is identical, no matter whether it is calculated from the data Ri or the true residuals εi: γˆ i =(0 0 1) (G’G)-1G’Ri = γi+(0 0 1) (G’G)-1G’εi =(0 0 1) (G’G)-1G’εi under the null hypothesis. Thus, under H0: γi=0, i=1,…,g, the test statistic is identical, no matter whether it is computed using the Ri or the εi: S(R1,…, Rg) = S(ε1,…,εg) under H0, and the null distribution of S is completely determined by the multivariate distribution of the row vector ε=[ε1:…: εg]. Hence p = P( S (R 1 ,..., R g ) ≥ s | FS 0 ) = P ( S (ε 1 ,..., ε g ) ≥ s | Fε ). Applying the plug-in principle, we estimate p = P ( S ≥ s | Fε ) as pˆ = P( S ≥ s | Fˆε ) . 2 The plug-in estimate is conveniently evaluated by using the bootstrap to estimate the distribution Fε . Since we do not know the true residuals vectors, we estimate Fε using the empirical distribution of the sample residual vectors (see Freedman and Peters, 1984 for the univariate regression case; see Rocke, 1989 for the multivariate case). Below is the bootstrap algorithm as presented in HWZ: 1. Fit model (4). Obtain the usual statistic S for testing H0 using the traditional method (assuming normality). Obtain also the T x g sample residual matrix e = [e1| … | eg]. 2. Exclude the row corresponding to D = 1 from e, leaving the (T-1) x g matrix e-. 3. Sample T row vectors, one at a time and with replacement, from e-. This gives a T x g matrix [ R1* | … | Rg* ]. 4. Fit the MVRM model Ri* = Xβi + Dγi + εi, i = 1, …,g, and obtain the test statistic S* using the same technique used to obtain the statistic S from the original sample. 5. Repeat 3 and 4 NBOOT times. The bootstrap p-value of the test is the proportion of the NBOOT samples yielding an S* statistic that is greater than or equal to the original S statistic from step 1. In steps 1 and 2, the researcher finds residual vectors that are used to estimate the (multivariate) distribution of the abnormal returns. The residual for the event day must be excluded because its residual is identically zero, and residuals of exactly zero are not anticipated under ordinary circumstances.3 In steps 3 and 4 the researcher creates the bootstrap response variable. Because the null distribution of the test statistic depends only upon the residuals, as described above, the bootstrap response variable need not contain the Xbi component. The bootstrap estimate of p̂ found in step 5 has a Monte Carlo standard error equal to pˆ (1 − pˆ ) / NBOOT. Thus, choosing a sufficiently large value of NBOOT can control the Monte Carlo error; we recommend choosing NBOOT as large as computing resources will comfortably allow to estimate the p-values with sufficient precision. While it is possible to reduce the number of bootstraps if a simple "reject/accept" decision is all that is required (Davidson and MacKinnon, 2000), ultimately one should report p-values in scientific reports. So that other researchers may verify results, it is best that reported statistics, including p-values, not contain random noise; we therefore recommend reducing such unnecessary noise in published reports by employing as large a number of resampled data sets, as time constraints allow. With the easy availability of modern high-speed computing, one can easily avoid unnecessary Monte Carlo error (see also Andrews and Buchinsky, 1998). b. The Test Statistic-Based Bootstrap Kramer (2001) recently proposed a different bootstrap approach for event studies, g considering the test statistic Z = ∑ ti/(g1/2st) as a measure of the effect of the event, where ti is i =1 the t-statistic from the univariate dummy-variable-based regression model for firm i, and st is the sample standard deviation of the g t-statistics t1,…,tg.. The p-value of the test is obtained by bootstrapping as follows: (i) create a pseudo-population of t-statistics ti* = ti - t (where t = Σ ti/g), reflecting the null hypothesis case where the true mean of the t-statistics is zero; (ii) sample g values with replacement from the pseudo-population and compute Z* from these pseudovalues; (iii) repeat (ii) NBOOT times, obtaining Z1*, …, ZNBOOT*. The p-value for the test is then 2×min(pU, pL), where pL is the proportion of the NBOOT bootstrap samples yielding Zi* ≤ Z and pU is the proportion yielding Zi* ≥ Z. In other words, Kramer suggests bootstrapping the test statistic Z under the assumption that the statistics are independent. Lyon, Barber and Tsai (1999) also suggest bootstrapping the test statistic, but they do so in a cumulative abnormal return setting, as opposed to the dummy variable setting. c. Type I error rates of bootstrap procedures under independence Since both test-statistic based and data based bootstraps are justified only asymptotically, simulation is required to evaluate their finite-sample operating characteristics. Figure 1 shows the results of a simulation to compare traditional normality-assuming F-tests, the data-based bootstrap, and the test-statistic based bootstrap. Data were simulated from the market model, where the distribution of the market return was chosen as normal4, and the abnormal returns were generated independently from distributions varying from extremely heavy-tailed to normal. To model heaviness of the tail, we used t-distributions with 1, 2, 4, and 8 degrees of freedom (T1, T2, T4, and T8)5, to simulate extremely heavy to lighter tails; and we included the normal distribution for completeness. The bootstrapped p-values are calculated using 999 samples, and the number of replications of the bootstrap is 10,000 in all cases6. The simulated type I error rates for nominal α=0.05 level tests are shown in Figure 1. We note the following from these simulations: First, as the number of firms (portfolios), g, increases, actual type I error rate for the traditional method becomes larger (see panels (a) and (d)). In contrast, both the data-based bootstrap and the test-statistic based bootstrap maintain the type I error rate better for larger g. This finding is important, as there are many examples in the finance literature where the event is supposed to affect a large number (g) of firms. (See, for example, Stewart and Hein (2002).) Second, the closer the underlying distribution of the abnormal returns is to the normal distribution (i.e., the less heavy-tailed it is), the closer the actual type I error rate is to 0.05 for the traditional method, as expected. Third, even with larger sample sizes, the true level of the traditional test does not converge to 0.05, (compare (a) with (d)), confirming the argument that the central limit theorem does not apply in this test setting; on the other hand, the data-based bootstrap does appear to converge asymptotically (compare (b) with (e)). Finally, we note that the test-statistic based bootstrap cannot be used at all when g=1, because it is undefined. When g=2, the algorithm produces bootstrap t-statistics that are uniformly equal to 0, hence the bootstrap critical values also are zero, meaning that the event hypothesis is always rejected. Thus, the test-statistic based bootstrap cannot be used for g=1 or g=2. While unusual, the case g=1 still may be desired to assess event effects on individual firms such as dividend announcements. On the other hand, the test statistic-based bootstrap controls type I errors even better with g ≥ 4 than does the data-based bootstrap. It should be noted that this simulation study considers independent returns only, and it also does not consider power of the tests, we next consider robustness to dependence structures and power. IV. The Effect of Cross-Sectional Correlation While the data-based bootstrap is valid under cross-sectional correlation, as occurs in clustered studies, the test statistic-based bootstrap is not. A simple structural model will shed light on the problem. Suppose again that the data consist only of true residuals vectors, again appealing to the large T case, and that the residuals are dependent with εti = ξt + δij, where (ξt, δt1,...,δtg) are independent, mean zero, normal random variables with Var(ξt) ≡ σ ξ2 and Var(δti) ≡ σ δ2 . This model is sensible if the g multivariate responses at time t share an “industry effect” ξt, in which case there will be a commonality to their excess returns. In this model, the excess returns {εt1,…,εtg} at time t are dependent, with variance Var (ε ti ) = σ ξ2 + σ δ2 , and with common cross-sectional correlation ρ = Corr (ε ti , ε tj ) = σ ξ2 /(σ ξ2 + σ δ2 ) (ρ is also known as the intraclass correlation coefficient). Assuming large T, the variances Var (ε ti ) = σ ξ2 + σ δ2 are estimated precisely, so the t-statistics for testing the null hypothesis that day t=1 is a non-event day are ti ≅ ε1i /(σ ξ2 + σ δ2 )1/ 2 . The sample variance of the t-statistics is g σ δ2 1 g 1 2 2 = 1 − ρ for large g. Thus, for large s = (δ1i − δ1 ) ≅ 2 ∑ (ti − t ) ≅ ( g − 1)(σ 2 + σ 2 ) ∑ g − 1 i =1 σ ξ + σ δ2 i =1 ξ δ 2 t g g and T, Kramer’s test-statistic is approximated by Z ≅ g −1/ 2 ∑ ε1i /{(σ ξ2 + σ δ2 )(1 − ρ )}1/ 2 . Now, i =1 the “pseudo-population” of centered statistics {ti - t }≅ { (δ1i − δ1 ) /(σ ξ2 + σ δ2 )1/ 2 } approximates a normal distribution with mean zero and variance (1-ρ) for large g; hence, the Z* statistics obtained by sampling from this pseudo-population will approximate a normal distribution with mean zero and variance one (the scale factor (1-ρ) vanishes because of the standardization by st). This implies that the critical values of the bootstrap procedure are approximately ± Z1-α / 2 , and that one rejects the null hypothesis that day t=1 is a non-event day approximately when Z ≥ Z1-α / 2 . Noting that Var(Z) ≅ g ⎛ ⎞ 1+(g-1)ρ 1 (eg, Johnson and Var ⎜ g -1/2 ∑ ε1i /(σ ξ2 +σ δ2 )1/2 ⎟ = (1 − ρ ) 1− ρ i =1 ⎝ ⎠ Wichern, 1998, p. 470), the method has type I error level approximately equal to α when ρ=0 since the variance of the Z score is unity in that case. However, in general the true type I error level is approximately equal to 2(1 − Φ ( Z1−α / 2 /((1 + ( g − 1) ρ ) /(1 − ρ ))1/ 2 )) . Thus, under the assumed model with ρ ≠0, the true type I error level of the Kramer test approaches 1.0 as g increases. In a nutshell, the problem is that natural structural effects ξt resulting in cross-sectional correlation are incorrectly determined to be effects of an event, which causes type I error rates to approach 1.0 for large g. V. Comparison of Methods In this section we begin by developing new alternative test procedures resulting from combining aspects of the alternative bootstrap approaches. We then report simulation results that compare these approaches in a number of different dimensions. a. Old and new bootstrap methods Kramer's summed Z statistic is more powerful than the general multivariate test statistic when the event effects are all in the same direction. However, use of the standard deviation of the t-statistics for normalization of Z can decrease power in cases where g is small, as random noise incurred by estimating the standard deviation is included in the test statistic. While use of such standardization to create pivotal statistics are known to improve the convergence rate of the type I error in i.i.d. cases (Babu and Singh, 1983), the present case differs somewhat in that (a) the statistics ti are already standardized and hence at least partially pivotal (free of mean and scale factors), and (b) we consider that the ti are non-i.i.d. Pesarin (2001, p.148) considers using the statistic Σti (the Liptak test) in related multivariate contexts, and uses resampling to estimate its distribution. To correct for lack of type I error control, we apply the data-based bootstrap algorithm of section III.a. to the statistic S = Z, and call this the “BK” test. We also consider the same bootstrapping of the statistic S = Σti test suggested by Pesarin, calling this the "BT" test. It should be noted that, unlike the HWZ test, there is no parametric equivalent to the BT and BK tests; bootstrapping is essential. We show below that BT tends to have higher power under consistent shifts, and that it maintains the type I error rate. In summary, we compare the following four bootstrap methods: • Method HWZ: The data-based bootstrap of the traditional parametric F test (the HWZ proposal). • Method BT: The data-based bootstrap of the Σti statistic (a new proposal). • Method BK: The data-based bootstrap of Kramer's Z statistic (a new proposal). • Method K: The test statistic-based bootstrap of Kramer’s Z statistic (Kramer’s proposal). b. Simulation study: Cross-sectional correlation effects The simulation study summarized by Figure 1 assumes zero cross-sectional correlation and does not consider power, only level. Our purpose now is to compare the various bootstrap methods in terms of level and power, for non-independent data. Since the methods are supposed to work under a variety of distributions, they should work in particular for the normal distribution, so we initially study normal models, using a variety of different correlation structures. We consider a normal MVRM with T=100 and equicorrelated cross-sectional errors (as would be implied, for example, by the structural model of section IV) with no event effects. We let g = 5 and g = 30 in two separate cases. Further, we allow the cross-sectional correlation parameter to vary as ρ=0.0, 0.1, …,0.9. Table I displays simulated type I error rates using b=999 bootstrap samples and NSIM=1000 simulations. All entries should be close to the nominal α=0.05 level. [ Insert Table I Here ] It is clear from Table I that the Kramer method does not control the type I error rate when there is cross-sectional correlation. For g=5, with ρ>0.2, the rejection rate is consistently over twice the nominal α = 0.05 level, and the problem is worse with larger g as shown in Panel B. The values in Panel B for the Kramer method closely agree with the theoretical values 2(1 − Φ ( Z1−α / 2 /((1 + ( g − 1) ρ ) /(1 − ρ ))1/ 2 )) developed in section IV. We also note that the bootstrapped HWZ method is too conservative for g = 30. The reason for this is that g is large (30) relative to T (100). In such cases, the covariance matrix of the bootstrapped residual vectors tends toward singularity because of the repeated vectors in the bootstrapped samples. This creates larger-than-expected bootstrapped F statistics, which in turn creates larger-than-expected bootstrap p-values. This problem disappears when g is small relative to T; we found in unreported simulations that the type I error level of HWZ is approximately correct when g=10 and T=100, (estimated value of 0.053), and only slightly conservative when g=15 when T=100 (estimated value of 0.041). Thus, we cannot recommend the HWZ method where the number of firms or portfolios (g) examined is large relative to the number of time points (T) because the researcher will too frequently fail to reject the null. On the other hand, the BT and BK methods perform well in terms of preserving type I error, showing again that bootstrapping of the residuals is superior to bootstrapping the test statistic in the face of cross sectional correlation in returns. There is a slight tendency toward excess type I errors in the independence case for BT and BK. Westfall and Young (1993, p. 127) document a similar phenomenon in a related application, and note that such effects diminish with increasing T. c. Simulation study: Power To compare the various bootstrap methods in terms of power, we assume that the first observation is the event. We model this with a common mean shift γ, ranging from 0.0 to 4.0 for all g variables, chosen to reflect a range of power values between 0 and 1. Since the variances are unity in this study, the mean shifts are equivalently 0 to 4 standard deviations. Again, we consider g = 5 and g = 30, with T = 100 in both cases. To simplify matters, and to make a fair comparison with the Kramer method, we assume zero cross sectional correlation. Table II shows simulated power in this case. [Insert Table II Here] Based on the simulations above, we can recommend the data-based bootstrapped, modified Kramer test, BT, because the BT procedure (i) maintains the type I error level and (ii) has higher power. As noted above, the HWZ approach suffers in power when g is large relative to T, and it can be too conservative when g is large relative to T. We should also note in fairness that the HWZ approach could detect cases where the event effects go in opposite directions (for example if γ increased for some firms and decreased for others), whereas the BT, BK, and K methods are virtually powerless in such instances. The HWZ approach would be more powerful in cases such as examining dividend announcements -- initiations and omissions -simultaneously, say. Such cases may be somewhat rare, but HWZ has greater power in such cases. d. The effect of time-series dependence: Analytic and simulation results Excess returns can be serially correlated, as well as cross-correlated. Interestingly, the data-based bootstrap we have described is robust to autocorrelation despite its use of independent sampling. The reason for this is precisely the same reason that it is non-robust to non-normality. Namely, the test is based on a single random observation rather than a sum. If the single suspected residual is large relative to the estimated distribution, then the event is called significant. All that is required is that the marginal distribution be estimated consistently in order for marginal type I error control to be maintained asymptotically. It is well known that OLS estimates and error variances are consistent under AR disturbances (Greene, 1993, p. 422), suggesting that the normality-assuming dummy variable tests should become essentially a function of the true residuals and their variances for large T. Further, the empirical distribution of the residuals converges almost surely to the true distribution under stationarity conditions given by Yu (1993) and Zhengyan and Chuanrong (1996, Chapter 12), implying that the bootstrap test procedures should maintain type I error rates in large samples under these conditions as well. Specifically, consider again the case where the test is based on known residuals and known variances (taken to be 1.0 without loss of generality) and consider the BT method in the case with g=1. In this case, the test statistic is simply ε1, assuming again that t=1 denotes the suspected event day, and the critical values for the bootstrap test procedures are the α/2 and 1- α/2 quantiles of the distribution of ε*, where ε* is sampled with replacement from the pseudopopulation {ε2,…,εT}. In this case, the true bootstrap critical values are evaluated without resorting to Monte Carlo sampling as F̂ε ,α / 2 and F̂ε ,1−α / 2 , respectively the (suitably interpolated) α/2 and 1-α/2 empirical quantiles of the data {ε2,…,εT}. The empirical distribution of data from stationary processes converges to the marginal distribution under conditions given by Yu (1993) and Zhengyan and Chuanrong (1996, Chapter 12), in which case P(reject H0 | H0 true) = P(ε1 ≤ F̂ε ,α / 2 or ε1 ≥ F̂ε ,1−α / 2 | H0 true) → P(ε1 ≤ Fε ,α / 2 or ε1 ≥ Fε ,1−α / 2 | H0 true) = α, for continuous Fε. This result may seem surprising because it is generally thought that independence-based bootstrapping fails in the case of non-i.i.d. data (Politis, 2003); however, the difference in the present case is that we are bootstrapping the distribution of a single observation at a single point in time, not of an average or some other combined statistic over the entire history. We note that this asymptotic control of type I error rates is only in the marginal sense; that is, the control is not conditional upon the recent history but rather unconditional, averaging over all histories. Conditioning upon the recent histories, the independence-assuming bootstrap test will have type I error rates that are sometimes greater, sometimes less, than the nominal level. On average, over all histories, the type I error level is controlled. Marginal control of type I error rates is a good property, and one that any procedure should possess. In particular, it is instructive to compare marginal type I error rates of different procedures; those that do not control marginal type I error rates are clearly inferior. On the other hand, the practical implementation of a procedure that controls the type I error only marginally requires that recent history be ignored. This is clearly an undesirable feature; therefore, despite the robustness of the independence-assuming bootstrap to time series dependencies, it can only be recommended when the data series is “reasonably close” to i.i.d. (although arbitrary cross-sectional correlation is allowable in any case, even if there are perfect cross-sectional dependencies in the case of BT and BK). Further research is needed to determine how significant the time series dependence must be for the conditionality issue to become problematic for the i.i.d. bootstrap. Table III shows results of simulations to estimate marginal type I error rates when T=100; g=5 as a function of the serial autocorrelation parameter φ when there is no cross-sectional correlation. [Insert Table III Here] Table III shows that all bootstrap tests control the type I error rates reasonably well, even under substantial autocorrelation. However, the Kramer method remains not valid when there is cross-sectional correlation (simulation results for cross-sectional correlation and time-series dependence are not shown but available from the authors). We conclude that autocorrelation itself is not a serious problem for marginal type I error rates of bootstrap-based dummy-variable event tests, despite the fact that the model assumes i.i.d. residuals. e. Simulation from observed populations So far, the simulations have been somewhat contrived, and the connection between the simulation models and real return data are not well established. It is well known that observed excess returns frequently exhibit cross-sectional correlations, non-normalities, autoregressive effects, and conditional heteroscedasticity. Rather than develop a simulation model that incorporates all of these characteristics, we instead utilize existing financial data for our simulation model. We used a data set consisting of daily returns on the S&P 500 as well as the five insurance sub-indexes created by Standard and Poors, from January 3, 1990 to November 6, 2001 (3086 consecutive trading days). The excess returns are fairly highly correlated cross- sectionally, with (min, ave, max) = (.41, .56, .79). There are also significant ARCH(1), GARCH(1), and AR(1) effects in the univariate models. In our simulation model, we randomly sampled the 6-dimensional vectors in 10 blocks of 20 consecutive time points to create simulated samples of length T=200 that preserve time-series dependencies, cross-sectional dependencies, and non-normalities of the original data. The starting point for the 20 consecutive time points was chosen at random from the possible starting points 1,…,(3086-20+1), thus the selected series possibly overlap, and data occurring earlier in the original “population” can occur later in the simulated sample. Künsch (1989), pioneered this type of block resampling for time series data, and Hall et al. (1995) provide recommendations for block sizes. Samples from the population exhibited similar time-series (autoregressive and conditional heteroscedastic) characteristics, cross-sectional dependencies, and distributional characteristics as found in the population, albeit with generally less statistical significance due to the smaller sample size, and with somewhat smaller time series effect sizes due to local independence induced by block resampling. The event day was chosen at random from the simulated series of 200 observations. We would expect that the various procedures should control the type I error rates in this analysis, since there is nothing special about the chosen day. Results are shown in Table IV, panel A, comparing the methods HWZ, BT, BK, and K described above, in addition to the classical parametric F test with no bootstrapping (called BINDER in the Table). The simulation from observed financial data confirms the essential results from the previous simulations. Namely, the simulation confirms that: (1) the normality-assuming test fails to control the type I error rate, as the Binder statistic too frequently rejects the null hypothesis, (2) the Kramer test statistic-based method fails to control the type I error rate in the presence of cross-sectional correlation, and (3) the data-bootstrap methods do control the type I error rate. Again, as in the previous section, we reiterate that these are unconditional type I error levels, and that, while we would rather control conditional type I error levels, unconditional control remains desirable as a basis for comparison of the various procedures. For all simulations, there were 10,000 samples of size T=200 as described above, and for all bootstrap methods, B=999 simulations were used. [ Insert Table IV Here ] To evaluate power, we added common effects γ=.01,…,.05 to the excess returns for all five indexes, allowing for a range of power from 0 to 1. The results are shown in Table IV, panel B. Among the bootstrap tests that control the type I error levels, the BT test is preferred for γ>.01, presumably because the normalization in the BK procedure adds excess variability. On the other hand, the common shift alternative is meant to favor both BT and BK over HWZ; if the alternative were that the variance was increased and not the mean, resulting in large positive, as well as negative, excess returns, then the HWZ test would be preferred. Since the type I errors of the Binder and Kramer tests are uncontrolled, it is inappropriate to compare their power with the other procedures without a size adjustment. However, since these tests are anti-conservative, a size adjustment must make their power smaller, and we can therefore conclude that the sizeadjusted Kramer test must be less powerful than the BT test for γ>.03, and that the size-adjusted binder test must be less powerful than the BT test for γ>0. VI. Conclusion In this paper, we examine the popular test statistics used for event studies in MVRM models. We show that when excess returns come from non-normal distributions, these traditional test statistics are misspecified. The traditionally calculated p-values are biased downward dramatically when the number of firms (or portfolios) is large and the residual distribution is heavy-tailed, causing the researcher to conclude an event is significant to financial markets too frequently. Importantly, this bias does not diminish as the sample size increases. Bootstrap methods correctly and automatically adapt to non-normal characteristics of the data, reducing this misspecification significantly and providing more accurate p-values. Further analysis shows that Kramer's (2001) bootstrap method is preferred in terms of the form of summed-t test statistic, which has good power, but is not preferred when there is crosssectional correlation, in which case the test has grossly inflated type I error probability. We recommend a modification of the Kramer method using data-based bootstrapping, which provides good power and maintains the type I error control. We also note that this type of bootstrap is marginally robust in the face of time series dependence, despite its resampling of independent residual vectors. It is standard practice to use the t-distribution rather than the z-distribution for analysis of mean values when the variance is unknown, simply because the procedures based on the tdistribution are generally more accurate. We argue for general use of the bootstrap rather than the traditional method of MVRM tests of key events, for precisely the same reasons. Since the bootstrap method performs better over a range of possible cases, as we have shown in numerous simulations, and since its performance is not noticeably inferior in the case of normal distributions, we feel it is generally much more prudent to use the bootstrap p-values than to use the traditional p-values, regardless of the form of the underlying distribution. At the very least, traditional analysis of event studies should be supplemented with a bootstrap analysis, so investigators can evaluate the robustness of their inferences. Limitations of the research are the restriction to MVRM models with a common event time, and the lack of control of conditional error rates under non-i.i.d. time series data. Since the BT method works well under cross-correlated data, extensions to separate, independent event times, as considered by Kramer, should also work well, but we leave this for future study. Further research is also needed to evaluate the conditionality issues described in section V. Software to perform these analyses is available freely from the authors. Footnotes 1 Smirlock and Kaufold (1987), Karafiath and Glascock (1989), Cornett and Tehranian (1989 and 1990), De Jong and Thompson (1990), Eyssell and Arshadi (1990), Demirguc-Kunt and Huizinga (1993), Madura et al. (1993), Unal, Demirguc-Kunt and Leung (1993), Clark and Perfect (1996), Cornett, Davidson, and Rangan (1996), Johnson and Sarkar (1996), Bin and Chen (1998), Cosimano and McDonald (1998), Sinkey and Carter (1999), and Stewart and Hein (2002) are examples of empirical studies that use such an approach in examining the significance of a wide variety of events. 2 All calculations are conditional on the observed matrix G; see Westfall and Young, 1993, p. 123, for further discussion; see also the description of the SAS/STAT software PROC MULTTEST, SAS Institute (1999), which uses an identical method to estimate p-values from tests in multivariate linear models. 3 A technical distinction between the bootstrap approach suggested by Chou and HWZ is that Chou fits null restricted models and does not suggest removing the residual from the bootstrap sample. In our formulation, fitting null restricted models is not needed, and the residuals that are identically zero should be removed. In cases where multiple event parameters are modeled, one should also exclude any other sample residuals that are forced to be zero; the program that is freely available from the authors allows this. 4 The distribution of the test statistic depends mainly on the distribution of the residuals; the distribution chosen for the market return has relatively little effect. 5 Student t distributions are used for fitting financial models with heavy tails; see e.g. the SAS/ETS procedure PROC AUTOREG. 6 While a large bootstrap sample size is recommended for reporting significance from an analysis of a data set to ensure replicability, it is not as essential that the bootstrap sample size be so large for simulation studies, in which case the major contributor to variability is the outer loop; see Westfall and Young (1993, pp. 38-41). Table I. Simulated type I error rates as a function of cross-sectional correlation. Panel A: g = 5 ρ=0 ρ=0.1 ρ=0.2 ρ=0.3 ρ=0.4 ρ=0.5 ρ=0.6 ρ=0.7 ρ=0.8 ρ=0.9 HWZ 0.059 0.059 0.059 0.059 0.059 0.059 0.059 0.059 0.059 0.059 BT 0.055 0.057 0.055 0.057 0.056 0.058 0.056 0.057 0.060 0.056 BK 0.053 0.056 0.049 0.053 0.050 0.042 0.041 0.045 0.050 0.049 K 0.057 0.078 0.113 0.168 0.220 0.275 0.335 0.418 0.500 0.624 Panel B: g = 30 ρ=0 ρ=0.1 ρ=0.2 ρ=0.3 ρ=0.4 ρ=0.5 ρ=0.6 ρ=0.7 ρ=0.8 ρ=0.9 HWZ 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 BT 0.070 0.057 0.056 0.059 0.062 0.064 0.051 0.058 0.056 0.049 BK 0.073 0.057 0.051 0.055 0.056 0.053 0.055 0.058 0.053 0.050 K 0.056 0.366 0.516 0.590 0.652 0.702 0.718 0.813 0.851 0.885 Table II. Simulated power as a function of event effect γ. Panel A: g = 5 γ=0 γ=0.2 γ=0.5 γ=0.75 γ=1.0 γ=1.5 γ=2.0 γ=3.0 γ=4.0 HWZ 0.05 0.04 0.09 0.19 0.28 0.74 0.95 1 1 BT 0.02 0.09 0.2 0.42 0.6 0.92 0.99 1 1 BK 0.05 0.07 0.17 0.34 0.58 0.78 0.92 0.97 1 K 0.06 0.08 0.16 0.27 0.42 0.65 0.75 0.88 0.93 Panel B: g = 30 γ=0 γ=0.1 γ=0.2 γ=0.3 γ=0.4 γ=0.5 γ=0.6 γ=0.7 γ=0.8 γ=0.9 γ=1.0 HWZ 0.002 0.002 0.003 0.007 0.011 0.013 0.042 0.072 0.130 0.1970.291 BT 0.070 0.092 0.206 0.363 0.573 0.751 0.871 0.949 0.988 0.9991.000 BK 0.073 0.090 0.193 0.335 0.539 0.697 0.844 0.930 0.975 0.9921.000 K 0.056 0.087 0.185 0.345 0.535 0.703 0.853 0.933 0.977 0.9960.999 Table III. Simulated type I error rates as a function of serial correlation φ. φ=0 φ=0.1 φ=0.2 φ=0.3 φ=0.4 φ=0.5 φ=0.6 φ=0.7 φ=0.8 φ=0.9 HWZ 0.052 0.054 0.056 0.057 0.057 0.059 0.072 0.085 0.113 0.176 BT 0.062 0.062 0.067 0.066 0.068 0.07 0.067 0.071 0.079 0.093 BK 0.062 0.06 0.064 0.065 0.067 0.064 0.062 0.067 0.067 0.061 K 0.053 0.052 0.047 0.048 0.045 0.048 0.046 0.046 0.052 0.045 Table IV. Simulated type I error rates and power for the various procedures under block sampling from a financial data series, where the samples exhibit ARCH, GARCH, autoregressive, and cross-sectional correlation effects. Panel A: Type I error rates Binder HWZ BT BK K Nominal type I error rate α=0.01 α=0.05 α=0.10 0.058 0.105 0.137 0.014 0.055 0.104 0.015 0.054 0.103 0.014 0.053 0.105 0.097 0.275 0.358 Panel B: Power for the nominal α=0.05 tests. Binder HWZ BT BK K γ=0.01 γ=0.02 γ=0.03 γ=0.04 γ=0.05 0.139 0.346 0.712 0.935 0.987 0.071 0.183 0.465 0.784 0.933 0.148 0.543 0.858 0.970 0.993 0.193 0.463 0.674 0.820 0.885 0.542 0.776 0.880 0.925 0.946 0.2 0.15 T1 Type I error T2 T4 0.1 T8 T2 T4 0.1 T8 Z Z 0.05 2 4 8 1 Number of Firms (or Portfolios) a) 2 4 1 0.2 Type I error Type I error T8 T2 T4 0.1 T8 T2 T4 0.1 T8 Z 0.05 0.05 0 0 4 T1 Z 0.05 2 8 0.15 T1 Z 1 4 0.2 0.15 T1 T4 2 (c) 0.2 0.1 Z Number of Firms (or Portfolios) Number of Firms (or Portfolios) T2 T4 T8 8 (b) 0.15 T2 0.1 0 0 1 T1 0.05 0.05 0 0 1 8 2 4 1 8 (e) 2 4 8 Number of Firms (or Portfolios) Number of Firms (or Portfolios) Number of Firms (or Portfolios) (d) 0.15 T1 Type I error Type I error 0.15 0.2 Type I error 0.2 (f) Figure 1: True type I error rates for nominal α=0.05 event tests with normal and nonnormal distributions. (a) traditional F test, (b) data bootstrap F test (HWZ), (c) test statistic bootstrap (Kramer), all for T=200; (d), (e), (f) repeat (a), (b), (c) when T=50. True type I error rates are 0% when g=1 for test statistic bootstrap, 100% when g=2. Figure 1: True type I error rates for nominal α=0.05 event tests with normal and nonnormal distributions. (a) traditional F test, (b) data bootstrap F test (HWZ), (c) test statistic bootstrap (Kramer), all for T=200; (d), (e), (f) repeat (a), (b), (c) when T=50. True type I error rates are 0% when g=1 for test statistic bootstrap, 100% when g=2. References Andrews, D.W.K. and M. Buchinsky. (1998). “On the number of bootstrap repetitions for bootstrap standard errors, confidence intervals, confidence regions, and tests.” Cowles Foundation Discussion Paper No. 1141R, Yale University, revised. Babu, G.J. and K. Singh. (1983). “Inference on means using the bootstrap.” Annals of Statistics 11, 999-1003. Bernard, V. L.. (1987). “Cross-sectional dependence and problems in inference in market-based accounting research.” Journal of Accounting Research 25 (1), 1-48. Bin, F. and D. Chen. (1998). “Casino legislation debates and gaming stock returns: a United States empirical study.” International Journal of Management 15 (4), 397-406. Binder, J. J. (1985a). “Measuring the effects of regulation with stock price data.” Rand Journal of Economics 16, 167-183. Binder, J. J. (1985b) “On the use of multivariate regression models in event studies.” Journal of Accounting Research 23, 370-383. Binder, J. J. (1998). “The Event Study Methodology Since 1969.” Review of Quantitative Finance and Accounting 11, 111-137. Boehme, R.D. and S.M. Sorescu. (2002). “The Long-run performance following divided Imitations and resumptions: under reaction or product of chance?” Journal of Finance 57, 871-900. Chou, P.H. (2001). "Bootstrap Tests For Multivariate Event Studies", Advances in Investment Analysis and Portfolio Management. Clark, J.A. and S.B. Perfect. (1996). “The economic effects of client losses on OTC bank derivative dealers: Evidence from the capital market.” Journal of Money, Credit, and Banking 28, 527-545. Cornett, M.M., W.N. Davidson, and N. Rangan. (1996). “Deregulation in investment banking: Industry concentration following Rule 415.” Journal of Banking and Finance 20, 85-113. Cornett, M.M. and H. Tehranian. (1989). “Stock market reaction to the depository institutions deregulation and monetary control act of 1980.” Journal of Banking and Finance 13, 81100. Cornett, M.M. and H. Tehranian. (1990). “An examination of the impact of the Garn-St. Germain depository institutions act of 1982 on commercial banks and savings and loans.” Journal of Finance 45, 95-111. Cosimano, T.F. and B. McDonald. (1998). “What's different among banks?” Journal of Monetary Economics 41, 57-70. Davidson, R. and J.G. MacKinnon. (2000). “Bootstrap tests: How many bootstraps?” Econometrics Reviews 19, 55-68. De Jong, P. and R. Thompson. (1990). “Testing linear hypothesis in the SUR framework with identical explanatory variables.” Research in Finance 8, 59-76. Demirguc-Kunt, A. and H. Huizinga. (1993). “Official credits to developing countries: implicit transfers to the banks.” Journal of Money, Credit, and Banking 25, 76-89. Efron, B. and R.Tibshirani. (1998). “An Introduction to the Bootstrap.” CRC Press LLC. Eyssell, T.H. and N. Arshadi. (1990). “The wealth effects of risk-based capital requirements in banking.” Journal of Banking and Finance 14, 179-197. Fama, Eugene. (1998). “Market Efficiency, long-term returns, and behavioral finance.” Journal of Financial Economics 49, 283-306. Freedman, D.A., and S.C. Peters. (1984). “Bootstrapping an Econometric Model: Some Empirical Results.” Journal of Business and Economic Statistics 2, 150-158. Greene, W.H. (1993). Econometric Analysis, Second Edition. Prentice Hall: New Jersey. Hall, P., Horowitz, J.L., and B.Y. Jing. (1995). “On blocking rules for the bootstrap with dependent data.” Biometrika 82, 561-574. Hein, S.E., P. Westfall, and Z. Zhang. (2001). “Improvements on Event Study Tests: Bootstrapping the Multivariate Regression Model.” Texas Tech University, Working Paper. Horowitz, J. L. (2001). “The bootstrap and hypothesis tests in econometrics.” Journal of Econometrics 100, 37-40. Johnson, R.A. and D.W. Wichern. (1998). Applied Multivariate Statistical Analysis, 4th ed, Prentice-Hall. Johnson, S.A. and S.K. Sarkar. (1996). “The valuation effects of the 1977 community reinvestment act and its enforcement.” Journal of Banking and Finance 20, 783-803. Karafiath, I. (1988) “Using dummy variables in the event study methodology.” Financial Review 23, 351-357. Karafiath, I. and J. Glascock. (1989). “Intra-industry effects of a regulatory shift: Capital market evidence from Penn Square.” Financial Review 24, 123-134. Kramer, L.A. (1998). "Banking on Event Studies: Statistical Problems, a New Bootstrap Solution and an Application to Failed Bank Acquisitions." Ph.D. Dissertation, University of British Columbia. Kramer, L.A (2001). “Alternative Methods for Robust Analysis in Event Study Applications.” Advances in Investment Analysis and Portfolio Management 8, 109-132. Künsch, H.R. (1989). “The jackknife and bootstrap for general stationary observations.” Annals of Statistics 17, 1217-1241. Lyon, J.D., B.M. Barber, and C.L. Tsai. (1999). “Improved Methods for Tests of Long-Run Abnormal Stock Returns.” Journal of Finance 54, 165-201. Madura, J., A.J. Tucker, and E. Zarrick. (1992). “Reaction of bank share prices to the ThirdWorld debt reduction plan.” Journal of Banking and Finance 16, 853-868. MacKinlay, A.C. (1997). “Event Studies in Economics and Finance.” Journal of Economic Literature 35, 13-39. Mitchell, M. and E. Stafford. (2000). “Managerial decisions and long-term stock price performance.” Journal of Business 73, 287-329. MacKinnon, J.G. (1999). “Bootstrap Testing in Econometrics.” Presented at the May 29. 1999 CEA Annual Meeting. Pesarin, F. (2001). Multivariate Permutation Tests. Wiley, New York. Politis, D.N. (2003). “The impact of bootstrap methods on time series analyis.” Statistical Science 18, 219-230. Rau, P.R. and T. Vermaelen. (1998). “Glamour, value, and the post-acquisition performance of acquiring firms.” Journal of Financial Economics 49, 223-253. Rocke, D. M. (1989). “Bootstrap Bartlett adjustment in seemingly unrelated regression.” Journal of the American Statistical Association 84, 598-601. SAS Institute, Inc. (1990). SAS Guide to Macro Processing, Version 6, Second Edition. SAS Institute Inc., Cary, NC. SAS Institute, Inc. (1999). SAS/STAT User's Guide, Version 8. SAS Institute Inc., Cary, NC. Schipper, K. and R. Thompson. (1985). “The impact of merger-related regulations using exact distributions of test statistics.” Journal of Accounting Research 23, 408-415. Sefcik, S. E. and R. Thompson. (1986). “An approach to statistical inference in cross-sectional models with security abnormal returns as the dependent variable.” Journal of Accounting Research 24, 316-334. Sinkey, J.F. and D.A. Carter. (1999). “The reaction of bank stock prices to news of derivatives losses by corporate clients.” Journal of Banking and Finance 23, 1725-1743. Smirlock, M. and H. Kaufold. (1987). “Bank foreign lending, mandatory disclosure rules, and the reaction of bank stock prices to the Mexican debt crisis.” Journal of Business 60, 347364. Stewart, J.D. and S.E. Hein. (2002). “An investigation of the Impact of the 1990 reserve requirement change on financial asset prices.” Journal of Financial Research 25, 367-382. Unal, H., A. Demirguc-Kunt and K. Leung. (1993). “The Brady Plan, 1989 Mexican DebtReduction Agreement, and Bank Stock Returns in United States and Japan.” Journal of Money, Credit, and Banking 25, 410-429. Westfall, P.H. and Young, S.S. (1993). Resampling-Based Multiple Testing: Examples and Methods for P-Value Adjustment. Wiley: New York. Yu, H. (1993). “A Glivenko-Cantelli lemma and weak convergence for empirical processes of associated sequences.” Probability Theory and Related Fields 95, 357-370. Zellner, A. (1962). “An efficient method of estimating seemingly unrelated regressions and tests for aggregation bias.” Journal of the American Statistical Association 57, 348-368. Zhengyan, L. and L. Chuanrong. (1996). Limit Theory for Mixing Dependent Random Variables. Science Press, New York.