GEOPHYSICAL RESEARCH LETTERS, VOL. 23, NO. 11, PAGES 1433-1436, MAY 27, 1996 Statistical tests of VAN earthquake predictions: comments and reflections Y. Y. Kagan and D. D. Jackson Institute of Geophysics and Planetary Physics, and Department of Earth and Space Sciences, University of California, Los Angeles Abstract. The papers in the debate show that precise consideration of whether various events should be considered as having been successfully predicted is one of the key issues for statistical testing of the VAN earthquake prediction efforts. Other issues involve whether the VAN prediction rules have been adjusted retroactively to fit data, how to test the prediction significance and, finally, the criteria for rejection of the null hypothesis - that the VAN successful predictions are due to chance. The diversity of views on these subjects suggests that the seismological community needs a general strategy for evaluating earthquake prediction methods. Introduction The VAN debate shows that the geophysical community has no consensus on what represents an earthquake prediction or how earthquake prediction claims should be evaluated. Hence such a debate is necessary and useful. Since there is no comprehensive theory of earthquake occurrence, proposed earthquake prediction techniques must be validated statistically. By definition, statistics cannot give us the final answer - at best it can be used to evaluate internal consistency of one hypothesis, and relative consistency of multiple hypotheses. In the debate contributions, we see several patterns: many controversies are due to uncertainties in data which magnitude should be used, how important are errors in the magnitude or location, what catalog is most appropriate for the testing, etc. Another point of contention is the ambiguity of the VAN predictions: whether the 'rules of the game' have been defined before the tests, or the rules have been adjusted during the tests, or, indeed, sometimes during the debate. Even the rules are not yet fully formulated; it is not clear, for example, what is the magnitude cutoff for predicted earthquakes. Similarly, it has not been specified how one counts 'successes' if more than one event falls into the prediction window, or ifthe same event is covered by several windows. The lower the magnitude threshold, the more such ambiguous cases exist. To avoid ambiguities in interpreting such predictions, we suggest that the total number of successfully predicted events and Copyright 1996 by the American Geophysical Union Paper number 95GL03786 0094-8534/96/95GL-03786$05.00 the total combined window space be counted in validation tests of earthquake prediction techniques. It is not yet clear what constitutes a proper null hypothesis and how the space-time-magnitude inhomogeneity of earthquake occurrence should be modeled. Finally, having obtained the test results, it is unclear what conclusions can be drawn in view of the above uncertainties. We comment below on some of those issues. We use the same notation as Kagan [1996]. Earthquake vs. prediction simulation A convenient artifice for constructing a null hypothesis is to randomize the times and locations of either the earthquakes or the predictions. The debate has shown that such randomization must be done with care, because both the earthquakes and the predictions have some statistical structure (they are not uniformly distributed in time and space), and a reasonable null hypothesis should preserve this structure. This is an especially important issue here, as several of the events that Varotsos et al. [1996] claim to have predicted were either aftershocks, or members of earthquake clusters. Kagan [1996, Table 1] provides an accounting of predicted mainshocks and aftershocks, according to one published recognition algorithm. Stark [1996] argues that it is preferable to randomize the prediction set, because earthquake times and places are too dependent, and difficult to simulate accurately. Aceves et al. [1996] do randomize the prediction set. However, we argue that the statistical properties of earthquakes have been studied extensively, while the predictions may have a nonuniform distribution that is not well enough characterized to be randomized with confidence. For example, the location of the predicted epicenters depends on the (non-uniform) distribution of electrical stations and the predicted origin time might arguably be related to funding availability, equipment failures, or even Prof. Varotsos' travel schedule [Dologlou, 1993, Tables 2-4]. Thus we believe that a null hypothesis can best be constructed by randomizing earthquake locations and times, while preserving the observed spatial distribution and temporal clustering. Kagan [1996] and Aceves et al. [1996] both test the VAN hypothesis against null hypotheses (Table 1). Among the differences between these tests are: (1) Kagan [1996] used the SI-NOA and NOAA catalogs for 1987-1989 separately, while Aceves et al. [1996] used a combination of these catalogs for 1960-1985; (2) Kagan [1996] used a magnitude threshold of 5.0, while 1433 KAGAN AND JACKSON: VAN STATISTICAL TESTS 1434 Table 1. Comparison of test results by Aceves et al. [1996] and by Kagan [1996] No Test Result 1 2 3 4 Original catalog Partially declustered (Tc = 1) More fully de clustered (Tc = 0) Alternative prediction Aceves et al. Significance + + 0.06% 0.10% 7.96% Result + + Kagan Significance 3.6% 3.3% 35.0% ? plus (+) means that the null hypothesis can be rejected in favor of the VAN method at the 95% confidence level; minus (-) means that the null hypothesis cannot be rejected. Tc values refer to Eq. 2 in Kagan [1996]. Aceves et al. [1996] used 4.3; (3) Kagan [1996] randomized earthquake times to construct his null hypothesis, while Aceves et al. [1996] randomized prediction times as well as locations and magnitudes of predicted events. In spite of these differences, both agreed on several points: (1) the VAN method outperforms a Poisson null hypothesis (that is with no clustering) at 95% and higher confidence level, if aftershocks are left in the catalog or only partially removed; and (2) the VAN method does not outperform the Poisson null hypothesis if aftershocks are more fully removed. Kagan [1996] further shows that a null hypothesis which explicitly includes clustering ('Alternative prediction') outperforms the VAN hypothesis. Aceves et al. [1996] did not attempt to include clustering in their null hypothesis. Aceves et al. [1996] appear to favor the VAN hypothesis in their conclusions, even though they could affirm VAN's statistical significance only if aftershock clusters remain in the catalog. Why is temporal clustering such an important issue? Primarily because some variation in natural phenomena, such as electric field variation, which might follow earthquakes, would typically precede late events in a cluster. The electrical variations might thus appear to have some predictive capability, but this would actually come purely from the clustering of earthquakes. A null hypothesis that neglects such clustering might perform worse than the "electrical" hypothesis, even though the electrical phenomena had no real intrinsic predictive power. Varotsos and Lazaridou [1996] agree that a null hypothesis based on clustering (that is, issuing a prediction following each M 2 5 earthquakes, as suggested by Kagan [1996]) has a high success rate. In fact, the null hypothesis predicts as many earthquakes (7) as the VAN hypothesis (which requires a 50% larger volume of space-time, using the parameters of Varotsos et al. [1996]). But Varotsos and Lazaridou [1996] protest that the earthquakes predicted by the null hypothesis are largely aftershocks, whereas they claim the VAN method predicts mainshocks more effectively. The distinction is largely semantic, as an aftershock is by definition an earthquake predictable from a previous earthquake. In any case it is too early to claim that the VAN method somehow predicts better or more important earthquakes, when there is no convincing proof that the VAN method predicts any earthquake beyond chance. Even if we randomize the prediction times in tests of earthquake forecasts, the problem of earthquake temporal clustering needs to be taken into account. A prediction technique can use the earthquake clustering information either directly (for example, preferentially issuing alarms during the times of high seismic activity), or co-seismic and post-seismic earthquake SES signals may trigger an alarm. In the latter scenario, earthquake prediction based on a scheme similar to the 'alternative' model in Kagan [1996] would perform better than the VAN technique. Hence the rejection of the null hypothesis means only that the plain Poisson process is an inappropriate model for an earthquake occurrence. The proper null hypothesis should be a Poisson cluster process incorporating foreshock-mainshock-aftershock sequences and this model must be tested. Varotsos et al. [1996] also emphasize this point. Retroactive parameter adjustment Varotsos and Lazaridou [1996] argue that the prediction rules for 1987-1989 were established before the test period. This is not so. The prediction rules analyzed by Varotsos et al. [1996] have been adjusted retroactively. Varotsos and Lazaridou's [1991] paper was published after the 1987-1989 period, and hence the data and the results were available for parameter fitting. Varotsos and Alexopoulos [1984, p. 121] state "The SES is observed 6-115 h before the EQ ... " Varotsos and Alexopoulos [1987, p. 336] say"... one week (which is the maximum of the time window between the appearance of an SES and the occurrence of the corresponding EQ)." Therefore, the time delays of 11 and 22 days for a single SES and SES activity respectively, were specified after 1987. Similarly, Varotsos and Alexopoulos [1984, p. 99] use the magnitude prediction error 0.5. In their 1991 paper it is reported to be 0.7 (repeated in their 1996 manuscript). Varotsos [1986] suggests the window radius 120 km and magnitude interval 0.8 units in the abstract, but on p. 192 the magnitude accuracy is 0.6 units. The prediction time lag is specified as "6 hrs to 1 week" [Varotsos, 1986, p. 192]. Although Varotsos et al. [1996] state often that earthquakes with magnitude M 2 5 are predicted, smaller earthquakes are also reported to be successfully forecasted [Varotsos and Alexopoulos, 1984, pp. 119-120,123-124]. Thus, most of the VAN parameters have been adjusted after the 1987-1989 test period, i.e., retroactively. KAGAN AND JACKSON: VAN STATISTICAL TESTS Furthermore the tolerances have grown with time, suggesting a consistent pattern of expanding the windows to include more events as "successes." This window enlargement is still continuing; the suggested parameters in Varotsos and Lazaridou [1996] differ from those in Varotsos et al. [1996]. Kagan [1996] counts 7 earthquakes meeting the prediction criteria in the SI-NOA catalog, and 6 in the PDE (NOAA) catalog, using rules apparently accepted by Varotsos et al. [1996]. The values 6.M = 0.7, Me = 5.0, and three limits for time delay ta (11 days for a single SES, 22 days for several SES signals, and '... of the order of 1 month .. .' for "gradual variation of electric field" GVEF) were specifically stated, and Varotsos et al. [1996] apparently accept the value Rmax = 100 km used by Hamada [1993]. But Varotsos and Lazaridou [1996] claim that additional earthquakes should be considered successes. For the SI-NOA catalog they wish to include the earthquakes of 1988/07/16 and 1989/09/19. The former occurred at 95 km depth according to the SI-NOA catalog [Mulargia and Gasperini, private correspondence, 1994] so it does not qualify as a shallow earthquake. The latter was 101.3 km from the predicted epicenter in the SI-NOA catalog. From the NOAA (PDE) catalog, Varotsos and Lazaridou [1996] wish to include the earthquakes of 1987/05/14 (mb = 4.7) and 1987/05/29 (mb = 5.2) which followed GVEF activities. Kagan did not count these earthquakes as successfully predicted because the time limits for GVEF prediction were not adequately specified. The latter earthquake followed the report of GVEF by more than 32 days. If these earthquakes are to be counted as successes, then the rules of the game must be revised yet again, and all of the statistical tests repeated with the new (clearly retrospective) rules. The above suggested changes are motivated by a desire to improve the performance of the SES hypothesis, not the null hypothesis. This type of optimization is perfectly acceptable in the learning phase of prediction research, while a specific hypothesis is being formulated. This debate demonstrates that this optimization has not been completed in the nearly five years since the end of 1989, and that there is not· yet a specific hypothesis to test. There is a way to specify predictions with less sensitivity to adjustable parameters. If the prediction is expressed as probability per unit time, area, and magnitude, as a function of time, location, and magnitude, the likelihood ratio test can be used to evaluate the test hypothesis relative to the null hypothesis [Rhoades and Evison, 1993]. The probability function may still depend on adjustable parameters, but the extreme sensitivity caused by sharp boundaries can be eliminated if the assumed probability function is continuous. Discussion The above comments call for further discussion of the criteria for acceptance or rejection of statistical hypotheses. Earthquake prediction presents a special 1435 problem because of the lack of an appropriate theory of earthquake occurrence, the complex and difficult to interpret character of earthquake data, the many dimensions of the earthquake process, and several other factors. Ideally, earthquake prediction experiments should be organized according to the rules proposed by Jackson [1996] or Rhoades and Evison [1996]. Strict compliance with these rules is very difficult. Only a few published methods come close [e.g., Reasenberg and Jones, 1989; Nishenko, 1991; Evison and Rhoades, 1993]. In most cases, retrospective parameter adjustment is nearly inevitable. Molchan and Rotwain [1983] discuss methods for statistical testing of predictions when the degrees of freedom subject to retrospective adjustment can be counted objectively. However, there is no objective way to count the degrees of freedom involved in retrospective adjustment of prediction windows, selection of earthquake catalogs, etc. Because ofthese adjustments the VAN method is still too vague to allow rigorous testing. Another problem is the instability of statistical tests caused by sharp magnitude-space-time boundaries of the VAN alarm zones. A similar problem is encountered in many other earthquake predictions. If an earthquake epicenter misses the zone even by a few kilometers, according to formal criteria such an event is scored as a failure. Similarly, magnitude thresholds are expressed in the forecast as sharp cutoffs; thus small magnitude variations caused either by the use of a preliminary vs. the final catalog or by the use of various magnitude scales can make the difference between the acceptance and rejection of a hypothesis. Real seismicity lacks such sharp boundaries either in space or magnitude. As we commented above, the sensitivity to various parameters can be significantly reduced by specifying an earthquake probability density for all magnitude-space-time windows of interest. The major problem which needs to be addressed is whether the observed phenomena can be explained by a null hypothesis. The proper choice of earthquake occurrence model is crucial for these tests. We believe that there is a consensus that a Poisson cluster process (i.e., mainshocks are occurring according to the Poisson process, and they are accompanied by foreshock/aftershock sequences) is an appropriate model of seismicity [Ogata, 1988; Kagan, 1991]. Unfortunately, there is no widely accepted and persuasive description of aftershock sequences in the space-time-magnitude domain. The null and alternative (research) hypotheses do not compete in such circumstances on an equal basis in statistical testing. Usually, the null hypothesis is simpler; by the principle of Occam's razor it should be accepted until it is displaced by a hypothesis that fits the data significantly better. Moreover, the history of failed attempts to predict earthquakes over the past 30 years suggests that prediction based on electric and other precursors is a difficult or l'laybe even impossible task. Therefore, any unresolved doubt in the data analysis, testing and representation of the test results should be interpreted in favor of the null hypothesis [ef. Riedel, 1436 KAGAN AND JACKSON: VAN STATISTICAL TESTS 1996; Stark, 1996]. In particular, if a slight variation of prediction parameters or input data makes the difference between the acceptance and rejection of the null hypothesis, the prediction effect should be considered weak, possibly non-existent. Such test results might arguably justify more extensive and rigorous experiments, but until the experiments and their tests are complete, such a statistically marginal hypothesis should not be accepted. Given the wide variety of interpretations expressed in this debate about the meaning of the VAN predictions, it is evident that the VAN experiments, the format of VAN's public statements, and the analysis of the VAN data need to be revised significantly before any statistical test of the hypothesis is contemplated. Acknowledgments. We appreciate partial support from the National Science Foundation through Cooperative Agreement EAR-8920136 and USGS Cooperative Agreement 14-08-000l-A0899 to the Southern California Earthquake Center (SCEC) and partial support by USGS through grant 1434-94-G-2425. Publication 186, SCEC. References Aceves, R. L., S. K. Park, and D. J. Strauss, Statistical evaluation of the VAN method using the historic earthquake catalog in Greece, Geophys. Res. Lett., this issue, 1996. Dologlou, E., A 3 year continuous sample of officially documented predictions issued in Greece using the VAN method - 1987-1989, Tectonophysics, 224, 189-202, 1993. Evison, F. F., and D. A. Rhoades, The precursory earthquake swarm in New Zealand: hypothesis tests, New Zealand J. Geol. Geoph., 36, 51-60, (Correction p. 267), 1993. Jackson, D. D., Earthquake prediction evaluation standards applied to the VAN method, Geophys. Res. Lett., this issue, 1996. Kagan, Y. Y., Likelihood analysis of earthquake catalogs, Geophys. J. Int., 106, 135-148, 1991. Kagan, Y. Y., VAN earthquake predictions - an attempt at statistical evaluation, Geophys. Res. Lett., this issue, 1996. Molchan, G. M., and I. M. Rotwain, Statistical analysis of long-term precursors of la:ge earthquakes, In: KeilisBorok, V. I., and Levshin, A. L., (eds.), Computational Seismology, 15, Nauka, Moscow, 52-66, (in Russian, English translation: Allerton Press, New York, 56-70,1985), 1983. Mulargia, F., and P. Gasperini, Evaluating the statistical validity beyond chance of VAN earthquake precursors, Geophys. J. Int., 111, 32-44, 1992. Nishenko, S. P., Circum-Pacific seismic potential: 19891999, Pure Appl. Geophys., 135, 169-259, 1991. Ogata, Y., Statistical models for earthquake occurrence and residual analysis for point processes, J. Amer. Statist. Assoc., 83, 9-27, 1988. Reasenberg, P. A., and L. M. Jones, Earthquake hazard after a mainshock in California, Science, 243, 1173-1176, 1989. Rhoades, D. A., and F. F. Evison, 1993. Long-range earthquake forecasting based on a single predictor with clustering, Geophys. J. Int., 113, 371-38l. Rhoades, D. A., and F. F. Evison, The VAN earthquake predictions, Geophys. Res. Lett., this issue, 1996. Riedel,K. S., Statistical tests for evaluating an earthquake prediction method, Geophys. Res. Lett., this issue, 1996. Stark, P. B., A few statistical considerations for ascribing statistical significance to earthquake predictions, Geophys. Res. Lett., this issue, 1996. Varotsos, P., Earthquake prediction in Greece based on seismic electric signals; Period: January 1, 1984 to March 18, 1986, Bollettino di Geodesia e Scienze Affini, 45(2), 191202, 1986. Varotsos, P., and K. Alexopoulos, Physical properties of the variations of the electric field of the Earth preceding earthquakes; II, Determination of epicenter and magnitude, Tectonophysics, 110, 99-125, 1984. Varotsos, P., and K. Alexopoulos, Physical properties ofthe variations in the electric field of the Earth preceding earthquakes, III, Tectonophysics, 136, 335-339, 1987. Varotsos, P., and M. Lazaridou, Latest aspects of earthquake prediction in Greece based on seismic electric signals, Tectonophysics, 188, 321-347, 1991. Varotsos, P., K. Eftaxias, F. Vallianatos, and M. Lazaridou, Basic principles for evaluating an earthquake prediction method, Geophys. Res. Lett., this issue, 1996. Varotsos, P., and M. Lazaridou, Reply to "VAN earthquake predictions - an attempt at statistical evaluation," by Y. Y. Kagan, Geophys. Res. Lett., this issue, 1996. Y. Y. Kagan; Institute of Geophysics and Planetary Physics, University of California, Los Angeles, California 90095-1567. (e-mail: kagan@cyclop.ess.ucla.edu) D. D. Jackson, Department of Earth and Space Sciences, University of California, Los Angeles, California 90095-1567. (e-mail: djackson@ess.ucla.edu) (received September 6, 1994; revised December 5, 1995; accepted December 5, 1995.)