The Model Selection Criteria of Akaike, Schwarz, and Kashyap Stanley L. Sclove University of Illinois at Chicago Given a set of models, indexed by k = 1,2,. . . , K, let Lk denote the maximized likelihood under the k-th model. Akaike's (1973), Schwarz's (1978) and Kashyap's (1982) model-selection criteria involve log Lk. It has become customary to multiply this by negative 2 in expressing the criteria (giving a smaller-is-better criterion); the criteria then take the form GIC(k) = -2 log(Lk) + a(n) mk + bk, where "GIC" denotes "Generalized Information Criterion," "log" denotes the natural (base e) logarithm, and mk = number of independent parameters in the k-th model, a(n)= 2, for all n, bk = 0, for Akaike's criterion AIC, a(n) = log n, bk = 0, for Schwarz's criterion, a(n) = log n, bk = log[det Bk], for Kashyap's criterion, det = determinant, and nBk = negative of matrix of second partial derivatives of the log likelihood with respect to the parameters, evaluated at their maximum likelihood estimates. The expected value of Bk is the Fisher information matrix. Note that Bk is O(1) with respect to n. The bigger det Bk is, the higher the penalty in Kashyap's criterion. Another criterion would result by replacing Bk by its expectation, the FIM. This criterion would be -2 log(Lk) + a(n)mk + log det FIMk, = -2 log(Lk) + a(n)mk - log det IFIMk, where IFIM is the Inverse Fisher Information Matrix. The IFIM is often the covariance matrix of the MLEs. The larger this covariance matrix, the lower (better) the value of the MSC. (This seems counter-intuitive. But the larger the covariance matrix involved in Lk, the smaller the likelihood will be, and the larger will be -2 log likelihood. So the additional Kashyap term mitigates this. It involves the covariance matrix of the estimates, whereas the likelihood involves the covariance matrix of the residuals.) The factor a(n) is to be thought of as a per-parameter penalty. Its value provides an answer to the question of how much an additional parameter must improve the fit before it should be included in the model. An additional parameter will be included in the model if the improvement in 2 log(max Lk) is greater than a(n), i.e., if max Lk improves by a factor greater than exp[a(n)/2]. AIC generally chooses a model with more parameters than do the others. Since for n greater than 8, log n is greater than 2, Schwarz's criterion will choose a model no larger than that chosen by Akaike's, for n greater than 8. In homoscedastic Gaussian models, -2 log(max Lk) is, except for constants, n times the log of the maximum likelihood estimate of the variance. More precisely, for p-variate normal samples, the criterion is of the form np log(2) + n log(det Sk) + np + a(n)mk, where Sk is the maximum likelihood estimate of the error covariance matrix under model k. The larger the determinant of the covariance matrix, the higher the criterion, and the corresponding model is judged to be less good. References Akaike, H. (1973). "Information Theory and an Extension of the Maximum Likelihood Principle," (Proc. 2nd International Symposium on nformation Theory) (eds. B.N. Petrov and F. Csaki), 267-281, Akademia Kiado, Budapest. Kashyap, R. L. (1982). "Optimal Choice of AR and MA Parts in Autoregressive Moving Average Models," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 4, 99-104. Schwarz, G. (1978). "Estimating the Dimension of a Model," Annals of Statistics, Vol. 6, 461464. Bibliography Aitkin, M. A. (1974). "Simultaneous Inference and the Choice of Variable Subsets," Technometrics, Vol. 16, 221-227. Akaike, H. (1969). "Fitting Autoregressive Models for Prediction," Annals of the Institute of Statistical Mathematics, Vol. 21, 243-247. Akaike, H. (1983). "Statistical Inference and Measurement of Entropy," Scientific Inference, Data Analysis and Robustness (eds. H. Akaike and C.-F. Wu), 165-189, Academic Press, New York. Akaike, H. (1985). "Prediction and Entropy," A Celebration of Statistics: the ISI Centenary Volume (eds. A.C. Atkinson and S.E. Fienberg), 1-24, Springer-Verlag, New York. Akaike, H. (1987). "Factor Analysis and AIC," Psychometrika, Vol. 52, 317-332. Bendel, R. B. and Afifi, A. A. (1977). "Comparison of Stopping Rules in Forward Stepwise Regression," Journal of the American Statistical Association, Vol. 72, 46-53. Bozdogan, H. (1986). "Multi-sample Cluster Analysis as an Alternative to Multiple Comparison Procedures," Bulletin of Informatics and Cybernetics, Vol. 22 (No. 1-2), 95-130. Bozdogan, H. (1988). "ICOMP: A New Model-Selection Criterion," (Classification and Related Methods of Data Analysis) (ed. H. H. Bock), 599-608, North-Holland Publishing Company, Amsterdam. Bozdogan, H. (1990). "On the Information Based Measure of Covariance Complexity and its Application to the Evaluation of Multivariate Linear Models," Communications in Statistics A (Theory and Methods), Vol. 19, 221-278. Bozdogan, H., and Sclove, S. L. (1984). "Multi-Sample Cluster Analysis using Akaike's Information Criterion," Annals of the Institute of Statistical Mathematics, Vol. 36, 163180. Bozdogan, H., Sclove, S. L. and Gupta, A. K. (1993). "AIC-Replacements for Some Multivariate Tests of Homogeneity with Applications in Multisample Clustering and Variable Selection," in (Multivariate Statistical Modeling), Proc. 1st U.S./Japan Conference on the Frontiers of Statistical Modeling: An Informational Approach, H. Bozdogan (ed.), Kluwer Academic Publishers, Dordrecht, the Netherlands. Cheeseman, P. (1993). "Overview of Model Selection," Selecting Models from Data: Artificial Intelligence and Statistics IV (Springer Lecture Notes in Statistics No. 89), pp. 31-39. P. Cheeseman and R.W. Oldford (eds.), Springer Verlag, New York, 1994. Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). "Maximum Likelihood from Incomplete Data via the E-M Algorithm," Journal of the Royal Statistical Society(Series B), Vol. 39, 1-38. Ferguson, T. (1967). Mathematical Statistics: A Decision Theoretic Approach. Academic Press, New York. Geisser, S. (1966). "Predictive Discrimination," Multivariate Analysis (ed. P.R. Krishnaiah), 149163, Academic Press, New York. Geisser, S. (1975). "Predictive Sample Reuse Method with Applications," (Journal of the American Statistical Association), Vol. 70, 320-328. Geisser, S., and Eddy, W. F. (1979). "A Predictive Approach to Model Selection," (Journal of the American Statistical Association), Vol. 74, 153-160. Good, I. J. (1950). Probability and the Weighing of Evidence. Charles Griffin, London. Hartigan, J. (1985). "A Failure of Likelihood Asymptotics for Normal Mixtures," Proc. Berkeley Conference in Honor of Jerzy Neyman and Jack Kiefer (eds. L.M. LeCam and R.A. Olshen), Vol. II, 807-811, Wadsworth, Inc., Belmont, CA. Kitagawa, G. (1979). "On the Use of AIC for the Detection of Outliers," Technometrics, Vol. 21, 193-199. Kullback, S. (1959). Information Theory and Statistics. John Wiley and Sons, New York. Reprinted (1968), Dover Publications, New York. Kullback, S. and Leibler, R. A. (1951). "On Information and Sufficiency," (Annals of Mathematical Statistics), Vol. 22, 79-86. Leonard, T. (1982). Comment on "A Simple Predictive Density Function" by M. LeJeune and G. D. Faulkenberry, (Journal of the American Statistical Association), Vol. 77, No. 370, 657-658. Linhart, H. and Zucchini, W. (1986). Model Selection. John Wiley and Sons, New York. McLachlan, G. J. and Basford, K. E. (1988). Mixture Models: Inference and Applications to Clustering. Marcel Dekker, Inc., New York. Nicholson, G.E., Jr. (1960). "Prediction in Future Samples," Contributions to Probability and Statistics: Essays in Honor of Harold Hotelling (ed. I. Olkin), 322-330, Stanford University Press, Palo Alto. Parzen, E. (1982). "Maximum Entropy Interpretation of Autoregressive Spectral Densities," ( Statistics and Probability Letters), Vol. 1, 7-11. Rissanen, J. (1978). "Modeling by Shortest Data Description," (Automatica), Vol. 14, 465-471. Rissanen, J. (1985). "Minimum-Description-Length Principle," (Encyclopedia of Statistical Sciences), Vol. 5, 523-527, John Wiley and Sons, New York. Rissanen, J. (1986). "Stochastic Complexity and Modeling," (Annals of Statistics), Vol. 14, 1080-1100. Rissanen, J. (1987). "Stochastic Complexity," (Journal of the Royal Statistical Society), (Series B), Vol. 49, 223-239. Rissanen, J. (1989). (Stochastic Complexity in Statistical Inquiry.) World Scientific Publishing Company, Teaneck, NJ. Sclove, S. L. (1969). "On Criteria for Choosing a Regression Equation for Prediction," Technical Report No. 28, Dept. of Statistics, Carnegie-Mellon University, Pittsburgh. Sclove, S. L. (1987). "Application of Model-Selection Criteria to Some Problems in Multivariate Analysis," Psychometrika, Vol. 52, 333-343. Titterington, D. M., Smith, A.F.M. and Makov, U. E. (1985). (Statistical Analysis of Finite Mixture Distributions.) John Wiley and Sons, New York. van der Merwe, A. J., Groenewald, P.C.N., de Waal, D. J. and van der Merwe, C. A. (1983). "Model Selection for Future Data in the Case of Multivariate Regression Analysis," (South African Statistical Journal), Vol. 17, 147-164. van Emden, M. H. (1971). An Analysis of Complexity. Mathematical Centre Tracts 35. Mathematisch Centrum, Amsterdam. Wolfe, J. H. (1970). "Pattern Clustering by Multivariate Mixture Analysis," (Multivariate Behavioral Research), Vol. 5, 329-350.