MSCs - University of Illinois at Chicago

advertisement
The Model Selection Criteria of Akaike, Schwarz, and Kashyap
Stanley L. Sclove
University of Illinois at Chicago
Given a set of models, indexed by k = 1,2,. . . , K, let Lk denote the maximized likelihood
under the k-th model.
Akaike's (1973), Schwarz's (1978) and Kashyap's (1982) model-selection criteria involve log
Lk. It has become customary to multiply this by negative 2 in expressing the criteria
(giving a smaller-is-better criterion); the criteria then take the form
GIC(k) = -2 log(Lk) + a(n) mk
+ bk,
where "GIC" denotes "Generalized Information Criterion," "log" denotes the natural (base e)
logarithm, and
mk = number of independent parameters in the k-th model,
a(n)= 2, for all n, bk = 0, for Akaike's criterion AIC,
a(n) = log n, bk = 0, for Schwarz's criterion,
a(n) = log n, bk = log[det Bk], for Kashyap's criterion,
det = determinant,
and
nBk = negative of matrix of second partial derivatives of the log likelihood with respect to the
parameters, evaluated at their maximum likelihood estimates.
The expected value of Bk is the Fisher information matrix.
Note that Bk is O(1) with respect to n. The bigger det Bk is, the higher the penalty in Kashyap's
criterion.
Another criterion would result by replacing Bk by its expectation, the FIM. This criterion would
be
-2 log(Lk) + a(n)mk + log det FIMk,
= -2 log(Lk) + a(n)mk - log det IFIMk,
where IFIM is the Inverse Fisher Information Matrix.
The IFIM is often the covariance matrix of the MLEs. The larger this covariance matrix, the
lower (better) the value of the MSC. (This seems counter-intuitive. But the larger the
covariance matrix involved in Lk, the smaller the likelihood will be, and the larger will be -2 log
likelihood. So the additional Kashyap term mitigates this. It involves the covariance matrix of
the estimates, whereas the likelihood involves the covariance matrix of the residuals.)
The factor a(n) is to be thought of as a per-parameter penalty. Its value provides an answer
to the question of how much an additional parameter must improve the fit before it should be
included in the model. An additional parameter will be included in the model if the improvement
in 2 log(max Lk) is greater than a(n), i.e., if max Lk improves by a factor greater than
exp[a(n)/2].
AIC generally chooses a model with more parameters
than do the others. Since for n greater than 8, log n is greater than 2, Schwarz's criterion will
choose a model no larger than that chosen by Akaike's, for n greater than 8.
In homoscedastic Gaussian models, -2 log(max Lk) is,
except for constants, n times the log of the maximum likelihood
estimate of the variance. More precisely, for p-variate normal samples, the criterion is of the
form
np log(2) + n log(det Sk) + np + a(n)mk,
where Sk is the maximum likelihood estimate of the
error covariance matrix under model k. The larger the determinant of the covariance matrix, the
higher the criterion, and the corresponding model is judged to be less good.
References
Akaike, H. (1973). "Information Theory and an Extension of the Maximum Likelihood Principle,"
(Proc. 2nd International Symposium on nformation Theory) (eds. B.N. Petrov and F.
Csaki), 267-281, Akademia Kiado, Budapest.
Kashyap, R. L. (1982). "Optimal Choice of AR and MA Parts in Autoregressive Moving Average
Models," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 4, 99-104.
Schwarz, G. (1978). "Estimating the Dimension of a Model," Annals of Statistics, Vol. 6, 461464.
Bibliography
Aitkin, M. A. (1974). "Simultaneous Inference and the Choice of Variable Subsets,"
Technometrics, Vol. 16, 221-227.
Akaike, H. (1969). "Fitting Autoregressive Models for Prediction," Annals of the Institute of
Statistical Mathematics, Vol. 21, 243-247.
Akaike, H. (1983). "Statistical Inference and Measurement of Entropy," Scientific Inference,
Data Analysis and Robustness (eds. H. Akaike and C.-F. Wu), 165-189, Academic Press,
New York.
Akaike, H. (1985). "Prediction and Entropy," A Celebration of Statistics: the ISI Centenary
Volume (eds. A.C. Atkinson and S.E. Fienberg), 1-24, Springer-Verlag, New York.
Akaike, H. (1987). "Factor Analysis and AIC," Psychometrika, Vol. 52, 317-332.
Bendel, R. B. and Afifi, A. A. (1977). "Comparison of Stopping Rules in Forward Stepwise
Regression," Journal of the American Statistical Association, Vol. 72, 46-53.
Bozdogan, H. (1986). "Multi-sample Cluster Analysis as an Alternative to Multiple Comparison
Procedures," Bulletin of Informatics and Cybernetics, Vol. 22 (No. 1-2), 95-130.
Bozdogan, H. (1988). "ICOMP: A New Model-Selection Criterion," (Classification and Related
Methods of Data Analysis) (ed. H. H. Bock), 599-608, North-Holland Publishing
Company, Amsterdam.
Bozdogan, H. (1990). "On the Information Based Measure of Covariance Complexity and its
Application to the Evaluation of Multivariate Linear Models," Communications in
Statistics A (Theory and Methods), Vol. 19, 221-278.
Bozdogan, H., and Sclove, S. L. (1984). "Multi-Sample Cluster Analysis using Akaike's
Information Criterion," Annals of the Institute of Statistical Mathematics, Vol. 36, 163180.
Bozdogan, H., Sclove, S. L. and Gupta, A. K. (1993). "AIC-Replacements for Some Multivariate
Tests of Homogeneity with Applications in Multisample Clustering and Variable
Selection," in (Multivariate Statistical Modeling), Proc. 1st U.S./Japan Conference on the
Frontiers of Statistical Modeling: An Informational Approach, H. Bozdogan (ed.), Kluwer
Academic Publishers, Dordrecht, the Netherlands.
Cheeseman, P. (1993). "Overview of Model Selection," Selecting Models from Data: Artificial
Intelligence and Statistics IV (Springer Lecture Notes in Statistics No. 89), pp. 31-39. P.
Cheeseman and R.W. Oldford (eds.), Springer Verlag, New York, 1994.
Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). "Maximum Likelihood from Incomplete
Data via the E-M Algorithm," Journal of the Royal Statistical Society(Series B), Vol. 39,
1-38.
Ferguson, T. (1967). Mathematical Statistics: A Decision Theoretic Approach. Academic
Press, New York.
Geisser, S. (1966). "Predictive Discrimination," Multivariate Analysis (ed. P.R. Krishnaiah), 149163, Academic Press, New York.
Geisser, S. (1975). "Predictive Sample Reuse Method with Applications," (Journal of the
American Statistical Association), Vol. 70, 320-328.
Geisser, S., and Eddy, W. F. (1979). "A Predictive Approach to Model Selection," (Journal of
the American Statistical Association), Vol. 74, 153-160.
Good, I. J. (1950). Probability and the Weighing of Evidence. Charles Griffin, London.
Hartigan, J. (1985). "A Failure of Likelihood Asymptotics for Normal Mixtures," Proc. Berkeley
Conference in Honor of Jerzy Neyman and Jack Kiefer (eds. L.M. LeCam and R.A.
Olshen), Vol. II, 807-811, Wadsworth, Inc., Belmont, CA.
Kitagawa, G. (1979). "On the Use of AIC for the Detection of Outliers," Technometrics, Vol. 21,
193-199.
Kullback, S. (1959). Information Theory and Statistics. John Wiley and Sons, New York.
Reprinted (1968), Dover Publications, New York.
Kullback, S. and Leibler, R. A. (1951). "On Information and Sufficiency," (Annals of
Mathematical Statistics), Vol. 22, 79-86.
Leonard, T. (1982). Comment on "A Simple Predictive Density Function" by M. LeJeune and G.
D. Faulkenberry, (Journal of the American Statistical Association), Vol. 77, No. 370, 657-658.
Linhart, H. and Zucchini, W. (1986). Model Selection. John Wiley and Sons, New York.
McLachlan, G. J. and Basford, K. E. (1988). Mixture Models: Inference and Applications to
Clustering. Marcel Dekker, Inc., New York.
Nicholson, G.E., Jr. (1960). "Prediction in Future Samples," Contributions to Probability and
Statistics: Essays in Honor of Harold Hotelling (ed. I. Olkin), 322-330, Stanford University
Press, Palo Alto.
Parzen, E. (1982). "Maximum Entropy Interpretation of Autoregressive Spectral Densities," (
Statistics and Probability Letters), Vol. 1, 7-11.
Rissanen, J. (1978). "Modeling by Shortest Data Description," (Automatica), Vol. 14, 465-471.
Rissanen, J. (1985). "Minimum-Description-Length Principle," (Encyclopedia of Statistical
Sciences), Vol. 5, 523-527, John Wiley and Sons, New York.
Rissanen, J. (1986). "Stochastic Complexity and Modeling," (Annals of Statistics), Vol. 14,
1080-1100.
Rissanen, J. (1987). "Stochastic Complexity," (Journal of the Royal Statistical Society), (Series
B), Vol. 49, 223-239.
Rissanen, J. (1989). (Stochastic Complexity in Statistical Inquiry.) World Scientific Publishing
Company, Teaneck, NJ.
Sclove, S. L. (1969). "On Criteria for Choosing a Regression Equation for Prediction," Technical
Report No. 28, Dept. of Statistics, Carnegie-Mellon University, Pittsburgh.
Sclove, S. L. (1987). "Application of Model-Selection Criteria to Some Problems in Multivariate
Analysis," Psychometrika, Vol. 52, 333-343.
Titterington, D. M., Smith, A.F.M. and Makov, U. E. (1985). (Statistical Analysis of Finite
Mixture Distributions.) John Wiley and Sons, New York.
van der Merwe, A. J., Groenewald, P.C.N., de Waal, D. J. and van der Merwe, C. A. (1983).
"Model Selection for Future Data in the Case of Multivariate Regression Analysis," (South
African Statistical Journal), Vol. 17, 147-164.
van Emden, M. H. (1971). An Analysis of Complexity. Mathematical Centre Tracts 35.
Mathematisch Centrum, Amsterdam.
Wolfe, J. H. (1970). "Pattern Clustering by Multivariate Mixture Analysis," (Multivariate
Behavioral Research), Vol. 5, 329-350.
Download