Fitting distributions to data This article explains why the ModelRisk software from Vose Software uses information criteria in the measuring and ranking the degree of fit for distributions fitted to a data set. Fitting a distribution to data is an essential element of risk analysis and almost all of the commercial risk analysis software tools currently available offer this facility. Most of these tools will fit a distribution using maximum likelihood methods, which means that they find the parameters of a distribution that give the highest probability that the distribution could have produced the observed values. This is a good technique. Usually one tries several different distribution types, in which case the software needs to rank how well each distribution fits the data. Unfortunately most software will then use some very old and flawed methods to assess the degree of fit. These methods are: Chi-squared goodness of fit test This involves organising the data into histogram form and then comparing the degree to which the histogram matches the fitted distribution. There are two problems with this method: (1) there is no ‘correct’ number of histogram bars one should use, nor is there a correct range for each bar, and by changing the number and range of the bars one can change the ranking of fit; and (2) the method is based on chi-squared statistics which implicitly assumes that there is a very large number of data points, which we typically don’t have (technically, it assumes that the distribution of the number of observations that could have fallen within the range of a histogram bar follows a binomial distribution which can be approximated to a Normal distribution). Kolmogorov-Smirnoff (K-S) goodness of fit test This involves comparing the cumulative distribution curves of the data and fitted distribution, and finding the maximum vertical distance between them (Chakravarti et al, 1967). Figure 1 illustrates an example: Figure 1: Illustration of the Kolmogorov-Smirnoff distance, shown as the maximum vertical distance between the cumulative probability function for a fitted distribution (red) and the empirical cumulative distribution (blue). The smaller this distance is, the better the fit is assumed to be. There are three problems with this technique: (1) the largest distance tends to occur towards the middle of the distribution, so by searching for the smallest K-S distance one is selecting for a distribution that fits best in the middle, when it is usually the tails of a distribution that present us with the big risks; (2) the technique was designed to compare a dataset against a distribution from which the data was hypothesised to have come from – it doesn’t apply when one has estimated the distribution’s parameters from the data. Modifications have been produced to account for this problem for a few distribution types (normal, lognormal, exponential, Weibull, and Gumbel extreme value) but if the fitted distribution is none of these then a general and sometimes quite inaccurate formula is applied, and ; and (3) the method cannot be used with a discrete distribution. Anderson-Darling (A-D) goodness of fit test Like the K-S method, this involves comparing the cumulative distribution curves of the data and fitted distribution, but assesses the total area between the two curves (Figure 2). Figure 2: Illustration of the area (in black) used for calculating the Anderson-Darling statistic. The smaller this area is, the better the fit is assumed to be. The goodness of fit statistic accounts for the greater chance of having bigger gaps towards the middle of a distribution by weighting the area across the distribution’s range (Stephens, 1974). There are three problems with this technique: (1) as with the K-S statistic, the statistical significance of the A-D statistic is dependent on the distribution type being fitted, and the statistical significance has only been determined for a few distribution types (normal, lognormal, exponential, Weibull, and Gumbel extreme value), so there is no correct way to compare the fit for different distributions outside this list, and (2) like the K-S method, the AD method does not apply when one has estimated the distribution parameters from data (Stephens, 1976) ; and (3) the method cannot be used with a discrete distribution. Other, more general problems A distribution that is defined by three or more parameters can have a lot more flexibility in shape to match the pattern of data than another distribution with just two parameters. For example, a Normal distribution is determined by its mean and standard deviation, whilst a three-parameter Student distribution has an additional ‘degrees of freedom’ parameter. If this last parameter is large the three-parameter Student will look almost exactly like a Normal (Figure ???). It is no surprise therefore that a three-parameter Student distribution will often give a better fit under the methods described above than a Normal, even when the data are random samples from a Normal distribution. This is because none of the techniques above penalise a fitted distribution for the number of parameters that can be changed to achieve the best possible fit. Another problem with measuring the degree of fit using the methods described above is that they don’t work where one has truncated, censored or grouped data – a situation we are often faced with. The solution ModelRisk uses three information criteria (IC) to measure goodness of fit: Akaike, Hannan-Quinn, and Schwartz instead of the methods described above. The smaller the IC value, the better the fit. The criteria penalise the fit for the number of parameters of the fitted distribution in slightly different ways. In general they give the same ranking of fit, but ModelRisk allows the user to switch between the IC used in order to investigate the robustness of the analysis. In addition, ICs can be used for discrete and continuous, truncated, censored and/or grouped data which allows one to maintain a consistency in approach. ModelRisk will report IC values directly in a spreadsheet using dedicated functions linked to fitted distribution ‘objects’, which can also be linked directly to data sets within a spreadsheet or an external database. This provides the additional benefit of allowing the user to construct a model that will automatically update with the arrival of new data and select the best fitting distribution. Moreover, a single icon click allows the user to review graphically and statistically how well the distribution fits the data. The following formulae for the three information criteria use: n = number of observations (e.g. data values, frequencies) k = number of parameters to be estimated (e.g. the Normal distribution has 2: mu and sigma) Lmax = the maximised value of the log-Likelihood for the estimated model (i.e. the parameters are fit by MLE and the natural log of the Likelihood is recorded). 1. SIC (Schwarz information criterion, aka Bayesian information criterion) The SIC (Schwarz, 1978) is the most strict in penalising loss of degree of freedom by having more parameters. SIC k ln n 2 ln Lmax 2. AIC (Akaike information criterion) 2n AIC C k 2 ln Lmax n k 1 The AIC (Akaike 1974) is the least strict of the three in penalising loss of degree of freedom. 3. HQIC (Hannan-Quinn information criterion) HQIC 2 ln ln n 2k ln Lmax The HQIC (Hannan and Quinn, 1979) holds the middle ground in its penalising for the number of parameters. The aim is to find the model with the lowest value of the selected information criterion. The −2 ln[Lmax] term appearing in each formula is an estimate of the deviance of the model fit. The coefficients for k in the first part of each formula show the degree to which the number of model parameters is being penalised. For n > 20 or so the SIC is the strictest in penalising loss of degree of freedom by having more parameters in the fitted model. For n > 40 the AICC is the least strict of the three and the HQIC holds the middle ground, or is the least penalising for n < 20. References Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control AC19, pp. 716–723. Hannan, E. J. and Quinn, B. G. (1979). The determination of the order of an autoregression. Journal of the Royal Statistical Society, B 41, pp. 190–195. Chakravarti, Laha, and Roy, (1967). Handbook of Methods of Applied Statistics, Volume I, John Wiley and Sons, pp. 392-394. Stephens, M. A. (1974). EDF Statistics for Goodness of Fit and Some Comparisons, Journal of the American Statistical Association, Vol. 69, pp. 730-737. Stephens, M. A. (1976). Asymptotic Results for Goodness-of-Fit Statistics with Unknown Parameters, Annals of Statistics, Vol. 4, pp. 357-369.