Fitting a distribution to data is an essential element

advertisement
Fitting distributions to data
This article explains why the ModelRisk software from Vose Software uses information criteria in the
measuring and ranking the degree of fit for distributions fitted to a data set.
Fitting a distribution to data is an essential element of risk analysis and almost all of the commercial
risk analysis software tools currently available offer this facility.
Most of these tools will fit a distribution using maximum likelihood methods, which means that they
find the parameters of a distribution that give the highest probability that the distribution could
have produced the observed values. This is a good technique.
Usually one tries several different distribution types, in which case the software needs to rank how
well each distribution fits the data. Unfortunately most software will then use some very old and
flawed methods to assess the degree of fit. These methods are:
Chi-squared goodness of fit test
This involves organising the data into histogram form and then comparing the degree to which the
histogram matches the fitted distribution. There are two problems with this method: (1) there is no
‘correct’ number of histogram bars one should use, nor is there a correct range for each bar, and by
changing the number and range of the bars one can change the ranking of fit; and (2) the method is
based on chi-squared statistics which implicitly assumes that there is a very large number of data
points, which we typically don’t have (technically, it assumes that the distribution of the number of
observations that could have fallen within the range of a histogram bar follows a binomial
distribution which can be approximated to a Normal distribution).
Kolmogorov-Smirnoff (K-S) goodness of fit test
This involves comparing the cumulative distribution curves of the data and fitted distribution, and
finding the maximum vertical distance between them (Chakravarti et al, 1967). Figure 1 illustrates an
example:
Figure 1: Illustration of the Kolmogorov-Smirnoff distance, shown as the maximum vertical distance between the
cumulative probability function for a fitted distribution (red) and the empirical cumulative distribution (blue).
The smaller this distance is, the better the fit is assumed to be. There are three problems with this
technique: (1) the largest distance tends to occur towards the middle of the distribution, so by
searching for the smallest K-S distance one is selecting for a distribution that fits best in the middle,
when it is usually the tails of a distribution that present us with the big risks; (2) the technique was
designed to compare a dataset against a distribution from which the data was hypothesised to have
come from – it doesn’t apply when one has estimated the distribution’s parameters from the data.
Modifications have been produced to account for this problem for a few distribution types (normal,
lognormal, exponential, Weibull, and Gumbel extreme value) but if the fitted distribution is none of
these then a general and sometimes quite inaccurate formula is applied, and ; and (3) the method
cannot be used with a discrete distribution.
Anderson-Darling (A-D) goodness of fit test
Like the K-S method, this involves comparing the cumulative distribution curves of the data and
fitted distribution, but assesses the total area between the two curves (Figure 2).
Figure 2: Illustration of the area (in black) used for calculating the Anderson-Darling statistic.
The smaller this area is, the better the fit is assumed to be. The goodness of fit statistic accounts for
the greater chance of having bigger gaps towards the middle of a distribution by weighting the area
across the distribution’s range (Stephens, 1974). There are three problems with this technique: (1) as
with the K-S statistic, the statistical significance of the A-D statistic is dependent on the distribution
type being fitted, and the statistical significance has only been determined for a few distribution
types (normal, lognormal, exponential, Weibull, and Gumbel extreme value), so there is no correct
way to compare the fit for different distributions outside this list, and (2) like the K-S method, the AD method does not apply when one has estimated the distribution parameters from data (Stephens,
1976) ; and (3) the method cannot be used with a discrete distribution.
Other, more general problems
A distribution that is defined by three or more parameters can have a lot more flexibility in shape to
match the pattern of data than another distribution with just two parameters. For example, a
Normal distribution is determined by its mean and standard deviation, whilst a three-parameter
Student distribution has an additional ‘degrees of freedom’ parameter. If this last parameter is large
the three-parameter Student will look almost exactly like a Normal (Figure ???). It is no surprise
therefore that a three-parameter Student distribution will often give a better fit under the methods
described above than a Normal, even when the data are random samples from a Normal
distribution. This is because none of the techniques above penalise a fitted distribution for the
number of parameters that can be changed to achieve the best possible fit.
Another problem with measuring the degree of fit using the methods described above is that they
don’t work where one has truncated, censored or grouped data – a situation we are often faced
with.
The solution
ModelRisk uses three information criteria (IC) to measure goodness of fit: Akaike, Hannan-Quinn,
and Schwartz instead of the methods described above. The smaller the IC value, the better the fit.
The criteria penalise the fit for the number of parameters of the fitted distribution in slightly
different ways. In general they give the same ranking of fit, but ModelRisk allows the user to switch
between the IC used in order to investigate the robustness of the analysis.
In addition, ICs can be used for discrete and continuous, truncated, censored and/or grouped data
which allows one to maintain a consistency in approach.
ModelRisk will report IC values directly in a spreadsheet using dedicated functions linked to fitted
distribution ‘objects’, which can also be linked directly to data sets within a spreadsheet or an
external database. This provides the additional benefit of allowing the user to construct a model that
will automatically update with the arrival of new data and select the best fitting distribution.
Moreover, a single icon click allows the user to review graphically and statistically how well the
distribution fits the data.
The following formulae for the three information criteria use:
n = number of observations (e.g. data values, frequencies)
k = number of parameters to be estimated (e.g. the Normal distribution has 2: mu and
sigma)
Lmax = the maximised value of the log-Likelihood for the estimated model (i.e. the
parameters are fit by MLE and the natural log of the Likelihood is recorded).
1. SIC (Schwarz information criterion, aka Bayesian information criterion)
The SIC (Schwarz, 1978) is the most strict in penalising loss of degree of freedom by having more
parameters.
SIC  k ln n  2 ln Lmax 
2. AIC (Akaike information criterion)
 2n 
AIC C  
k  2 ln Lmax 
 n  k 1
The AIC (Akaike 1974) is the least strict of the three in penalising loss of degree of freedom.
3. HQIC (Hannan-Quinn information criterion)
HQIC  2 ln ln n  2k ln Lmax 
The HQIC (Hannan and Quinn, 1979) holds the middle ground in its penalising for the number of
parameters.
The aim is to find the model with the lowest value of the selected information criterion. The −2
ln[Lmax] term appearing in each formula is an estimate of the deviance of the model fit. The
coefficients for k in the first part of each formula show the degree to which the number of model
parameters is being penalised. For n > 20 or so the SIC is the strictest in penalising loss of degree of
freedom by having more parameters in the fitted model. For n > 40 the AICC is the least strict of the
three and the HQIC holds the middle ground, or is the least penalising for n < 20.
References
Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic
Control AC19, pp. 716–723.
Hannan, E. J. and Quinn, B. G. (1979). The determination of the order of an autoregression. Journal
of the Royal Statistical Society, B 41, pp. 190–195.
Chakravarti, Laha, and Roy, (1967). Handbook of Methods of Applied Statistics, Volume I, John Wiley
and Sons, pp. 392-394.
Stephens, M. A. (1974). EDF Statistics for Goodness of Fit and Some Comparisons, Journal of the
American Statistical Association, Vol. 69, pp. 730-737.
Stephens, M. A. (1976). Asymptotic Results for Goodness-of-Fit Statistics with Unknown Parameters,
Annals of Statistics, Vol. 4, pp. 357-369.
Download