The Relationship Between Least Squares and Likelihood George P. Smith Division of Biological Sciences Tucker Hall University of Missouri Columbia, MO 65211-7400 (573) 882-3344; smithgp@missouri.edu AUTHOR’S FOOTNOTE George P. Smith is Professor, Division of Biological Sciences, University of Missouri, Columbia, MO 65211 (e-mail: smithgp@missouri.edu). 1 ABSTRACT Under broadly applicable assumptions, the likelihood of a theory on a set of observed quantitative data is proportional to 1/Dn, where D is the root mean squared deviation of the data from the predictions of the theory and n is the number of observations. KEYWORDS Bayesian statistics; Ignorance prior probability; Root mean squared deviation 2 1. INTRODUCTION One of the commonest settings for statistical analysis involves a series of n quantitative observations X x1 , x2 ,..., xn and a series of competing explanatory theories , each of which specifies a theoretical value i corresponding to each of the actual observations xi. The degree to which the observations fit the expectations of a given theory is usually gauged by sum of the n squares of the deviations S xi i for that theory, or equivalently the root mean squared 2 i 1 (RMS) deviation D S ; D has the advantage of being on the same scale as the observations n themselves and for that reason will be used here. The theory for which D is minimized is the best fit to the data according to the least-squares criterion. Least-squares analysis is on firm theoretical grounds when it can reasonably be assumed that the deviations of the observations from the expectations of the true theory are independently, identically and normally distributed (IIND) with standard deviation . In those circumstances, it is well known (and will be demonstrated below) that the theory that minimizes D (or equivalently, S) also maximizes likelihood. The purpose of this article is to explain a deeper relationship between likelihood and RMS deviation that holds under broadly applicable assumptions. 3 2. ANALYSIS 2.1 RMS Deviation and the Likelihood Function In consequence of the assumed normal distribution of deviations, the probability density for observing datum xi at data-point i given standard deviation and a theory that predicts a value of i at that point is 1 xi i 2 exp 2 Pr( xi | , ) 2 Eq. 1 Here and throughout this article, the generic probability function notation Pr( | ) and the summation sign will be used for both continuous and discrete variables; it is to be understood from the context when a probability density function is intended, and when summation is to be accomplished by integration. This notational choice preserves the laws of probability in their usual form while allowing both kinds of random variable to be accommodated in a simple, unified framework. Because of the IIND assumption, the joint probability density for obtaining the ensemble X of observations {x1, x2, x3,…, xn} is the product of all n such probability densities: S exp n D exp 2 n 2 2 Pr X | , Pr xi | , n/2 n 2 n 2 n / 2 i 1 2 Eq. 2 It will be useful in what follows to gauge dispersion in the normal distribution in terms of ln rather than itself, in which case the above distribution can be written in the form 4 n exp exp 2ln ln D nln ln D 1 2 Pr( X | , ln ) n n/2 D (2 ) Eq. 3 The right-hand factor in this expression is a peak-shaped function of ln whose peak value occurs when ln = lnD but whose size and shape are independent of D and therefore of both data X and theory . The foregoing probability is related to other key probabilities via Bayes’s theorem: Pr(, ln | X ) Pr( X | , ln ) Pr(, ln ) Pr( X | , ln ) Pr(ln ) Pr() , Eq. 4 where we assume in the second equality that the prior probability distributions for ln and are independent. Summing over all possible values of ln (from minus to plus infinity) Pr( | X ) Pr(, ln | X ) Pr( X | , ln ) Pr(ln ) Pr() Pr( X | ) Pr() ln ln Eq. 5 In the Bayesian view, the laws of probability underlying the foregoing relationships embody the fundamental logic of science. In particular, Bayesians interpret the preceding equation as the rule for rationally updating our opinions of the competing theories in light of the new evidence embodied in the observations X. The prior distribution Pr() and posterior distribution Pr( | X ) gauge rational degrees of belief in the theories before and after obtaining (or considering) evidence X, respectively. Updating is achieved by multiplying the prior probability of each theory by Pr( X | ) —the probability, given , that we would obtain the evidence we actually did obtain. Considered as a function of for fixed evidence X—the data actually 5 observed— Pr( X | ) is the likelihood function L( | X ) . It captures, exactly and quantitatively, the relative weight of the evidence X for the competing theories, allowing a profound arithmetization of empirical judgment in those situations when it can be calculated. In summary, the likelihood function for this problem can be written L( | X ) Pr( X | , ln ) Pr(ln ) . Eq. 6 ln 2.2 When We Are Sufficiently Ignorant of the Standard Deviation, Likelihood Is a Simple Function of RMS Deviation The likelihood function in Eq. 6 itself contains a prior probability distribution: Pr(ln ) . Regardless of the form of this distribution, it is obvious from the expression for Pr( X | , ) in Eq. 2 that the theory that maximizes likelihood is the one with the smallest RMS deviation D (the same is true, though less obviously, of the expression for Pr( X | , ln ) in Eq. 3); this confirms the well-known fact stated above that least-squares analysis pinpoints the maximum likelihood theory under the IIND assumption. But if the likelihood function is to be used to weigh the strength of the evidence for the competing theories quantitatively, rather than merely to identify the maximum likelihood theory, the prior distribution Pr(ln ) must be specified. Occasionally it happens that extensive prior information entails a particular special form for this function. Much more often, though, we are essentially ignorant of ln in advance of the data. That is the case that will be considered here. What probability distribution Pr(ln ) properly expresses prior ignorance of the value of this parameter? Jaynes (1968; 2003, pp. 372–386) argues compellingly that ignorance, taken 6 seriously, imposes strong constraints on prior probability distributions. In particular, the appropriate distribution for the logarithm of a scale parameter like is the uniform distribution Pr(ln ) const , or equivalently Pr( ) 1 / ; this is the only distribution that remains invariant under a change of scale—a transformation that converts the original inference problem into another that should look identical to the truly ignorant observer. Substituting that ignorant prior distribution into Eq. 6 the likelihood function can be written L( | X ) Pr( X | , ln ) Pr(ln ) ln 1 n exp exp 2ln ln D nln ln D d (ln ) n D 2 1 n n exp exp 2ln ln D nln ln D d (ln ) D 2 Eq. 7 where constants that don’t depend on the variables of interest and X are suppressed because it is only the relative values of the likelihood function that matter. As remarked above under Eq. 3, the integrand in the third part of the equation has the same size and shape regardless of the value of D. The integral in that equation is therefore itself a constant Q that doesn’t depend on and X. Dividing the last part of Eq. 7 by Q, the likelihood function further simplifies to L ( | X ) 1 Dn Eq. 8 under the specified conditions. This likelihood function was previously derived in a different context by Zellner (1971, pp. 114–117). This expression does much more than simply remind us that the maximum likelihood theory is the one that minimizes D; it makes it easy for us to articulate numerically the weight of the evidence X for each contending theory . 7 2.3 How Much Ignorance Is Enough? The following extreme case might be put forward as a counterexample to the above reasoning. Suppose one of the competing theories happens to fit the data exactly. D vanishes altogether for such a theory, and according to Eq. 8 that theory’s likelihood would be infinitely greater than the likelihood of a theory that deviates even infinitesimally from the observed data. But common sense rebels at thus according infinite weight to an infinitesimal deviation. This “counterexample” serves not to undermine the reasoning above, but rather to warn us that in using the “ignorant” prior distribution Pr(ln ) const we are pretending to more ignorance than we actually possess. In any situation we choose to analyze in terms of distributions of deviations, we surely must have some vague prior information that convinces us that there is at least some error—that is, that the standard deviation is not infinitessimally small. Likewise, there is ordinarily some limit to how large the standard deviation can plausibly be. If we are making measurements with a light microscope, for instance, we wouldn’t credit a standard deviation as low as 1 femtometer or as high as 1 kilometer. This vague state of actual prior knowledge is sketched schematically in the mesa-shaped prior probability distribution in the upper part of Fig. 1. This curve is characterized by a broad central “domain of ignorance” where the curve is flat, and where the scale invariance that underlies the ignorant prior distribution Pr(ln) = const holds to a high degree of approximation. On either side of that domain the prior credibility of extreme values of ln descends gradually as shown, though in most cases we would be hard-pressed to describe that descent numerically. The ignorant prior distribution Pr(ln ) const , on which the simple likelihood function Eq. 8 is based, is represented in the dashed curve in the figure; it corresponds to an idealized domain of ignorance 8 that extends without limit in both directions. For ease of comparison, both curves are shown normalized to the same plateau level of 1—a harmless transformation since it’s only relative values of the likelihoods that matter. relative probability density 1 0 Domain of ignorance Pr(X|,ln) idealized Pr(ln) ) tail area n l ( = 10% Pr l a u t c a tail area = 10% ln( min ) ln(D min ) ln(D max ) lnD ln( max ) ln Ratio of RMS deviation to standard deviation 1.6 1.4 Dmin/min 1.2 1 0.8 Dmax /max 0.6 0.4 0.2 0 1 5 10 15 20 25 100 Number of data points Figure 1. Defining conditions under which the simple likelihood function Eq. 8 is valid. Use of the figure is explained in the text. UPPER GRAPH: The mesa-shaped “actual” prior probability distribution Pr(ln) represents a typical vague state of prior knowledge of the ln parameter. The flat part of the curve spans a “domain of ignorance” in which scale invariance holds to a high degree of approximation; the lower and upper bounds lnmin and lnmax are chosen to lie unarguably within the domain of ignorance. The flat dashed line represents an idealized prior distribution in which the domain of ignorance extends indefinitely in both directions. The three 9 Pr(X|,ln) curves are plots of the integrand in the third part of Eq. 7, using n = 12 as the number of data-points; the curves differ in the value of lnD, which is the value of ln at which the curve peaks. The value of lnD for the left-hand Pr(X|,ln) curve, ln(Dmin), is chosen so that the tail area to the left of lnmin is 10% of the total area under the curve. Similarly, the value of lnD for the right-hand Pr(X|,ln) curve, ln(Dmax), is chosen so that the tail area to the right of lnmax is 10% of the total area under the curve. LOWER GRAPH: The ratios Dmin/min and Dmax/max, as defined above, are plotted against the number of data-points n. Also shown in the upper part of the figure are peak-shaped Pr( X | , ln ) distributions for three different theories , corresponding to three different values of lnD. The three curves have been normalized to the same arbitrary peak height by multiplying each by a factor proportional to Dn; they are thus graphs of the integrand in the third part of Eq. 7. The peak value of each curve occurs when ln = lnD. The middle curve corresponds to a theory whose lnD value lies well within the domain of ignorance. In many applications, all contending theories are like that: they all correspond to values of lnD that clearly lie within the domain of ignorance. In those cases, the relative values of their likelihoods L( | X ) Pr( X | , ln ) Pr(ln ) Pr( X | , ln ) Pr(ln )d (ln ) ln will be the same whether we use the actual or idealized prior distribution for Pr(ln); that’s because both those prior distributions are uniform over all values of ln that are of practical relevance to the inference at hand. 10 What values of lnD lie safely within the domain of ignorance, as it has rather vaguely been described so far? To put the question in another way: for what range of theories , corresponding to what range of RMS deviations D, is the fractional error in likelihood incurred by using the idealized prior distribution Prideal (ln ) const in place of the actual prior distribution Practual (ln ) acceptably low—say, 10% or less? Answering that question precisely is typically either prohibitively laborious or beyond our powers altogether. However, it is usually feasible to put upper bounds on the fractional error, which can be written Pr( X | , ln ) Prideal (ln )d (ln ) Pr( X | , ln ) Practual (ln )d (ln ) Pr( X | , ln ) Prideal (ln )d (ln ) That will be so if, without undue mental effort, we can specify a lower limit lnmin and an upper limit lnmax that lie unarguably within the domain of ignorance, as shown in the upper part of Fig. 1. For values of lnD that lie near lnmin, as for the left-hand peak in the upper part of the figure, it is easy to prove that the fractional error can be no more than ln min Pr( X | , ln ) Prideal (ln )d (ln ) Pr( X | , ln ) Prideal (ln )d (ln ) ln min Pr( X | , ln )d (ln ) Pr( X | , ln )d (ln ) where in the second part of the equation we use the fact that Prideal (ln ) const . For the lefthand peak, this error corresponds to the ratio of the blackened tail area to the total area under the curve. Similarly, for values of lnD that lie near lnmax, as for the right-hand peak, the fractional error can be no more than 11 ln Pr( X | , ln )d (ln ) max Pr( X | , ln )d (ln ) (blackened tail area over total area for the right-hand peak). As indicated in the figure, we can define values of RMS deviation Dmin and Dmax such that these tail areas are only 10%—usually an acceptable error level given all the other uncertainties that beset quantitative inference in practice. When for all theories in contention RMS deviation D lies between Dmin and Dmax, we are adequately ignorant to warrant use of the simplified likelihood function Eq. 8. The lower part of Fig. 1 graphs the ratios Dmin min and Dmax max for various numbers of data points, allowing Dmin and Dmax to be computed from min and max. Large percentage errors in calculating the likelihoods of the “winning” theories—those determining the smallest RMS deviations D—are intolerable because they give rise to serious errors of judgment. As RMS deviation for the winning theory gets smaller and smaller—i.e., as lnD moves farther and farther to the left of the domain of ignorance—its likelihood asymptotically approaches an upper limit that doesn’t depend on D but does depend sensitively on the exact form of the prior probability distribution Practual(ln). On the scale of Eq. 8, that limit is lim ln D 1 Dn n exp exp 2ln ln D nln ln D Practual (ln )d (ln ) 2 Q exp n ln Practual (ln )d (ln ) Q 12 where the constant Q is defined under Eq. 7, and where the distribution Practual(ln) is assumed to be normalized to a plateau value of 1 (as in Fig. 1). Substituting the simplified likelihood Eq. 8—a likelihood that increases without bound in proportion to 1/Dn—is a gross misrepresentation of the data, vastly overstating the weight of the evidence for the winners. That is precisely the situation in the “counterexample” with which this subsection began. In contrast, large percentage errors in the likelihoods of the “losing” theories—those with the largest RMS deviations—is frequently harmless, even when those losers’ RMS deviations lie far beyond the domain of ignorance. That’s because we’re not interested in the losers individually, but only in how collectively they affect our judgment of the winners. Although the losers’ likelihoods may be greatly exaggerated by Eq. 8, those likelihoods are so small that the losers’ posterior probabilities may be collectively negligible in comparison to those of the entire ensemble of contending theories. In that case, the exaggeration will have no significant impact on our judgment of the winners. Again taking 10% as an acceptable error level, this condition will be met if 1 Pr( ) n D Dmax D 10% 1 n Pr() all D Eq. 9 In summary, sufficient conditions for use of the simple likelihood function Eq. 8 are that the lowest RMS deviation be greater than or equal to Dmin and that inequality Eq. 9 be valid. These conditions will be met in the large majority of cases encountered in practice. 13 3. REFENCECES Jaynes, E.T. (1968), “Prior probabilities,” IEEE Transactions on Systems Science and Cybernetics, SSC-4, 227–241. Jaynes, E.T. (2003), Probability Theory: The Logic of Science, Cambridge, UK, Cambridge University Press. Zellner, A. (1971), An Introduction to Bayesian Inference in Econometrics. New York: Wiley. 14