The relationship between least squares analysis and likelihood

advertisement
The Relationship Between Least Squares and Likelihood
George P. Smith
Division of Biological Sciences
Tucker Hall
University of Missouri
Columbia, MO 65211-7400
(573) 882-3344; smithgp@missouri.edu
AUTHOR’S FOOTNOTE
George P. Smith is Professor, Division of Biological Sciences, University of Missouri,
Columbia, MO 65211 (e-mail: smithgp@missouri.edu).
1
ABSTRACT
Under broadly applicable assumptions, the likelihood of a theory on a set of observed
quantitative data is proportional to 1/Dn, where D is the root mean squared deviation of the data
from the predictions of the theory and n is the number of observations.
KEYWORDS
Bayesian statistics; Ignorance prior probability; Root mean squared deviation
2
1. INTRODUCTION
One of the commonest settings for statistical analysis involves a series of n quantitative
observations X  x1 , x2 ,..., xn  and a series of competing explanatory theories , each of which
specifies a theoretical value i corresponding to each of the actual observations xi. The degree to
which the observations fit the expectations of a given theory is usually gauged by sum of the
n
squares of the deviations S   xi   i  for that theory, or equivalently the root mean squared
2
i 1
(RMS) deviation D 
S
; D has the advantage of being on the same scale as the observations
n
themselves and for that reason will be used here. The theory for which D is minimized is the
best fit to the data according to the least-squares criterion.
Least-squares analysis is on firm theoretical grounds when it can reasonably be assumed
that the deviations of the observations from the expectations of the true theory are independently,
identically and normally distributed (IIND) with standard deviation . In those circumstances, it
is well known (and will be demonstrated below) that the theory that minimizes D (or
equivalently, S) also maximizes likelihood. The purpose of this article is to explain a deeper
relationship between likelihood and RMS deviation that holds under broadly applicable
assumptions.
3
2. ANALYSIS
2.1 RMS Deviation and the Likelihood Function
In consequence of the assumed normal distribution of deviations, the probability density
for observing datum xi at data-point i given standard deviation  and a theory  that predicts a
value of i at that point is
 1  xi   i  2 
exp   
 
 2   


Pr( xi | ,  ) 
 2
Eq. 1
Here and throughout this article, the generic probability function notation Pr( | ) and the
summation sign

will be used for both continuous and discrete variables; it is to be
understood from the context when a probability density function is intended, and when
summation is to be accomplished by integration. This notational choice preserves the laws of
probability in their usual form while allowing both kinds of random variable to be
accommodated in a simple, unified framework. Because of the IIND assumption, the joint
probability density for obtaining the ensemble X of observations {x1, x2, x3,…, xn} is the product
of all n such probability densities:
S  exp   n  D 

exp



 2 
n
2 2 


Pr  X | ,     Pr xi | ,   

n/2
n
 2 
 n 2 n / 2
i 1
2




Eq. 2
It will be useful in what follows to gauge dispersion in the normal distribution in terms of ln
rather than  itself, in which case the above distribution can be written in the form
4
 n

exp   exp  2ln   ln D   nln   ln D 
1
 2

Pr( X | , ln  )  n
n/2
D
(2 )
Eq. 3
The right-hand factor in this expression is a peak-shaped function of ln whose peak value
occurs when ln = lnD but whose size and shape are independent of D and therefore of both data
X and theory .
The foregoing probability is related to other key probabilities via Bayes’s theorem:
Pr(, ln  | X )  Pr( X | , ln  ) Pr(, ln  )  Pr( X | , ln  ) Pr(ln  ) Pr() ,
Eq. 4
where we assume in the second equality that the prior probability distributions for ln and  are
independent. Summing over all possible values of ln (from minus to plus infinity)


Pr( | X )   Pr(, ln  | X )   Pr( X | , ln  ) Pr(ln  ) Pr()  Pr( X | ) Pr()
ln 
 ln 

Eq. 5
In the Bayesian view, the laws of probability underlying the foregoing relationships embody the
fundamental logic of science. In particular, Bayesians interpret the preceding equation as the
rule for rationally updating our opinions of the competing theories  in light of the new evidence
embodied in the observations X. The prior distribution Pr() and posterior distribution
Pr( | X ) gauge rational degrees of belief in the theories before and after obtaining (or
considering) evidence X, respectively. Updating is achieved by multiplying the prior probability
of each theory  by Pr( X | ) —the probability, given , that we would obtain the evidence we
actually did obtain. Considered as a function of  for fixed evidence X—the data actually
5
observed— Pr( X | ) is the likelihood function L( | X ) . It captures, exactly and
quantitatively, the relative weight of the evidence X for the competing theories, allowing a
profound arithmetization of empirical judgment in those situations when it can be calculated. In
summary, the likelihood function for this problem can be written
L( | X )   Pr( X | , ln  ) Pr(ln  ) .
Eq. 6
ln 
2.2 When We Are Sufficiently Ignorant of the Standard Deviation, Likelihood Is a Simple
Function of RMS Deviation
The likelihood function in Eq. 6 itself contains a prior probability distribution: Pr(ln  ) .
Regardless of the form of this distribution, it is obvious from the expression for Pr( X | ,  ) in
Eq. 2 that the theory that maximizes likelihood is the one with the smallest RMS deviation D (the
same is true, though less obviously, of the expression for Pr( X | , ln  ) in Eq. 3); this confirms
the well-known fact stated above that least-squares analysis pinpoints the maximum likelihood
theory under the IIND assumption.
But if the likelihood function is to be used to weigh the strength of the evidence for the
competing theories quantitatively, rather than merely to identify the maximum likelihood theory,
the prior distribution Pr(ln  ) must be specified. Occasionally it happens that extensive prior
information entails a particular special form for this function. Much more often, though, we are
essentially ignorant of ln in advance of the data. That is the case that will be considered here.
What probability distribution Pr(ln  ) properly expresses prior ignorance of the value of
this parameter? Jaynes (1968; 2003, pp. 372–386) argues compellingly that ignorance, taken
6
seriously, imposes strong constraints on prior probability distributions. In particular, the
appropriate distribution for the logarithm of a scale parameter like  is the uniform distribution
Pr(ln  )  const , or equivalently Pr( )  1 /  ; this is the only distribution that remains invariant
under a change of scale—a transformation that converts the original inference problem into
another that should look identical to the truly ignorant observer. Substituting that ignorant prior
distribution into Eq. 6 the likelihood function can be written
L( | X )   Pr( X | , ln  ) Pr(ln  ) 
ln 
1
 n

exp   exp  2ln   ln D   nln   ln D d (ln  )
n
 D
 2

1 
 n

 n  exp   exp  2ln   ln D   nln   ln D d (ln  )
D 
 2



Eq. 7
where constants that don’t depend on the variables of interest  and X are suppressed because it
is only the relative values of the likelihood function that matter. As remarked above under Eq. 3,
the integrand in the third part of the equation has the same size and shape regardless of the value
of D. The integral in that equation is therefore itself a constant Q that doesn’t depend on  and
X. Dividing the last part of Eq. 7 by Q, the likelihood function further simplifies to
L ( | X ) 
1
Dn
Eq. 8
under the specified conditions. This likelihood function was previously derived in a different
context by Zellner (1971, pp. 114–117). This expression does much more than simply remind us
that the maximum likelihood theory is the one that minimizes D; it makes it easy for us to
articulate numerically the weight of the evidence X for each contending theory .
7
2.3 How Much Ignorance Is Enough?
The following extreme case might be put forward as a counterexample to the above
reasoning. Suppose one of the competing theories happens to fit the data exactly. D vanishes
altogether for such a theory, and according to Eq. 8 that theory’s likelihood would be infinitely
greater than the likelihood of a theory that deviates even infinitesimally from the observed data.
But common sense rebels at thus according infinite weight to an infinitesimal deviation.
This “counterexample” serves not to undermine the reasoning above, but rather to warn
us that in using the “ignorant” prior distribution Pr(ln  )  const we are pretending to more
ignorance than we actually possess. In any situation we choose to analyze in terms of
distributions of deviations, we surely must have some vague prior information that convinces us
that there is at least some error—that is, that the standard deviation  is not infinitessimally
small. Likewise, there is ordinarily some limit to how large the standard deviation can plausibly
be. If we are making measurements with a light microscope, for instance, we wouldn’t credit a
standard deviation as low as 1 femtometer or as high as 1 kilometer. This vague state of actual
prior knowledge is sketched schematically in the mesa-shaped prior probability distribution in
the upper part of Fig. 1. This curve is characterized by a broad central “domain of ignorance”
where the curve is flat, and where the scale invariance that underlies the ignorant prior
distribution Pr(ln) = const holds to a high degree of approximation. On either side of that
domain the prior credibility of extreme values of ln descends gradually as shown, though in
most cases we would be hard-pressed to describe that descent numerically. The ignorant prior
distribution Pr(ln  )  const , on which the simple likelihood function Eq. 8 is based, is
represented in the dashed curve in the figure; it corresponds to an idealized domain of ignorance
8
that extends without limit in both directions. For ease of comparison, both curves are shown
normalized to the same plateau level of 1—a harmless transformation since it’s only relative
values of the likelihoods that matter.
relative
probability
density
1
0
Domain of ignorance
Pr(X|,ln)
idealized Pr(ln)
) tail area
n
l
(
= 10%
Pr
l
a
u
t
c
a
tail area
= 10%
ln(
min )
ln(D
min )
ln(D
max )
lnD
ln(
max )
ln
Ratio of RMS deviation
to standard deviation
1.6
1.4
Dmin/min
1.2
1
0.8
Dmax /max
0.6
0.4
0.2
0
1
5
10
15
20
25
100
Number of data points
Figure 1. Defining conditions under which the simple likelihood function Eq. 8 is valid. Use of
the figure is explained in the text. UPPER GRAPH: The mesa-shaped “actual” prior probability
distribution Pr(ln) represents a typical vague state of prior knowledge of the ln parameter.
The flat part of the curve spans a “domain of ignorance” in which scale invariance holds to a
high degree of approximation; the lower and upper bounds lnmin and lnmax are chosen to lie
unarguably within the domain of ignorance. The flat dashed line represents an idealized prior
distribution in which the domain of ignorance extends indefinitely in both directions. The three
9
Pr(X|,ln) curves are plots of the integrand in the third part of Eq. 7, using n = 12 as the
number of data-points; the curves differ in the value of lnD, which is the value of ln at which
the curve peaks. The value of lnD for the left-hand Pr(X|,ln) curve, ln(Dmin), is chosen so that
the tail area to the left of lnmin is 10% of the total area under the curve. Similarly, the value of
lnD for the right-hand Pr(X|,ln) curve, ln(Dmax), is chosen so that the tail area to the right of
lnmax is 10% of the total area under the curve. LOWER GRAPH: The ratios Dmin/min and
Dmax/max, as defined above, are plotted against the number of data-points n.
Also shown in the upper part of the figure are peak-shaped Pr( X | , ln  ) distributions for three
different theories , corresponding to three different values of lnD. The three curves have been
normalized to the same arbitrary peak height by multiplying each by a factor proportional to Dn;
they are thus graphs of the integrand in the third part of Eq. 7. The peak value of each curve
occurs when ln = lnD. The middle curve corresponds to a theory  whose lnD value lies well
within the domain of ignorance. In many applications, all contending theories  are like that:
they all correspond to values of lnD that clearly lie within the domain of ignorance. In those
cases, the relative values of their likelihoods

L( | X )   Pr( X | , ln  ) Pr(ln  )   Pr( X | , ln  ) Pr(ln  )d (ln  )
ln 

will be the same whether we use the actual or idealized prior distribution for Pr(ln); that’s
because both those prior distributions are uniform over all values of ln that are of practical
relevance to the inference at hand.
10
What values of lnD lie safely within the domain of ignorance, as it has rather vaguely
been described so far? To put the question in another way: for what range of theories ,
corresponding to what range of RMS deviations D, is the fractional error in likelihood incurred
by using the idealized prior distribution Prideal (ln  )  const in place of the actual prior
distribution Practual (ln  ) acceptably low—say, 10% or less? Answering that question precisely
is typically either prohibitively laborious or beyond our powers altogether. However, it is
usually feasible to put upper bounds on the fractional error, which can be written




Pr( X | , ln  ) Prideal (ln  )d (ln  )   Pr( X | , ln  ) Practual (ln  )d (ln  )




Pr( X | , ln  ) Prideal (ln  )d (ln  )
That will be so if, without undue mental effort, we can specify a lower limit lnmin and an upper
limit lnmax that lie unarguably within the domain of ignorance, as shown in the upper part of
Fig. 1. For values of lnD that lie near lnmin, as for the left-hand peak in the upper part of the
figure, it is easy to prove that the fractional error can be no more than

ln  min




Pr( X | , ln  ) Prideal (ln  )d (ln  )
Pr( X | , ln  ) Prideal (ln  )d (ln  )


ln  min




Pr( X | , ln  )d (ln  )
Pr( X | , ln  )d (ln  )
where in the second part of the equation we use the fact that Prideal (ln  )  const . For the lefthand peak, this error corresponds to the ratio of the blackened tail area to the total area under the
curve. Similarly, for values of lnD that lie near lnmax, as for the right-hand peak, the fractional
error can be no more than
11



ln
Pr( X | , ln  )d (ln  )
max


Pr( X | , ln  )d (ln  )
(blackened tail area over total area for the right-hand peak). As indicated in the figure, we can
define values of RMS deviation Dmin and Dmax such that these tail areas are only 10%—usually
an acceptable error level given all the other uncertainties that beset quantitative inference in
practice. When for all theories  in contention RMS deviation D lies between Dmin and Dmax, we
are adequately ignorant to warrant use of the simplified likelihood function Eq. 8. The lower part
of Fig. 1 graphs the ratios Dmin  min and Dmax  max for various numbers of data points,
allowing Dmin and Dmax to be computed from min and max.
Large percentage errors in calculating the likelihoods of the “winning” theories—those
determining the smallest RMS deviations D—are intolerable because they give rise to serious
errors of judgment. As RMS deviation for the winning theory gets smaller and smaller—i.e., as
lnD moves farther and farther to the left of the domain of ignorance—its likelihood
asymptotically approaches an upper limit that doesn’t depend on D but does depend sensitively
on the exact form of the prior probability distribution Practual(ln). On the scale of Eq. 8, that
limit is
lim
ln D  
1
Dn
 n

exp   exp  2ln   ln D   nln   ln D  Practual (ln  )d (ln  )

 2


Q





exp  n ln   Practual (ln  )d (ln  )
Q
12
where the constant Q is defined under Eq. 7, and where the distribution Practual(ln) is assumed to
be normalized to a plateau value of 1 (as in Fig. 1). Substituting the simplified likelihood Eq.
8—a likelihood that increases without bound in proportion to 1/Dn—is a gross misrepresentation
of the data, vastly overstating the weight of the evidence for the winners. That is precisely the
situation in the “counterexample” with which this subsection began.
In contrast, large percentage errors in the likelihoods of the “losing” theories—those with
the largest RMS deviations—is frequently harmless, even when those losers’ RMS deviations lie
far beyond the domain of ignorance. That’s because we’re not interested in the losers
individually, but only in how collectively they affect our judgment of the winners. Although the
losers’ likelihoods may be greatly exaggerated by Eq. 8, those likelihoods are so small that the
losers’ posterior probabilities may be collectively negligible in comparison to those of the entire
ensemble of contending theories. In that case, the exaggeration will have no significant impact
on our judgment of the winners. Again taking 10% as an acceptable error level, this condition
will be met if
1
Pr( )
n
D  Dmax D
 10%
1
 n Pr()
all  D

Eq. 9
In summary, sufficient conditions for use of the simple likelihood function Eq. 8 are that
the lowest RMS deviation be greater than or equal to Dmin and that inequality Eq. 9 be valid.
These conditions will be met in the large majority of cases encountered in practice.
13
3. REFENCECES
Jaynes, E.T. (1968), “Prior probabilities,” IEEE Transactions on Systems Science and
Cybernetics, SSC-4, 227–241.
Jaynes, E.T. (2003), Probability Theory: The Logic of Science, Cambridge, UK, Cambridge
University Press.
Zellner, A. (1971), An Introduction to Bayesian Inference in Econometrics. New York: Wiley.
14
Download