Statistics 580 Monte Carlo Methods Introduction

advertisement
Statistics 580
Monte Carlo Methods
Introduction
Statisticians have used simulation to evaluate the behaviour of complex random variables
whose precise distribution cannot be exactly evaluated theoretically. For example, suppose
that a statistic T based on a sample x1 , x2 , . . . , xn has been formulated for testing a certain
hypothesis; the test procedure is to reject H0 if T (x1 , x2 , . . . , xn ) ≥ c. Since the exact
distribution of T is unknown, the value of c has been determined such that the asymptotic
type I error rate is α = .05, (say), i.e., P r(T ≥ c|H0 is true) = .05 as n → ∞. We can study
the actual small sample behavior of T by the following procedure:
a. generate x1 , x2 , . . . , xn from the appropriate distribution, say, the Normal distribution
and compute Tj = T (x1 , x2 , . . . , xn ),
b. repeat N times, yielding a sample T1 , T2 , . . . , TN , and
c. compute proportion of times Tj ≥ c as an estimate of the error rate i.e.,
P
#(Tj ≥c)
α̂(N ) = N1 N
, where I(·) is the indicator function.
j=1 I(Tj ≥ c) =
N
a.s.
d. Note that α̂(N ) → α, the type I error rate.
Results of such a study may provide evidence to support the asymptotic theory even for
small samples. Similarily, we may study the power of the test under different alternatives
using Monte Carlo sampling. The study of Monte Carlo methods involves learning about
(i) methods available to perform step (a) above correctly and efficiently, and
(ii) how to perform step (c) above efficiently, by incorporating “variance reduction” techniques to obtain more accurate estimates.
In 1908, Monte Carlo sampling was used by W. S. Gossett,
√ who published under the
pseudonym “Student”, to approximate the distribution of z = n(x̄−µ)/s where (x1 , x2 , . . . , xn )
is a random sample from a Normal distribution with mean µ and variance σ 2 and s2 is the
sample variance. He writes:
“Before I had succeeded in solving my problem analytically, I had endeavoured
to do so empirically [i.e., by simulation]. The material used was a ... table
containing the height and left middle finger measurements of 3000 criminals....
The measurements were written out on 3000 pieces of cardboard, which were
then very thoroughly shuffled and drawn at random ... each consecutive set of 4
was taken as a sample ... [i.e.,n above is = 4] ... and the mean [and] standard
deviation of each sample determined.... This provides us with two sets of ... 750
z’s on which to test the theoretical results arrived at. The height and left middle
finger ... table was chosen because the distribution of both was approximately
normal...”
1
The object of many simulation experiments in statistics is the estimation of an expectation
of the form E[g(X)] where X is a random vector. Suppose that f (x) is the joint density of
X. Then
Z
θ = E[g(X)] = g(x)f (x)dx.
Thus, virtually any Monte Carlo procedure may be characterized as a problem of evaluating
an integral (a multidimensional one, as in this case). For example, if T (x) is a statistic based
on a random sample x, computing the following quantities involving functions of T may be
of interest:
the mean, θ1 =
the tail probability, θ2 =
or the variance, θ3 =
=
Z
Z
Z
Z
T (x)f (x) dx,
I(T (x) ≤ c)f (x) dx,
(T (x) − θ1 )2 f (x) dx
T (x)2 f (x) dx − θ12 ,
where I() is the indicator function.
Crude Monte Carlo
As outlined in the introduction, applications
of simulation in Statistics involve the comR
putation of an integral of the form g(x)f (x) dx, where g(x) is a function of a statistic
computed on a random sample x. A simple estimate of the above integral can be obtained
by using the following simulation algorithm:
1. Generate X1 , X2 , . . . , XN .
2. Compute g(X1 ), g(X2 ), . . . , g(XN )
3. Estimate θ̂(N ) =
1
N
PN
j=1
g(Xj )
Here N is the simulation sample size and note that
a.s.
• θ̂(N ) → θ
• if var(g(X)) = σ 2 , the usual estimate of σ 2 is
2
s =
PN
− (θ̂))2
;
N −1
j=1 (g(Xj )
√
• thus S.E.(θ̂) = s/ N and for large N , the Central Limit Theorem could be used to
construct approximate confidence bounds for θ.
s2
s2
[θ̂(N ) − zα/2 √ , θ̂(N ) + zα/2 √ ]
N
N
√
• Clearly, the precision of θ̂ is proportional to 1/ N (in other words, the error in θ̂ is
1
O(N − 2 )), and it depends on the value of s2 .
2
Efficiency of Monte Carlo Simulation
Suppose as usual that it is required to estimate θ = E[g(X)]. Then the standard simulation algorithm as detailed previously gives the estimate
θ̂(N ) =
N
1 X
g(Xj )
N j=1
√
with standard error s(θ̂) = s/ N One way to measure the quality of the estimator θ̂(N ) is
by the half-width, HW, of the confidence interval for θ. For a fixed α it is given by
HW = zα/2
s
V ar(g(X))
N
For increasing the precision of the estimate, HW needs to be made small, but sometimes
this is difficult to achieve. This may be because V ar(g(X)) is too large, or too much computational effort is required to simulate each g(Xj ) so that the simulation sample size N
cannot be too large, or both. If the effort required to produce one value of g(Xj ) is E the
total effort for estimation of θ by a Monte Carlo experiment is N × E. Since for achieving
a precision of HW at least a sample size
N≥
zα/2
× V ar(g(X))
HW
is needed, the quantity V ar(g(X)) × E is proportional to the total effort required to achieve
a fixed precision. This thsi quantity may be use to compare the efficiency of different Monte
Carlo methods. When E’s are similar for two methods, the method with the lower variance
of the estimator is considered superior. As a result, statistical techniques are often employed
to increase simulation efficiency by decreasing the variability of the simulation output that
are used to estimate θ. The techniques used to do this are usually called variance reduction
techniques.
Variance Reduction Techniques
Monte Carlo methods attempt to reduce S.E.(θ̂), that is obtain more accurate estimates
without increasing N . These techniques are called variance reduction techniques. When
considering variance reduction techniques, one must take into account the cost (in terms of
difficulty level) of implementing such a procedure and weigh it against the gain in precision
of estimates from incorporating such a procedure.
Before proceeding to a detailed description of available methods, a brief example is used
to illustrate what is meant by variance reduction. Suppose the parameter to be estimated is
an upper percentage point of the standard Cauchy distribution:
θ = P r(X > 2)
where
f (x) =
1
,
π(1 + x2 )
3
−∞ < x < ∞
is the Cauchy
density. To estimate θ, let g(x) = I(x > 2), and express θ = P r(x > 2) as
R∞
the integral −∞ I(x > 2)f (x)dx. Note that x is a scalar here. Using crude Monte Carlo, θ
is estimated by generating N Cauchy variates x1 , x2 , . . . , xN and computing the proportion
that exceeds 2, i.e.,
PN
I(xi > 2)
#(xi > 2)
θ̂ = j=1
=
.
N
N
Note that since N θ̂ ∼ Binomial (N, θ), the variance of θ̂ is given by θ(1 − θ)/N . Since θ can
be directly computed by analytical integration i.e.,
θ = 1 − F (2) =
1
− π −1 tan 2 ≈ .1476
2
the variance of the crude Monte Carlo estimate θ̂ is given approximately as 0.126/N. In order
to construct a more efficient Monte Carlo method, note that for y = 2/x,
θ=
Z
∞
2
Z 1
Z 1
1
2
dx =
dy =
g(y)dy .
π(1 + x2 )
0 π(4 + y 2 )
0
This particular integral is in a form where it can be evaluated using crude Monte Carlo by
sampling from the uniform distribution U (0, 1). If u1 , u2 , . . . , uN from U (0, 1), an estimate
of θ is
N
1 X
g(uj ).
θ̃ =
N j=1
For comparison with the estimate of θ obtained from crude Monte Carlo the variance of θ̃
can again be computed by evaluating the integral
1 Z
{g(x) − θ}2 f (x)dx
var(θ̃) =
N
analytically, where f (x) is the density of U (0, 1). This gives var(θ̃) ≈ 9.4×10−5 /N, a variance
reduction of 1340.
In the following sections, several methods available for variance reduction will be discussed. Among these are
• stratified sampling
• control variates
• importance sampling
• antithetic variates
• conditioning swindles
4
Stratified Sampling
Consider again the estimation of
Z
θ = E[g(X)] =
g(x) f (x) dx
For stratified sampling, we first partition the domain of integration S into m disjoint subsets
Si , i = 1, 2, . . . , m. That is, Si ∩ Sj = 0 ∀ i 6= j and ∪m
i=1 Si = S. Define θi as
θi = E g(X)X Si
Z
=
Si
g(x) f (x) dx
for i = 1, 2, . . . , m. Then
θ=
Z
S
m Z
X
g(x) f (x) dx =
i=1
Si
g(x) f (x) dx =
m
X
θi
i=1
The motivation behind this method is to be able to sample more from regions of S that
contribute more variability to the estimator of θ, rather than spread the samples evenly across
the whole region S. This idea is the reason why stratified sampling is used often as a general
sampling techniques; for e.g., instead of random sampling from the entire state, one might
use stratified sampling based on subregions such as counties, or districts, sampling more
from those regions that contribute more information about the quantity being estimated.
Define
pi =
and
Z
Si
fi (x) =
Then
m
X
f (x) dx
1
f (x) .
pi
pi = 1 and
i=1
Z
Si
By expressing
we can show that
fi (x) dx = 1 .

 g(x), if x Si
gi (x) = 
θi =
Z
0,
Si
= pi
otherwise
pi g(x)
Z
S
f (x)
dx
pi
gi (x) fi (x) dx
and the crude Monte Carlo estimator of θi is then
pi
θ̂i =
Ni
X
gi (xj )
j=1
Ni
5
where x1 , . . . , xNi is a random sample from the distribution with density fi (x) on Si , with
variance given by
p2 σ 2
Var(θ̂i ) = i i ,
Ni
where σi2 = Var g(x)X Si . Thus the combined estimator of θ is
θ̃ =
m
X
θ̂i =
i=1
Ni
m
X
pi X
i=1
with variance given by
Var(θ̃) =
Ni
gi (xj ),
j=1
m
X
p2i σi2
i=1
Ni
.
For the best stratification, we need to choose Ni optimally i.e., by minimizing Var(θ̃) with
respect to Ni , i = 1, 2, . . . , m. That is, minimize the Lagrangian
m
X
p2i σi2
i=1
Ni
+ λ(N −
m
X
Ni )
i=1
Equating partial derivatives with respect to Ni to zero it has been shown that the optimum
occurs when
p i σi
N
Ni = X
m
p i σi
i=1
for N =
m
X
Ni . Thus after the strata are chosen, the samples sizes Ni should be selected
i=1
proportional to pi σi . Of course, since σi are unknown, they need to be estimated by some
method, such as from past data or in absence of any information, by conducting a pilot
study. Small samples of points are taken from each Si . The variances are estimated from
these samples. This allows the sample sizes for the stratified sample to be allocated in such
a way that more points are
sampled from those strata where g(x) shows the most variation.
Z
Example: Evaluate I =
1
0
2
e−x dx using stratified sampling.
Since in this example f (x) is the U (0, 1) density we will use 4 equispaced partitions of the
interval (0, 1) giving p1 = p2 = p3 = p4 = .25. To conduct a pilot study to determine strata
sample sizes, we first estimated the strata variances of g(x):
σi2 = var (g(x)|x ∈ Si ) for i = 1, 2, 3, 4
by a simulation experiment. We generated 10000 u(0, 1) variables u and compute the stan2
dard deviation of e−u for u’s falling in each of the intervals (0, .25), (.25, .5), (5, .75), and
(.75, 1)
6
1.0
0.9
0.8
exp(-x^2)
0.7
0.6
0.5
0.4
0.0
0.2
0.4
0.6
0.8
1.0
x
These came out to be σi = .018, .046, .061, and .058. Since p1 = p2 = p3 = p4 ,
P
P
Ni ∝ σi / σi , we get σi / σi = .0984, .2514, .3333, and .3169 implying that for N = 10000,
N1 = 984, N2 = 2514, N3 = 3333, and N4 = 3169.
We will round these and use
N1 = 1000, N2 = 2500, N3 = 3500, and N4 = 3000.
as sample sizes of our four strata.
A MC experiment using stratified sampling for obtaining an estimate of I and its standard
error can now be carried out as follows. For each stratum (mi , mi+1 ), m1 = 0, m2 =
.25, m3 = .50, m4 = .75, and, m5 = 1.0, generate Ni numbers uj ∼ U (mi , mi+1 )) for j =
2
1, . . . , Ni and compute xj = e−uj for the Ni uj ’s. Then compute estimates
θ̂i =
σ̂i2 =
X
X
xj /Ni
(xj − x)2 /(Ni − 1).
P
for i = 1, 2, 3, 4. Pool these estimates to obtain θ̃ = θ̂i /4 and var(θ̃) =
obtained
θ̃ = 0.7470707 and var (θ̃) = 2.132326e − 007
P
σ̂i2 /Ni
.
42
We
A crude MC estimation of I is obtained by ignoring strata and using the 10000 u(0, 1)
2
variables to compute means and variance of e−uj . We obtained
θ̂ = 0.7462886 and var (θ̂) = 4.040211e − 006
Thus the variance reduction is 4.040211e − 006/2.132326e − 007 ≈ 19.
7
Control Variates
R
Suppose it is required to estimate θ = E(g(x)) = g(x)f (x)dx where f (x)
is a joint denR
sity as before. Let h(x) be a function “similar” to g(x) such that E(h(x)) = h(x)f (x)dx =
τ is known. Then θ can be written as
θ=
Z
[g(x) − h(x)]f (x)dx + τ = E[g(x) − h(x)] + τ
and the crude Monte Carlo estimator of θ now becomes
θ̃ =
and
var(θ̃) =
N
1 X
[g(xj ) − h(xj )] + τ ,
N j=1
var(g(x)) + var(h(x)) − 2cov(g(x), h(x))
.
N
If var(h(x)) ≈ var(g(x)) which is possibly so because h mimics g, then there is a reduction
in variance if
1
corr(g(x), h(x)) > .
2
In general, θ can be written as
θ = E[g(x) − β{h(x) − τ }]
in which case the crude Monte Carlo estimator of θ is
θ̃ =
N
N
X
1 X
[ g(xj ) − β{ h({xj ) − τ }] ,
N j=1
j=1
As a trivial example consider a Monte Carlo study to estimate θ = E(median(x 1 , x2 , . . . , xn ))
where x = (x1 , x2 , . . . , xn )0 are i.i.d. random variables from some distribution. If µ = E(x̄)
is known, then it might be a good idea to use x̄ as a control variate. Crude Monte Carlo is
performed on the difference {median(x) − x̄} rather than on median(x) giving the estimator
θ̃ =
with
1 X
{median(x) − x̄} + µ
N
var(θ̃) =
var{median(x) − x̄}
.
N
The basic problem in using the method is identifying a suitable variate to use as a control
variable. Generally, if there are more than one candidate for a control variate, a possible way
to combine them is to find an appropriate linear combination of them. Let Z = g(x) and let
Wi = hi (x) for i = 1, ..., p be the possible control variates, where as above it is required to
estimate θ = E(Z) and E(Wi ) = τi are known. Then θ can be expressed as
θ = E[g(x) − {β1 (h1 (x) − τ1 ) + . . . + βp (hp (x) − τp )}]
= E[Z − β1 (W1 − τ1 ) − . . . − βp (Wp − τp )].
8
This is a generalization of the application with a single control variate given above, i.e.,
if p = 1 then
θ = E[Z − β(W − τ )].
An unbiased estimator of θ in this case is obtained by averaging observations of Z −β(W 1 −τ )
as before and β is chosen to minimize var(θ̂). This amounts to regressing observations of
Z on the observations of W. In the general case, this leads to multiple regression of Z on
W1 , . . . , Wp . However note that Wi ’s are random and therefore the estimators of θ and var(θ̂)
obtained by standard regression methods may not be unbiased. However, if the estimates
β̂i are found in a preliminary experiment and then held fixed, unbiased estimators can be
obtained.
To
illustrate this technique, again consider the example of estimating θ = P (x > 2) =
R2
1
− 0 f (x)dx where x ∼ Cauchy. Expanding f (x) = 1/π(1 + x2 ) suggests control variates
2
x2 and x4 , i.e., h1 (x) = x2 , h2 (x) = x4 . A small regression experiment based on samples
drawn from U (0, 2) gave
θ̂ =
1
− [f (x) + 0.15(x2 − 8/3) − 0.025(x4 − 32/5)].
2
with var(θ̂) ≈ 6.3 × 10−4 /N.
Koehler and Larntz (1980) consider the likelihood ratio (LR) statistic for testing of
goodness-of-fit in multinomials:
H0 : pi = pi0 ,
i = 1, . . . , k
where (X1 , ..., Xk ) ∼ mult(n, p), where p = (p1 , ..., pk )0 . The chi-squared statistic is given by
χ2 =
k
X
(xj − npj0 )2
npj0
j=1
,
and the LR statistic by
G2 = 2
k
X
xj log
j=1
xj
.
npj0
The problem is to estimate θ = E(G2 ) by Monte Carlo sampling and they consider using
the χ2 statistic as a control variate for G2 as it is known that E(χ2 ) = k − 1 and that G2
and χ2 are often highly correlated. Then θ = E(G2 − χ2 ) + (k − 1) and the Monte Carlo
estimator is given by
1 X 2
{G − χ2 } + k − 1
θ̃ =
N
and an estimate of var(θ̃) is given by
d
var(θ̃) =
var{G2 − χ2 }
.
N
Now, since in general, var(G2 ) and var(χ2 ) are not necessarily similar in magnitude, an
improved estimator may be obtained by considering
9
θ = E[G2 − β (χ2 − (k − 1))]
and the estimator is
θ̃β =
1 X 2
{G − β̂ χ2 } + β̂ (k − 1)
N
where β̂ is estimate of β obtained through an independent experiment. To do this, compute
simulated values of G2 and χ2 statistics from a number of independent samples (say, 100)
obtained under the null hypothesis. Then regress the G2 values on χ2 values to obtain β̂.
An estimate of var(θ̃β ) is given by
var(
ˆ θ̃β ) =
n
var G2 − β̂ χ2
N
Importance Sampling
o
.
R
Again, the interest is in estimating θ = E(g(x)) = g(x)f (x)dx where f (x) is a joint
density. Importance sampling attempts to reduce the variance of the crude Monte Carlo
estimator θ̂ by changing the distribution from which the actual sampling is carried out.
Recall that in crude Monte Carlo the sample is generated from the distribution of x (with
density f (x)). Suppose that it is possible to find a distribution H(x) with density h(x) with
the property that g(x)f (x)/h(x) has small variability. Then θ can be expressed as
θ=
Z
g(x)f (x)dx =
Z
g(x)
f (x)
h(x)dx = E [φ(x)]
h(x)
where φ(x) = g(x)f (x)/h(x) and the expectation is with respect to H. Thus, to estimate θ
the crude Monte Carlo estimator of φ(x) with respect to the distribution H could be used,
i.e.,
N
1 X
θ̃ =
φ(xj )
N j=1
where xj = (x1 , . . . , xN )0 is sampled from H. The variance of this estimator is
1 Z
1 Z f (x)g(x) 2
θ2
2
var(θ̃) =
{φ(xj ) − θ} h(x)dx =
[
] h(x)dx − .
N
N
h(x)
N
Thus var(θ̃) will be smaller than var(θ̂) if
Z
[
Z
f (x)g(x) 2
] h(x)dx ≤ [g(x)]2 f (x)dx
h(x)
since
var(θ̂) =
Z
[g(x)]2 f (x)dx −
10
θ2
.
N
It can be shown using Cauchy-Schwarz’s inequality for integrals that the above inequality
holds if
|g(x)|f (x)
h(x) = R
,
|f (x)|g(x)dx
i.e., if h(x) ∝ g(x)f (x). For an extreme example, if h(x) = g(x)f (x)/θ, then var( θ̃) = 0!
The choice of an appropriate density h is usually difficult and depends on the problem at
hand. A slightly better idea is obtained by looking at the expression for the left hand side
of the above inequality a little differently:
Z
Z
f (x)
f (x)g(x) 2
] h(x)dx = [g(x)2 f (x)][
]dx
[
h(x)
h(x)
To ensure a reduction in var(θ̃)), this suggests that h(x) should be chosen such that h(x) >
f (x) for large values of g(x)2 f (x) and h(x) < f (x) for small values of g(x)2 f (x).
Since sampling is now done from h(x) , θ̃ will be more efficient if h(x) is selected so that
regions of x for which g(x)2 f (x) is large are more frequently sampled than when using f (x).
By choosing the sampling distribution h(x) this way, the probability mass is redistributed
according to the relative importance of x as measured by g(x)2 f (x). This gives the name
importance sampling to this method since important x’s are sampled more often and h(x) is
called the importance function.
Example 1:
Rx
R
2
√1 e−y /2 dy = ∞ I(y < x)φ(y)dy, where φ(y) is the Standard
Evaluate θ = Φ(x) = −∞
−∞
2π
Normal density. A density curve similar in shape to φ(y) is the logistic density:
√
π exp(−π y/ 3)
√
f (y) = √
3(1 + exp(−πy/ 3))2
with mean 0 and variance 1. The cdf of this distribution is:
F (y) =
1
√ .
1 + exp(−π y/ 3)
A crude Monte Carlo estimate of θ is
θ̂ =
PN
i=1
#(zi < x)
I(zi < x)
=
N
N
where z1 , . . . , zN is a random sample from the standard normal. An importance sampling
estimator of θ can be obtained by considering
θ=
Z
∞
−∞
Z ∞
I(y < x)φ(y)
kφ(y) I(y < x)f (y)
f (y)dy =
dy
f (y)
k
−∞ f (y)
where k is determined such that f (y)I(y < x)/k is a density i.e., f (y)/k is a density over
(−∞, x):
1
√ ·
k=
1 + exp(−πx/ 3)
11
The importance sampling estimator of θ is:
θ̃ =
N
φ(yi )
k X
N i=1 f (yi )
where y1 , . . . , yN is a random sample from the density f (y)I(y < x)/k. The sampling from
this density is easily done through the method of inversion: Generate a variate u i from
uniform (0, 1) and transform using
√
o
n
√
3
yi = −
2
log (1 + exp(−π x/ 3))/ui − 1 .
π
Earlier, it was argued that one may expect to obtain variance reduction if h(x) is chosen
similar in shape to g(x)f (x. One way to do this is attempt to select h(x) so that both h(x)
and g(x)f (x) are maximized at the same value. With this choice if h is chosen from the
same family of distributions as f one might expect to gain substantial variance reduction.
For example, if f is multivariate normal, then h might also be taken to be a multivariate
normal but with a different mean and/or covariance matrix.
Example 2:
2
Evaluate θ = E[h(X)] = E[X 4 eX /4 I[X≥2] ] where X ∼ N (0, 1). Chose h to be also Normal
with a mean µ but the same variance 1. Since g is maximized at µ, the above suggests a
good choice for µ to be the value that maximizes g(x)f (x):
µ = arg max g(x)f (x)
x
2
4 x2 /4
= arg max
xe
x
= arg max x4 e−x
x≥2
√
=
8
e−x /2
I[x≥2] × √
2π
2 /4
Then
N
√
1 X
2
yj4 e(yj /4− 8yj +4)
N j=1
√
where yj ≥ 2 and generated from N ( 8, 1).
θ̃ =
2
A recommended procedure for selecting an appropriate h(x) given gL ≤ g(x) ≤ gU for
all x, is to choose h(x) so that, for all x,
h(x) ≥
g(x) − gL
gU − g L
While this does not guarantee variance reduction, it can be used to advantage when crude
Monte Carlo will require large sample sizes to obtain a required accuracy. In many cases,
this occurs usually when g(x) is unbounded, i.e., gU ≡ ∞ in the region of integration. In
is bounded.
this case, we may chose h(x) such that φ(x) = f (x)g(x)
h(x)
12
Example 3:
Consider the evaluation of
θ=
Z
1
0
1
<α≤1
2
xα−1 e−x dx,
α−1 −x
Here if g(x) = x e and f (x) = I(0,1) (x), we find that g(x) is unbounded at x = 0.
Suppose we consider the case when α = .6. Then the crude MC estimate of θ is
θ̂ =
PN
g(uj )
=
N
j=1
PN
−.4 −uj
)
j=1 (uj e
N
1.5
2.5
h(x)
20
0
0.5
10
g(x)^2*f(x)
30
3.5
40
where u1 , u2 , . . . , uN are from U (0, 1).
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
x
0.6
0.8
1.0
x
On the left above is a plot of g(x)2 f (x) for values of x ∈ (0, 1). This plot shows that we
need to select h(x) > 1 (say) for x < .1 (say) and h(x) < 1 for x > .1. Suppose we chose
h(x) = αxα−1 . The plot on the right is a plot of h(x) for α = .6. The dashed line is a plot of
f (x) = I(0,1) (x). This plot shows clearly that h(x) would give more weight to those values of
x < .1, the values for which g(x) gets larger, than would f (x), which assigned equal weights
to all values of g(x). We also see that since
θ=
where φ(x) =
e−x
,
α
Z
1
0
x
α−1 −x
e dx =
Z
1
0
φ(x)αxα−1 dx
φ(x) is bounded at x = 0.
Example 4: A more complex application
In Sequential Testing, the likelihood ratio test statistics λ is used to test
H0 : Population density is f0 (x)
vs.
HA : Population density is f1 (x)
where
λ =
m
Y
f1 (xj )
f1 (x1 , . . . , xm )
=
f0 (x1 , . . . , xm ) j=1 f0 (xj )

M
X



m
X

f1 (xj ) 
= exp  log
=
exp
z
j


f0 (xj ) 
j=1
j=1
and, zj = log{f1 (xj )/f0 (xj )}. Using the decision rule:
13
P
z ≥b
P j
zj ≤ a
P
Reject H0
if
Accept H0
if
continue sampling if
a<
zj < b
It is usually assumed that a < 0 < b. Here the stopping time, m, is a random variable and
also, note that EH0 (zj ) < 0. The theory of sequential tests needs to evaluate probabilities
P
such as θ = P r { zj ≥ b|H0 is true}.
As an example, consider X ∼ N (µ, 1) and suppose that H0 : µ = µ0 vs. Ha : µ = µ1 .
For simplicity, let µ ≤ 0. Consider estimating the type I error rate
θ = P r H0
where zj = xj , i.e.
P
nX
o
zj ≥ b .
xj is the test statistic used. Since θ in integral form is
θ=
where g0 is the density of
the form
P
Z
b
−∞
g0 (x)dx
zj under the null hypothesis H0 , which can also be expressed in
θ=
Z
I(
X
zj ≥ b)f0 (x)dx,
where f0 is the density of Xj under H0 , the usual Monte Carlo estimator of θ is
θ̂ =
PN
i=1
I(yi ≥ b)
#(yi ≥ b)
=
N
N
where y1 , . . . , yN are N independently generated values of
importance sampling is obtained by considering
θ=
Z
I(
X
zj ≥ b)
P
zj . An estimator of θ under
f0 (x)
f1 (x) dx
f1 (x)
where f1 is the density of Xj under the alternative hypothesis HA .
Generating N independent samples of y1 , y2 , . . . , yN under HA the estimator of θ under
importance sampling is
PN
I(yi ≥ b)f0 (yi )/f1 (yi )
θ̃ = i=1
.
N
For this example,
2
f0 (yi )
e−1/2(yi −µ0 )
= −1/2(yi −µ1 )2 = e−1/2(µ0 −µ1 )(2yi −(µ0 +µ1 )) .
f1 (yi )
e
If µ1 is chosen to be −µ0 (remember µ0 is ≤ 0 and thus µ1 > 0), the algebra simplifies to
f0 (yi )
= e2µ0 yi .
f1 (yi )
14
P
Thus the importance sampling estimator reduces to
I(yi ≥ b) exp(2µ0 yi ) where yi are
sampled under HA . The choice of −µ0 for µ1 is optimal in the sense that it minimizes Var(θ̃)
asymptotially. The following results are from Siegmund (1976):
−µ0
.0
a
any
θ̃
b
any .5
{N Var(θ̂)}1/2
.5
{N Var(θ̃)}1/2
.5
R.E.
1
.125
.25
.50
-5
-5
-5
5
5
5
.199
.0578
.00376
.399
.233
.0612
.102
.0200
.00171
15
136
1280
.125
.25
.50
-9
-9
-9
9
9
9
.0835
.00824
.0000692
.277
.0904
.00832
.0273
.00207
.0000311
102
1910
7160
More on Important Sampling
For estimating
θ = E [g(X)] =
=
Z
g(x)
we used
θ̃ =
where
Z
g(x)f (x)dx
f (x)
h(x)dx
h(x)
N
1 X
g(xj )w(xj )
N j=1
w(xj ) = f (xj )/h(xj )
and xj ’s are generated from H. This estimator θ̃ is called the importance sampling estimator
with unstandardized weights. For estimating the posterior mean, for example, f is known
only upto a constant. In that case, the problem will involve estimating the constant of
integration, as well, using Monte Carlo. That is
θ = E [g(X)] =
=
R
R
g(x)f (x)dx
R
f (x)dx
(x)
g(x) fh(x)
h(x)dx
R
f (x)
h(x)dx
h(x)
is estimated by the importance sampling estimator with standardized weights θ̃∗ :
θ̃
∗
=
=
1
N
1
N
PN
j=1 g(xj )w(xj )
1 PN
j=1 w(xj )
N
N
X
g(xj )w∗ (xj )
j=1
15
where
w(xj )
w∗ (xj ) = PN
j=1 w(xj )
and xj ’s are generated from H as before. Obtaining the standard error of θ̃∗ involves using
an approximation, either based on Fieller’s theorem or the Delta method. To derive this,
the following shorthand notation is introduced: gj ≡ g(xj ), wj ≡ w(xj ), and zj ≡ gj wj . In
the new notation,
1 PN
1 PN
zj
z̄
j=1 gj wj
∗
N
N
= 1 PNj=1
=
θ̃ = 1 PN
w
j=1 wj
j=1 wj
N
N
Now from the central limit theorem, we have
!
z̄
w
≈N
"
µz
µw
!
1
,
n
σz2 σzw
σzw σw2
!#
where µz , µw , σz2 , σzw and σw2 are all expectations derived with respect to H.
It is required to estimate µz /µw and the standard error of this estimate would be the
standard error of θ̃∗ . To estimate µz /µw either the Delta method or Fieller’s theorem could
be used. Delta method gives
z̄
≈N
w
2µz
1
µ2z 2
µz
2
σ
−
σ
,
σ
+
zw
z
µw nµ2w
µw
µ2w w
!!
and the standard error of z̄/w is given by
z̄
s.e.
w
"
!
1
z̄ 2
z̄
=√
s2z − 2
szw +
s2
nw
w
w2 w
#1/2
where
s2z =
X
(zi − z̄)2 /(n − 1), szw =
X
(zi − z̄)(wi − w)/(n − 1), and s2w =
X
(wi − w)2 /(n − 1).
Rather than using variance reduction to measure the effectiveness of an importance sampling
strategy under a selected h(x), the following quantity, known as the effective sample size has
been suggested as a measure of the efficiency. When using unstandardized weights this
measure is given by
N
Ñ (g, f ) =
1 + s2w
Clearly, this reduces to N when h is exactly equal to f and is smaller than N otherwise. It
can thus be interpreted to mean that N samples must be used in the importance sampling
procedure using h to obtain the same precision as estimating using Ñ samples drawn exactly
from f (i.e. using crude Monte Carlo). Thus it measures the loss in precision in approximating f with g, and is a useful measure for comparing different importance functions. When
using standardized weights an equivalent measure is
Ñ (g, f ) =
16
N
1 + cv2
where cv=sw /w is coefficient of variation of w.
From the above results, for an importance sampling procedure to succeed some general
requirements must be met:
• weights must be bounded
• small coefficient of variation of the weights
• infinite variance of the weights must be avoided
P
• max wi / wi should be small, i.e., for calculating the standard error of the the importance sampling estimate the central limit theorem must hold. Since the central limit
theorem relies on the average of similar quantities, this condition ensures that a few
large weights will not dominate the weighted sum.
Since standardized weight could be used even when f is completely determined the following
criterion could be used to make a choice between ”unstandardized” and ”standardized”
estimators. The mean squared error of the standardized esimator is smaller than that of the
unstandardized estimator, i.e., MSE(θ̃∗ ) < MSE(θ̃) if
Corr(zj , wj ) >
cv(wj )
2cv(zj )
where zj = gj wj as before.
Antithetic Variates
R
Again, the problem of estimating θ = g(x)f (x)d x is considered. In the control variate
and importance sampling variance reduction methods the averaging of the function g(x)
done in crude Monte Carlo was replaced by the averaging of g(x) combined with another
function. The knowledge about this second function brought about the variance reduction
of the resulting Monte Carlo method. In the antithetic variate method, two estimators are
found by averaging two different functions, say, g1 and g2 . These two estimators are such
that they are negatively correlated, so that the combined estimator has smaller variance.
Suppose that θ̂1 is the crude Monte Carlo estimator based on g1 and a sample of size N
and θ̂2 , that based on g2 and a sample of size N . Then let the combined estimator be
θ̃ = 1/2 θ̂1 + θ̂2 .
where θ̂1 =
P
g1 (xj )/N and θ̂2 =
P
g2 (xj )/N . Then
h
Var(θ̃) = 1/4 Var(θ̂1 ) + Var(θ̂2 ) + 2cov(θ̂1 , θ̂2 )
h
= 1/4 σ12 /N + σ22 /N + 2ρσ1 σ2 /N
i
i
where ρ = corr(θ̂1 , θ̂2 ). A substantial reduction in Var(θ̃) over Var(θ̂1 ) will be achieved if
ρ 0. If this is so, then g2 is said to be an antithetic variate for g1 . The usual way of
generating values for g2 so that the estimate of θ̂2 will have high negative correlation with θ̂1
17
is by using uniform variates 1 − u to generate samples for θ̂2 of the same u’s used to generate
samples for θ̂1 . In practice this is achieved by using streams uj and 1 − uj to generate 2
samples from the required distribution by inversion.
The following example will help clarify the idea. Consider estimation of θ = E[g(u)]
where u is uniformly distributed between 0 and 1. Then for any function g(·), let
θ = E(g(u)) =
P
Z
1
0
g(u)du ,
so N1 g(uj ) is an unbiased estimator of θ for a random sample u1 , . . . , uN . But since 1 − u
P
is also uniformly distributed on (0, 1), N1 g(1 − uj ) is also an unbiased estimator of θ. If
g(x) is a monotonic function in x, g(u) and g(1 − u) are negatively correlated and thus are
“antithetic variates.”
In general, let
θ̂1 =
θ̂2 =
N
1 X
g(F1−1 (Uj1 ), F2−1 (Uj2 ), . . . , Fm−1 (Ujm ))
N j=1
N
1 X
g(F1−1 (1 − Uj1 ), F2−1 (1 − Uj2 ), . . . , Fm−1 (1 − Ujm ))
N j=1
where F1 , F2 , . . . , Fm are c.d.f.’s of X1 , X2 , . . . , Xm and U1 , U2 , . . . , Um are iid U (0, 1). Then
it can be shown that if g is a monotone function, that cov(θ̂1 , θ̂2 ) << 0 and thus
θ̃ =
1
θ̂1 + θ̂2
2
will have a smaller variance.
Example 1:
Evaluate E[h(X)] where h(x) = x/(2x − 1) and X ∼ exp (2)
Generate N samples ui , i = 1, . . . , N where u ∼ U (0, 1) and compute
θ̃ =
X
1 X
xi /N +
yi /N
2
where xi = −2 log (ui ) and yi = −2 log (1 − ui ), i = 1, . . . , N.
Example 2: Antithetic Exponential Variates
For X ∼ exp (θ) the inverse transform gives X = −θ log (1 − U ) as a variate from exp (θ).
Thus X = −θ log (U ) is an antithetic variate. It can be shown that, since
Cov(X, Y ) = θ 2
Z
1
0
log (u) log (1 − u)du −
= θ2 (1 − π 2 /6)
Z
1
0
log (u)du
Z
1
0
log (1 − u)du
and Var(X) = Var(Y ) = θ 2 , that
Corr(X, Y ) =
θ2 (1 − π 2 /6)
√
= (1 − π 2 /6) = −.645
2
2
θ ·θ
18
Example 3: Antithetic Variates with Symmetric Distributions
Consider the inverse transform to obtain two standard normal random variables: Z 1 =
−1
Φ (U ) and Z2 = Φ−1 (1 − U ). Z1 and Z2 are antithetic variates as Φ−1 is monotonic.
Further, it is also easy to see that Φ(Z1 ) = 1−Φ(Z2 ). Since the standard normal distribution
is symmetric it is also true that Φ(Z) = 1 − Φ(−Z) which implies that Z2 = −Z1 . Thus, it
is seen that Z1 and −Z1 are two antithetic random variables from the standard normal and
Corr(Z1 , −Z1 ) = −1.
For example, to evaluate θ = E(eZ ) for Z ∼ N (0, 1) generate z1 , z2 , . . . , zN and compute
P
PN
zj
−zj
ˆ
the antithetic estimate θ̃ = (θˆ1 + θˆ2 )/2 where θˆ1 = ( N
)/N . In
j=1 e )/N and θ2 = ( j=1 e
general, for distributions symmetric around c it is true that F (x − c) = 1 − F (c − x). Then
for two anti thetic random variables X and Y from this distribution we have F (X − c) = U
and F (Y − c) = 1 − U which results in Y = 2c − X and Corr(X, Y ) = −1.
Conditioning Swindles
For simplicity of discussion, consider the problem of estimating θ = E(X) is considered.
For a random variable W defined on the same probability space as X
E(X) = E{E(X|W )}
and
var(X) = var{E(X|W )} + E{var(X|W )}
giving var(X) ≥ var{E(X|W )} . The crude Monte Carlo estimate of θ is
N
1 X
θ̂ =
xj
N j=1
and its variance
var(X)
N
is a random sample from the distribution of X. The conditional Monte
var(θ̂) =
where x1 , x2 , . . . , xN
Carlo estimator is
N
1 X
q(wj )
θ̃ =
N j=1
and its variance
var{E(X|W )}
N
where w1 , w2 , . . . , wN is a random sample from the distribution of W and q(wj ) = E(X|Wj ).
The usual situation in which the conditioning swindle an be applied is when q(w) = E(X|W )
can be easily evaluated analytically.
In general, this type of conditioning is calle Rao-Blackwellization. Recall that RaoBlackell therem says that the variance of an unbiased estimator can be reduced by coditioning
var(θ̃) =
19
on the sufficient statistics. Let (X, Y ) have a joint distribution and consider the estimation
of θ = EX (g(X)). A crude Monte Carlo estimate of θ is
N
1 X
g(xj )
θ̂ =
N j=1
where xj , j = 1, . . . , N generated from the distribution of X. The estimator of θ = E y EX (g(X)|Y )
˜ = 1 PN EX (g(X)|Y ) where yj , j = 1, . . . , N generated from the distribution of Y .
is theta
j=1
N
This requires that the conditional expectation EX (g(X)|Y ) to be explicitly determined. It
˜ dominates θ̂ in variance.
can be shown as above that theta
In general, we can write
var(g(X)) = var{E(g(X)|W )} + E{var(g(X)|W )}
and techniques such as importance sampling and stratified sampling were actually based on
this conditioning principle.
Example 1:
Suppose W ∼ Poisson(10) and that X|W ∼ Beta(w, w 2 + 1). The problem is to estimate
θ = E(X). Attempting to use crude Monte Carlo, i.e., to average samples from X will be
inefficient. To use the conditioning swindle, first compute E(X|W ) which is shown to be
w/(w2 + w + 1). Thus the conditional Monte Carlo estimator is
θ̃ =
N
1 X
wj
2
N j=1 wj + wj + 1
where w1 , w2 , . . . , wN is a random sample from Poisson (10).
In this example, the conditioning variable was rather obvious; sometimes real ingenuity
is required to discover a good random variable to condition on.
Example 2:
Suppose x1 , x2 , . . . , xn is a random sample from N (0, 1) and let M = median (X1 , . . . , Xn .
The object is to estimate θ = var(M ) = E[M 2 ]. The crude Monte Carlo estimate is obtained
by generating N samples of size n from N (0, 1) and simply averaging the squares of medians
from each sample:
θ̂ =
N
1 X
Mj2 .
N j=1
An alternative is to use the independence decomposition swindle. If X̄ is the sample
mean, note that
var(M ) = var(X̄ + (M − X̄)) where X̄ = sample mean
= var(X̄) + var(M − X̄) + 2cov(X̄, M − X̄)
20
However, X̄ is independent of M − X̄ giving cov(X̄, M − X̄) = 0 . Thus
var(M ) =
1
+ E(M − X̄)2
n
giving the swindle estimate of θ to be
θ̃ =
N
1 X
1
(mj − X̄j )2 .
+
n N j=1
One would expect the variation in M − X̄ to be much less than the variation in M . In
general, the idea is to decompose the stimator X into X = U + V where U and V are
independent; thus var(X) = var(U ) + var(V ). If var(U ) is known, then var(V ) ≤ var(X),
and therefore V can be estimated more precisely than X.
21
Download