Introduction to Monte

advertisement
Introduction to Monte-Carlo Methods
Pascal Bianchi
14/11/2011
1/38
Outline
About this week
Introduction to MC methods
Convergence of random variables
Confidence intervals
Approximation of integrals : Monte-Carlo versus deterministic
Some applications
2/38
Telecom ParisTech and the group STA
I
Research group “Statistics and applications (STA)” in charge of this course
I
16 senior researchers in the various fields related to statistics (Machine learning,
time series analysis, distributed algorithms, statistical signal processing,
MC methods,. . .)
I
Monte-Carlo methods : research theme leaded by
Éric Moulines and Gersende Fort
Applications to finance, astronomy and robotics
3/38
Schedule of the week
Mornings : Lectures
Afternoons : Labs
I
Today : basics in probability theory + introduction to MC methods
I
Thuesday : how to generate random variables with a target distribution
An application to physics
I
Wednesday / Thursday : how to reduce the variance of the estimation error
An application to the computation of the waiting-time in a queue
I
Friday morning : MC methods for statistical inference in Hidden Markov Models
An application to self-localization in robotics
I
Friday afternoon : Seminar by G. Fort
Recent advances in MC methods and applications
4/38
Practical stuffs (1/2)
All materials available at http ://perso.telecom-paristech.fr/ bianchi/athens/athens.html
Evaluation Each student receives a grade from 0 to 20 points based on four Lab
reports and on a short Quiz. The Quiz counts for 7 points, Lab reports for 13 points.
Quiz
The quiz will take place on Friday afternoon. It will consist in a small series of brief
questions on the course. Documents are not permitted during the quiz.
5/38
Practical stuffs (2/2)
Lab reports There are four Labs. For each Lab, the person in charge of the Lab is :
I Lab I : Jérémie Jakubowicz, jakubowi@telecom-paristech.fr
I Lab II : Ian Flint, ian.flint@telecom-paristech.fr
I Lab III : Pascal Bianchi, pascal.bianchi@telecom-paristech.fr
I Lab IV : Ian Flint, ian.flint@telecom-paristech.fr
Students work in pair during each Lab.
Each pair of student provides four Lab reports, one for each Lab.
Each report must be send by email to the person in charge of the Lab before 28/11.
Please use email-titles of the form « ATHENS / Lab II / Name1 - Name2 »
Instructions for Lab reports
The numerical results should be commented. A great attention will be attached to the
relevance of the comments. It is recommended to send the source code in a file
separated from the report.
6/38
Outline
About this week
Introduction to MC methods
Convergence of random variables
Confidence intervals
Approximation of integrals : Monte-Carlo versus deterministic
Some applications
7/38
Monte-Carlo (MC) methods : general definition
MC methods refer to methods that allow to :
I
sample from a target distribution µ
I
use these samples to approximate numerical quantities
I
control the approximation error
Applications to statistical physics, molecular dynamics, finance, biology, astronomy, robotics,
signal and image processing,. . .
8/38
Buffon’s needles (1/2)
1777 : One of the most ancient and celebrated example of Monte-Carlo simulation
20
15
10
5
0
0
2
4
6
8
10
Consider a floor with parallel lines. Set r = distance between lines.
Throw n needles of length ` ≤ d on the floor
Count the number Nn intersecting the lines
Deduce an estimate of π = 3.14159 . . .
9/38
Buffon’s needles (2/2)
Define Xi = 1 if the ith needle intersects a line, 0 otherwise
This afternoon, you will prove that : P(Xi = 1) = π2`r
We can estimate the probability P(Xi = 1) by
θ̂n =
Nn
n
=
n
1
∑ Xi
n
i =1
Thus an estimate of π is given by
π̂n =
2`
θ̂n r
Performance ? Optimal choice of `, r ?
10/38
General case : evaluating integrals
Estimate
Z
θ=
f (x )d µ(x )
where µ is a probability measure on Rd .
Assume that we are able to draw i.i.d. samples from µ
iid
X1 , X2 , X3 , . . . ∼ µ
A MC estimator is given by
θ̂n =
1
n
∑ f (X i )
n
i =1
i.e. approximate the expectation by the empirical mean
Similarly, an estimator of g (θ) is given by g (θ̂n )
11/38
Questions
I
How to sample from an arbitrary distribution µ ? (c.f. tomorrow’s lecture)
I
Is the estimator consistent i.e., does it converge to the true value as n → ∞
I
What indications do we have about the estimation error ?
I
Can we improve the estimator i.e., reduce its variance ?
(c.f. wednesday/thursday’s lectures)
12/38
Outline
About this week
Introduction to MC methods
Convergence of random variables
Confidence intervals
Approximation of integrals : Monte-Carlo versus deterministic
Some applications
13/38
Convergence almost sure - convergence in probability
(Ω, F , P) a probability space
(Xn )n≥1 a sequence of random variables (r.v.) on Rd
I
(Xn )n≥1 converges almost surely (a.s.) to a r.v. X if
lim Xn (ω) = X (ω)
n→∞
for any ω except on a set of P-measure zero
a.s.
Notation : Xn → X
I
(Xn )n≥1 converges in probability to X if
∀ε > 0, lim P(|Xn − X | > ε) = 0
n→∞
P
Notation : Xn → X
Property : Convergence a.s. implies convergence in probability
Theorem of continuity : Let f be a continuous function.
a.s.
a.s.
I
Xn → X implies f (Xn ) → f (X )
I
Xn → X implies f (Xn ) → f (X )
P
P
14/38
Law of Large Numbers (LLN)
Theorem : Let (Xn )n≥1 be an iid sequence such that EkX1 k < ∞. Then,
1
n
a.s.
∑ Xi −→ E(X1 )
n
i =1
15/38
Example #1 : Convergence of the standard MC estimator
Analyze the MC-estimator
θ̂n =
1
n
∑ f (X i )
n i =1
where the Xi ’s are iid with distribution µ. Assume that Ekf (Xi )k < ∞.
1. The estimator is unbiased : Eθ̂n = θ
2. By the LLN, θ̂n converges a.s. to θ = Ef (X1 ).
The sequence of estimators θ̂n is said to be strongly consistent
Similarly, g (θ̂n ) is a strongly consistent estimator of g (θ) if g is continuous at θ.
3. However, we still ignore the fluctuations of the estimation error
16/38
Example #2 : Approximating densities by histograms
Let X1 , . . . , Xn be n real iid samples with density p(x ) Lipschitz-continuous on [a, b].
Define k ≥ 1.
(b−a)
Define a` = a + ` k for ` = 1, . . . , k
Consider the histogram :
k
h n (x ) =
(`)
∑ hn
1[a`−1 ,a` ] (x )
`=1
(`)
where hn = card{i = 1, . . . , n : Xi ∈ [a`−1 , a` ]}.
By the LLN, for each `,
(`)
hn
n
Noting that
R a`
a`−1
a.s.
−→
Z
a`
p(t )dt
a`−1
p(t )dt = p(a`−1 )/k + O (1/k 2 ), we conclude that
∀x ∈ [a, b],
k
n
a.s.
hn (x ) −→ p(x ) + O (1/k )
Normalized histograms can be interpreted as an approximation of the pdf
17/38
Convergence in distribution
Let FX denote the distribution function of a r.v. X
Definition : A sequence (Xn )n≥1 is said to converge in distribution (or in law) to a r.v.
X if
lim FXn (x ) = FX (x )
n→∞
at any continuity point x of FX
L
Notation : Xn −→ X
P
L
Property : Xn −→ X implies Xn −→ X
Portmanteau’s Theorem : The following statements are equivalent
L
1. Xn −→ X
2. For any bounded continuous function f : Rd → R, E(f (Xn )) → E(f (X ))
3. For any Borel set H such that P(X ∈ ∂H ) = 0, P(Xn ∈ H ) → P(X ∈ H )
L
L
Theorem of continuity : Let f be continuous. Then, Xn −→ X implies f (Xn ) −→ f (X )
18/38
Slutsky’s Lemma
L
L
Assume that Xn −→ X and Yn −→ c where c is a constant. Then :
L
Xn + Yn −→ X + c
L
Xn Yn −→ cX
Xn
Yn
L
−→
X
c
if c 6= 0
19/38
Central Limit Theorem (CLT)
Theorem : Let (Xn )n≥1 be an iid sequence such that E(kXn k2 ) < ∞.
Define m = E(X1 ) and Σ = Cov(X1 ). Then,
1
n
L
∑ (Xi − m) −→ N (0, Σ)
n
√
i =1
where N (0, Σ) stands for a Gaussian r.v. with zero mean and covariance Σ
20/38
Outline
About this week
Introduction to MC methods
Convergence of random variables
Confidence intervals
Approximation of integrals : Monte-Carlo versus deterministic
Some applications
21/38
Control of the estimation error (1/2)
Example : Let θ̂n = n1 ∑i f (Xi ) be the standard MC estimate of θ = E(f (X1 )).
We already know that the estimation tends a.s. to zero when n → ∞
Ideal objective : For a given tolerance level δ, find a n large enough which ensures
that the error kθ̂n − θk does not exceed δ
Remark : MC methods are random by nature. The (random) error can always
exceeds a given δ with a small but nonzero probability. Thus, we should reformulate
the problem as :
Find a n large enough which ensures that P(kθ̂n − θk > δ) is less than 1 − α
22/38
Control of the estimation error (2/2)
Assume θ ∈ R for simplicity
First solution : use upperbounds on the probability.
Example : the Chebischev bound leads to
P(|θ̂n − θ| > δ)
≤
≤
1
E(|θ̂n − θ|2 )
δ2
σ2
n δ2
where σ2 = Var(f (X1 ))
2
σ
It is sufficient to set n larger than (1−α)δ
2 to ensure that P(kθ̂n − θk > δ) < 1 − α
Two major drawbacks :
1. The bound is far brom being tight. Finer bounds do exist, but in general, a
pessimistic value of n is to be expected.
2. As θ = E(f (X1 )) is unknown, it is very likely that σ2 is unknown as well
Second solution : consider the asymptotic regime n → ∞ and use the CLT
23/38
Asymptotic confidence intervals (1/2)
By the CLT,
1 n
L
n(θ̂n − θ) = √ ∑ (f (Xi ) − θ) −→ N (0, σ2 )
n i =1
√
Select a such that
1
√
Z
a
2π −a
e−
s2
2
ds = 1 − α
otherwise stated, a is the quantile function at 1 − α2 . We obtain :
P(θ ∈ [θ̂n −
aσ
n
, θ̂n +
aσ
n
]) −→ 1 − α
The interval θ̂n ± anσ is called a 100(1-α)%-asymptotic confidence interval
Example : For a 95% confidence interval, set a ' 1.96
24/38
Asymptotic confidence intervals (1/2)
By the CLT,
1 n
L
n(θ̂n − θ) = √ ∑ (f (Xi ) − θ) −→ N (0, σ2 )
n i =1
√
Select a such that
1
√
Z
a
2π −a
e−
s2
2
ds = 1 − α
otherwise stated, a is the quantile function at 1 − α2 . We obtain :
P(θ ∈ [θ̂n −
aσ
n
, θ̂n +
aσ
n
]) −→ 1 − α
The interval θ̂n ± anσ is called a 100(1-α)%-asymptotic confidence interval
Example : For a 95% confidence interval, set a ' 1.96
Remark : The computation of this interval requires the knowledge of σ2 = Var(f (X1 ))
Again, as θ = E(f (X1 )) is unknown, it is very likely that σ2 is unknown as well
Can we still compute a confidence interval ?
24/38
Asymptotic confidence intervals (2/2)
The ideal is to replace the unknown variance by its MC-estimate :
σ̂2n =
n
1
∑ (f (Xi ) − θ̂n )2
n
i =1
a.s.
L
By Slutsky’s Lemma and the LLN, σ̂2n −→ σ2 . In particular, σ̂2n −→ σ2 .
√
√
n
σ̂n
(θ̂n − θ) =
n
σ
s
(θ̂n − θ)
σ2
σ̂2n
Thus, by Slutsky’s Lemma and the CLT
√
n
σ̂n
L
(θ̂n − θ) −→ N (0, 1)
n
Conclusion : The interval θ̂n ± aσ̂
n is a 100(1-α)%-confidence interval
25/38
Delta Method
Theorem : Consider a sequence of random variables (θ̂n )n≥1 on Rd which satisfy
√
L
n(θ̂n − θ) −→ Y
where θ ∈ Rd and Y is a r.v. on Rd (typically, a Gaussian r.v.)
Let g : Rd → Rd be a differentiable function at point θ. Then,
√
L
n(g (θ̂n ) − g (θ)) −→ ∇g (θ) Y
where ∇g (θ) is the Jacobian matrix of g at point θ
26/38
Delta Method : Example of application
iid
X1 , X2 , . . . ∼ E (λ)
where E (λ) is the exponential distribution with parameter λ > 0 : λe−λx 1R+ (x ).
Problem posed : To estimate λ.
Recall that : EX1 = λ1 . Thus, X̄n = n1 ∑ni=1 Xi is the standard MC-estimator of λ1 .
A natural estimate of λ is therefore :
ˆλn = 1
X̄n
1. ˆ
λn is a consistent estimator of λ
√
L
2. By the CLT, n(X̄n − λ1 ) −→ N (0, λ12 ). Thus, by the Delta-method,
√ ˆ
L
n(λn − λ) −→ N (0, λ2 )
3. Can you find an asymptotic 100(1 − α)% confidence interval ?
27/38
Outline
About this week
Introduction to MC methods
Convergence of random variables
Confidence intervals
Approximation of integrals : Monte-Carlo versus deterministic
Some applications
28/38
Deterministic methods
To compute
R
I
f (x )dx, why not use traditional quadrature methods ?
Quadrature methods consists in the approximation
Z
n
f (x )dx '
I
∑ wj f (xj )
(1)
j =0
where (x0 , x1 , . . . , xn ) are deterministic points (= the grid) and where wj are some
weights.
I
Newton quadrature : regular grid
I
Gauss quadrature : irregular grid chosen as the roots of a well-chosen
polynomial of degree n + 1.
The most famous Newton quadrature method is the trapeze method.
We refer to [Stoer-Bulirsch,2002] for more details about deterministic numerical
integration methods
29/38
Trapeze method in 1-dimension
1
Z
Example : approximate θ =
f (x )dx
0
Idea : Approximate f by a piecewise affine function
Z
k
n
k −1
n
f (x )dx '
1
k
f ( k−
n )+f(n)
2n
The integral θ can be approximated by :
In =
f (0) + f (1)
2
+
1 n−1
k
∑ f(n)
n k =1
Requires (n + 1) evaluations of f .
Assuming f is C 2 , we obtain the following control of the error :
|In − θ| ≤
1
sup |f 00 (x )|
12n2 x ∈[0,1]
Conclusion : In dimension 1, the trapeze method outperforms the MC estimate
30/38
Trapeze method in d-dimension
Z
Approximate θ =
0
1
···
Z
1
0
f (x1 , . . . , xd )dx1 · · · dxd
Idea : using Fubini’s theorem, the multiple integral can be rephrased as repeated
one-dimensional integrals → use the trapeze method for each of them
The trapeze approximation has the form :
In =
1
n
∑
nd j ,··· ,j =0
1
d
wj1 ,··· ,jd f
I
Requires N = (n + 1)d evaluation of f
I
Similarly to the 1-D case, one can show that
In
=
=
j1
n
,··· ,
jd
n
θ + O n−2
θ + O N −2/d
For a fixed computational complexity N, the error increases at speed N −2/d w.r.t. d
n.b. : More efficient methods than the trapeze do exist (Simpson method, Gauss
quadrature methods), but are still sensitive to the value of d.
31/38
Comparison to Monte-Carlo methods
The standard MC estimator for the estimation of
θ̂n =
1
n
R
f (x )d µ(x ) is
∑ f (Xi )
i
I
Main drawback : the estimation error is random. One can only ensure that the
approximation error associated with a MC run is small with high probability
I
Main asset : by the CLT, the estimation error converges a.s. to zero at
√
speed 1/ n, for any dimension d
To divide the standard deviation of the error by a factor 2,
multiply the number of samples by 4
Thus, MC methods are expected to perform well even in case of integrals on spaces
of high dimension
32/38
A remark on Quasi Monte-Carlo methods
Z
Principle : Approximate θ =
0
1
···
1
Z
f (x1 , . . . , xd )dx1 · · · dxd by
0
1
n
∑ f (xi )
n
i =1
where x1 , x2 , . . . is a well-chosen deterministic sequence in [0, 1]d
I
Despite their name, QMC methods are deterministic methods
→ out of the scope of this course
I
When d ≥ 2, the regular grid is not very efficient
More accurate approximations can be obtained by using sequences (xi )i ≥1 with
good algebraic properties (low discrepancy sequences)
If f is C k , the error is a O
(log n)d
n
I
The error depends on the dimension d. The use of QMC methods can be a good
move if d ∼ 15
I
n.b. : It is generally difficult to have tight bounds on the approximation error
33/38
Outline
About this week
Introduction to MC methods
Convergence of random variables
Confidence intervals
Approximation of integrals : Monte-Carlo versus deterministic
Some applications
34/38
Evaluating the price of an option (1/2)
I
Let St be the value of an asset at instant t → (St )t ∈R+ is a stochastic process
At t = 0, S0 is the present (known) value of the asset
I
A european call gives the right to buy the asset at time T (maturity) at a given
price K
I
At maturity T , if ST ≥ K the benefit is ST − K . Otherwise, the option is useless.
In order to determine the price of the option, one is left to compute
p = E[(ST − K )+ ]
The above value depends on the probabilistic model underlying St
I
In the Black-Scholes model (1973), ST follows a log-normal distribution
→ in this case, p admits a simple expression (no need for MC methods)
I
This is no longer the case in more involved models
Asian option :
"
p=E
I
1
! #
N
∑ St − K
N i =1
i
+
No explicit formula (even for the Black-Scholes model)
35/38
Evaluating the price of an option (2/2)
Algorithm (case an an Asian option)
I
I
I
estim = 0;
for j=1:n
Generate S∼ [St1 , . . . , StN ]
estim = estim + max( 0 , mean(S)-K );
estim = estim/n;
36/38
Evaluating the probability of rare events - Example : Ruin probability (1/2)
I
An insurance compagny earns premiums at rate r per unit of time
I
It pays claims at random time instants Ti . The amount of ith claim is Yi .
I
The benefit of the compagny at time t is
Nt
Xt = rt − ∑ Yi
i =1
where Nt is the number of claims in [0, t ]
I
Denote by R the event “the compagny is ruined”. We set
R = {∃t > 0, Xt < −m}
where m is the initial amount of credits.
I
Objective : What is the probability of ruin θ = P(R = 1) ?
I
Note : The actual event of ruin is expected to be rare, say for instance θ ∼ 10−20
37/38
Evaluating the probability of rare events - Example : Ruin probability (2/2)
At first glance, a naive MC-estimate θ = P(R = 1) would be expected to be
θ̂n =
1
n
∑ Ri
n
i =1
where R1 , . . . , Rn are iid with the same distribution as R
Two issues :
I
The event R = 1 depends on the whole process Xt , (t ∈ R+ ) :
One hardly see how to generate samples R1 , . . . , Rn
I
Even if we were able to simulate R1 , . . . , Rn , the number n of samples needed to
obtain a fine estimate of θ would be huge
Idea : change the law under which random variables are simulated
This is called importance sampling (c.f. thuesday’s lecture)
38/38
Download