9. Monte Carlo Methods Monte Carlo (MC) methods use repeated random sampling to solve computational problems when there is no affordable deterministic algorithm. Most often used in Physics Chemistry Finance and risk analysis Engineering MC is typically used in high dimensional problems where a lower dimensional approximation is inaccurate. Example: n year mortgages paid once a month. The risk analysis is in 12n dimensions. For a 30 year mortgage we have a 360 dimensional problem. Integration (quadrature rules) above 8 dimensions is impractical. Main drawback is the addition of statistical errors to the systematic errors. A balance between the two error types has to be made intelligently, which is not always easy nor obvious. Short history: MC methods have been used since 1777 when the Compte de Buffon and Laplace each solved problems. In the 1930’s Enrico Fermi used MC to estimate what lab experiments would show for neutron transport in fissile materials. Metropolis and Ulam first called the method MC in the 1940’s. In 1950’s MC was expanded to use an probability distribution, not just Gaussian. In the 1960’s and 1970’s, quantum MC and variational MC methods were developed. MC Simulations: The problem being solved is stochastic and the MC method mimics the stochastic properties well. Example: neutron transport an decay in a nuclear reactor. MC Calculations: The problem is not stochastic, but is solved using a stochastic MC method. Example: high dimension integration. Quick review of probability Event B is a set of possible outcomes that has probability Pr(B). The set of all events is denoted by W and particular outcomes are w . Hence, BÎW . Suppose B,C ÎW . Then BÇC represents events in both B and C. Similarly, BÈC represents events that are in B or C. Some axioms of probability are 1. Pr(B)Î[0,1] 2. Pr(AÈB)= Pr(A)+Pr(B) if AÇB¹Æ 3. BÍC ÞPr(B)£Pr(C) 4. Pr(W)=1 The conditional probability that a C outcome is also a B outcome is given by Bayes formula, Pr(B|C)= Pr(BÇC) . Pr(C) Frequently, we already know both Pr(B|C) and Pr(C) and use Bayes formula to calculate Pr(BÇC). Events B and C are independent if Pr(BÇC)= Pr(B)Pr(C)ÞPr(B)= Pr(B|C). If W is either finite or countable, we call W discrete. In this case we can specify all probabilities of possible outcomes as f k = Pr(w = w k ) and an event B has probability Pr(B)= å w k ÎB Pr(w k ) = å w k ÎB fk . A discrete random variable is a number X(w ) that depends on the random outcome w . As an example, in coin tossing, X(w ) could represent how many heads or tails came up. For xk = X(w k ), define the expected value by E[X]= å X(w )Pr(w ) = å xk fk . w ÎW w ÎW The probability distribution of a continuous random variable is described using a probability density function (PDF) f (x). If X Î n and BÍ n , then Pr(B)= ò xÎB f (x)dx and E[X]= òn xf (x)dx. The variance in 1D is given by s 2 = var(X)= E[(X - E[X])2 ]= ò (x- E[x])2 f (x)dx . The notation is identical for discrete and continuous random variables. For 2 or higher dimensions, there is a symmetric n´n variance/covariance matrix given by C = E[(X - E[X])(X - E[X])T ], where the matrix elements C jk are given by C jk = E[(X j - E[X j ])(Xk - E[Xk ])]= cov[X j ,Xk ]. The covariance matrix is positive semidefinite. Common random variables The standard uniform random variable U has a probability density of ì ï í ï î f (u)= 1 if 0 £u £1 0 otherwise. We can create a random variable in [a,b] by Y =(b-a)U +a . The PDF for Y is ì ï ï í ï ïî 1 if a £ y £b g(y)= b- a 0 otherwise. The exponential random variable T with rate constant l >0 has a PDF ì ï í ï î - lt if 0 £t f (t)= l e 0 otherwise. The standard normal is denoted by Z and has a PDF f (z)= 1 e-z2 /2 . 2p The general normal with mean m and variance s 2 is given by X = s Z + m and has PDF f (x)= 1 2ps 2 e-(x-m)2 /2s 2 . We write X ~ N(m,s 2 ) in this case. A standard distribution has X ~ N(0,1). If an n component random variable X is a multivariate normal with mean m and covariance C, then it has a probability density n -(x-m )T C-1 x-m /2 1 æ ö f (x)= e , where Z = çè 2p ÷ø det(C) . Z æ ç è ö ÷ ø Multivariate normal possess a linear transformation property: suppose L is an m´n matrix with rank m, so L: n ® m and onto. If X Î n and Y Î m are multivariate normal, then the covariance matrix for Y is CY = LCX LT assuming that m = 0 . Finally, there are two probability laws/theorems that are crucial to believing that MC is relevant to any problem: 1. Law of large numbers 2. Central limit theorem Law of large numbers: Suppose A= E[X] and ìí Xk |k =1,2, î approximation of A is ü ý þ Í X . The n 1 Ân = n å Xk ® A as n®¥. k=1 All estimators satisfying the law of large numbers are denoted consistent. Central limit theorem: If s 2 = var[X], then Rn = Ân - A» N(0,s 2 /n). Hence, recalling that A is not random, that Var(Rn )=Var(Ân )= 1 n Var(X). The law of large numbers makes the estimator unbiased. The central limit theorem follows from the independence of the Xk . When n is large enough, Rn is approximately normal, independent of the distribution of X as long as E[X]<¥. Random number generators Beware simple random number generators. For example, never, ever use the UNIX/Linux function rand. It repeats much too quickly. The function random repeats less frequently, but is not useful for parallel computing. Matlab has a very good random number generator that is operating system independent. Look for digital based codes developed 20 years ago by Michael Mascagni for good parallel random number generators. These are the state of the art even today. However, the best ones are analog: they measure the deviations in the electrical line over time and normalize them to the interval [0,1]. Some CPUs do this as a hardware instruction for sampling the deviation. These are the only true random number generators available on computers. The Itanium2 CPU line has this built in. Some other chips have this, too, but finding operating systems that will sample this instruction is hard to find. Sampling A simple sampler produces and independent sample of X each time it is called. The simple sampler turns standard uniforms into samples of some other random variable. MC codes spend almost all of their time in the sampler. Optimizing the sampler code to reduce its execution time can have a profound effect on the overall run time of the MC computation. In the discussion below rng() is a good random number generator. Bernoulli coin tossing A Bernoulli random variable with parameter p is a random variable X with Pr(X =1)= p and Pr(X = 0)=1- p . If U is a standard uniform, then p = Pr(U £ p). So we can sample X using the code fragment if ( rng() <= p ) X = 1; else X = 0; For a random variable with a finite number of values Pr(X = xk )= pk with å pk =1, we sample it using the unit interval and dividing it into subintervals of length pk. This works well with Markov chains. Exponential If U is a standard uniform, then T = -l -1 ln(U) is an exponential with rate parameter l with units 1/Time. Since 0<U<1, ln(U)<0 and T>0. We can sample T with the code fragment T = -(1/lambda)*log(rng()); The PDF of the random variable T is given by f (t)= le-lt for some t>0. Cumulative density function (CDF) Suppose X is a one component random variable with PDF f (x). Then the CDF F(x)= Pr(X £ x)= ò f (x')dx' . We know that 0 £ F(x)£1, "x and any u Î[0,1] x'£x there is an x such that F(x)=u . The simple sampler can be coded with 1. Choose U = rng() 2. Find X such that F(X)=U Note that step 2 can be quite difficult and time consuming. Good programming reduces the time. There is no elementary formula for the cumulative normal N(z). However there is software available to compute it to approximately double precision. The inverse cumulative normal z = N -1(u) can also be approximated. The Box Muller method We can generate two independent standard normal from two independent standard uniforms using the formulas R Q Z1 Z2 = -2ln(U1) = 2pU2 = Rcos(Q) = Rsin(Q) We can make N independent standard normal by making N standard uniforms and then using them in pairs to make N/2 pairs of independent standard normal. Multivariate normals Let X Î n be a multivariate normal random variable with mean 0 and covariance matrix C. We sample X using the Cholesky factorization of C = LT L , where L is lower triangular. Let Z Î n be a vector of n independent standard normal generated by the Box Muller method (or similarly). Then cov[Z]= I . If X = LZ , then X is multivariate normal and has cov[X]= LILT =C . There are many more methods that can be studied, e.g., Rejection. Testing samplers All scientific software should be presumed wrong until demonstrated to be correct. Simple 1D samplers are testing using tables and histograms. Errors Estimating the error in a MC calculation is straightforward. Normally a result with an error estimate is given when using a MC method. n Suppose X is a scalar random variable. Approximate A= E[X] by Ân = 1 n åk=1 Xk . The central limit theorem states that Rn = Ân - A» s nZ, where s n is the standard deviation of Ân and Z ~ N(0,1) . It can be shown that s n = 1 , where s 2 = var[X]= E[(X - A)2 ], s2 which we estimate using ŝ 2 = 1 n n 2 then take ŝ = 1 ŝ 2 . (X  ) n å k n n k=1 n n Since Z is of order 1, Rn is of order ŝ n . We typically report the MC data as A = Ân ±ŝ n . We can plot circles with a line for the diameter called the (standard deviation) error bar. We can think of k é ù êë úû standard deviation error bars ê Ân - kŝ n, Ân + kŝ n ú , which are confidence levels. The central limit theorem can be used to show that æé çê çê èë Pr Ân - ŝ n, Ân + ŝ ùö ÷ n úú÷ ûø æé çê çê èë » 66% and Pr Ân -2ŝ n, Ân +2ŝ ùö ÷ n úú÷ ûø » 95%. It is common in MC to report one standard deviation error bar. To interpret the data correctly, knowledge that the real data is outside of the circle one-third of the time has to be understood. Integration (quadrature) We want to approximate a d dimensional integral to an accuracy of e >0 . Assume we can do this using N quadrature points. Consider Simpson’s rule. For a function f (x): n ® , e µN -d/4 . MC integration can be done so that e µN -1/2 independent of d as long as the variance of the integrand is finite. MC integration Let V be the domain of integration. Define I( f )= ò f (x)dx and for uniform xi ÎV V let N N < f >= 1 åi=1 f (xi ) and < f 2 >= 1 åi=1 f 2(xi ). N N Then 2 2 I( f )=< f > ± < < f > - < f > > . N -1