Numerical Methods for Finance Dr Robert Nürnberg This course introduces the major numerical methods needed for quantitative work in finance. To this avail, the course will strike a balance between a general survey of significant numerical methods anyone working in a quantitative field should know, and a detailed study of some numerical methods specific to financial mathematics. In the first part the course will cover e.g. linear and nonlinear equations, interpolation and optimization, while the second part introduces e.g. binomial and trinomial methods, finite difference methods, Monte-Carlo simulation, random number generators, option pricing and hedging. The assessment consists of 80% an exam and 20% a project. 1 1. References 1. Burden and Faires (2004), Numerical Analysis. 2. Clewlow and Strickland (1998), Implementing Derivative Models. 3. Fletcher (2000), Practical Methods of Optimization. 4. Glasserman (2004), Monte Carlo Methods in Financial Engineering. 5. Higham (2004), An Introduction to Financial Option Valuation. 6. Hull (2005), Options, Futures, and Other Derivatives. 7. Kwok (1998), Mathematical Models of Financial Derivatives. 8. Press et al. (1992), Numerical Recipes in C. (online) 9. Press et al. (2002), Numerical Recipes in C++. 10. Seydel (2006), Tools for Computational Finance. 11. Wilmott, Dewynne, and Howison (1993), Option Pricing. 2 12. Winston (1994), Operations Research: Applications and Algorithms. 3 2. Preliminaries 1. Algorithms. An algorithm is a set of instructions to construct an approximate solution to a mathematical problem. A basic requirement for an algorithm is that the error can be made as small as we like. Usually, the higher the accuracy we demand, the greater is the amount of computation required. An algorithm is convergent if it produces a sequence of values which converge to the desired solution of the problem. Example: Given c > 1 and ε > 0, use the bisection method to seek an approxima√ tion to c with error not greater than ε. 4 Example √ Find x = c, c > 1 constant. Answer √ x = c ⇐⇒ x2 = c ⇐⇒ f (x) := x2 − c = 0 ⇒ ⇒ f (1) = 1 − c < 0 ∃ xˉ ∈ (1, c) s.t. f 0(x) = 2 x > 0 and f (ˉ x) = 0 ⇒ f (c) = c2 − c > 0 f monotonically increasing Denote In := [an, bn] with I0 = [a0, b0] = [1, c]. Let xn := ⇒ an +bn 2 . (i) If f (xn) = 0 then xˉ = xn. (ii) If f (an) f (xn) > 0 then xˉ ∈ (xn, bn), let an+1 := xn, bn+1 := bn. (iii) If f (an) f (xn) < 0 then xˉ ∈ (an, xn), let an+1 := an, bn+1 := xn. 5 xˉ is unique. Length of In : m(In) = 12 m(In−1) = ∙ ∙ ∙ = 21n m(I0) = Algorithm Algorithm stops if m(In) < ε and let x? := xn. Error as small as we like? c−1 2n xˉ, x? ∈ In ⇒ error |x? − xˉ| = |xn − xˉ| ≤ m(In) → 0 as n → ∞. X Convergence? ⇒ I0 ⊃ I1 ⊃ ∙ ∙ ∙ ⊃ In ⊃ ∙ ∙ ∙ ∞ \ √ ∃! xˉ ∈ In , f (ˉ x) = 0 , i.e. xˉ = c. X n=0 Implementation: No need to define In = [an, bn]. It is sufficient to store only 3 points throughout. Suppose xˉ ∈ (a, b), define x := a+b 2 . If xˉ ∈ (a, x) let a := a, b := x, otherwise let a := x, b := b. 6 2. Errors. There are various errors in computed solutions, such as • discretization error (discrete approximation to continuous systems), • truncation error (termination of an infinite process), and • rounding error (finite digit limitation in computer arithmetic). If a is a number and e a is an approximation to a, then the absolute error is |a − e a| and |a − e a| provided a 6= 0. the relative error is |a| Example: Discuss sources of errors in deriving the numerical solution of the nonlinear differential equation x0 = f (x) on the interval [a, b] with initial condition x(a) = x0. 7 Example discretization error [differential equation] x0 = f (x) x(t + h) − x(t) = f (x(t)) [difference equation] h x(t + h) − x(t) 0 DE = − x (t) h truncation error lim xn = x, n→∞ approximate x with xN , N a large number. TE = |x − xN | rounding error We cannot express x exactly, due to finite digit limitation. We get x̂ instead. RE = |x − x̂| Total error = DE + TE + RE. 8 3. Well/Ill Conditioned Problems. A problem is well-conditioned (or ill-conditioned) if every small perturbation of the data results in a small (or large) change in the solution. Example: Show that the solution to equations x + y = 2 and x + 1.01 y = 2.01 is ill-conditioned. Exercise: Show that the following problems are ill-conditioned: (a) solution to the differential equation x00 − 10x0 − 11x = 0 with initial conditions x(0) = 1 and x0(0) = −1, (b) value of q(x) = x2 + x − 1150 if x is near a root. 9 Example Change 2.01 to 2.02: x + y = 2 x + 1.01 y = 2.01 ⇒ x + y = 2 x + 1.01 y = 2.02 ⇒ x = 1 y = 1 x = 0 y = 2 I.e. 0.5% change in data produces 100% change in solution: ill-conditioned ! 1 1 [reason: det 1 1.01 ! = 0.01 ⇒ nearly singular] 10 4. Taylor Polynomials. Suppose f, f 0, . . . , f (n) are continuous on [a, b] and f (n+1) exists on (a, b). Let x0 ∈ [a, b]. Then for every x ∈ [a, b], there exists a ξ between x0 and x with f (x) = n X f (k)(x0) k=0 k! (x − x0)k + Rn(x) f (n+1)(ξ) (x − x0)n+1 is the remainder. where Rn(x) = (n + 1)! Z x (n+1) f (t) [Equivalently: Rn(x) = (x − t)n dt.] n! x0 Examples: • • exp(x) = sin(x) = ∞ X xk k=0 ∞ X k=0 k! (−1)k 2k+1 x (2k + 1)! 11 5. Gradient and Hessian Matrix. Assume f : Rn → R. The gradient of f at a point x, written as ∇f (x), is a column vector in Rn with ith ∂f (x). component ∂x i The Hessian matrix of f at x, written as ∇2f (x), is an n × n matrix with (i, j)th 2f ∂ 2f ∂ 2f 2 (x). [As = , ∇ f (x) is symmetric.] component ∂x∂i∂x ∂xi ∂xj ∂xj ∂xi j Examples: • • • f (x) = aT x, a ∈ Rn ⇒ f (x) = 12 xT A x, A symmetric ∇f = a, ∇2f = 0 ⇒ ∇f (x) = A x, ∇2f = A f (x) = exp( 12 xT A x), A symmetric ⇒ ∇f (x) = exp( 12 xT A x) A x, ∇2f (x) = exp( 12 xT A x) A x xT A + exp( 12 xT A x) A 12 6. Taylor’s Theorem. Suppose that f : Rn → R is continuously differentiable and that p ∈ Rn. Then we have for some t ∈ (0, 1). f (x + p) = f (x) + ∇f (x + t p)T p , Moreover, if f is twice continuously differentiable, we have Z 1 ∇2f (x + t p) p dt, ∇f (x + p) = ∇f (x) + 0 and 1 T 2 f (x + p) = f (x) + ∇f (x) p + p ∇ f (x + t p) p, 2 for some t ∈ (0, 1). T 13 7. Positive Definite Matrices. An n × n matrix A = (aij ) is positive definite if it is symmetric (i.e. AT = A) and [I.e. xT A x ≥ 0 with “=” only if x = 0.] xT A x > 0 for all x ∈ Rn \ {0}. The following statements are equivalent: (a) A is a positive definite matrix, (b) all eigenvalues of A are positive, (c) all leading principal minors of A are positive. The leading principal minors of A are the determinants Δk , k = 1, 2, . . . , n, defined by " # a11 a12 , . . . , Δn = det A. Δ1 = det[a11], Δ2 = det a21 a22 A matrix A is symmetric and positive semi-definite, if xT A x ≥ 0 for all x ∈ Rn. Exercise. Show that any positive definite matrix A has only positive diagonal entries. 14 8. Convex Sets and Functions. A set S ⊂ Rn is a convex set if the straight line segment connecting any two points in S lies entirely inside S, i.e., for any two points x, y ∈ S we have α x + (1 − α) y ∈ S ∀ α ∈ [0, 1]. A function f : D → R is a convex function if its domain D ⊂ Rn is a convex set and if for any two points x, y ∈ D we have f (α x + (1 − α) y) ≤ αf (x) + (1 − α) f (y) ∀ α ∈ [0, 1]. Exercise. Let D ⊂ Rn be a convex, open set. (a) If f : D → R is continuously differentiable, then f is convex if and only if f (y) ≥ f (x) + ∇f (x)T (y − x) ∀ x, y ∈ D . (b) If f is twice continuously differentiable, then f is convex if and only if ∇2f (x) is positive semi-definite for all x in the domain. 15 Exercise 2.8. (a) “⇒” As f is convex we have for any x, y in the convex set D that f (α y + (1 − α) x) ≤ αf (y) + (1 − α) f (x) ∀ α ∈ [0, 1]. Hence f (x + α (y − x)) − f (x) + f (x) . α Letting α → 0 yields f (y) ≥ f (x) + ∇f (x)T (y − x). “⇐” For any x1, x2 ∈ D and λ ∈ [0, 1] let x := λ x1 + (1 − λ) x2 ∈ D and y := x1. On noting that y − x = x1 − λ x1 − (1 − λ) x2 = (1 − λ) (x1 − x2) we have that f (y) ≥ f (x1) = f (y) ≥ f (x) + ∇f (x)T (y − x) = f (x) + (1 − λ) ∇f (x)T (x1 − x2) . (†) Similarly, letting x := λ x1 + (1 − λ) x2 and y := x2 gives, on noting that y − x = λ (x2 − x1), that f (x2) ≥ f (x) + λ ∇f (x)T (x2 − x1) . (‡) 16 Combining λ ∙ (†) + (1 − λ) ∙ (‡) gives λ f (x1) + (1 − λ) f (x2) ≥ f (x) = f (λ x1 + (1 − λ) x2 ⇒ f is convex. (b) “⇐” For any x, x0 ∈ D use Taylor’s theorem at x0: 1 T f (x) = f (x0)+∇f (x0) (x−x0)+ (x−x0)T ∇2f (x0 +θ (x−x0)) (x−x0) 2 As ∇2f is positive semi-definite, this immediately gives f (x) ≥ f (x0) + ∇f (x0)T (x − x0) ⇒ θ ∈ (0, 1) f is convex. “⇒” Assume ∇2f is not positive semi-definite in the domain D. Then there exists x0 ∈ D and x̂ ∈ Rn s.t. x̂T ∇2f (x0) x̂ < 0. As D is open we can find x1 := x0 + α x̂ ∈ D, for α > 0 sufficiently small. Hence (x1 − x0)T ∇2f (x0) (x1 − x0) < 0. For x1 sufficiently close to x0 the continuity of ∇2f gives (x1 − x0)T ∇2f (x0 + θ (x1 − x0)) (x1 − x0) < 0 for all θ ∈ (0, 1). Taylor’s theorem then yields f (x1) < f (x0) + ∇f (x0)T (x1 − x0). This contradicts f being convex, see (a). 17 9. Vector Norms. A vector norm on Rn is a function, k∙k, from Rn into R with the following properties: (i) kxk ≥ 0 for all x ∈ Rn and kxk = 0 if and only if x = 0. (ii) kα xk = |α| kxk for all α ∈ R and x ∈ Rn. (iii) kx + yk ≤ kxk + kyk for all x, y ∈ Rn. Common vector norms are the l1−, l2− (Euclidean), and l∞−norms: ( n )1/2 n X X kxk1 = |xi|, kxk2 = x2i , kxk∞ = max |xi|. i=1 i=1 Exercise. (a) Prove that k ∙ k1, k ∙ k2 and k ∙ k∞ are norms. (b) Given a symmetric positive definite matrix A, prove that √ kxkA := xT A x is a norm. 18 1≤i≤n Example. Draw graphs defined by kxk1 ≤ 1, kxk2 ≤ 1, kxk∞ ≤ 1 when n = 2. l2 l1 l∞ Exercise. Prove that for all x, y ∈ Rn we have n X |xi yi| ≤ kxk2 kyk2 (a) [Scharz inequality] i=1 and (b) √ 1 √ kxk2 ≤ kxk∞ ≤ kxk1 ≤ n kxk2. n 19 10. Spectral Radius. The spectral radius of a matrix A ∈ Rn×n is defined by ρ(A) = max |λi|, where 1≤i≤n λ1, . . . , λn are all the eigenvalues of A. 11. Matrix Norms. For an n × n matrix A, the natural matrix norm kAk for a given vector norm k ∙ k is defined by kAk = max kA xk. kxk=1 The common matrix norms are n q X |aij |, kAk2 = ρ(AT A), kAk1 = max 1≤j≤n | {z } i=1 kAk∞ = max 1≤i≤n ρ(A) if A=AT Exercise: Compute kAk1, kAk∞, and kAk2 for A = (Answer: kAk1 = kAk∞ = 4 and kAk2 = 20 p 7+ √ " 1 1 0 1 2 1 −1 1 2 7 ≈ 3.1.) # . n X j=1 |aij |. 12. Convergence. A sequence of vectors {x(k)} ⊂ Rn is said to converge to a vector x ∈ Rn if kx(k) − xk → 0 as k → ∞ for an arbitrary vector norm k ∙ k. This is equivalent (k) to the componentwise convergence, i.e., xi → xi as k → ∞, i = 1, . . . , n. A square matrix A ∈ Rn×n is said to be convergent if kAk k → 0 as k → ∞, which is equivalent to (Ak )ij → 0 as k → ∞ for all i, j. The following statements are equivalent: (i) A is a convergent matrix, (ii) ρ(A) < 1, (iii) limk→∞ Ak x = 0, for every x ∈ Rn. Exercise. Show that A is convergent, where " # 1/2 0 A= . 1/4 1/2 21 3. Algebraic Equations 1. Decomposition Methods for Linear Equations. A matrix A ∈ Rn×n is said to have LU decomposition if A = L U where L ∈ Rn×n is a lower triangular matrix (lij = 0 if 1 ≤ i < j ≤ n) and U ∈ Rn×n is an upper triangular matrix (uij = 0 if 1 ≤ j < i ≤ n). The decomposition is unique if one assumes e.g. lii = 1 for 1 ≤ i ≤ n. l11 l21 l22 , L = l31 l32 l33 . . . . . ln1 ln2 ln3 . . . lnn u11 u12 u13 ... u1n u22 u23 ... u2n . . .. U = . u u n−1,n−1 n−1,n unn 22 In general, the diagonal elements of either L or U are given and the remaining elements of the matrices are determined by directly comparing two sides of the equation. The linear system A x = b is then equivalent to L y = b and U x = y. Exercise. Show that the solution to L y = b is y1 = b1/l11, yi = (bi − i−1 X lik yk )/lii, i = 2, . . . , n k=1 (forward substitution) and the solution to U x = y is xn = yn/unn, (backward substitution). xi = (yi − n X k=i+1 23 uik xk )/uii, i = n − 1, . . . , 1 2. Crout Algorithm. Exercise. Let A be tridiagonal, i.e. aij = 0 if |i − j| > 1 (aij = 0 except perhaps ai−1,i, aii and P ai,i+1), and strictly diagonally dominant (|aii| > j6=i |aij | holds for i = 1, . . . , n). Show that A can be factorized as A = L U where lii = 1 for i = 1, . . . , n, u11 = a11, and ui,i+1 = ai,i+1 li+1,i = ai+1,i/uii ui+1,i+1 = ai+1,i+1 − li+1,i ui,i+1 for i = 1, . . . , n − 1. [Note: L and U are tridiagonal.] C++ Exercise: Write a program to solve a tridiagonal and strictly diagonally dominant linear equation A x = b by the Crout algorithm. The input are the number of variables n, the matrix A, and the vector b. The output is the solution x. 24 Exercise 3.2. u11 = a11 and ui,i+1 = ai,i+1, li+1,i = ai+1,i/uii, ui+1,i+1 = ai+1,i+1 − li+1,i ui,i+1, for i = 1, . . . , n − 1, can easily be shown. It remains to show that uii 6= 0 for i = 1, . . . , n. We proceed by induction to show that |uii| > |ai,i+1|, where for convenience we define an,n+1 := 0. • i = 1: |u11| = |a11| > |a1,2| X • i 7→ i + 1: ai+1,i ai,i+1 |ui+1,i+1| = ai+1,i+1 − uii |ai,i+1| ≥ |ai+1,i+1| − |ai+1,i| > |ai+1,i+2| X |uii| Overall we have that |uii| > 0 and so the Crout algorithm is well defined. Moreover, Qn det(A) = det(L) det(U ) = det(U ) = i=1 uii 6= 0. ≥ |ai+1,i+1| − |ai+1,i| 25 3. Choleski Algorithm. Exercise. Let A be a positive definite matrix. Show that A can be factorized as A = L LT where L is a lower triangular matrix. (i) Compute 1st column: l11 = √ a11, li1 = ai1/l11, i = 2, . . . , n. (ii) For j = 2, . . . , n − 1 compute jth column: ljj = (ajj − lij = (aij − j−1 X k=1 j−1 X 2 12 ljk ) lik ljk )/ljj , k=1 26 i = j + 1, . . . , n. (iii) Compute nth column: lnn = (ann − 27 n−1 X k=1 1 2 2 lnk ) . 4. Iterative Methods for Linear Equations. Split A into A = M + N with M nonsingular and convert the equation A x = b into an equivalent equation x = C x + d with C = −M −1 N and d = M −1 b. Choose an initial vector x(0) and then generate a sequence of vectors by x(k) = C x(k−1) + d, k = 1, 2, . . . The resulting sequence converges to the solution of A x = b, for an arbitrary initial vector x(0), if and only if ρ(C) < 1. The objective is to choose M such that M −1 is easy to compute and ρ(C) < 1. The iteration stops if kx(k) − x(k−1)k < ε. [Note: In practice one solves M x(k) = −N x(k−1) + b, for k = 1, 2, . . ..] 28 Claim. The iteration x(k) = C x(k−1) + d is convergent if and only if ρ(C) < 1. Proof. Define e(k) := x(k) − x, the error of the kth iterate. Then e(k) = C x(k−1) + d − (C x + d) = C (x(k−1) − x) = C e(k−1) = C 2 e(k−2) = . . . C k e(0) , where e(0) = x(0) − x is an arbitrary vector. Assume C is similar to the diagonal matrix Λ = diag(λ1, . . . , λn), where λi are the eigenvalues of C. ⇒ ∃ X nonsingular s.t. C = X Λ X −1 ⇒ e(k) = C k e(0) = X Λk X −1 e(0) ⇐⇒ ⇐⇒ |λi| < 1 ρ(C) < 1 . λk1 ... =X ∀ i = 1, . . . , n 29 λkn −1 (0) X e → 0 as k → ∞ 5. Jacobi Algorithm. Exercise: Let M = D and N = L + U (L strict lower triangular part of A, D diagonal, U strict upper triangular part). Show that the ith component at the kth iteration is i−1 n X X 1 (k) (k−1) (k−1) bi − xi = aij xj − aij xj aii j=1 j=i+1 for i = 1, . . . , n. 6. Gauss–Seidel Algorithm. Exercise: Let M = D + L and N = U . Show that the ith component at the kth iteration is i−1 n X X 1 (k) (k) (k−1) xi = bi − aij xj − aij xj aii j=1 j=i+1 for i = 1, . . . , n. 30 7. SOR Algorithm. Exercise. Let M = ω1 D + L and N = U + (1 − ω1 )D where 0 < ω < 2. Show that the ith component at the kth iteration is i−1 n X X 1 (k) (k−1) (k) (k−1) + ω bi − aij xj − aij xj xi = (1 − ω) xi aii j=1 j=i+1 for i = 1, . . . , n. C++ Exercise: Write a program to solve a diagonally dominant linear equation A x = b by the SOR algorithm. The input are the number of variables n, the matrix A, the vector b, the initial vector x0, the relaxation parameter ω, and tolerance ε. The output is the number of iterations k and the approximate solution xk . 31 8. Special Matrices. If A is strictly diagonally dominant, then Jacobi and Gauss–Seidel converge for any initial vector x(0). In addition, SOR converges for ω ∈ (0, 1]. If A is positive definite and 0 < ω < 2, then the SOR method converges for any initial vector x(0). If A is positive definite and tridiagonal, then ρ(CGS ) = [ρ(CJ )]2 < 1 and the optimal 2 p ∈ [1, 2). With this choice choice of ω for the SOR method is ω = 1 + 1 − ρ(CGS ) of ω, ρ(CSOR ) = ω − 1 ≤ ρ(CGS ). Exercise. Find the optimal ω for the SOR method for the matrix 4 3 0 A = 3 4 −1 . 0 −1 4 (Answer: ω ≈ 1.24.) 32 9. Condition Numbers. The condition number of a nonsingular matrix A relative to a norm k ∙ k is defined by κ(A) = kAk ∙ kA−1k. Note that κ(A) ≥ kA A−1k = kIk = max kxk = 1. kxk=1 A matrix A is well-conditioned if κ(A) is close to one and is ill-conditioned if κ(A) is much larger than one. 1 Suppose kδAk < . Then the solution x e to (A + δA) x e = b + δb approximates −1 kA k the solution x of A x = b with error estimate κ(A) kx − x ek kδbk kδAk . ≤ + kxk 1 − kδAk kA−1k kbk kAk In particular, if δA = 0 (no perturbation to matrix A) then kx − x̃k kδbk ≤ κ(A) . kxk kbk 33 Example. Consider Example 1.3. A= −1 ⇒ A Recall 1 1 1 1.01 1 = det A ! 1.01 −1 −1 1 ! kAk1 = max 1≤j≤n Hence n X i=1 = 101 −100 −100 100 |aij | . kA−1k1 = max(201, 200) = 201 . kAk1 = max(2, 2.01) = 2.01 , ⇒ 1.01 −1 −1 1 1 = 0.01 ! κ1(A) = kAk1 ∙ kA−1k1 = 404.01 1 Similarly κ∞ = 404.01 and κ2 = ρ(A) ρ(A−1) = 34 λmax λmin (ill-conditioned!) = 402.0075. ! 10. Hilbert Matrix. An n × n Hilbert matrix Hn = (hij ) is defined by hij = 1/(i + j − 1) for i, j = 1, 2, . . . , n. Hilbert matrices are notoriously ill-conditioned and κ(Hn) → ∞ very rapidly as n → ∞. Exercise. 1 1 1 . . . 2 n 1 1 1 ... 3 n+1 Hn = 2 . .. . 1 1 1 ... n n+1 2n − 1 Compute the condition numbers κ1(H3) and κ1(H6). 35 (Answer: κ1(H3) = 748 and κ1(H6) = 2.9 × 106.) 36 11. Fixed Point Method for Nonlinear Equations. A function g : R → R has a fixed point xˉ if g(ˉ x) = xˉ. A function g is a contraction mapping on [a, b] if g : [a, b] → [a, b] and |g 0(x)| ≤ L < 1, ∀ x ∈ (a, b) where L is a constant. Exercise. Assume g is a contraction mapping on [a, b]. Prove that g has a unique fixed point xˉ in [a, b], and for any x0 ∈ [a, b], the sequence defined by xn+1 = g(xn), n ≥ 0, converges to xˉ. The algorithm is called fixed point iteration. 37 Exercise 3.11. Existence: Define h(x) = x − g(x) on [a, b]. Then h(a) = a − g(a) ≤ 0 and h(b) = b − g(b) ≥ 0. As h is continuous, ∃ c ∈ [a, b] s.t. h(c) = 0. I.e. c = g(c). X Uniqueness: Suppose p, q ∈ [a, b] are two fixed points. Then |p − q| = |g(p) − g(q)| ⇒ (1 − L) |p − q| ≤ 0 ⇒ = |{z} MVT, α∈(a,b) |g 0(α) (p − q)| ≤ L |p − q| |p − q| ≤ 0 ⇒ p = q. Convergence: |xn − xˉ| = |g(xn−1) − g(ˉ x)| = |g 0(α) (xn−1 − xˉ)| ≤ L |xn−1 − xˉ| ≤ . . . ≤ Ln |x0 − xˉ| → 0 as n → ∞ . Hence xn → xˉ as n → ∞ . 38 X X 12. Newton Method for Nonlinear Equations. Assume that f ∈ C 1([a, b]), f (ˉ x) = 0 (ˉ x is a root or zero) and f 0(ˉ x) 6= 0. The Newton method can be used to find the root xˉ by generating a sequence {xn} satisfying f (xn) , n = 0, 1, . . . xn+1 = xn − 0 f (xn) provided f 0(xn) 6= 0 for all n. The sequence xn converges to the root xˉ as long as the initial point x0 is sufficiently close to xˉ. The algorithm stops if |xn+1 − xn| < ε, a prescribed error tolerance, and xn+1 is taken as an approximation to xˉ . Geometric Interpretation: Tangent line at (xn, f (xn)) is Y = f (xn) + f 0(xn) (X − xn). 39 n) Setting Y = 0 yields xn+1 := X = xn − ff0(x (xn ) . 40 13. Choice of Initial Point. Suppose f ∈ C 2([a, b]) and f (ˉ x) = 0 with f 0(ˉ x) 6= 0. Then there exists δ > 0 such that the Newton method generates a sequence xn converging to xˉ for any initial point x0 ∈ [ˉ x − δ, xˉ + δ] (x0 can only be chosen locally). However, if f satisfies the following additional conditions: 1. f (a)f (b) < 0, 2. f 00 does not change sign on [a, b], 3. tangent lines to the curve y = f (x) at both a and b cut the x-axis within [a, b]; f (b) , b − (i.e. a − ff0(a) (a) f 0 (b) ∈ [a, b]) then f (x) = 0 has a unique root xˉ in [a, b] and Newton method converges to xˉ for any initial point x0 ∈ [a, b] (x0 can be chosen globally). Example. Use the Newton method to compute x = can be any point in [1, c]. 41 √ c, c > 1, and show that the initial point Example. √ Find x = c, c > 1. Answer. x is root of f (x) := x2 − c. 2 f (xn) xn − c 1 c = , = xn − Newton: xn+1 = xn − 0 xn + f (xn) 2 xn 2 xn How to choose x0? Check the 3 conditions on [1, c]. 1. f (1) = 1 − c < 0, f (c) = c2 − c > 0. 2. f 00 = 2 ⇒ f (1) f (c) < 0 X n ≥ 0. X 3. Tangent line at 1: Y = f (1) + f 0(1) (X − 1) = 1 − c + 2 (X − 1) c−1 ∈ (1, c). X Let Y = 0, then X = 1 + 2 Tangent line at c: Y = f (c) + f 0(c) (X − c) = c2 − c + 2 c (X − c) c−1 Let Y = 0, then X = c − ∈ (1, c). X 2 42 ⇒ Newton convergence for any x0 ∈ [1, c]. 43 Numerical Example. √ √ Find 7. (From calculator: 7 = 2.6457513.) Newton converges for all x0 ∈ [1, 7]. Choose x0 = 4. 1 7 x1 = = 2.875 x0 + 2 x0 x2 = 2.6548913 x3 = 2.6457670 x4 = 2.6457513 Comparison to bisection method with I0 = [1, 7]: I1 = [1, 4] I2 = [2.5, 4] I3 = [2.5, 3.25] I4 = [2.5, 2.875] .. I25 = [2.6457512, 2.6457513] 44 14. Pitfalls. Here are some difficulties which may be encountered with the Newton method: 1. {xn} may wander around and not converge (there are only complex roots to the equation), 2. initial approximation x0 is too far away from the desired root and {xn} converges to some other root (this usually happens when f 0(x0) is small), 3. {xn} may diverge to +∞ (the function f is positive and monotonically decreasing on an unbounded interval), and 4. {xn} may repeat (cycling). 45 15. Rate of Convergence. Suppose {xn} is a sequence that converges to xˉ . The convergence is said to be linear if there is a constant r ∈ (0, 1) such that |xn+1 − xˉ| ≤ r, |xn − xˉ| for all n sufficiently large. The convergence is said to be superlinear if |xn+1 − xˉ| lim = 0. n→∞ |xn − x ˉ| In particular, the convergence is said to be quadratic if |xn+1 − xˉ| ≤ M, 2 |xn − xˉ| for all n sufficiently large. where M is a positive constant, not necessarily less than 1. n Example. xn = xˉ + 0.5n linear, xn = xˉ + 0.52 quadratic. Example. Show that the Newton method converges quadratically. 46 Example. f (x) Define g(x) = x − 0 . Then the Newton method is given by f (x) xn+1 = g(xn) . Moreover, f (ˉ x) = 0 and f 0(ˉ x) 6= 0 imply that g(ˉ x) = xˉ , (f 0)2 − f f 00 f 00(ˉ x) x) = 1 − (ˉ x ) = f (ˉ x ) = 0, g (ˉ 0 2 0 2 (f ) x)) (f (ˉ x) f 00(ˉ 00 g (ˉ x) = 0 . f (ˉ x) 0 Assuming that xn → xˉ we have that |g 0(ˉ x) (xn − xˉ) + 12 g 00(ηn) (xn − xˉ)2| |xn+1 − xˉ| |g(xn) − g(ˉ x)| = = |{z} 2 2 2 |x |xn − xˉ| |xn − xˉ| − x ˉ | n Taylor 1 1 = |g 00(ηn)| → |g 00(ˉ x)| =: λ as n → ∞ . 2 2 47 Hence |xn+1 − xˉ| ≈ λ |xn − xˉ|2 ⇒ quadratic convergence rate. 48 4. Interpolations 1. Polynomial Approximation. For any continuous function f defined on an interval [a, b], there exist polynomials P that can be as “close” to the given function as desired. Taylor polynomials agree closely with a given function at a specific point, but they concentrate their accuracy only near that point. A good polynomial needs to provide a relatively accurate approximation over an entire interval. 49 2. Interpolating Polynomial – Lagrange Form. Suppose xi ∈ [a, b], i = 0, 1, . . . , n, are pairwise distinct mesh points in [a, b]. The Lagrange polynomial p is a polynomial of degree ≤ n such that p(xi) = f (xi), i = 0, 1, . . . , n. p can be constructed explicitly as p(x) = n X Li(x)f (xi) i=0 where Li is a polynomial of degree n satisfying Li(xj ) = 0, This results in Li(x) = j 6= i, Y x − xj j6=i xi − xj Li(xi) = 1. i = 0, 1, . . . , n. p is called linear interpolation if n = 1 and quadratic interpolation if n = 2. 50 Exercise. Find the Lagrange polynomial p for the following points (x, f (x)): (1, 0), (−1, −3), and (2, −4). Assume that a new point (0, 2) is observed, and construct a Lagrange polynomial to incorporate this new information in it. Error formula. Suppose f is n + 1 times differentiable on [a, b]. Then it holds that f (n+1)(ξ) f (x) = p(x) + (x − x0) ∙ ∙ ∙ (x − xn), (n + 1)! where ξ = ξ(x) lies in (a, b). Proof. Define g(x) = f (x) − p(x) + λ n Y j=0 (x − xj ), where λ ∈ R is a constant. Clearly, g(xj ) = 0 for j = 0, . . . , n. To estimate the error at x = α 6∈ {x0, . . . , xn}, choose λ such that g(α) = 0. 51 Hence g(x) = f (x) − p(x) − (f (α) − p(α)) ⇒ g has at least n + 2 zeros: x0, . . . , xn, α. n Y x − xj j=0 α − xj . Mean Value Theorem yields that g 0 has at least n + 1 zeros .. g (n+1) has at least 1 zero, say ξ Moreover, as p is polynomial of degree ≤ n, it holds that p(n+1) = 0. 52 Hence ⇒ (n + 1)! 0 = g (n+1)(ξ) = f (n+1)(ξ) − (f (α) − p(α)) Qn j=0 (α − xj ) n f (n+1)(ξ) Y Error = f (α) − p(α) = (α − xj ) . (n + 1)! j=0 53 3. Trapezoid Rule. We can use linear interpolation (n = 1, x0 = a, x1 = b) to approximate f (x) on [a, b] Rb and then compute a f (x) dx to get the trapezoid rule: Z b 1 f (x) dx ≈ (b − a) [f (a) + f (b)] . 2 a If we partition [a, b] into n equal subintervals with mesh points xi = a + ih, i = 0, . . . , n, and step size h = (b − a)/n, we can derive the composite trapezoid rule: Z b n−1 X h f (x) dx ≈ [f (x0) + 2 f (xi) + f (xn)] . 2 a i=1 The approximation error is in the order of O(h2) if |f 00| is bounded. 54 Use linear interpolation (n = 1, x0 = a, x1 = b) to approximate f (x) on [a, b] and then Rb compute a f (x) dx. Answer. The linear interpolating polynomial is p(x) = f (a) L0(x) + f (b) L1(x), where x−a L1(x) = . b−a Z b Z b Z b b x−b x−a dx + f (b) dx f (x) dx ≈ p(x) dx = f (a) a − b b − a a a a a 1 1 1 1 2 b = f (a) (x − b) |a +f (b) (x − a)2 |ba a−b2 b−a2 1 1 1 1 2 (−(a − b) ) + f (b) (b − a)2 = f (a) a−b2 b−a2 b−a = (f (a) + f (b)) ← Trapezoid Rule 2 L0(x) = ⇒ x−b , a−b Z and [Of course, one could have arrived at this formula with a simple geometric argument.] 55 Error Analysis. f 00(ξ) Let f (x) = p(x) + E(x), where E(x) = (x − a) (x − b) with ξ ∈ (a, b). Assume 2 that |f 00| ≤ M is bounded. Then Z Z Z b b b M |E(x)| dx ≤ (x − a) (b − x) dx E(x) dx ≤ a 2 a a Z b M (x − a) [(b − a) − (x − a)] dx = 2 a Z b M 2 = −(x − a) + (b − a) (x − a) dx 2 a M 1 1 − (b − a)3 + (b − a)3 dx = 2 3 2 M (b − a)3 . = 12 56 The composite formula can be obtained by considering the partitioning of [a, b] into b−a a = x0 < x1 < . . . < xn−1 < xn = b, where xi = a + i h with h := . n Z b f (x) dx = a n−1 Z X i=0 xi+1 xi f (x) dx ≈ = n−1 X xi+1 − xi i=0 n−1 X i=0 2 (f (xi) + f (xi+1)) h (f (xi) + f (xi+1)) 2 1 1 =h f (a) + f (x1) + . . . + f (xn−1) + f (b) . 2 2 Error analysis then yields that M (b − a) 2 M 3 h n= h = O(h2) . Error ≤ 12 12 57 4. Simpson’s Rule. Exercise. Use quadratic interpolation (n = 2, x0 = a, x1 = a+b 2 , x2 = b) to approximate f (x) Rb on [a, b] and then compute a f (x) dx to get the Simpson’s rule: Z b 1 a+b f (x) dx ≈ (b − a) [f (a) + 4 f ( ) + f (b)] . 6 2 a Derive the composite Simpson’s rule: Z b n/2 n/2 X X h f (x) dx ≈ [f (x0) + 2 f (x2i−2) + 4 f (x2i−1) + f (xn)] , 3 a i=2 i=1 where n is an even number and xi and h are chosen as in the composite trapezoid rule. [One can show that the approximation error is in the order of O(h4) if |f (4)| is bounded.] 58 5. Newton–Cotes Formula. Suppose x0, . . . , xn are mesh points in [a, b], usually mesh points are equally spaced and x0 = a, xn = b, then integral can be approximated by the Newton–Cotes formula: Z b n X f (x) dx ≈ Ai f (xi) a i=0 where parameters Ai are determined in such a way that the integral is exact for all polynomials of degree ≤ n. [Note: n+1 unknowns Ai and n+1 coefficients for polynomial of degree n.] Exercise. Use Newton–Cotes formula to derive the trapezoid rule and the Simpson’s rule. Prove that if f is n + 1 times differentiable and |f (n+1)| ≤ M on [a, b] then Z b Z bY n n X M | f (x) dx − Ai f (xi)| ≤ |x − xi| dx. (n + 1)! a i=0 a i=0 59 Exercise 4.5. We have that Z b n X q(x) dx = Ai q(xi) a for all polynomials q of degree ≤ n. i=0 for the data points x0, x1, . . . , xn. Let q(x) = Lj (x), where Lj is the jth Lagrange polynomial 1 i = j I.e. Lj is of degree n and satisfies Lj (xi) = δij = . Now 0 i 6= j Z b n X Lj dx = Ai Lj (xi) = Aj ⇒ Z a b a f (x) dx ≈ = i=0 n X Ai f (xi) = i=0 Z bX n a n X f (xi) i=0 f (xi) Li(x) dx = i=0 Z Z b Li(x) dx a b p(x) dx, a where p(x) is the interpolating Lagrange polynomial to f . Hence we find trapezoid 60 (n = 1) and Simpson’s rule (n = 2, with x1 = The Lagrange polynomial has the error term f (x) = p(x) + E(x), a+b 2 ). f (n+1)(ξ) (x − x0) ∙ ∙ ∙ (x − xn), E(x) := (n + 1)! where ξ = ξ(x) lies in (a, b). Hence Z Z Z Z b b b b p(x) dx = E(x) dx ≤ |E(x)| dx f (x) dx − a a a a Z bY n M ≤ |x − xi| dx. (n + 1)! a i=0 61 6. Ordinary Differential Equations. An initial value problem for an ODE has the form x0(t) = f (t, x(t)), a ≤ t ≤ b and x(a) = x0. (1) is equivalent to the integral equation: Z t f (s, x(s)) ds, a ≤ t ≤ b. x(t) = x0 + (1) (2) a To solve (2) numerically we divide [a, b] into subintervals with mesh points ti = a+ih, i = 0, . . . , n, and step size h = (b − a)/n. (2) implies Z ti+1 f (s, x(s)) ds, i = 0, . . . , n − 1. x(ti+1) = x(ti) + ti 62 (a) If we approximate f (s, x(s)) on [ti, ti+1] by f (ti, x(ti)), we get the Euler (explicit) method for equation (1): wi+1 = wi + hf (ti, wi), w0 = x0. We have x(ti+1) ≈ wi+1 if h is sufficiently small. [Taylor: x(ti+1) = x(ti) + x0(ti) h + O(h2) = x(ti) + f (ti, x(ti)) h + O(h2).] (b) If we approximate f (s, x(s)) on [ti, ti+1] by linear interpolation with points (ti, f (ti, x(ti)) and (ti+1, f (ti+1, x(ti+1))), we get the trapezoidal (implicit) method for equation (1): h w0 = x0. wi+1 = wi + [f (ti, wi) + f (ti+1, wi+1)], 2 (c) If we combine the Euler method with the trapezoidal method, we get the modified Euler (explicit) method (or Runge–Kutta 2nd order method): wi+1 h = wi + [f (ti, wi) + f (ti+1, wi + hf (ti, wi))], | {z } 2 ≈ wi+1 63 w0 = x0. 7. Divided Differences. Suppose a function f and (n + 1) distinct points x0, x1, . . . , xn are given. Divided differences of f can be expressed in a table format as follows: xk x0 x1 x2 x3 .. 0DD 1DD 2DD 3DD ... f [x0] f [x1] f [x0, x1] f [x2] f [x1, x2] f [x0, x1, x2] f [x3] f [x2, x3] f [x1, x2, x3] f [x0, x1, x2, x3] ... 64 where f [xi] = f (xi) f [xi+1] − f [xi] f [xi, xi+1] = xi+1 − xi f [xi+1, xi+2, . . . , xi+k ] − f [xi, xi+1, . . . , xi+k−1] f [xi, xi+1, . . . , xi+k ] = xi+k − xi f [x2, . . . , xn] − f [x1, . . . , xn−1] f [x1, . . . , xn] = xn − x1 65 8. Interpolating Polynomial – Newton Form. One drawback of Lagrange polynomials is that there is no recursive relationship between Pn−1 and Pn, which implies that each polynomial has to be constructed individually. Hence, in practice one uses the Newton polynomials. The Newton interpolating polynomial Pn of degree n that agrees with f at the points x0, x1, . . . , xn is given by Pn(x) = f [x0] + n X f [x0, x1, . . . , xk ] k−1 Y i=0 k=1 (x − xi). Note that Pn can be computed recursively using the relation Pn(x) = Pn−1(x) + f [x0, x1, . . . , xn](x − x0)(x − x1) ∙ ∙ ∙ (x − xn−1). [Note that f [x0, x1, . . . , xk ] can be found on the diagonal of the DD table.] Exercise. Suppose values (x, y) are given as (1, −2), (−2, −56), (0, −2), (3, 4), (−1, −16), and (7, 376). Is there a cubic polynomial that takes these values? (Answer: 2x3 − 7x2 + 66 5x − 2.) 67 Exercise 4.8. Data points: (1, −2), (−2, −56), (0, −2), (3, 4), (−1, −16), (7, 376). xk 1 −2 0 3 −1 7 0DD 1DD 2DD 3DD 4DD 5DD −2 −56 18 −2 27 −9 4 2 −5 2 −16 5 −3 2 0 376 49 11 2 0 0 Newton polynomial: p(x) = −2 + 18 (x − 1) − 9 (x − 1) (x + 2) + 2 (x − 1) (x + 2) x = 2x3 − 7x2 + 5x − 2 . [Note: We could have stopped at the row for x3 = 3 and checked whether p3 goes through the remaining data points.] 68 9. Piecewise Polynomial Approximations. Another drawback of interpolating polynomials is that Pn tends to oscillate widely when n is large, which implies that Pn(x) may be a poor approximation to f (x) if x is not close to the interpolating points. If an interval [a, b] is divided into a set of subintervals [xi, xi+1], i = 0, 1, . . . , n−1, and on each subinterval a different polynomial is constructed to approximate a function f , such an approximation is called spline. The simplest spline is the linear spline P which approximates the function f on the interval [xi, xi+1], i = 0, 1, . . . , n − 1, with P agreeing with f at xi and xi+1. Linear splines are easy to construct but are not smooth at points x1, x2, . . . , xn−1. 69 10. Natural Cubic Splines. Given a function f defined on [a, b] and a set of points a = x0 < x1 < ∙ ∙ ∙ < xn = b, a function S is called a natural cubic spline if there exist n cubic polynomials Si such that: (a) S(x) = Si(x) for x in [xi, xi+1] and i = 0, 1, . . . , n − 1; (b) Si(xi) = f (xi) and Si(xi+1) = f (xi+1) for i = 0, 1, . . . , n − 1; 0 (c) Si+1 (xi+1) = Si0(xi+1) for i = 0, 1, . . . , n − 2; 00 (xi+1) = Si00(xi+1) for i = 0, 1, . . . , n − 2; (d) Si+1 (e) S 00(x0) = S 00(xn) = 0. 70 Natural Cubic Splines. Si Si+1 | | xi (a) 4 n parameters (b) 2 n equations (c) n − 1 equations (d) n − 1 equations (e) 2 equations | xi+1 xi+2 4 n equations 71 Example. Assume S is a natural cubic spline that interpolates f ∈ C 2([a, b]) at the nodes a = x0 < x1 < ∙ ∙ ∙ < xn = b. We have the following smoothness property of cubic splines: Z Z b a In fact, it even holds that Z b [S 00(x)]2 dx ≤ 00 2 b a [S (x)] dx = min g∈G a [f 00(x)]2 dx. Z b [g 00(x)]2 dx, a where G := {g ∈ C 2([a, b]) : g(xi) = f (xi) i = 0, 1, . . . , n}. Exercise: Determine the parameters a to h so that S(x) is a natural cubic spline, where S(x) = ax3 + bx2 + cx + d for x ∈ [−1, 0] and S(x) = ex3 + f x2 + gx + h for x ∈ [0, 1] with interpolation conditions S(−1) = 1, S(0) = 2, and S(1) = −1. (Answer: a = −1, b = −3, c = −1, d = 2, e = 1, f = −3, g = −1, h = 2.) 72 11. Computation of Natural Cubic Splines. Denote ci = S 00(xi), i = 0, 1, . . . , n. Then c0 = cn = 0. Since Si is a cubic function on [xi, xi+1], we know that Si00 is a linear function on [xi, xi+1]. Hence it can be written as xi+1 − x x − xi 00 Si (x) = ci + ci+1 hi hi where hi := xi+1 − xi. 73 Exercise. Show that Si is given by ci ci+1 Si(x) = (xi+1 − x)3 + (x − xi)3 + pi (xi+1 − x) + qi (x − xi), 6 hi 6 hi where f (xi) ci hi f (xi+1) ci+1 hi , qi = pi = − − 6 6 hi hi and c1, . . . , cn−1 satisfy the linear equations: hi−1 ci−1 + 2 (hi−1 + hi) ci + hi ci+1 = ui, where ui = 6 (di − di−1), di = for i = 1, 2, . . . , n − 1. f (xi+1) − f (xi) hi C++ Exercise: Write a program to construct a natural cubic spline. The inputs are the number of points n + 1 and all the points (xi, yi), i = 0, 1, . . . , n. The output is a natural cubic spline expressed in terms of the functions Si defined on the intervals [xi, xi+1] for i = 0, 1, . . . , n − 1. 74 5. Basic Probability Theory 1. CDF and PDF. Let (Ω, F , P ) be a probability space, X be a random variable. The cumulative distribution function (cdf) F of X is defined by F (x) = P (X ≤ x), x ∈ R. F is an increasing right-continuous function satisfying F (−∞) = 0, F (+∞) = 1. If F is absolutely continuous then X has a probability density function (pdf) f defined by f (x) = F 0(x), x ∈ R. F can be recovered from f by the relation Z x F (x) = f (u) du. −∞ 75 2. Normal Distribution. A random variable X has a normal distribution with parameters μ and σ 2, written X ∼ N (μ, σ 2), if X has the pdf (x−μ)2 1 − φ(x) = √ e 2 σ2 σ 2π for x ∈ R. If μ = 0 and σ 2 = 1 then X is called a standard normal random variable and its cdf is usually written as Z x 1 − u2 √ e 2 du. Φ(x) = 2π −∞ If X ∼ N (μ, σ 2) then the characteristic function (Fourier transform) of X is given by where i = √ c(s) = E(e i sX 2 2 )=e i μ t− σ 2s , −1. The moment generating function (mgf) is given by sX m(s) = E(e ) = e 76 2 2 μ t+ σ 2s . 3. Approximation of Normal CDF. It is suggested that the standard normal cdf Φ(x) can be approximated by a “polye nomial” Φ(x) as follows: e Φ(x) := 1 − Φ0(x) (a1k + a2k 2 + a3k 3 + a4k 4 + a5k 5) (3) e e when x ≥ 0 and Φ(x) := 1 − Φ(−x) when x < 0. 1 , γ = 0.2316419, a1 = 0.319381530, a2 = The parameters are given by k = 1+γx −0.356563782, a3 = 1.781477937, a4 = −1.821255978, and a5 = 1.330274429. This approximation has a maximum absolute error less than 7.5 × 10−8 for all x. C++ Exercise: Write a program to compute Φ(x) with (3) and compare the result with that derived with the composite Simpson Rule (see Exercise 4.4). The input is a variable x and the number of partitions n over the interval [0, x]. The output is the results from the two methods and their error. 77 4. Lognormal Random Variable. Let Y = eX and X be a N (μ, σ 2) random variable. Then Y is a lognormal random variable. Exercise: Show that 1 2 2 E(Y ) = eμ+ 2 σ , E(Y 2) = e2μ+2σ . 5. An Important Formula in Pricing European Options. If V is lognormally distributed and the standard deviation of ln V is s then 1 E(max(V − K, 0)) = E(V ) Φ(d1) − K Φ(d2) where 1 1 E(V ) s + and d2 = d1 − s. d1 = ln s K 2 K will be later denoted by X. We use a different notation here to avoid an abuse of notation, as K is not a random variable. 78 1 E(V ) s d1 = ln + and d2 = d1 − s. s K 2 + E(V − K) = E(V ) Φ(d1) − K Φ(d2), Proof. Let g be the pdf of V . Then Z E(V − K)+ = ∞ −∞ + (v − K) g(v) dv = Z ∞ K (v − K) g(v) dv. As V is lognormal, ln V is normal ∼ N (m, s2), where m = ln(E(V )) − 12 s2. Let Y := ln V −m , s i.e. V = e √1 2π m+s Y E(V − K)+ = E(em+s Y . Then Y ∼ N (0, 1) with pdf: φ(y) = Z ∞ + m+s y − K) = − K φ(y) dy e ln K−m Z ∞s Z ∞ = em+s y φ(y) dy − K ln K−m s = I1 − K I2 . 79 ln K−m s e 2 − y2 . φ(y) dy Z ∞ 1 − y2 +m+s y √ e 2 I1 = dy ln K−m 2π Z ∞s 1 − (y−s)2 +m+ s2 2 dy √ e 2 = ln K−m 2π s Z ∞ 2 s 1 − y2 m+ 2 √ e 2 dy [y − s 7→ y] =e ln K−m −s 2π s 2 2 s s ln K − m ln K − m = em+ 2 1 − Φ −s = em+ 2 Φ − +s s s ! s2 − ln K + ln E(V ) − 2 = eln E(V ) Φ +s s 1 E(V ) s = E(V ) Φ(d1) , ln + = E(V ) Φ s K 2 on recalling m = ln(E(V )) − 12 s2. 80 Similarly ln K − m I2 = 1 − Φ s ln K − m =Φ − s 81 = Φ(d1 − s) = Φ(d2) . 6. Correlated Random Variables. Assume X = (X1, . . . , Xn) is an n-vector of random variables. The mean of X is an n-vector μ = (E(X1), . . . , E(Xn)). The covariance of X is an n × n-matrix Σ with components Σij = (CovX)ij = E((Xi − μi)(Xj − μj )) . The variance of Xi is given by σi2 = Σii and the correlation between Xi and Xj is Σ given by ρij = σiσijj . X is called a multi-dimensional normal vector, written as X ∼ N (μ, Σ), if X has pdf 1 1 1 T −1 exp − Σ (x − μ) (x − μ) f (x) = 2 (2π)n/2 (detΣ)1/2 where Σ is a symmetric positive definite matrix. 82 7. Convergence. Let {Xn} be a sequence of random variables. There are four types of convergence concepts associated with {Xn}: a.s. • Almost sure convergence, written Xn −→ X, if there exists a null set N such that for all ω ∈ Ω \ N one has Xn(ω) → X(ω), P n → ∞. • Convergence in probability, written Xn −→ X, if for every ε > 0 one has P (|Xn − X| > ε) → 0, Lp n → ∞. • Convergence in norm, written Xn −→ X, if Xn, X ∈ Lp and E|Xn − X|p → 0, D n → ∞. • Convergence in distribution, written Xn −→ X, if for any x at which P (X ≤ x) is continuous one has P (Xn ≤ x) → P (X ≤ x), 83 n → ∞. 8. Strong Law of Large Numbers. Let {Xn} be independent, identically distributed (iid) random variables with finite expectation E(X1) = μ. Then Zn a.s. −→ μ n where Zn = X1 + ∙ ∙ ∙ + Xn. 9. Central Limit Theorem. Let {Xn} be iid random variables with finite expectation μ and finite variance σ 2 > 0. For each n, let Zn = X1 + ∙ ∙ ∙ + Xn. Then Zn n −μ √σ n = Zn − nμ D √ −→ Z nσ where Z is a N (0, 1) random variable, i.e., Z z 2 Zn − nμ 1 − x2 P( √ ≤ z) → √ e dx, nσ 2π −∞ 84 n → ∞. 10. Lindeberg–Feller Central Limit Theorem. Suppose X is a triangular array of random variables, i.e., n X = {X1n, X2n, . . . , Xk(n) : n ∈ {1, 2, . . .}}, with k(n) → ∞ as n → ∞, n are independently distributed and are bounded such that, for each n, X1n, . . . , Xk(n) in absolute value by a constant yn with yn → 0. Let n . Zn = X1n + ∙ ∙ ∙ + Xk(n) If E(Zn) → μ and var(Zn) → σ 2 > 0, then Zn converges in distribution to a normally distributed random variable with mean μ and variance σ 2. Note: Lindeberg–Feller implies the Central Limit Theorem (5.9). 85 If X1, X2, . . . are iid with expectation μ and variance σ 2, then define Xi − μ n Xi := √ , i = 1, 2, . . . , k(n) := n. nσ n For each n, X1n, . . . , Xk(n) are independent and E(Xi) − μ μ − μ √ = = √ = 0, nσ nσ 1 1 2 1 VarX − μ = σ = . Var(Xin) = i 2 2 nσ nσ n P n = ( ni=1 Xi − n μ) √1nσ , then Let Zn = X1n + ∙ ∙ ∙ + Xk(n) E(Xin) E(Zn) = k(n) X E(Xin) = 0 , i=1 k(n) Var(Zn) = X Var(Xin) = 1 . i=1 Hence, by Lindeberg–Feller, D Zn −→ Z ∼ N (0, 1) . 86 6. Optimization 1. Unconstrained Optimization. Given f : Rn → R. Minimize f (x) over x ∈ Rn. f has a local minimum at a point xˉ if f (ˉ x) ≤ f (x) for all x near xˉ, i.e. ∃ε>0 s.t. f (ˉ x) ≤ f (x) ∀ x : kx − xˉk < ε . f has a global minimum at xˉ if f (ˉ x) ≤ f (x) 87 ∀ x ∈ Rn . 2. Optimality Conditions. • First order necessary conditions: Suppose that f has a local minimum at xˉ and that f is continuously differentiable in an open neighbourhood of xˉ . Then ∇f (ˉ x) = 0. (ˉ x is called a stationary point.) • Second order sufficient Conditions: Suppose that f is twice continuously differentiable in an open neighbourhood of xˉ and that ∇f (ˉ x) = 0 and ∇2f (ˉ x) is positive definite. Then xˉ is a strict local minimizer of f . Example: Show that f = (2x21 − x2)(x21 − 2x2) has a minimum at (0, 0) along any straight line passing through the origin, but f has no minimum at (0, 0). Exercise: Find the minimum solution of f (x1, x2) = 2x21 + x1x2 + x22 − x1 − 3x2. (Answer. 1 7 (−1, 11).) 88 (4) Sufficient Condition. Taylor gives for any d ∈ Rn: f (ˉ x + d) = f (ˉ x) + ∇f (ˉ x)T d + 12 dT ∇2f (ˉ x + λ d) d If xˉ is not strict local minimizer, then ∃ {xk } ⊂ Rn \ {ˉ x} : xk → xˉ λ ∈ (0, 1) . s.t. f (xk ) ≤ f (ˉ x) . x Define dk := kxxk −ˉ xk . Then kdk k = 1 and there exists a subsequence {dkj } such that k −ˉ dkj → d? as j → ∞ and kd?k = 1. W.l.o.g. we assume dk → d? as k → ∞. f (ˉ x) ≥ f (xk ) = f (ˉ x + kxk − xˉk dk ) x)T dk + 12 kxk − xˉk2 dTk ∇2f (ˉ x + λk kxk − xˉk dk ) dk = f (ˉ x) + kxk − xˉk ∇f (ˉ = f (ˉ x) + 12 kxk − xˉk2 dTk ∇2f (ˉ x + λk kxk − xˉk dk ) dk . x + λk kxk − xˉk dk ) dk ≤ 0, and on letting k → ∞ Hence dTk ∇2f (ˉ dT? ∇2f (ˉ x ) d? ≤ 0 . x) being symmetric positive definite. Hence xˉ As d? 6= 0, this is a contradiction to ∇2f (ˉ is a strict local minimizer. 89 Example 6.2. Show that f = (2x21 − x2)(x21 − 2x2) has a minimum at (0, 0) along any straight line passing through the origin, but f has no minimum at (0, 0). Answer. Straight line through (0, 0): x2 = α x1, α ∈ R fixed. g(r) := f (r, α r) = (2 r2 − α r) (r2 − 2 α r) ⇒ g 0(r) = 8 r3 − 15 α r2 + 4 α2 r, g 0(0) = 0 g 00(r) = 24 r2 − 30 α r + 4 α2 and g 00(0) = 4 α2 > 0 . Hence r = 0 is a minimizer for g ⇐⇒ (0, 0) is a minimizer for f along any straight line. Now let (xk1 , xk2 ) = ( k1 , k12 ) → (0, 0) as k → ∞. Then 1 1 ∀ k. f (xk1 , xk2 ) = − 2 2 < 0 = f (0, 0) k k Hence (0, 0) is not a minimizer for f . ! 0 0 2 [Note: ∇f (0, 0) = 0, but ∇ f (0, 0) = .] 0 4 90 3. Convex Optimization. Exercise. When f is convex, any local minimizer xˉ is a global minimizer of f . If in addition f is differentiable, then any stationary point x ˉ is a global minimizer of f . (Hint. Use a contradiction argument.) 91 Exercise 6.3. When f is convex, any local minimizer xˉ is a global minimizer of f . Proof. Suppose xˉ is a local minimizer, but not a global minimizer. Then Since f is convex, we have that ∃x e s.t. f (e x) < f (ˉ x). f (λ x e + (1 − λ) xˉ) ≤ λ f (e x) + (1 − λ) f (ˉ x) < λ f (ˉ x) + (1 − λ) f (ˉ x) = f (ˉ x) Let xλ := λ x e + (1 − λ) xˉ. Then xλ → xˉ and f (xλ) < f (ˉ x) This is a contradiction to xˉ being a local minimizer. Hence xˉ is a global minimizer for f . 92 ∀ λ ∈ (0, 1] . as λ → 0. 4. Line Search. The basic procedure to solve numerically an unconstrained problem (minimize f (x) over x ∈ Rn) is as follows. (i) Choose an initial point x0 ∈ Rn and an initial search direction d0 ∈ Rn and set k = 0. (ii) Choose a step size αk and define a new point xk+1 = xk + αk dk . Check if the stopping criterion is satisfied (k∇f (xk+1)k < ε?). If yes, xk+1 is the optimal solution, stop. If no, go to (iii). (iii) Choose a new search direction dk+1 (descent direction) and set k = k + 1. Go to (ii). The essential and most difficult part in any search algorithm is to choose a descent direction dk and a step size αk with good convergence and stability properties. 93 5. Steepest Descent Method. f is differentiable. Choose dk = −g k , where g k = ∇f (xk ), and choose αk s.t. f (xk + αk dk ) = min f (xk + α dk ). α∈R Note that the successive descent directions are orthogonal to each other, i.e. (g k )T g k+1 = 0, and the convergence for some functions may be very slow, called zigzagging. Exercise. Use the steepest descent (SD) method to solve (4) with the initial point x0 = (1, 1). (Answer. First three iterations give x1 = (0, 1), x2 = (0, 32 ), and x3 = (− 18 , 32 ).) 94 Steepest Descent. Taylor gives: f (xk + α dk ) = f (xk ) + α ∇f (xk )T dk + O(α2). As ∇f (xk )T dk = k∇f (xk )k kdk k cos θk , with θk the angle between dk and ∇f (xk ), we see that dk is a descent direction if cos θk < 0. The descent is steepest when θk = π ⇐⇒ cos θk = −1. Zigzagging. αk is minimizer of φ(α) := f (xk + α dk ) with dk = −g k . Hence 0 = φ0(αk ) = ∇f (xk + αk dk )T dk = ∇f (xk+1)T (−g k ) = −(g k+1)T g k . Hence dk+1 ⊥ dk , which leads to zigzagging. 95 Exercise 6.5. Use the SD method to solve (4) with the initial point x0 = (1, 1). [min: Answer. ∇f = (4 x1 + x2 − 1, 2 x2 + x1 − 3). Iteration 0: d0 = −∇f (x0) = −(4, 0) 6= (0, 0). φ(α) = f (x0 + α d0) = f (1 − 4 α, 1) = 2 (1 − 4 α)2 − 2 minimum point at α0 = 14 ⇒ x1 = x0 + α0 d0 = (0, 1), d1 = −∇f (x1) = −(0, −1) = (0, 1) 6= (0, 0). Iteration 1: x2 = (0, 32 ), d2 = (− 12 , 0). Iteration 2: x3 = (− 18 , 32 ), d3 = (0, 18 ). 96 1 7 (−1, 11).] 6. Newton Method. f is twice differentiable. Choose dk = −[H k ]−1g k , where H k = ∇2f (xk ). Set xk+1 = xk + dk . If H k is positive definite then dk is a descent direction. The main drawback of the Newton method is that it requires the computation of ∇2f (xk ) and its inverse, which can be difficult and time-consuming. Exercise. Use the Newton method to solve (4) with x0 = (1, 1). (Answer. First iteration gives x1 = 17 (−1, 11).) 97 Newton Method. Taylor gives f (xk + d) ≈ f (xk ) + dT ∇f (xk ) + 12 dT ∇2f (xk ) d =: m(d) min m(d) ⇒ d ⇒ ∇m(d) = 0 ∇f (xk ) + ∇2f (xk ) d = 0 . Hence choose dk = −[∇2f (xk )]−1 ∇f (xk ) = −[H k ]−1g k . If H k is positive definite, then so is (H k )−1, and we get (dk )T g k = −(g k )T (H k )−1 g k ≤ −σk kg k k2 < 0 for some σk > 0. Hence dk is a descent direction. [Aside: The Newton method for minx f (x) is equivalent to the Newton method for finding a root of the system of nonlinear equations ∇f (x) = 0.] 98 Exercise 6.6. Use the Newton method to minimize f (x1, x2) = 2x21 + x1x2 + x22 − x1 − 3x2 with x0 = (1, 1)T . Answer. ! ! 4 x 1 + x2 − 1 4 1 , H := ∇2f = ∇f = . 2 x 2 + x1 − 3 1 2 ! ! 2 −1 2 −1 1 1 −1 =7 . H = det H −1 4 −1 4 Iteration 0: x0 = (1, 1)T , ∇f (x0) = (4, 0)T . ! 1 x1 = x0 − [H 0]−1 ∇f (x0) = − 17 1 ⇒ ⇒ 2 −1 −1 4 ∇f (x1) = (0, 0)T and H positive definite. x1 is minimum point. 99 ! 4 0 ! = 1 7 ! −1 . 11 7. Choice of Stepsize. In computing the step size αk we face a tradeoff. We would like to choose αk to give a substantial reduction of f , but at the same time we do not want to spend too much time making the choice. The ideal choice would be the global minimizer of the univariate function φ : R → R defined by φ(α) = f (xk + α dk ), α > 0, but in general, it is too expensive to identify this value. A common strategy is to perform an inexact line search to identify a step size that achieves adequate reductions in f with minimum cost. α is normally chosen to satisfy the Wolfe conditions: f (xk + αk dk ) ≤ f (xk ) + c1 αk (g k )T dk ∇f (xk + αk dk )T dk ≥ c2 (g k )T dk , (5) (6) with 0 < c1 < c2 < 1. (5) is called the sufficient decrease condition, and (6) is the curvature condition. 100 Choice of Stepsize. The simple condition f (xk + αk dk ) < f (xk ) (†) is not appropriate, as it may not lead to a sufficient reduction. Example: f (x) = (x − 1)2 − 1. So min f (x) = −1, but we can choose xk satisfying (†) such that f (xk ) = k1 → 0. Note that the sufficient decrease condition (5) φ(α) = f (xk + α dk ) ≤ `(α) := f (xk ) + c1 α (g k )T dk yields acceptable regions for α. Here φ(α) < `(α) for small α > 0, as (g k )T dk < 0 for descent directions. The curvature condition (6) is equivalent to φ0(α) ≥ c2 φ0(0) [ > φ0(0) ] i.e. a condition on the desired slope and so rules out unacceptably short steps α. In practice c1 = 10−4 and c2 = 0.9. 101 8. Convergence of Line Search Methods. An algorithm is said to be globally convergent if lim kg k k = 0. k→∞ It can be shown that if the step sizes satisfy the Wolfe conditions • then the steepest descent method is globally convergent, • so is the Newton method provided the Hessian matrices ∇2f (xk ) have a bounded condition number and are positive definite. Exercise. Show that the steepest descent method is globally convergent if the following conditions hold (a) αk satisfies the Wolfe conditions, (b) f (x) ≥ M ∀ x ∈ Rn , (c) f ∈ C 1 and ∇f is Lipschitz: k∇f (x) − ∇f (y)k ≤ L kx − yk 102 ∀ x, y ∈ Rn. [Hint: Show that ∞ X k=0 kg k k2 < ∞.] 103 Exercise 6.8. Assume that dk is a descent direction, i.e. (g k )T dk < 0, where g k := ∇f (xk ). Then if 1. αk satisfies the Wolfe conditions, 2. f (x) ≥ M ∀ x ∈ Rn , 3. f ∈ C 1 and ∇f is Lipschitz, i.e. k∇f (x) − ∇f (y)k ≤ L kx − yk ∀ x, y ∈ Rn, ∞ X (g k )T dk 2 k k 2 k cos θ kg k < ∞, where cos θ := kgk k kdk k . it holds that k=0 [Note: SD method is special case with cos2 θk = 1. ⇒ lim kg k k = 0.] k→∞ Proof. Wolfe condition (6) ⇒ (g k+1)T dk ≥ c2 (g k )T dk ⇒ (g k+1 − g k )T dk ≥ (c2 − 1) (g k )T dk . . 104 (†) The Lipschitz condition yields that (g k+1 − g k )T dk ≤ kg k+1 − g k k kdk k = k∇f (xk+1) − ∇f (xk )k kdk k ≤ L kxk+1 − xk k kdk k = αk L kdk k2 . c2 − 1 (g k )T dk , and hence Combining (†) and (‡) gives αk ≥ k 2 L kd k c2 − 1 [(g k )T dk ]2 αk (g ) d ≤ kdk k2 L k T k Together with Wolfe condition (5) we get f (x k+1 c2 − 1 [(g k )T dk ]2 k 2 k k 2 ) ≤ f (x ) + c1 = f (x ) − c cos θ kg k , k 2 L kd k k 105 (‡) 2 where c := c1 1−c L > 0. f (xk+1) ≤ f (xk ) − c cos2 θk kg k k2 ≤ f (x0) − c ⇒ k X 1 cos2 θj kg j k2 ≤ (f (x0) − M ) ∀ k c j=0 106 ⇒ ∞ X j=0 k X j=0 cos2 θj kg j k2 cos2 θj kg j k2 < ∞ . 9. Popular Search Methods. In practice the steepest descent method and the Newton method are rarely used due to the slow convergence rate and the difficulty in computing Hessian matrices, respectively. The popular search methods are • the conjugate gradient method (variation of SD method with superlinear convergence) and • the quasi-Newton method (variation of Newton method without computation of Hessian matrices). There are some efficient algorithms based on the trust-region approach. See Fletcher (2000) for details. 107 10. Constrained Optimization. Minimize f (x) over x ∈ Rn subject to the equality constraints hi(x) = 0, i = 1, . . . , l , and the inequality constraints gj (x) ≤ 0, j = 1, . . . , m . Assume that all functions involved are differentiable. 108 11. Linear Programming. The problem is to minimize z = c1 x1 + ∙ ∙ ∙ + cn xn subject to ai1 x1 + ∙ ∙ ∙ + ain xn ≥ bi, i = 1, . . . , m , and x1, . . . , xn ≥ 0. LPs can be easily and efficiently solved with the simplex algorithm or the interior point method. MS-Excel has a good in-built LP solver capable of solving problems up to 200 variables. MATLAB with optimization toolbox also provides a good LP solver. 109 12. Graphic Method. If an LP problem has only two decision variables (x1, x2), then it can be solved by the graphic method as follows: • First draw the feasible region from the given constraints and a contour line of the objective function, • then, on establishing the increasing direction perpendicular to the contour line, find the optimal point on the boundary of the feasible region, • then find two linear equations which define that point, • and finally solve the two equations to obtain the optimal point. Exercise. Use the graphic method to solve the LP: minimize z = −3 x1 − 2 x2 subject to x1 + x2 ≤ 80, 2x1 + x2 ≤ 100, x1 ≤ 40, and x1, x2 ≥ 0. (Answer. x1 = 20, x2 = 60.) 110 13. Quadratic Programming. Minimize xT Q x + cT x subject to Ax ≤ b and x ≥ 0, where Q is an n × n symmetric positive definite matrix, A is an n × m matrix, x, c ∈ Rn, b ∈ Rm. To solve a QP problem, one • first derives a set of equations from the Kuhn–Tucker conditions, and • then applies the Wolfe algorithm or the Lemke algorithm to find the optimal solution. The MS-Excel solver is capable of solving reasonably sized QP problems, similarly for MATLAB. 111 14. Kuhn–Tucker Conditions. min f (x) over x ∈ Rn s.t. hi(x) = 0, i = 1, . . . , l; gj (x) ≤ 0, j = 1, . . . , m. Assume that xˉ is an optimal solution. Under some regularity conditions, called the constraint qualifications, there exist two vectors uˉ = (ˉ u1, . . . , uˉl ) and vˉ = (ˉ v1, . . . , vˉm), called the Lagrange multipliers, such that the following set of conditions is satisfied: x, uˉ, vˉ) = 0, Lxk (ˉ k = 1, . . . , n x) = 0, hi(ˉ i = 1, . . . , l x) ≤ 0, gj (ˉ j = 1, . . . , m vˉj gj (ˉ x) = 0, vˉj ≥ 0, where L(x, u, v) = f (x) + l X ui hi(x) + i=1 is called the Lagrange function or Lagrangian. 112 j = 1, . . . , m m X j=1 vj gj (x) Furthermore, if f : Rn → R and hi, gj : Rn → R are convex, then xˉ is an optimal solution if and only if (ˉ x, uˉ, vˉ) satisfies the Kuhn–Tucker conditions. This holds in particular, when f is convex and hi, gj are linear. Example. Find the minimum solution to the function x2 − x1 subject to x21 + x22 ≤ 1. Exercise. Find the minimum solution to the function x21 +x22 −2x1 −4x2 subject to x1 +2x2 ≤ 2 and x2 ≥ 0. (Answer. (x1, x2) = 15 (2, 4).) 113 Interpretation of Kuhn–Tucker conditions Assume that no equality constraints are present. If xˉ is an interior point, i.e. no constraints are active, then we recover the usual optimality condition: ∇f (ˉ x) = 0. Now assume that xˉ lies on the boundary of the feasible set and let gjk be the active constraints at xˉ. Then a necessary condition for optimality is that we cannot find a descent direction for f at xˉ that is also a feasible direction. Such a vector cannot exist, if X −∇f (ˉ x) = vˉjk ∇gjk (ˉ x) with vˉjk ≥ 0 . (†) k P x) d < 0 and k vˉjk ∇gjk (ˉ x)T d > This is because, if d ∈ R is a descent direction, then ∇f (ˉ 0. So there must exist a jk , such that ∇gjk (ˉ x)T d > 0. But that means that d is an ascent direction for gjk , and as gjk is active at xˉ, it is not a feasible direction. x) = 0 for all inactive constraints, then we can re-write (†) as 0 = If we require vˉj gj (ˉ n T 114 ∇f (ˉ x) + Pm ˉj j=1 v ∇gj (ˉ x) with vˉj ≥ 0. These are the KT conditions. 115 Application of Kuhn–Tucker: LP Duality Let b ∈ Rm, c ∈ Rn and A ∈ Rn×m. min cT x s.t. A x ≥ b, x ≥ 0. (P) Equivalent to min cT x s.t. b − A x ≤ 0, −x ≤ 0 . Lagrangian: L = cT x + v T (b − A x) + y T (−x). Hence, xˉ is the solution, if there exist vˉ and yˉ such that KT conditions: ∇L = c − AT vˉ − yˉ = 0 ⇒ vˉ, yˉ ≥ 0, −ˉ x ≤ 0. vˉT (b − A xˉ) = 0, 116 yˉT (−ˉ x) = 0, b − A xˉ ≤ 0, yˉ = c − AT vˉ, Eliminate yˉ to find vˉ, xˉ: A xˉ ≥ b, AT vˉ ≤ c, xˉ ≥ 0 vˉ ≥ 0 vˉT (b − A xˉ) = 0 xˉT (c − AT vˉ) = 0 feasible region: primal ) feasible region: dual ⇒ xˉT c = xˉT AT vˉ = vˉT b Hence vˉ ∈ Rm solves the dual: max bT v s.t. AT v ≤ c, 117 v ≥ 0. (D) Here we have used that cT xˉ = min x≥0, A x≥b cT x ≥ min max cT x + v T (b − A x) x≥0 v≥0 = max min cT x + v T (b − A x) v≥0 x≥0 = max min v T b + xT (c − AT v) v≥0 ≥ x≥0 max v≥0, AT v≤c T T vT b ≥ vˉ b = c xˉ 118 Example 6.14. Find the minimum solution to the function x2 − x1 subject to x21 + x22 ≤ 1. Answer. L = x2 − x1 + v (x21 + x22 − 1), so the KT conditions become Lx1 = −1 + 2 v x1 = 0 (1) Lx2 = 1 + 2 v x2 = 0 (2) x21 + x22 ≤ 1 (3) v (x21 + x22 − 1) = 0, v≥0 (4) (1) ⇒ v > 0 and hence x1 = 2v1 , x2 = − 2v1 . Plugging this into (4) yields 4 2v2 = 1 and hence 1 √ vˉ = 2 ⇒ √ 1 1 √ √ xˉ1 = , xˉ2 = − ; 2 2 with the optimal value being zˉ = − 2. 119 Example. min x1 s.t. x2 − x31 ≤ 0, x1 ≤ 1, x2 ≥ 0. Since x31 ≥ x2 ≥ 0 we have x1 ≥ 0 and hence xˉ = (0, 0) is the unique minimizer. Lagrangian: L = x1 + v1 (x2 − x31) + v2 (x1 − 1) + v3 (−x2). KT conditions for a feasible point x: 2 1 − 3 v 1 x 1 + v2 =0 (1) ∇L = v1 − v3 v1 (x2 − x31) = 0, v2 (x1 − 1) = 0, v3 (−x2) = 0 (2) v1 , v 2 , v 3 ≥ 0 (3) Check KT conditions at xˉ = (0, 0): (1) ⇒ v2 = −1 < 0 impossible! KT condition is not satisfied, since the constraint qualifications do not hold. 0 0 3 Here g1 = x2 − x1 and g3 = −x2 are active at (0, 0), and ∇g1 = 1 , ∇g3 = −1 . Hence ∇g1 and ∇g3 are not linearly independent! 120 Exercise 6.14 Find the minimum solution to the function x21 + x22 − 2 x1 − 4 x2 subject to x1 + 2 x2 ≤ 2 and x2 ≥ 0. Answer. Lagrangian: L = x21 + x22 − 2 x1 − 4 x2 + v1 (x1 + 2 x2 − 2) + v2 (−x2) KT conditions: 2 x 1 − 2 + v1 ∇L = =0 (1) 2 x2 − 4 + 2 v1 − v2 (2) v1 (x1 + 2 x2 − 2) = 0, v2 x2 = 0, v1, v2 ≥ 0 x1 + 2 x2 ≤ 2, x2 ≥ 0 (3) If x1 + 2 x2 − 2 < 0 then v1 = 0 ⇒ x1 = 1 , x2 = 2 + 12 v2 ≥ 2. Hence from (2), v2 = 0, and so x1 = 1, x2 = 2. But that contradicts (3), and so it must hold that x1 + 2 x2 − 2 = 0. If x2 > 0, then v2 = 0 ⇒ solving (1) together with x1 + 2 x2 − 2 = 0 gives x1 = 25 , X x2 = 45 ; and v1 = 65 , v2 = 0. 121 If x2 = 0 then x1 = 2 − 2 x2 = 2. But then from (1), v1 = −2 < 0. Impossible. 122 7. Lattice (Tree) Methods 1. Concepts in Option Pricing Theory. A call (or put) option is a contract that gives the holder the right to buy (or sell) a prescribed asset (underlying asset) by a certain date T (expiration date) for a predetermined price X (exercise price). A European option can only be exercised on the expiration date while an American option can be exercised at any time prior to the expiration date. The other party to the holder of the option contract is called the writer. The holder and the writer are said to be in long and short positions of the contract, respectively. The terminal payoffs of a European call (or put) option is (S(T )−X)+ := max(S(T )− X, 0) (or (X − S(T ))+), where S(T ) is the underlying asset price at time T . 123 2. Random Walk Model and Assumption. Assume S(t) and V (t) are the asset price and the option price at time t, respectively, and the current time is t and current asset price is S, i.e. S(t) = S. After one period of time Δt, the asset price S(t + Δt) is either u S (“up” state) with probability q or d S (“down” state) with probability 1 − q, where q is a real (subjective) probability. To avoid riskless arbitrage opportunities, we must have u>R>d where R := er Δt and r is the riskless interest rate. Here r is the riskless interest rate that allows for unlimited borrowing or lending at the same rate r. Investing $1 at time t yields a value (return) at time t + Δt of $R = $(er Δt). (continuous compounding). 124 Continuous compounding Saving an amount B at time t, yields with interest rate r at time t + Δt without compounding: (1 + r Δt) B. Compounding the interest n times yields n r Δt 1+ B → er ΔtB as n → ∞ . n Example: Consider Δt = 1 and r = 0.05. Then R = er = 1.0512711, equivalent to an AER of 5.127%. n 1 4 12 52 365 (1 + nr )n 1.05 1.05094534 1.05116190 1.05124584 1.05126750 125 AER 5% 5.095 5.116 5.125 5.127 % % % % No Arbitrage. u > R > d If R ≥ u, then short sell α > 0 units of the asset, deposit α S in the bank; giving a portfolio value at time t of: π(t) = 0 and at time t + Δt of π(t + Δt) = R (α S) − α S(t + Δt) ≥ R (α S) − α (u S) ≥ 0 . Moreover, in the “down” state (with probability 1 − q > 0) the value is π(t + Δt) = R (α S) − α (d S) > 0 . Hence we can make a riskless profit with positive probability. No arbitrage implies u > R. A similar argument yields that d < R. 126 3. Replicating Portfolio. We form a portfolio consisting of α units of underlying asset (α > 0 buying, α < 0 short selling) and cash amount B in riskless cash bond (B > 0 lending, B < 0 borrowing). If we choose V (u S, t + Δt) − V (d S, t + Δt) , uS − dS u V (d S, t + Δt) − d V (u S, t + Δt) B = , uR − dR we have replicated the payoffs of the option at time t + Δt, no matter how the asset price changes. α = 127 Replicating Portfolio. Value of portfolio at time t : π(t) = α S + B, and at time t + Δt: α u S + R B =: π Δt w. prob. q u π(t + Δt) = α S(t + Δt) + R B = α d S + R B =: π Δt w. prob. 1 − q d Call option value at time t + Δt (expiration time) : (u S − X)+ =: C Δt w. prob. q u C(t + Δt) = (d S − X)+ =: C Δt w. prob. 1 − q d To replicate the call option, we must have πuΔt = CuΔt and πdΔt = CdΔt. Hence CuΔt − CdΔt α u S + R B = C Δt α = u uS − dS ⇐⇒ Δt Δt u C − d C α d S + R B = C Δt u d d B = uR − dR This gives the fair price for the call option at time t: π(t) = α S + B. 128 Replicating Portfolio. The value of the call option at time t is: where CuΔt − CdΔt u CdΔt − d CuΔt C(t) = π(t) = α S + B = S+ uS − d S u R − d R 1 R − d Δt u − R Δt = Cu + Cd R u−d u−d 1 Δt Δt = p Cu + (1 − p) Cd , R R−d . u−d No arbitrage argument: If C(t) < π(t), then buy call option at price C(t) and sell portfolio at π(t) (that is short sell α units of asset and lend B amounts of cash to the bank). This gives an instantaneous profit of π(t) − C(t) > 0. And we know that at time t + Δt, the payoff from our call option will exactly compensate for the value of the portfolio, as C(t + Δt) = π(t + Δt). A similar argument for the case C(t) > π(t) yields that C(t) = π(t). p= 129 4. Risk Neutral Option Price. The current option price is given by 1 V (S, t) = p V (u S, t + Δt) + (1 − p) V (d S, t + Δt) , R R−d where p = . Note that u−d 1. the real probability q does not appear in the option pricing formula, (7) 2. p is a probability (0 < p < 1) since d < R < u, and 3. p (u S) + (1 − p) (d S) = R S, so p is called the risk neutral probability and the process of finding V is called the risk neutral valuation. 130 5. Riskless Hedging Principle. We can also derive the option pricing formula (7) as follows: form a portfolio with a long position of one unit of option and a short position of α units of underlying asset. By choosing an appropriate α we can ensure such a portfolio is riskless. Exercise. Find α (called delta) and derive the pricing formula (7). 131 6. Asset Price Process. In a risk neutral continuous time model, the asset price S is assumed to follow a lognormal process. (as discussed in detail in Stochastic Processes I, omitted here). Over a period [t, t + τ ] the asset price S(t + τ ) can be expressed in terms of S(t) = S as √ (r− 12 σ 2 ) τ +σ τ Z S(t + τ ) = S e where r is the riskless interest rate, σ the volatility, and Z ∼ N (0, 1) a standard normal random variable. S(t + τ ) has the first moment S er τ and the second moment S 2 e(2 r+σ 132 2) τ . 7. Relations Between u, d and p. By equating the first and second moments of the asset price S(t + Δt) in both continuous and discrete time models, we obtain the relation S p u + (1 − p) d = S er Δt, 2 2 2 2 S p u + (1 − p) d = S 2 e(2 r+σ ) Δt, or equivalently, p u + (1 − p) d = er Δt, p u2 + (1 − p) d2 = e(2 r+σ (8) 2 ) Δt . (9) There are two equations and three unknowns u, d, p. An extra condition is needed to uniquely determine a solution. R−d , where R = er Δt as before.] [Note that (8) implies p = u−d 133 8. Cox–Ross–Rubinstein Model. The extra condition is u d = 1. The solutions to (8), (9), (10) are p 1 2 u = σ̂ + 1 + (σ̂ 2 + 1)2 − 4 R2 , 2R p 1 2 σ̂ + 1 − (σ̂ 2 + 1)2 − 4 R2 , d = 2R where 2 σ̂ 2 = e(2 r+σ ) Δt. 3 If the higher order term O((Δt) 2 ) is ignored in u, d, then √ √ R−d σ Δt −σ Δt u=e , d=e , p= . u−d These are the parameters chosen by Cox–Ross–Rubinstein for their model. Note that with this choice (9) is satisfied up to O((Δt)2). 134 (10) Proof. σ̂ 2 = e(2 r+σ ⇒ 2 ) Δt = p (u2 − d2) + d2 R−d (u + d) (u − d) + d2 = u−d = Ru − du + Rd R = Ru−1+ u R u2 − (1 + σ̂ 2) u + R = 0 . As u > d, we get the unique solutions s p 2 2 2 2 2 2 2 1 + σ̂ + (1 + σ̂ ) − 4 R 1 + σ̂ 1 + σ̂ utrue = −1 = + 2R 2R 2R and s p 2 2 2 2 2 2 2 1 1 + σ̂ − (1 + σ̂ ) − 4 R 1 + σ̂ 1 + σ̂ − 1. dtrue = = = − u 2R 2R 2R 135 Moreover and 1 + σ̂ 2 1 (2 r+σ2) Δt e + 1 e−r Δt = 2 2R 1 2 2 2 = 2 + (2 r + σ ) Δt + O((Δt) ) 1 − r Δt + O((Δt) ) 2 1 2 2 = 2 + σ Δt + O((Δt) ) 2 = 1 + 12 σ 2 Δt + O((Δt)2), s σ̂ 2 1+ 2R ⇒ 2 −1= utrue q 1+ p 1 2 σ 2 Δt + 2 2 O((Δt) ) −1 1 + σ 2 Δt + O((Δt)2) − 1 √ p = σ Δt 1 + O(Δt) √ = σ Δt (1 + O(Δt)) √ 3 = σ Δt + O((Δt) 2 ) . √ 3 1 2 = 1 + 2 σ Δt + σ Δt + O((Δt) 2 ) . = 136 The Cox–Ross–Rubinstein (CRR) model uses √ √ 3 σ Δt 1 2 = 1 + σ Δt + 2 σ Δt + O((Δt) 2 ) . u=e Hence the first three terms in the Taylor series of the CRR value u and the true value utrue match. So 3 u = utrue + O((Δt) 2 ) . Estimating the error in (9) by this choice of u gives 2 2 Error = p u + (1 − p) d − e (2 r+σ 2 ) Δt 2 = R (u + d) − 1 − e(2 r+σ ) Δt √ √ r Δt σ Δt −σ Δt (2 r+σ 2 ) Δt =e e +e −1−e 2 2 2 = 1 + r Δt + O((Δt) ) 2 + σ Δt + O((Δt) ) − 1 2 2 − 1 + (2 r + σ ) Δt + O((Δt) ) = O((Δt)2) . 137 9. Jarrow–Rudd Model. The extra condition is p= 1 . 2 (11) Exercise. Show that the solutions to (8), (9), (11) are p 2 u = R 1 + eσ Δt − 1 , p d = R 1 − eσ2Δt − 1 . 3 Show that if O((Δt) 2 ) is ignored then 2 √ 2 √ u = e (r− σ2 ) Δt+σ d = e (r− σ2 ) Δt−σ Δt Δt , . (These are the parameters chosen by Jarrow–Rudd for their model.) Also show that (8) and (9) are satisfied up to O((Δt)2). 138 10. Tian Model. The extra condition is p u3 + (1 − p) d3 = e3 r Δt+3 σ 2 Δt . Exercise. Show that the solutions to (8), (9), (12) are p RQ Q + 1 + Q2 + 2 Q − 3 , u = 2 p RQ Q + 1 − Q2 + 2 Q − 3 , d = 2 R−d , p = u−d where R = er Δt and Q = eσ 2 Δt . 3 Also show that if O((Δt) 2 ) is ignored, then 1 3 √ p = − σ Δt. 2 4 139 (12) (Note that u d = R2 Q2 instead of u d = 1 and that the binomial tree loses its symmetry about S whenever u d 6= 1.) 140 11. Black–Scholes Equation (BSE). As the time interval Δt tends to zero, the one period option pricing formula, (7) with (8) and (9), tends to the Black–Scholes Equation ∂V 1 2 2 ∂ 2V ∂V +rS + σ S − r V = 0. (13) ∂t ∂S 2 ∂S 2 Proof. Two variable Taylor expansion at (S, t) gives V (u S, t + Δt) = V + VS (u S − S) + Vt Δt + 1 2 2 2 VSS (u S − S) + 2 VSt (u S − S) Δt + Vtt (Δt) + higher order terms . Similarly, V (d S, t + Δt) = V + VS (d S − S) + Vt Δt + 1 2 2 2 VSS (d S − S) + 2 VSt (d S − S) Δt + Vtt (Δt) + higher order terms . 141 Substituting into the right hand side of (7) gives V = {V + S VS [p (u − 1) + (1 − p) (d − 1)] + Vt Δt 2 2 2 1 + 2 S VSS p (u − 1) + (1 − p) (d − 1) +2 S VSt (p (u − 1) + (1 − p) (d − 1)) Δt −r Δt 2 +Vtt (Δt) + higher order terms e . Now it follows from (8) that p (u − 1) + (1 − p) (d − 1) = p u + (1 − p) d − 1 = er Δt − 1 = r Δt + O((Δt)2) . Similarly, (8) and (9) give that p (u − 1)2 + (1 − p) (d − 1)2 = p u2 + (1 − p) d2 − 2 (p u + (1 − p) d) + 1 =e (2 r+σ 2 ) Δt 142 − 2 er Δt + 1 = σ 2 Δt + O((Δt)2) . So, for the RHS of (7) we get 2 V = V + S VS r Δt + O((Δt) ) + Vt Δt 2 2 2 2 1 + 2 S VSS σ Δt + O((Δt) ) + 2 VSt r Δt + O((Δt) ) Δt −r Δt 2 +O((Δt) ) e 2 2 1 2 2 = V + r S VS + Vt + 2 σ S VSS Δt + O((Δt) ) 1 − r Δt + O((Δt) ) 1 2 2 = V + −r V + r S VS + Vt + 2 σ S VSS Δt + O((Δt)2) . Cancelling V and dividing by Δt yields −r V + r S VS + Vt + 12 σ 2 S 2 VSS + O(Δt) = 0 . Ignoring the O(Δt) term, we get the BSE. ⇒ The binomial model approximates BSE to 1st order accuracy. 143 12. n-Period Option Price Formula. For a multiplicative n-period binomial process, the call value is n X n j C = R−n p (1 − p)n−j max(uj dn−j S − X, 0) , j j=0 ! n n! is the binomial coefficient. where = j!(n − j)! j (14) Define k to be the smallest non-negative integer such that uk dn−k S ≥ X. The call pricing formula can be simplified as C = S Ψ(n, k, p0) − X R−n Ψ(n, k, p) , (15) up where p0 = and Ψ is the complementary binomial distribution function defined R by ! n X n Ψ(n, k, p) = pj (1 − p)n−j . j j=k 144 u2 S uS Asset: S recombined tree udS dS d2 S CuΔt Call: C CdΔt ↑ t ↑ t + Δt 2Δt = (u2 S − X)+ Cuu 2Δt Cud = (u d S − X)+ 2Δt Cdd = (d2 S − X)+ ↑ t + 2 Δt 145 Call price at time t + Δt: 1 1 Δt 2Δt 2Δt Δt 2Δt 2Δt Cu = p Cuu + (1 − p) Cud , p Cud + (1 − p) Cdd . Cd = R R Call price at time t: 1 1 Δt Δt 2 2Δt 2Δt 2 2Δt p Cu + (1 − p) Cd = 2 p Cuu + 2 p (1 − p) Cud + (1 − p) Cdd . C= R R Proof of (14) by induction: Assume (14) holds for n ≤ k − 1, then for n = k, there are k − 1 periods between time t + Δt and t + k Δt. So k−1 X + k−1 j 1 Δt k−1−j j k−1−j p (1 − p) (u S) − X u d Cu = k−1 j R j=0 k X + k − 1 j−1 1 k−j j k−j = k−1 p (1 − p) u d S−X . R j−1 j=1 146 ⇒ p CuΔt = 1 Rk−1 and similarly (1 − p) CdΔt = 1 Rk−1 k X k−1 j=1 j−1 k−1 X k−1 j=0 j j k−j j u d k−j j k−j j k−j p (1 − p) p (1 − p) u d S−X + S−X + , . Combining gives 1 Δt Δt C= p Cu + (1 − p) Cd R 1 k k = k p (u S − X)+ + (1 − p)k (dk S − X)+ R k−1 X k−1 k−1 j k−j j k−j + + + p (1 − p) (u d S − X) j−1 j j=1 | {z } k = j This proves (14). 147 Let k be the smallest non-negative integer such that u k X X u k n−k u d S ≥ X ⇐⇒ ≥ ⇐⇒ k ≥ ln / ln . d S dn S dn d Then 0 j<k j n−j + (u d S − X) = . uj dn−j S − X j ≥ k Hence n X 1 n j C= n p (1 − p)n−j (uj dn−j S − X) R j j=k j n−j n n X X n pu (1 − p) d n j −n −XR p (1 − p)n−j . =S j j R R j=k j=k pu On setting p0 = , it holds that R R − d u −R d + u d u − R d (1 − p) d 0 1−p =1− = = = . u−d R R (u − d) u−d R R This proves (15). 148 13. Black–Scholes Call Price Formula. As the time interval Δt tends to zero, the n-period call price formula (15) tends to the Black–Scholes call option pricing formula c = S Φ(d1) − X e−r (T −t) Φ(d2) where d1 = and Proof. ln S X σ2 2 ) (T + (r + √ σ T −t (16) − t) 2 √ ln XS + (r − σ2 ) (T − t) √ d2 = ≡ d1 − σ T − t. σ T −t The n-period call price (15) is given by C = S Ψ(n, k, p0) − X R−n Ψ(n, k, p) = S Ψ(n, k, p0) − X e|−r{zn Δt} Ψ(n, k, p) . e−r (T −t) 149 Hence we need to prove Ψ(n, k, p0) → Φ(d1) and Ψ(n, k, p) → Φ(d2) 150 as Δt → 0. Let j be the binomial RV of the number of upward moves in n periods. Then u j j n−j Sn = S u d =S dn , d where Sn is the asset price at time T = t + n Δt. Sn u = j ln + n ln d = j ln u + (n − j) ln d , ⇒ ln S d E(j) = n p , Var(j) = n p (1 − p) , u 2 u Sn Sn = n p ln + n ln d , = n p (1 − p) ln . Var ln E ln S S d d Moreover, 1 − Ψ(n, k, p) = P (j ≤ k − 1) = P Sn u ln ≤ (k − 1) ln + n ln d d S . As k is the smallest integer such that k ≥ ln SXdn / ln ud , there exists an α ∈ (0, 1] such that X u −α, k − 1 = ln / ln n Sd d 151 and so 1 − Ψ(n, k, p) = P Sn X u ln ≤ ln − α ln S d S 152 . Define ln u with prob. p Xin := ln d with prob 1 − p i = 1, . . . , n , and Zn := X1n + . . . + Xnn. Then, for each n, {X1n, . . . , Xnn} are independent and Zn ∼ ln SSn . Hence 1 − Ψ(n, k, p) = P (Zn ≤ βn) , X u βn := ln − α ln . S d Assuming e.g. the CRR model, we get σ2 where μ := r − , 2 E(Zn) = n p ln u + n (1 − p) ln d → μ (T − t), 2u → σ 2 (T − t) , Var(Zn) = n p (1 − p) ln d X u X βn = ln − α ln → ln , S d√ S √ σ T −t n √ as n → ∞ . ≡ yn → 0 and |Xi | ≤ |σ Δt| = n 153 Hence Theorem 5.10. implies that D Zn −→ Z ∼ N (μ (T − t), σ 2 (T − t)) . 154 So, as n → ∞ 1 1 − Ψ(n, k, p) = P (Zn ≤ βn) → p 2 π σ 2 (T On letting y := we have 1 1 − Ψ(n, k, p) → √ 2π Hence =Φ ln S X ln X S e − (s−μ (T −t))2 2 σ 2 (T −t) ds . −∞ s − μ (T − t) √ , σ T −t ln X S ! ! ln X S ! + (r − − t) √ σ T −t = Φ(d2) . Z (T −t) ln X S −μ √ σ T −t 2 e − y2 dy = Φ −∞ ln Ψ(n, k, p) → 1 − Φ − t) Z X S − μ (T − t) √ σ T −t σ2 2 ) (T Similarly one can show Ψ(n, k, p0) → Φ(d1). 155 =Φ − ! − μ (T − t) √ σ T −t − μ (T − t) √ σ T −t . 14. Implementation. Choose the number of time steps that is required to march down until the expiration T −t time T to be n, so that Δt = . n At time ti = t + i Δt there are i + 1 nodes. Denote these nodes as (i, j) with i = 0, 1, . . . , n and j = 0, 1, . . . , i, where the first component i refers to time ti and the second component j refers to the asset price S uj di−j . Compute the option prices Vjn = V (S uj dn−j , T ), j = 0, 1, . . . , n, at expiration time tn = T over n + 1 nodes. Then go backwards one step and compute option prices at time tn−1 = T − Δt using the one period binomial option pricing formula over n nodes. Continue until reaching the current time t0 = t. 156 The recursive formula is 1 i i+1 i+1 Vj = p Vj+1 + (1 − p) Vj , j = 0, . . . , i and i = n − 1, . . . , 0 , R where R−d . p= u−d The value V00 = V (S, t) is the option price at the current time t. C++ Exercise: Write a program to implement the Cox–Ross–Rubinstein model to price options with a non-dividend paying underlying asset. The inputs are the asset price S, the strike price X, the riskless interest rate r, the current time t and the maturity time T , the volatility σ, and the number of steps n. The outputs are the European call and put option prices. 157 15. Greeks. Assume the current time is t = 0. Some common sensitivity measures of a European call option price c (the rate of change of the call price with respect to the underlying factors) are delta, gamma, vega, rho, theta, defined by ∂c ∂ 2c ∂c ∂c ∂c δ= , γ= , ρ = , θ = − . , v = 2 ∂S ∂S ∂σ ∂r ∂T Exercise. Derive the following relations from the call price (16). δ = Φ(d1) , Φ0(d1) √ , γ = S√ σ T v = S T Φ0(d1) , ρ = X T e−r T Φ(d2) , S Φ0(d1) σ √ θ = − − r X e−r T Φ(d2) . 2 T 158 16. Dynamic Hedging. Greeks can be used to hedge an option position against changes of underlying factors. For example, to hedge a short position of a European call against changes of the asset price, one should keep a long position of the underlying asset with δ units. This strategy requires continuous trading and is prohibitively expensive if there are any transaction costs. 159 17. Dividends. (a) If the asset pays a continuous dividend yield at a rate q, then use r − q instead of r in building the asset price tree, but still use r in discounting the option values in the backward calculation. In the risk-neutral model, the asset has a rate of return of r − q. So (8) becomes (9) becomes p u + (1 − p) d = e(r−q) Δt , p u2 + (1 − p) d2 = e(2 r−2 q+σ 1 However, the discount factor remains the same: R 1 = e−r Δt . R 160 2 ) Δt . (b) If the asset pays a discrete proportional dividend β S between ti−1 and ti, then the asset price at node (i, j) drops to (1 − β) uj di−j S instead of the usual uj di−j S. The tree is still recombined. 161 S• ◦ • • • (1 − β) u2 S • • (1 − β) u d S ◦ • ◦ • (1 − β) d2 S • • • (1 − β) u4 S • • • • (1 − β) d4 S (c) If the asset pays a discrete cash dividend D between ti−1 and ti, then the asset price at node (i, j) drops to uj di−j S −D instead of the usual uj di−j S. There are i + 1 new recombined trees emanating from the nodes at time ti, but the original tree is no longer recombined. 162 Suppose a cash dividend D is paid between ti−1 and ti, then at time ti+m there are (i + 1) (m + 1) nodes, instead of i + m + 1 in a recombined tree. If there are several cash dividends, then the number of nodes is much larger than in a normal tree. Example. Assume there is a cash dividend D between periods 1 and 2. Then the possible asset prices at time n = 2 are {u2 S − D, u d S − D, d2 S − D}. At n = 3 there are 6 nodes: u ∙ {u2 S − D, u d S − D, d2 S − D} and d ∙ {u2 S − D, u d S − D, d2 S − D}, instead of 4. [Reason: d (u2 S − D) 6= u (u d S − D)]. 163 • u2 (u2 S − D) ◦ • S• • ◦ • • u2 S − D • • • udS − D ◦ • • • d2 S − D • • • • • • • • • 164 d2 (d2 S − D) 18. Trinomial Model. Suppose the asset price is S at time t. At time t + Δt the asset price can be u S (up-state) with probability pu, m S (middle-state) with probability pm, and d S (down-state) with probability pd. Then the risk neutral one period option pricing formula is 1 V (S, t) = (pu V (u S, t + Δt) + pm V (m S, t + Δt) + pd V (d S, t + Δt)) . R The trinomial tree approach is equivalent to the explicit finite difference method. By equating the first and second moments of the asset price S(t + Δt) in both continuous and discrete models, we obtain the following equations: pu + pm + pd = 1 , pu u + pm m + pd d = er Δt , pu u2 + pm m2 + pd d2 = e(2 r+σ 2 ) Δt . There are three equations and six variables u, m, d and pu, pm, pd. We need three more equations to uniquely determine these variables. 165 19. Boyle Model. The three additional conditions are m = 1, u d = 1, u=e λσ √ Δt , where λ is a free parameter. Exercise. Show that the risk-neutral probabilities are given by (W − R) u − (R − 1) , pu = (u − 1) (u2 − 1) (W − R) u2 − (R − 1) u3 pd = , 2 (u − 1) (u − 1) where R = er Δt and W = e(2 r+σ 2 ) Δt p m = 1 − pu − pd , . Note that if λ = 1, which corresponds to the choice of u as in the CRR model, certain sets of parameters can lead to pm < 0. To rectify this, choose λ > 1. 166 Boyle claimed that if pu ≈ pm ≈ pd ≈ 13 , then the trinomial scheme with 5 steps is comparable to the CRR binomial scheme with 20 steps. 167 20. Hull–White Model. This is the same as the Boyle model with λ = √ 3. Exercise. Show that the risk-neutral probabilities are r Δt 1 1 2 r − pd = − σ , 2 6 12 σ 2 2 pm = , 3 r 1 1 2 Δt σ , r − pu = + 2 6 12 σ 2 if terms of order O(Δt) are ignored. 168 21. Kamrad–Ritchken Model. If S follows a lognormal process, then we can write ln S(t + Δt) = ln S(t) + Z, where Z is a normal random variable with mean (r − 12 σ 2) Δt and variance σ 2 Δt. KR suggested to approximate Z by a discrete random variable Z a as follows: Z a = Δx with probability pu, 0 with probability pm, and −Δx with probability pd, where √ Δx = λ σ Δt and λ ≥ 1. The corresponding u, m, d in the trinomial tree are u = eΔx, m = 1, d = e−Δx. By omitting the higher order term O((Δt)2), they showed that the risk-neutral probabilities are 1 1 1 2 √ + Δt , r− σ pu = 2 2λ 2λσ 2 1 pm = 1 − 2 , λ 1 1 1 2 √ pd = − r− σ Δt . 2 2λ 2λσ 2 Note that if λ = 1 then pm = 0 and the trinomial scheme is reduced to the binomial 169 scheme. KR claimed that if pu = pm = pd = 13 , then the trinomial scheme with 15 steps is comparable to the binomial scheme (pm = 0) with 55 steps. They also discussed the trinomial scheme with two correlated state variables. 170 and the equations become with prob. pu Δx Za = 0 with prob. pm −Δx with prob. pd pu + pm + pd = 1 , E(Z a) = Δx (pu − pd) = (r − 12 σ 2) Δt , (1) (2) Var(Z a) = (Δx)2 (pu + pd) − (Δx)2 (pu − pd)2 = σ 2 Δt . | {z } O((Δt)2 ) Dropping the O((Δt)2) term leads to (Δx)2 (pu + pd) = σ 2 Δt . 171 (3) Hence ⇒ 1 Δt 2 Δt 1 2 1 Δt 2 Δt 1 2 pu = σ + σ − (r − σ ) , pd = (r − σ ) . 2 (Δx)2 Δx 2 2 (Δx)2 Δx 2 1 1 1 1 2 √ 1 1 1 1 2 √ (r − σ ) Δt , pd = (r − σ ) Δt . pu = + − 2 2 2 λ λσ 2 2 λ λσ 2 172 8. Finite Difference Methods 1. Diffusion Equations of One State Variable. 2 ∂u ∂ u 2 , (x, t) ∈ D , (17) =c ∂t ∂x2 where t is a time variable, x is a state variable, and u(x, t) is an unknown function satisfying the equation. To find a well-defined solution, we need to impose the initial condition u(x, 0) = u0(x) (18) and, if D = [a, b] × [0, ∞), the boundary conditions u(a, t) = ga(t) and where u0, ga, gb are continuous functions. 173 u(b, t) = gb(t) , (19) If D = (−∞, ∞) × (0, ∞), we need to impose the boundary conditions lim u(x, t) e |x|→∞ −a x2 =0 for any a > 0. (20) implies u(x, t) does not grow too fast as |x| → ∞. (20) The diffusion equation (17) with the initial condition (18) and the boundary conditions (19) is well-posed, i.e. there exists a unique solution that depends continuously on u0, ga and gb. 174 2. Grid Points. To find a numerical solution to equation (17) with finite difference methods, we first need to define a set of grid points in the domain D as follows: Choose a state step size Δx = b−a N (N is an integer) and a time step size Δt, draw a set of horizontal and vertical lines across D, and get all intersection points (xj , tn), or simply (j, n), where xj = a + j Δx, j = 0, . . . , N , and tn = n Δt, n = 0, 1, . . .. If D = [a, b] × [0, T ] then choose Δt = 0, . . . , M . 175 T M (M is an integer) and tn = n Δt, n = tM = T .. . t3 t2 t1 t0 = 0 x0 = a x 1 x2 ... 176 xN = b 3. Finite Differences. The partial derivatives ∂u ∂ 2u ux := and uxx := 2 ∂x ∂x are always approximated by central difference quotients, i.e. unj+1 − unj−1 unj+1 − 2 unj + unj−1 ux ≈ and uxx ≈ 2 Δx (Δx)2 (21) at a grid point (j, n). Here unj = u(xj , tn). Depending on how ut is approximated, we have three basic schemes: explicit, implicit, and Crank–Nicolson schemes. 177 4. Explicit Scheme. If ut is approximated by a forward difference quotient − unj un+1 j ut ≈ Δt at (j, n), then the corresponding difference equation to (17) at grid point (j, n) is n n wjn+1 = λ wj+1 + (1 − 2 λ) wjn + λ wj−1 , where Δt . 2 (Δx) The initial condition is wj0 = u0(xj ), j = 0, . . . , N , and λ = c2 n = gb(tn), n = 0, 1, . . .. the boundary conditions are w0n = ga(tn) and wN The difference equations (22), j = 1, . . . , N − 1, can be solved explicitly. 178 (22) 5. Implicit Scheme. If ut is approximated by a backward difference quotient − unj un+1 j ut ≈ Δt at (j, n + 1), then the corresponding difference equation to (17) at grid point (j, n + 1) is n+1 n+1 −λ wj+1 + (1 + 2 λ) wjn+1 − λ wj−1 = wjn . (23) The difference equations (23), j = 1, . . . , N − 1, together with the initial and boundary conditions as before, can be solved using the Crout algorithm or the SOR algorithm. 179 Explicit Method. n n n wjn+1 − wjn w − 2 w + w j−1 j j+1 = c2 Δt (Δx)2 (†) Δt Letting λ := c2 (Δx) 2 gives (22). Implicit Method. n+1 n+1 n+1 wjn+1 − wjn w − 2 w + w j−1 j j+1 = c2 Δt (Δx)2 Δt Letting λ := c2 (Δx) 2 gives (23). In matrix form 1 + 2 λ −λ −λ 1 + 2 λ −λ w = b. ... ... −λ 1 + 2 λ 180 (‡) The matrix is tridiagonal and diagonally dominant. 181 ⇒ Crout / SOR. 6. Crank–Nicolson Scheme. The Crank–Nicolson scheme is the average of the explicit scheme at (j, n) and the implicit scheme at (j, n + 1). The resulting difference equation is λ n+1 λ n+1 λ n λ n − wj−1 + (1 + λ) wjn+1 − wj+1 = wj−1 + (1 − λ) wjn + wj+1 . (24) 2 2 2 2 The difference equations (24), j = 1, . . . , N − 1, together with the initial and boundary conditions as before, can be solved using Crout algorithm or SOR algorithm. 182 Crank–Nicolson. 1 [(†) + (‡)] gives 2 n+1 n+1 n+1 n n wjn+1 − wjn 1 2 wj−1 − 2 wjn + wj+1 1 2 wj−1 − 2 wj + wj+1 + c = c Δt (Δx)2 (Δx)2 2 2 Δt Letting μ := 12 c2 (Δx) 2 = λ 2 gives n+1 n+1 − μ wj+1 + (1 + 2 μ) wjn+1 − μ wj−1 =w bjn+1 n n and w bjn+1 = μ wj+1 + (1 − 2 μ) wjn + μ wj−1 . This can be interpreted as w bjn+1 — wjn+1 — predictor (explicit method) corrector (implicit method) 183 7. Local Truncation Errors. These are measures of the error by which the exact solution of a differential equation does not satisfy the difference equation at the grid points and are obtained by substituting the exact solution of the continuous problem into the numerical scheme. A necessary condition for the convergence of the numerical solutions to the continuous solution is that the local truncation error tends to zero as the step size goes to zero. In this case the method is said to be consistent. It can be shown that all three methods are consistent. The explicit and implicit schemes have local truncation errors O(Δt, (Δx)2), while that of the Crank–Nicolson scheme is O((Δt)2, (Δx)2). 184 Local Truncation Error. For the explicit scheme we get for the LTE at (j, n) Enj = u(xj , tn+1) − u(xj , tn) u(xj−1, tn) − 2 u(xj , tn) + u(xj+1, tn) . − c2 2 Δt (Δx) With the help of a Taylor expansion at (xj , tn) we find that u(xj , tn+1) − u(xj , tn) = ut(xj , tn) + O(Δt) , Δt u(xj−1, tn) − 2 u(xj , tn) + u(xj+1, tn) 2 = u (x , t ) + O((Δx) ). xx j n 2 (Δx) Hence Enj = ut(xj , tn) − c2 uxx(xj , tn) +O(Δt) + O((Δx)2) . | {z } = 0 185 8. Numerical Stability. Consistency is only a necessary but not a sufficient condition for convergence. Roundoff errors incurred during calculations may lead to a blow up of the solution or erode the whole computation. A scheme is stable if roundoff errors are not amplified in the calculations. The Fourier method can be used to check if a scheme is stable. Assume that a numerical scheme admits a solution of the form vjn = a(n)(ω) ei j ω Δx , √ where ω is the wave number and i = −1. 186 (25) Define a(n+1)(ω) G(ω) = (n) , a (ω) where G(ω) is an amplification factor, which governs the growth of the Fourier component a(ω). The von Neumann stability condition is given by |G(ω)| ≤ 1 for 0 ≤ ωΔx ≤ π. It can be shown that the explicit scheme is stable if and only if λ ≤ 12 , called conditionally stable, and the implicit and Crank–Nicolson schemes are stable for any values of λ, called unconditionally stable. 187 Stability Analysis. For the explicit scheme we get on substituting (25) into (22) that a(n+1)(ω) ei j ω Δx = λ a(n)(ω) ei (j+1) ω Δx + (1 − 2 λ) a(n)(ω) ei j ω Δx + λ a(n)(ω) ei (j−1) ω Δx ⇒ a(n+1)(ω) G(ω) = (n) = λ ei ω Δx + (1 − 2 λ) + λ e−i ω Δx . a (ω) The von Neumann stability condition then is |G(ω)| ≤ 1 ⇐⇒ ⇐⇒ ⇐⇒ ⇐⇒ ⇐⇒ for all 0 ≤ ω Δx ≤ π. |λ ei ω Δx + (1 − 2 λ) + λ e−i ω Δx| ≤ 1 |(1 − 2 λ) + 2λ cos(ω Δx)| ≤ 1 2 ω Δx |≤1 |1 − 4 λ sin 2 ω Δx 0 ≤ 4 λ sin2 ≤2 2 1 0≤λ≤ 2 ω Δx 2 sin 2 188 [cos 2 α = 1 − 2 sin2 α] This is equivalent to 0 ≤ λ ≤ 12 . 189 Remark. The explicit method is stable, if and only if (Δx)2 Δt ≤ . (†) 2 2c (†) is a strong restriction on the time step size Δt. If Δx is reduced to 12 Δx, then Δt must be reduced to 14 Δt. So the total computational work increases by a factor 8. Example. ut = uxx Take Δx = 0.01. Then (x, t) ∈ [0, 1] × [0, 1] 1 ⇒ Δt ≤ 0.00005 λ≤ 2 I.e. the number of grid points is equal to 1 1 = 100 × 20, 000 = 2 × 106 . Δx Δt 190 Remark. In vector notation, the explicit scheme can be written as wn+1 = A wn + bn , n T N −1 where wn = (w1n, . . . , wN ) ∈ R and −1 1 − 2λ λ λ 1 − 2λ λ ∈ R(N −1)×(N −1), A= ... ... λ 1 − 2λ For the implicit method we get B wn+1 = wn + bn+1 , λ w0n 0 n N −1 . b = . . ∈ R 0 n λ wN 1 + 2 λ −λ −λ 1 + 2 λ −λ . where B = ... ... −λ 1 + 2 λ 191 Remark. Forward diffusion equation: Backward diffusion equation: ut − c2 uxx = 0 ut + c2 uxx = 0 u(x, T ) = uT (x) u(a, t) = ga(t), u(b, t) = gb(t) t ≥ 0. t≤T ∀x ∀t [as before] [Note: We could use the transformation v(x, t) := u(x, T − t) in order to transform this into a standard forward diffusion problem.] We can solve the backward diffusion equation directly by starting at t = T and solving “backwards”, i.e. given wn+1, find wn. e wn + b̃n Implicit: wn+1 = A e wn+1 = wn + b̃n Explicit: B The von Neumann stability condition for the backward problem then becomes n a (ω) e |G(ω)| = n+1 ≤ 1 . a (ω) 192 Stability of the Binomial Model. The binomial model is an explicit method for a backward equation. 1 1 n+1 n+1 n+1 n+1 Vjn = (p Vj+1 + (1 − p) Vj−1 ) = (p Vj+1 + 0 Vjn+1 + (1 − p) Vj−1 ) R R for j = −n, −n + 2, . . . , n − 2, n and n = N − 1, . . . , 1, 0. N N N N , V−N , . . . , V , V Here the initial values V−N +2 N −2 N are given. Now let Vjn = a(n)(ω) ei j ω Δx, then 1 a(n)(ω) ei j ω Δx = p a(n+1)(ω) ei (j+1) ω Δx + (1 − p) a(n+1)(ω) ei (j−1) ω Δx R −r Δt i ω Δx −i ω Δx e ⇒ G(ω) = p e + (1 − p) e e −r Δt = cos(ω Δx) + (2 p − 1) i sin(ω Δx) e | {z } = q ⇒ 2 e |G(ω)| = (cos2(ω Δx) + q 2 sin2(ω Δx)) e−2 r Δt = (1 + (q 2 − 1) sin2(ω Δx)) e−2 r Δt ≤ e−2 r Δt ≤ 1 if q 2 ≤ 1 ⇐⇒ −1 ≤ q ≤ 1 ⇐⇒ p ∈ [0, 1]. Hence the binomial model is stable. 193 Stability of the CRR Model. We know that the binomial model is stable if p ∈ (0, 1). For the CRR model we have that u=e σ √ Δt , d=e −σ √ Δt , so p ∈ (0, 1) is equivalent to u > R > d. Clearly, for Δt small, we can ensure that σ e √ Δt p= R−d , u−d > er Δt . Hence the CRR model is stable, if Δt is sufficiently small, i.e. if Δt < σ2 . r2 Alternatively, one can argue (less rigorously) as follows. Since Δx = u S − S = √ √ σ Δt S (e − 1) ≈ S σ Δt and as the BSE can be written as 1 ut + σ 2 S 2 uSS + . . . = 0 , |2 {z } c2 194 it follows that Δt 1 2 2 Δt 1 = λ=c = σ S 2 2 (Δx)2 2 S σ Δt 2 2 195 ⇒ CRR is stable. 9. Simplification of the BSE. Assume V (S, t) is the price of a European option at time t. Then V satisfies the Black–Scholes equation (13) with appropriate initial and boundary conditions. Define τ = T − t, w(τ, x) = eα x+β τ V (t, S) , x = ln S, where α and β are parameters. Then the Black–Scholes equation can be transformed into a basic diffusion equation: ∂w 1 2 ∂ 2w = σ ∂τ 2 ∂x2 with a new set of initial and boundary conditions. Finite difference methods can be used to solve the corresponding difference equations and hence to derive option values at grid points. 196 Transformation of the BSE. Consider a call option. ∂V Let τ = T −t be the remaining time to maturity. Set u(x, τ ) = V (x, t). Then ∂u = − ∂τ ∂t and the BSE (13) is equivalent to 1 uτ = σ 2 S 2 uSS + r S uS − r u , (†) 2 (IC) u(S, 0) = V (S, T ) = (S − X)+ , u(0, τ ) = V (0, t) = 0 , u(S, τ ) = V (S, t) ≈ S as S → ∞. (BC) Let x = ln S ( ⇐⇒ S = ex). Set ũ(x, τ ) = u(S, τ ). Then ũx = uS ex = S uS , ũxx = S uS + S 2 uSS and (†) becomes 1 2 1 2 ũτ = σ ũxx + (r − σ ) ũx − r ũ , 2 2 ũ(x, 0) = u(S, 0) = (ex − X)+ , ũ(x, τ ) = u(0, τ ) = 0 as x → −∞ , (‡) (IC) ũ(x, τ ) = u(ex, τ ) ≈ ex as x → ∞. 197 (BC) −a x2 Note that the growth condition (20), lim|x|→∞ ũ(x, τ ) e = 0 for any a > 0, is satisfied. Hence (‡) is well defined. Let w(x, τ ) = eα x+β τ ũ(x, τ ) ⇐⇒ ũ(x, τ ) = e−α x−β τ w(x, τ ) =: C w(x, τ ). Then ũτ = C (−β w + wτ ) ũx = C (−α w + wx) ũxx = C (−α (−α w + wx) + (−α wx + wxx)) = C (α2 w − 2 α wx + wxx) . So (‡) is equivalent to 1 2 1 σ C (α2 w − 2 α wx + wxx) + (r − σ 2) C (−α w + wx) − r C w . 2 2 In order to cancel the w and wx terms we need to have −β = 1 σ 2 α2 − (r − 1 σ 2) α − r , α = 1 (r − 1 σ 2) , 2 2 2 σ2 ⇐⇒ 0 = 1 σ 2 (−2 α) + r − 1 σ 2 . β = 1 2 (r − 1 σ 2)2 + r . C (−β w + wτ ) = 2 2 2σ 198 2 With this choice of α and β the equation (‡) is equivalent to 1 2 wτ = σ wxx , 2 w(x, 0) = eα x ũ(x, 0) = eα x (ex − X)+ , w(x, τ ) = 0 as x → −∞ , (]) (IC) w(x, τ ) ≈ eα x+β τ ex as x → ∞. (BC) Note that the growth condition (20) is satisfied. Hence (]) is well defined. Implementation. 1. Choose a truncated interval [a, b] to approximate (−∞, ∞). e−8 = 0.0003, e8 = 2981 ⇒ [a, b] = [−8, 8] serves all practical purposes. 2. Choose integers N , M to get the step sizes Δx = b−a N and Δτ = T −t M . Grid points (xj , τn): xj = a + j Δx, j = 0, 1, . . . , N and τn = n Δτ , n = 0, 1, . . . , M . Note: x0, xN and τ0 represent the boundary of the grid with known values. 199 3. Solve (]) with w(x, 0) = eα x (ex − X)+ , e(α+1) b+β τ w(a, τ ) = 0 , w(b, τ ) = eα b (eb − X) eβ τ (IC) or (a better choice) . (BC) Note: If the explicit method is used, N and M need to be chosen such that 1 2 Δτ 1 σ ≤ 2 (Δx)2 2 σ 2 (T − t) 2 M≥ N . (b − a)2 ⇐⇒ If the implicit or Crank–Nicolson scheme is used, there are no restrictions on N , M . Use Crout or SOR to solve. 4. Assume w(xj , τM ), j = 0, 1, . . . , N are the solutions from step 3, then the call option price at time t is V (Sj , t) = e−α xj −β (T −t) w(xj , T − }t) | {z τM where Sj = exj and T − t ≡ τM . 200 j = 0, 1, . . . , N , Note: The Sj are not equally spaced. 201 10. Direct Discretization of the BSE. Exercise: Apply the Crank–Nicolson scheme directly to the BSE (13), i.e. there is no transformation of variables, and write out the resulting difference equations and do a stability analysis. C++ Exercise: Write a program to solve the BSE (13) using the result of the previous exercise and the Crout algorithm. The inputs are the interest rate r, the volatility σ, the current time t, the expiry time T , the strike price X, the maximum price Smax, the number of intervals N in [0, Smax], and the number of subintervals M in [t, T ]. The output are the asset prices Sj , j = 0, 1, . . . , N , at time t, and their corresponding European call and put prices (with the same strike price X). 202 11. Greeks. Assume that the asset prices Sj and option values Vj , j = 0, 1, . . . , N , are known at time t. The sensitivities of V at Sj , j = 1, . . . , N − 1, are computed as follows: ∂V Vj+1 − Vj−1 |S=Sj ≈ δj = , ∂S Sj+1 − Sj−1 which is Vj+1 − Vj−1 , if S is equally spaced. 2 ΔS 2 γj = ∂ V |S=Sj ≈ 2 ∂S Vj+1 −Vj Sj+1 −Sj Sj − Sj−1 Vj+1 − 2 Vj + Vj−1 , if S is equally spaced. which is (ΔS)2 203 V −V − Sjj −Sj−1 j−1 , 12. Diffusion Equations of Two State Variables. 2 2 ∂u ∂ u ∂ u = α2 + 2 , (x, y, t) ∈ [a, b] × [c, d] × [0, ∞). 2 ∂t ∂x ∂y (26) The initial conditions are ∀ (x, y) ∈ [a, b] × [c, d] , u(x, y, 0) = u0(x, y) and the boundary conditions are u(a, y, t) = ga(y, t), u(b, y, t) = gb(y, t) ∀ y ∈ [c, d], t ≥ 0 , u(x, c, t) = gc(x, t), u(x, d, t) = gd(x, t) ∀ x ∈ [a, b], t ≥ 0 . and Here we assume that all the functions involved are consistent, in the sense that they have the same value at common points, e.g. ga(c, t) = gc(a, t) for all t ≥ 0. 204 13. Grid Points. (xi, yj , tn), where b−a , i = 0, . . . , I , xi = a + i Δx, Δx = I d−c yj = c + j Δy, Δy = , j = 0, . . . , J , J T tn = n Δt, Δt = , n = 0, . . . , N N and I, J, N are integers. Recalling the finite differences (21), we have uxx uni+1,j − 2 uni,j + uni−1,j ≈ (Δx)2 and uyy at a grid point (i, j, n). 205 uni,j+1 − 2 uni,j + uni,j−1 ≈ . (Δy)2 Depending on how ut is approximated, we have three basic schemes: explicit, implicit, and Crank–Nicolson schemes. 206 14. Explicit Scheme. If ut is approximated by a forward difference quotient n un+1 i,j − ui,j ut ≈ Δt at (i, j, n), then the corresponding difference equation at grid point (i, j, n) is n+1 n n n n n wi,j = (1 − 2 λ − 2 μ) wi,j + λ wi+1,j + λ wi−1,j + μ wi,j+1 + μ wi,j−1 (27) for i = 1, . . . , I − 1 and j = 1, . . . , J − 1, where Δt λ=α (Δx)2 2 and Δt μ=α . (Δy)2 2 (27) can be solved explicitly. It has local truncation error O(Δt, (Δx)2, (Δy)2), but is only conditionally stable. 207 15. Implicit Scheme. If ut is approximated by a backward difference quotient ut ≈ then the difference equation at grid point (i, j, n + 1) is n un+1 i,j −ui,j Δt at (i, j, n + 1), n+1 n+1 n+1 n+1 n+1 n − λ wi+1,j − λ wi−1,j − μ wi,j+1 − μ wi,j−1 = wi,j (1 + 2 λ + 2 μ) wi,j (28) for i = 1, . . . , I − 1 and j = 1, . . . , J − 1. For fixed n, there are (I − 1) (J − 1) unknowns and equations. (28) can be solved by relabeling the grid points and using the SOR algorithm. (28) is unconditionally stable with local truncation error O(Δt, (Δx)2, (Δy)2), but is more difficult to solve, as it is no longer tridiagonal, so the Crout algorithm cannot be applied. 16. Crank–Nicolson Scheme. It is the average of the explicit scheme at (i, j, n) and the implicit scheme at (i, j, n + 1). It is similar to the implicit scheme but with the improved local truncation error O((Δt)2, (Δx)2, (Δy)2). 208 Solving the Implicit Scheme. n+1 n+1 n+1 n+1 n+1 n (1 + 2 λ + 2 μ) wi,j − λ wi+1,j − λ wi−1,j − μ wi,j+1 − μ wi,j−1 = wi,j With SOR for ω ∈ (0, 2). For each n = 0, 1, . . . , N 1. Set wn+1,0 := wn and n+1,0 n+1,0 n+1,0 n+1,0 fill in the boundary conditions w0,j , wI,j , wi,0 , wi,J for all i, j. 2. For k = 0, 1, . . . For i = 1, . . . I − 1, j = 1, . . . , J − 1 n+1,k n+1,k+1 n+1,k n+1,k+1 k+1 1 n = 1+2 λ+2 + λ w + λ w + μ w + μ w w w bi,j i,j i+1,j i−1,j i,j+1 i,j−1 μ n+1,k+1 n+1,k k+1 = (1 − ω) wi,j +ωw bi,j wi,j until kwn+1,k+1 − wn+1,k k < ε. 3. Set wn+1 = wn+1,k+1. 209 With Block Jacobi/Gauss–Seidel. w1,j w2,j ∈ RI−1, j = 1, . . . , J − 1, Denote w ej = .. wI−1,j 210 w e1 w e2 (I−1) (J−1) w= . . ∈ R . w eJ−1 On setting c = 1 + 2 λ + 2 μ, we have from (28) for j fixed w1,j+1 w1,j−1 c −λ w1,j −μ −μ −λ c −λ w2,j w2,j+1 w2,j−1 ... . + .. + .. .. .. ... −μ −μ −λ c wI−1,j wI−1,j+1 wI−1,j−1 n+1 n w1,j + λ w0,j n w2,j n+1 n+1 n+1 n . + B w e + B w e = d ⇐⇒ A w e = . j j j+1 j−1 n wI−2,j n+1 n wI−1,j + λ wI,j 211 Rewriting as n+1 n+1 ej+1 +Bw ej−1 = dnj Aw ejn+1 + B w j = 1, . . . , J − 1 den1 dn1 − B A B n d2 B A B dn2 .. =: .. , . . . . . . . . . .. = . B A B . dnJ−2 dnJ−2 n+1 B A w eJ−1 denJ−1 dnJ−1 − B w eJn+1 w e1n+1 n+1 w e 2 w e0n+1 where w e0n+1 and w eJn+1 represent boundary points, leads to the following Block Jacobi 212 iteration: For k = 0, 1, . . . Aw e1n+1,k+1 = −B w e2n+1,k + den1 Aw e2n+1,k+1 = −B w e1n+1,k − B w e3n+1,k + dn2 .. n+1,k+1 n+1,k n+1,k Aw eJ−2 = −B w eJ−3 −Bw eJ−1 + dnJ−2 Aw en+1,k+1 = −B w en+1,k + den J−1 J−1 J−2 Similarly, the Block Gauss–Seidel iteration is given by: For k = 0, 1, . . . Aw e1n+1,k+1 = −B w e2n+1,k + den1 Aw e2n+1,k+1 = −B w e1n+1,k+1 − B w e3n+1,k + dn2 .. n+1,k+1 n+1,k+1 n+1,k Aw eJ−2 = −B w eJ−3 −Bw eJ−1 + dnJ−2 Aw en+1,k+1 = −B w en+1,k + den J−1 J−2 213 J−1 In each case, use the Crout algorithm to solve for w ejn+1,k+1, j = 1, . . . , J − 1. Note on Stability. (n+1) (ω) Recall that in 1d a scheme was stable if |G(ω)| = aa(n)(ω) ≤ 1, where vjn = a(n)(ω) ei j ω Δx. In 2d, this is adapted to n vi,j (n) √ = a (ω) e √ −1 i ω Δx+ −1 j ω Δy 214 . 17. Alternating Direction Implicit (ADI) Method. An alternative finite difference method is the ADI scheme, which is unconditionally stable while the difference equations are still tridiagonal and diagonally dominant. The ADI algorithm can be used to efficiently solve the Black–Scholes two asset pricing equation: Vt + 12 σ12 S12 VS1S1 + 12 σ22 S22 VS2S2 + ρ σ1 σ2 S1 S2VS1S2 + r S1VS1 + r S2VS2 − r V = 0. (29) See Clewlow and Strickland (1998) for details on how to transform the Black–Scholes equation (29) into the basic diffusion equation (26) and then to solve it with the ADI scheme. 215 ADI scheme Implicit method at (i, j, n + 1): n+1 n+1 n+1 n+1 n n n n wi,j − wi,j w − 2 w + w w − 2 w + w i,j i,j−1 i+1,j i,j i−1,j 2 i,j+1 = α2 +α Δt (Δx)2 (Δy)2 | {z } approx. uxx using (i, j, n) data Implicit method at (i, j, n + 2): n+2 n+1 n+2 n+2 n+2 n+1 n+1 n+1 wi,j − wi,j w − 2 w + w w − 2 w + w i+1,j i,j i−1,j i,j+1 i,j i,j−1 2 + α = α2 Δt (Δx)2 (Δy)2 | {z } approx. uyy using (i, j, n + 1) data We can write the two equations as follows: n+1 n+1 n+1 n n n −μ wi,j+1 + (1 + 2 μ) wi,j − μ wi,j−1 = λ wi+1,j + (1 − 2 λ) wi,j + λ wi−1,j n+2 n+2 n+2 n+1 n+1 n+1 −λ wi+1,j + (1 + 2 λ) wi,j − λ wi−1,j = μ wi,j+1 + (1 − 2 μ) wi,j + μ wi,j−1 216 (†) (‡) n+1 To solve (†), fix i = 1, . . . , I − 1 and solve a tridiagonal system to get wi,j for j = 1, . . . , J − 1. This can be done with e.g. the Crout algorithm. n+2 for i = To solve (‡), fix j = 1, . . . , J − 1 and solve a tridiagonal system to get wi,j 1, . . . , I − 1. Currently the method works on the interval [tn, tn+2] and has features of an explicit method. In order to obtain an (unconditionally stable) implicit method, we need to n adapt the method so that it works on the interval [tn, tn+1] and hence gives values wi,j for all n = 1, . . . , N . n+ 12 1 Introduce the intermediate time point n + 2 . Then (†) generates wi,j (not used) and n+1 . (‡) generates wi,j μ n+ 12 μ n+ 12 λ n λ n n+ 12 n − wi,j+1 + (1 + μ) wi,j − wi,j−1 = wi+1,j + (1 − λ) wi,j + wi−1,j 2 2 2 2 1 1 λ n+1 λ n+1 μ n+ 2 μ n+ 12 n+ 2 n+1 − wi+1,j + (1 + λ) wi,j − wi−1,j = wi,j+1 + (1 − μ) wi,j + wi,j−1 2 2 2 2 217 (†) (‡) 9. Simulation 1. Uniform Random Number Generators. In order to use simulation techniques, we first need to generate independent samples from some given distribution functions. The simplest and the most important distribution in simulation is the uniform distribution U [0, 1]. Note that if X is a U [0, 1] random variable, then Y = a + (b − a)X is a U [a, b] random variable. We focus on how to generate U [0, 1] random variables. Examples. • 2 outcomes, p1 = p2 = 1 2 (coin) • 6 outcomes, p1 = . . . = p6 = 1 6 (dice) • 3 outcomes, p1 = p2 = 0.49, p3 = 0.05 218 (roulette wheel) 2. Linear Congruential Generator. A common technique is to generate a sequence of integers ni, defined recursively by ni = (a ni−1) mod m (30) for i = 1, 2, . . . , N , where n0 (6= 0) is called the seed, a > 0 and m > 0 are integers such that a and m have no common factors. (30) generates a sequence of numbers in the set {1, 2, . . . , m − 1}. Note that ni are periodic with period ≤ m − 1, this is because there are not m different ni and two in {n0, . . . , nm−1} must be equal: ni = ni+p with p ≤ m − 1. If the period is m − 1 then (30) is said to have full period. The condition of full period is that m is a prime, am−1 − 1 is divisible by m, and aj − 1 is not divisible by m for j = 1, . . . , m − 2. Example. n0 = 35, a = 13, m = 100. Then the sequence is {35, 55, 15, 95, 35, . . .}. So p = 4. 219 3. Pseudo–Uniform Random Numbers. If we define ni m then xi is a sequence of numbers in the interval (0, 1). xi = If (30) has full period then these xi’s are called pseudo-U [0, 1] random numbers. In view of the periodic property, the number m should be as large as possible, because a small set of numbers makes the outcome easier to predict – a contrast to randomness. The main drawback of linear congruential generators is that consecutive points obtained lie on parallel hyperplanes, which implies that the unit cube cannot be uniformly filled with these points. x i+1 For example, if a = 6 and m = 11, then the ten distinct points generated lie on just two parallel lines in the unit square. xi 220 4. Choice of Parameters. A good choice of a and m is given by a = 16807 and m = 2147483647 = 231 − 1. The seed n0 can be any positive integer and can be chosen manually. This allows us to repeatedly generate the same set of numbers, which may be useful when we want to compare different simulation techniques. In general, we let the computer choose n0 for us, a common choice is the computer’s internal clock. For details on the implementation of LCG, see Press et al. (1992). For state of the art random number generators (linear and nonlinear), see the pLab website at http://random.mat.sbg.ac.at/. It includes extensive information on random number generation, as well as links to free software in a variety of computing languages. 221 5. Normal Random Number Generators Once we have generated a set of pseudo-U [0, 1] random numbers, we can generate pseudo-N (0, 1) random numbers. Again there are several methods to generate a sequence of independent N (0, 1) random numbers. Note that if X is a N (0, 1) random variable, then Y = μ + σ X is a N (μ, σ 2) random variable. 222 6. Convolution Method. This method is based on the Central Limit Theorem. We know that if Xi are independent identically distributed random variables, with finite mean μ and finite variance σ 2, then Pn i=1 Xi − n μ √ Zn = (31) σ n converges in distribution to a N (0, 1) random variable, i.e. P (Zn ≤ x) → Φ(x), If X is U [0, 1] distributed then its mean is 1 2 n → ∞. and its variance is 1 12 . If we choose n = 12 then (31) is simplified to Zn = 12 X i=1 223 Xi − 6. (32) Note that Zn generated in this way is only an approximate normal random number. This is due to the fact that the Central Limit Theorem only gives convergence in distribution, not almost surely, and n should tend to infinity. In practice there are hardly any differences between pseudo-N (0, 1) random numbers generated by (32) and those generated by other techniques. The disadvantage is that we need to generate 12 uniform random numbers to generate 1 normal random number, and this does not seem efficient. 224 7. Box–Muller Method. This is a direct method as follows: Generate two independent U [0, 1] random numbers X1 and X2, define Z = h(X) where X = (X1, X2) and Z = (Z1, Z2) are R2 vectors and h : [0, 1]2 → R2 is a vector function defined by p p −2 ln x1 cos(2 π x2), −2 ln x1 sin(2 π x2) . (z1, z2) = h(x1, x2) = Then Z1 and Z2 are two independent N (0, 1) random numbers. 225 This result can be easily proved with the transformation method. Specifically, we can find the inverse function h−1 by 1 2 1 z2 2 . (x1, x2) = h−1(z1, z2) = e− 2 (z1 +z2 ), arctan z1 2π The absolute value of the determinant of the Jacobian matrix is ∂(x1, x2) = √1 e− 12 z12 ∙ √1 e− 12 z22 . ∂(z1, z2) 2π 2π (33) From the transformation theorem of random variables and (33) we know that if X1 and X2 are independent U [0, 1] random variables, then Z1 and Z2 are independent N (0, 1) random variables. 226 8. Correlated Normal Distributions. Assume X ∼ N (μ, Σ), where Σ ∈ Rn×n is symmetric positive definite. To generate a normal vector X do the following: (a) Calculate the Cholesky decomposition Σ = LLT . (b) Generate n independent N (0, 1) random numbers Zi and let Z = (Z1, . . . , Zn). (c) Set X = μ + L Z. Example. To generate 2 correlated N (μ, σ 2) RVs with correlation coefficient ρ, let p (Y1, Y2) = (Z1, ρ Z1 + 1 − ρ2 Z2) and then set (X1, X2) = (μ + σ Y1, μ + σ Y2) . 227 9. Inverse Transformation Method. To generate a random variable X with known cdf F (x), the inverse transform method sets X = F −1(U ) , where U is a U [0, 1] random variable (i.e. X satisfies F (X) = U ∼ U [0, 1])). The inverse of F is well defined if F is strictly increasing and continuous, otherwise we set F −1(u) = inf{x : F (x) ≥ u}. Exercise. Use the inversion method to generate X from a Rayleigh distribution, i.e. 2 −x F (x) = 1 − exp . 2 2σ 228 Inverse Transformation Method. X = F −1(U ) ⇒ P (X ≤ x) = P (F −1(U ) ≤ x) = P (U ≤ F (x)) Z F (x) = 1 du 0 = F (x) ⇒ X∼F Examples. • Exponential with parameter a > 0: F (x) = 1 − e−a x, x ≥ 0. Let U ∼ U [0, 1] and F (X) = U ⇐⇒ 1 − e−a X = U 1 1 ⇐⇒ X = − ln(1 − U ) = − ln U . a a 229 Hence X ∼ Exp(a). • Arcsin law: Let Bt be a Brownian motion on the time interval [0, 1]. Let T = arg max0≤t≤1 Bt. Then, for any t ∈ [0, 1] we have that √ 2 P (T ≤ t) = arcsin t =: F (t) . π Similarly, let L = sup{t ≤ 1 : Bt = 0}. Then, for any s ∈ [0, 1] we have that √ 2 P (L ≤ s) = π arcsin s = F (s). How to generate from this distribution? Let U ∼ U [0, 1] and √ 2 arcsin X = U F (X) = U ⇐⇒ π 2 1 1 πU = − cos(π U ). ⇐⇒ X = sin 2 2 2 Hence X ∼ F . 230 10. Acceptance-Rejection Method. Assume X has pdf f (x) on a set S, g is another pdf on S from which we know how to generate samples, and f is dominated by g on S, i.e. there exists a constant c such that f (x) ≤ c g(x) ∀ x ∈ S . The acceptance-rejection method generates a sample X from g and a sample U from U [0, 1], and f (X) accepts X as a sample from f if U ≤ and c g(X) rejects X otherwise, and repeats the process. Exercise. Suppose the pdf f is defined over a finite interval [a, b] and is bounded by M . Use the acceptance-rejection method to generate X from f . 231 Acceptance-Rejection Method. Let Y be a sample returned by the algorithm. Then Y has the same distribution of X conditional on U ≤ So for any measurable subset A ⊂ S, we have P (Y ∈ A) = P (X ∈ A | U ≤ f (X) )= c g(X) f (X) c g(X) . P (X ∈ A & U ≤ P (U ≤ f (X) c g(X) ) f (X) c g(X) ) . Now, as U ∼ U [0, 1], Z Z Z f (X) f (x) f (x) 1 1 P (U ≤ f (x) dx = , ) = P (U ≤ ) g(x) dx = g(x) dx = c g(X) c g(x) c g(x) c c S S S which yields Z Z f (x) f (X) f (x) dx . )=c g(x) dx = P (Y ∈ A) = c P (X ∈ A & U ≤ c g(X) A c g(x) A As A was chosen arbitrarily, we have that Y ∼f. 232 11. Monte Carlo Method for Option Pricing. A wide class of derivative pricing problems come down to the evaluation of the following expectation E [f (Z(T ; t, z))] , where Z denotes the stochastic process that describes the price evolution of one or more underlying financial variables such as asset prices and interest rates, under the respective risk neutral probability distributions. The process Z has the initial value z at time t. The function f specifies the value of the derivative at the expiration time T . Monte Carlo simulation is a powerful and versatile technique for estimating the expected value of a random variable. Examples. • E[e −r (T −t) + (ST − X) ], where ST = St e (r− 12 σ 2 ) (T −t)+σ • E[e−r (T −t) (maxt≤τ ≤T Sτ − X)+], where Sτ as before. 233 √ T −t Z , Z ∼ N (0, 1). 12. Simulation Procedure. (a) Generate sample paths of the underlying state variables (asset prices, interest rates, etc.) according to risk neutral probability distributions. (b) For each simulated sample path, evaluate discounted cash flows of the derivative. (c) Take the sample average of the discounted cash flows over all sample paths. C++ Exercise: Write a program to simulate paths of stock prices S satisfying the discretized SDE: √ ΔS = r S Δt + σ S Δt Z where Z ∼ N (0, 1). The inputs are the initial asset price S, the time horizon T , the number of partitions n (time-step Δt = Tn ), the interest rate r, the volatility σ, and the number of simulations M . The output is a graph of paths of the stock price (or the data needed to generate the graph). 234 13. Main Advantage and Drawback. With the Monte Carlo approach it is easy to price complicated terminal payoff function such as path-dependent options. Practitioners often use the brute force Monte Carlo simulation to study newly invented options. However, the method requires a large number of simulation trials to achieve a high level of accuracy, which makes it less competitive compared to the binomial and finite difference methods, when analytic properties of an option pricing model are better known and formulated. [Note: The CLT tells us that the Monte Carlo estimate is correct to order O( √1M ). So to incease the accuracy by a factor of 10, we need to compute 100 times more paths.] 235 14. Computation Efficiency. Suppose Wtotal is the total amount of computational work units available to generate an estimate of the value of the option V . Assume there are two methods for generating MC estimates, requiring W1 and W2 units of computational work for each run. Assume Wtotal is divisible by both W1 and W2. Denote by V1i and V2i the samples for the estimator of V using method 1 and 2, and σ1 and σ2 their standard deviation. The sample means for estimating V using Wtotal amount of work are, respectively, N1 N2 X X 1 1 V1i and V2i N1 i=1 N2 i=1 where Ni = WWtotal , i = 1, 2. The law of large numbers tells us that the above two i estimators are approximately normal with mean V and variances σ22 σ12 and . N1 N2 Hence, method 1 is preferred over method 2 provided that σ12 W1 ≤ σ22 W2. 236 15. Antithetic Variate Method. Suppose {εi} are independent samples from N (0, 1) for the simulation of asset prices, so that √ 2 i (r− σ2 ) (T −t)+σ T −t εi ST = St e for i = 1, . . . , M , where M is the total number of simulation runs. An unbiased estimator of a European call option price is given by M M 1 X i 1 X −r (T −t) c = e max(STi − X, 0). ĉ = M i=1 M i=1 We observe that {−εi} are also independent samples from N (0, 1), and therefore the simulated price σ 2 ) (T −t)−σ √T −t ε i (r− i Se = St e 2 T for i = 1, . . . , M , is a valid sample of the terminal asset price. 237 A new unbiased estimator is given by M M 1 X i 1 X −r (T −t) c̃ = c̃ = e max(SeTi − X, 0). M i=1 M i=1 Normally, we expect ci and c̃i to be negatively correlated, i.e. if one estimate overshoots the true value, the other estimate undershoots the true value. The antithetic variate estimate is defined to be ĉ + c̃ cˉ = . 2 The antithetic variate method improves the computation efficiency provided that cov(ci, c̃i) ≤ 0. [Reason: Var(X + Y ) = Var(X) + Var(Y ) + 2 cov(X, Y ). ⇒ Var(ˉc) = 12 Var(ĉ) + 12 cov(ĉ, c̃)] 238 16. Control Variate Method. Suppose A and B are two similar options and the analytic price formula for option B is available. Let VA and VbA be the true and estimated values of option A, VB and VbB are similar notations for option B. The control variate estimate of option A is defined to be VeA = VbA + (VB − VbB ), where the error VB − VbB is used as an adjustment in the estimation of VA. The control variate method reduces the variance of the estimator of VA when the options A and B are strongly (positively) correlated. 239 The general control variate estimate of option A is defined to be Ve β = VbA + β (VB − VbB ), A where β ∈ R is a parameter. The minimum variance of VeAβ is achieved when cov(VbA, VbB ) ∗ β = . b Var(VB ) Unfortunately, cov(VbA, VbB ) is in general not available. One may estimate β ∗ using the regression technique from the simulated option values VAi and VBi , i = 1, . . . , M . Note: E(X + β (E(Y ) − Y )) = E(X) + β (E(Y ) − E(Y )) = E(X) and Var(X + β(E(Y ) − Y )) = Var(X − β Y ) = Var(X) + β 2 Var(Y ) − 2 β cov(X, Y ) . 240 17. Other Variance Reduction Procedures. Other methods include • importance sampling that modifies distribution functions to make sampling more efficient, • stratified sampling that divides the distribution regions and takes samples from these regions according to their probabilities, • moment matching that adjusts the samples such that their moments are the same as those of distribution functions, 1 • low-discrepancy sequence that leads to estimating error proportional to rather M 1 than √ , where M is the sample size. M For details of these methods see Hull (2005) or Glasserman (2004). 241 18. Pricing European Options. If the payoff is a function of the underlying asset at one specific time, then we can simplify the above Monte Carlo procedure. For example, a European call option has a payoff max(S(T ) − X, 0) at expiry. If we assume that S follows a lognormal process, then (r− 12 σ 2 ) T +σ S(T ) = S0 e √ TZ (34) where Z is a standard N (0, 1) random variable. To value the call option, we only need to simulate Z to get samples of S(T ). There is no need to simulate the whole path of S. The computational work load will be considerably reduced. Of course, if we want to price path-dependent European options, we will have to simulate the whole path of the asset process S. 242 C++ Exercise: Write a program to price European call and put options using the Monte Carlo simulation with the antithetic variate method. The standard normal random variables can be generated by first generating U [0, 1] random variables and then using the Box–Muller method. The terminal asset prices can be generated by (34). The inputs are the current asset price S0, the exercise price X, the volatility σ, the interest rate r, the exercise time T , and the number of simulations M . The outputs are European call and put prices at time t = 0 (same strike price). 243 10. American Option Pricing 1. Formulation. A general class of American option pricing problems can be formulated by specifying a Markov process {X(t), 0 ≤ t ≤ T } representing relevant financial variables such as underlying asset prices, an option payoff h(X(t)) at time t, an instantaneous short rate process {r(t), 0 ≤ t ≤ T }, and a class of admissible stopping times T with values in [0, T ]. The American option pricing problem is to find the optimal expected discounted payoff R − 0τ r(u) du h(X(τ ))]. sup E[e τ ∈T It is implicit that the expectation is taken with respect to the risk-neutral measure. In this course we assume that the short rate r(t) = r, a non-negative constant for 0 ≤ t ≤ T. 244 For example, if the option can only be exercised at times 0 < t1 < t2 < ∙ ∙ ∙ < tm = T (this type of option is often called the Bermudan option), then the value of an American put can be written as sup E[e−r ti (K − Si)+] , i=1,...,m where K is the exercise price, Si the underlying asset price S(ti), and r the risk-free interest rate. 245 American Option In general, the value of an American option is higher than that of the corresponding European option. However, an American call for a non-dividend paying asset has the same value as the European call. This follows from the fact that in this case the optimal exercise time is always the expiration time T , as can be seen from the following argument. Suppose the holder wants to exercise the call at time t < T , when S(t) > X. Exercising would give a profit of S(t) − X. Instead, one could keep the option and sell the asset short at time t, then purchase the asset at time T by either (a) exercising the option at time t = T or (b) buying at the market price at time T . Hence the holder gained an amount S(t) > X at time t and paid out min(S(T ), X) ≤ X at time T . This is better than just S(t) − X at time t. 246 2. Dynamic Programming Formulation. Let hi denote the payoff function for exercise at time ti, Vi(x) the value of the option at ti given Xi = x, assuming the option has not previously been exercised. We are interested in V0(X0). This value is determined recursively as follows: Vm(x) = hm(x) , Vi−1(x) = max{hi−1(x), Ci−1(x)}, i = m, . . . , 1 , where hi(x) is the immediate exercise value at time ti, Ci−1(x) = E[Di Vi(Xi) | Xi−1 = x] is the expected discounted continuation value at time ti−1, and Di = e−r (ti−ti−1) is the discount factor over [ti−1, ti]. 247 3. Stopping Rule. The optimal stopping time τ is determined by τ = min{ti : hi(Xi) ≥ Ci(Xi), i = 1, . . . , m}. bi(x) is the approximation to Ci(x), then the sub-optimal stopping time τ̂ is If C determined by bi(Xi), i = 1, . . . , m}. τ̂ = min{ti : hi(Xi) ≥ C 248 4. Binomial Method. At the terminal exercise time tm = T , the American option value at node j is given i i , j = 0, 1, . . . , m, where h = h(S by Vjm = hm j j j ) is the intrinsic value at (i, j), e.g. h(s) = (K − s)+. At time ti, i = m − 1, . . . , 0, the option value at node j is given by 1 i+1 + (1 − p) Vji+1] . Vji = max hij , [p Vj+1 R Dynamic programming is used to find the option value at time t0. R−d and the values of p, u, d are determined as usual. In (35) we have p = u−d 249 (35) 5. Finite Difference Method. The dynamic programming approach cannot be applied with the implicit or Crank– Nicolson scheme. Suppose the difference equation with an implicit scheme has the form aj−1 Vj−1 + aj Vj + aj+1 Vj+1 = dj (36) for j = 1, . . . , N − 1, where the superscript “n + 1” is omitted for brevity and dj represents the known quantities. Recall the SOR algorithm to solve (36) is given by ω (k) (k−1) (k) (k−1) (k−1) + dj − aj−1 Vj−1 − aj Vj − aj+1 Vj+1 Vj = V j aj for j = 1, . . . , N − 1 and k = 1, 2, . . .. 250 Let e.g. hj = (Sjn+1 − K)+ be the intrinsic value of the American option at node (j, n + 1). The solution procedure to find Vjn+1 is then ω (k) (k−1) (k) (k−1) (k−1) Vj = max hj , Vj + (dj − aj−1 Vj−1 − aj Vj − aj+1 Vj+1 ) aj for j = 1, . . . , N − 1 and k = 1, 2, . . .. The procedure (37) is called the projected SOR scheme. 251 (37) 6. Random Tree Method. This is a Monte Carlo method based on simulating a tree of paths of the underlying Markov chain X0, X1, . . . , Xm. Fix a branching parameter b ≥ 2. From the initial state X0 simulate b independent successor states X11, . . . , X1b all having the law of X1. From each X1i simulate b independent successors X2i1, . . . , X2ib from the conditional law of X2 given X1 = X1i . From each X2i1i2 generate b successors X3i1i21, . . . , X3i1i2b, and so on. A generic node in the tree at time step i is denoted by Xij1j2∙∙∙ji . At the mth time step there are bm nodes and the computational cost has exponential growth. (e.g. if b = 5 and m = 5, there are 3125 nodes; if b = 5 and m = 10, there are about 10 million nodes.) 252 7. High Estimator. Write Vbij1j2∙∙∙ji for the value of the high estimator at node Xij1j2∙∙∙ji . At the terminal nodes we set Working backwards we get Vbij1j2∙∙∙ji j1 j2 ∙∙∙jm Vbmj1j2∙∙∙jm = hm(Xm ). b X j1 j2 ∙∙∙ji 1 j1 j2 ∙∙∙ji j = max hi(Xi ), Di+1Vbi+1 . b j=1 Vb0 is biased high in the sense that E[Vb0] ≥ V0(X0) and converges in probability and in norm to the true value V0(X0) as b → ∞. 253 8. Low Estimator. Write v̂ij1j2∙∙∙ji for the value of the low estimator at node Xij1j2∙∙∙ji . At the terminal nodes we set j1 j2 ∙∙∙jm j1 j2 ∙∙∙jm v̂m = hm(Xm ). Working backwards, for k = 1, . . . , b, we set ( Pb j1 j2 ∙∙∙ji j1 j2 ∙∙∙ji j 1 ) if b−1 j=1,j6=k Di+1v̂i+1 ≤ hi(Xij1j2∙∙∙ji ) hi(Xi j1 j2 ∙∙∙ji = v̂i,k j1 j2 ∙∙∙ji k Di+1v̂i+1 otherwise we then set b v̂ij1j2∙∙∙ji 1 X j1j2∙∙∙ji = v̂i,k . b k=1 v̂0 is biased low and converges in probability and in norm to V0(X0) as b → ∞. 254 Example. High Estimator. Low Estimator. 92 (0) 112 (12) 100 (9) 97 (9) 105 (6) 112 (12) 115 (15) 92 (0) ← 12 115 (15) ← 12 85 (0) 85 (0) ← 12 107 (7) 107 (7) ← 7 100 (8) 102 (2) 97 (9) 102 (2) ← 2 118 (18) 118 (18) ← 18 114 (14) 114 (14) ← 5 105 (3) 104 (4) 104 (4) ← 4 94 (0) ← 0 94 (0) Exercise price K = 100, interest rate r = 0. Asset price (call price) 255 9. Confidence Interval. Let Vˉ0(M ) denote the sample mean of the M replications of Vb0, sV (M ) the sample standard deviation, and z δ the 1 − 2δ quantile of the normal distribution. Then 2 sV (M ) ˉ V0(M ) ± z δ √ 2 M provides an asymptotically valid (for large M ) 1 − δ confidence interval for E[Vb0]. Similarly, we can get the confidence interval for E[v̂0]. The interval sv (M ) sV (M ) vˉ0(M ) − z δ √ , Vˉ0(M ) + z δ √ 2 2 M M contains the unknown value V0(X0) with probability of at least 1 − δ. 256 10. Regression Method. Assume that the continuation value can be expressed as E(Vi+1(Xi+1) | Xi = x) = n X βir ψr (x) = βiT ψ(x) , (38) r=1 where βi = (βi1, . . . , βin)T , ψ(x) = (ψ1(x), . . . , ψn(x))T , and ψr , r = 1, . . . , n, are basis functions (e.g. polynomials 1, x, x2, . . .). Then the vector βi is given by βi = Bψ−1 bψV , where Bψ = E[ψ(Xi) ψ(Xi)T ] is an n × n matrix (assumed to be non-singular), bψV = E[ψ(Xi) Vi+1(Xi+1)] is an n column vector, and the expectation is over the joint distribution of (Xi, Xi+1). 257 The regression method can be used to estimate βi as follows: • generate b independent paths (X1j , . . . , Xmj ), j = 1, . . . , b, and • suppose that the values Vi+1(Xi+1,j ) are known, then b −1 b̂ψV , β̂i = B ψ bψ is an n × n matrix with qr entry where B b 1X ψq (Xij ) ψr (Xij ) b j=1 and b̂ψV is an n vector with rth entry b 1X ψr (Xij ) Vi+1(Xi+1,j ). b j=1 However, Vi+1 is unknown and must be replaced by some estimated value Vbi+1. 258 11. Pricing Algorithm. (a) Generate b independent paths {X0j , X1j , . . . , Xmj }, j = 1, . . . , b, of the Markov chain, where X0j = X0, j = 1, . . . , b. (b) Set Vbmj = hm(Xmj ), j = 1, . . . , b. (c) For i = m − 1, . . . , 0, and given estimated values Vbi+1,j , j = 1, . . . , b, use the regression method to calculate β̂i and set Vbij = max{hi(Xij ), Di+1 β̂iT ψ(Xij )}. b X (d) Set Vb0 = 1b Vb0j . j=1 The regression algorithm is biased high and Vb0 converges to V0(X0) as b → ∞ if the relation (38) holds for all i = 0, . . . , m − 1. 259 12. Longstaff–Schwartz Algorithm. Same as 10.11, except in computing Vbij : ( T β̂ (X ) if D ψ(Xij ) ≤ hi(Xij ) h i ij i+1 i b Vij = Di+1 Vbi+1,j otherwise The LS algorithm is biased low and Vb0 converges to V0(X0) as b → ∞ if the relation (38) holds for all i = 0, . . . , m − 1. 13. Other Methods. Parametric approximation, state-space partitioning, stochastic mesh method, duality algorithm, and obstacle (free boundary-value) problem formulation. (See Glasserman (2004) and Seydel (2006) for details.) 260 11. Exotic Option Pricing 1. Introduction. Derivatives with more complicated payoffs than the standard European or American calls and puts are called exotic options, many of which are so called path-dependent options. Some common exotic options are • Asian option: Average price call and put, average strike call and put. Arithmetic or geometric average can be taken. • Barrier option: Up/down-and-in/out call/put. Double barrier, partial barrier and other time dependent effects possible. • Range forward contract: It is a portfolio of a long call with a higher strike price and a short put with a lower strike price. • Bermudan option: It is a nonstandard American option in which early exercise is restricted to certain fixed times. 261 • Compound option: It is an option on an option. There are four basic forms: call-on-call, call-on-put, put-on-call, put-on-put. • Chooser option: The option holders have the right to choose whether the option is a call or a put. • Binary (digital) option: The terminal payoff is a fixed amount of cash, if the underlying asset price is above a strike price, and it pays nothing otherwise. • Lookback option: The terminal payoff depends on the maximum or minimum realized asset price over the life of the option. 262 Examples. Range forward contract: Long call and short put Assume you have a long position in a call with strike price X2 and a short position in a put S − with strike price X1 < X2. Then the terminal payoff is given by (S−X2)+−(X1−S)+ = 0 S − X1 X2 S Buy this option, if you do not think that the asset price will go down. ⇒ You can generate cash from a price increase. 263 Option price = Call price - Put price 264 Compound option: Call on call option | t C(t) = ? | | T1 T2 e 2) = (S(T2) − X2)+ C(T e e 1) ≡ C(S(T C(T 1 ), X2 , T1 , . . .) e 1) − X1)+ C(T1) = (C(T Binary (digital) option: Price = E[e−r T Q 1{ST ≥X}] = e−r T Q E[1{ST ≥X}] = e−r T Q P (ST ≥ X) . The price can be easily computed, since ST is lognormally distributed. 265 Lookback Option Let m = min S(u) t≤u≤T and • (Floating strike) Call: (ST − m)+. • (Floating strike) Put: (M − ST )+. • Fixed strike Call: (M − X)+. • Fixed strike Put: (X − m)+. 266 M = max S(u) . t≤u≤T 2. Barrier Options. The existence of options at the expiration date depends on whether the underlying asset prices have crossed certain values (barriers). There are four basic forms: down-and-out, up-and-out, down-and-in, up-and-in. If there is only one barrier then we can derive analytic formulas for standard European barrier options. However, there are many situations (two barriers, American options, etc.), in which we have to rely on numerical procedures to find the option values. The standard binomial and trinomial schemes can be used for this purpose, but their convergence is very slow, because the barriers assumed by the tree is different from the true barrier. 267 Outer barrier True barrier • • Inner barrier • • • • Barriers assumed by binomial trees. • • • • • • • • • The usual tree calculations implicitly assume that the outer barrier is the true barrier. 268 Outer barrier True barrier Inner barrier • • • • • • • • • • • • • • • • • • • • Barriers assumed by trinomial trees. 269 • • • • • 3. Positioning Nodes on Barriers. Assume that there are two barriers H1 and H2 with H1 < H2. Assume that the trinomial scheme is used for the option pricing with m = 1 and d = u1 . Choose u such that nodes lie on both barriers. u must satisfy H2 = H1 uN (and hence H1 = H2 dN ), or equivalently ln H2 = ln H1 + N ln u , with N an integer. √ It is known that u = e 3 Δt is a good choice in the standard trinomial tree, recall the Hull–White model (7.20). σ 270 A good choice of N is therefore ln H2 − ln H1 √ N = int + 0.5 σ 3 Δt and u is determined by 1 H2 ln . N H1 Normally, the trinomial tree is constructed so that the central node is the initial asset price S. In this case, the asset price at the first node is the initial asset price. After that, we choose the central node of the tree (beginning with the first period) to be H1 uM , where M is an integer that makes this quantity as close as possible to S, i.e. ln S − ln H1 M = int + 0.5 . ln u u = exp 271 H2 S • • • H1 u3 • • • • • • • • • • • • • • • H1 Tree with nodes lying on each of two barriers. 272 • • • • • • • 4. Adaptive Mesh Model. Computational efficiency can be greatly improved if one projects a high resolution tree onto a low resolution tree in order to achieve a more detailed modelling of the asset price in some regions of the tree. E.g. to price a standard American option, it is useful to have a high resolution tree near its maturity around the strike price. To price a barrier option it is useful to have a high resolution tree near its barriers. See Hull (2005) for details. 273 5. Asian Options. The terminal payoff depends on some form of averaging of the underlying asset prices over a part or the whole of the life of the option. There are two types of terminal payoffs: • max(A − X, 0) (average price call) and • max(ST − A, 0) (average strike call), where A is the average of asset prices, and there are many different ways of computing A. From now on we focus on pricing average price calls, the results for average strike calls can be obtained similarly. We also assume that the asset price S follows a lognormal process in a risk-neutral world. 274 6. Continuous Arithmetic Average. The average A is computed as 1 A = I(T ) , T where the function I(t) is defined by Z t I(t) = S(u) du. 0 Assume that V (S, I, t) is the option value at time t. Then V satisfies the following diffusion equation: ∂V ∂V ∂V 1 2 2 ∂ 2V − r V + S +rS + σ S =0 2 ∂t ∂S 2 ∂S ∂I with suitable boundary and terminal conditions. (39) If finite difference methods are used to solve (39), then the schemes are prone to serious oscillations and implicit schemes have poor performance, due to the missing of one diffusion term in this two-dimensional PDE. 275 7. Continuous Geometric Average. The average A is computed as 1 A = exp I(T ) , T where the function I(t) is defined by Z t ln S(u) du . I(t) = 0 The option value V satisfies the following diffusion equation: ∂V ∂V ∂V 1 2 2 ∂ 2V − r V + ln S +rS + σ S = 0. (40) 2 ∂t ∂S 2 ∂S ∂I Exercise. Show that if we define a new variable I + (T − t) ln S y= T and seek a solution of the form V (S, I, t) = F (y, t), then the equation (40) can be reduced to a one-state-variable PDE: 2 1 2 T − t ∂F ∂F 1 σ (T − t) ∂ 2F + (r − σ ) + − rF = 0. ∂t 2 T ∂y 2 2 T ∂y 276 8. Discrete Geometric Average: Case t ≥ T0. Assume that the averaging period is [T0, T ] and that the sampling points are ti = 0 T0 + i Δt, i = 1, 2, . . . , n, Δt = T −T n . The average A is computed as A= n Y S(ti) i=1 ! n1 . Assume t ≥ T0 , i.e. the current time is already within the averaging period. Assume t = tk + α Δt for some integer k: 0 ≤ k ≤ n − 1 and 0 ≤ α < 1. 277 e + μA and variance σ 2 , Then ln A is a normal random variable with mean ln S(t) A where 1 n−k e S(t) = [S(t1) ∙ ∙ ∙ S(tk )] n S(t) n 2 σ (n − k − 1) (n − k) n−k μA = r − (T − T0) (1 − α) + 2 n2 2 n2 2 (n − k − 1) (n − k) (2n − 2k − 1) (n − k) σA2 = σ 2 (T − T0) (1 − α) + n3 6 n3 The European call price at time t is 1 2 e eμA+ 2 σA Φ(d1) − X Φ(d2) c(S, t) = e−r (T −t) S(t) where 1 d1 = σA and e S(t) ln + μA X d2 = d 1 − σ A . 278 ! + σA Proof. t = tk + α Δt, Δt = T −T0 n , 0 ≤ k ≤ n − 1, 0 ≤ α < 1 n1 2 n−k ) (t ) (tk+1) n−k S(t S S n n−1 ∙∙∙ S (t) S(tk ) ∙ ∙ ∙ S(t1) A = [S(t1) ∙ ∙ ∙ S(tn)] = 2 n−k {z } S(tn−1) S (tn−2) S (t) | 1 n = 2 Rn Rn−1 ∙∙∙ 1 n−k−1 n−k n Rk+2 Rt known e S(t) e + 1 [ln Rn + 2 ln Rn−1 + . . . + (n − k − 1) ln Rk+2 + (n − k) ln Rt] ln A = ln S(t) n Since S is a lognormal process (dS = r S dt + σ S dW ), we have that ln Rn, . . . , ln Rk+2, ln Rt are independent normally distributed and ⇒ ln Rn, . . . , ln Rk+2 ∼ N (μ Δt, σ 2 Δt), ln Rt ∼ N (μ (tk+1 − t), σ 2 (tk+1 − t)) = N (μ (1 − α) Δt, σ 2 (1 − α) Δt) , where μ = r − 12 σ 2. 279 Hence ln A is normal with mean 1 e E[ln A] = E[ln S(t)] + E [ln Rn + 2 ln Rn−1 + . . . + (n − k − 1) ln Rk+2 + (n − k) ln Rt] n e + 1 [μ Δt + 2 μ Δt + . . . + (n − k − 1) μ Δt + (n − k) μ (1 − α) Δt] = ln S(t) n 1 e = ln S(t) + μ Δt [1 + 2 + . . . + (n − k − 1) + (n − k) (1 − α)] n (n − k − 1) (n − k) n − k e e + μA . = ln S(t) + μ Δt + (1 − α) = ln S(t) 2n n Moreover, the variance of ln A is given by 1 2 2 Var(ln A) = 2 Var(ln Rn) + . . . + (n − k − 1) Var(ln Rk+2) + (n − k) Var(ln Rt) n 1 2 2 2 2 2 = 2 σ Δt 1 + 2 + . . . + (n − k − 1) + (n − k − 1) (1 − α) n 2 (n − k − 1) (n − k) (2 n − 2 k − 1) (n − k) 2 = σ 2 Δt + (1 − α) = σ . A 2 2 6n n [Used: n X i=1 n X n (n + 1) n (n + 1) (2n + 1) i= i2 = and .] 2 6 i=1 280 As A = eln A, A is lognormally distributed and the important formula in 5.5. tells us that E[(A − X)+] = E(A) Φ(d1) − X Φ(d2) , where d1 = Now 1 s s 2 ln E(A) + and d = d − s, with s = Var(ln A). 2 1 2 X and so, on recalling 5.4., 2 μA + 12 σA e E(A) = S(t) e . Combining gives where e + μA, σA2 ) ln A ∼ N (ln S(t) e eμA+ 12 σA2 Φ(d1) − X Φ(d2) , E[(A − X)+] = S(t) " # " # e e S(t) S(t) 1 1 2 1 1 d1 = ln ln + μA + σ A + σ A = + μA + σ A σA X 2 2 σA X 281 and d2 = d1 − σA. Finally, the Asian (European) call price is e−r (T −t) E[(A − X)+]. 282 Aside: n X i=1 n X n (n + 1) n (n + 1) (2n + 1) and . i= i2 = 2 6 i=1 Let f (x) = x then " # i Z Z 1 n n n 2 2 X n X1 X1 i 1 i i−1 2 n = x | i−1 = f (x) dx = x dx = − i−1 2 2 2 n n n 0 n i=1 i=1 i=1 n n n n X 1 i 1 1 X 1 X 1 X 1 i− 2 1= 2 i− 2 2− 2 = 2 = 2 n n n i=1 2 n i=1 n i=1 2n i=1 Hence n 1 X 1 1 n+1 i= + = n2 i=1 2 2n 2n ⇒ n X i=1 n (n + 1) i= . 2 Similarly, for f (x) = x2 we obtain Z 1 n Z i X n 1 f (x) dx = x2 dx = . . . = i−1 3 0 n i=1 283 9. Discrete Geometric Average: Case t < T0. Exercise. Assume t < T0 , i.e. the current time is before the averaging period. Show that ln A is a normal random variable with mean ln S(t) + μA and variance σA2 , where 2 σ n+1 (T0 − t) + μA = r − (T − T0) 2 2 n (n + 1) (2n + 1) σA2 = σ 2 (T0 − t) + (T − T0) . 2 6n The European call price at time t is 1 2 c(S, t) = e−r (T −t) S eμA+ 2 σA Φ(d1) − X Φ(d2) 1 S where d1 = σ ln X + μA + σA and d2 = d1 − σA. A 284 10. Limiting Case n = 1. If n = 1 then A = S(T ). Furthermore, μA σA2 σ2 = (r − ) (T − t) 2 2 = σ (T − t). The call price reduces to the Black–Scholes call price. 285 A = (Π1i=1S(ti))1 = S(t1) = S(T ), independently from the choice of T0. Hence choose e.g. T0 > t. σ2 σ2 ⇒ μA = (r − ) [(T0 − t) + (T − T0)] = (r − ) (T − t) , 2 2 σA2 = σ 2 [(T0 − t) + (T − T0)] = σ 2 (T − t) . Call price: c = e−r (T −t) S e where 2 (r− σ2 ) (T −t)+ 12 σ 2 (T −t) Φ(d1) − X Φ(d2) = S Φ(d1) − X e−r (T −t) Φ(d2) , 1 S ln + μA + σA σA X √ 1 S σ2 = √ ln + (r − ) (T − t) + σ T − t 2 σ T −t X 1 S 1 √ ln + r (T − t) + σ T − t = √ X 2 σ T −t 1 √ 1 S √ and d2 = d1 − σA = σ T −t ln X + r (T − t) − 2 σ T − t. d1 = 286 This is just the Black–Scholes pricing formula (16), see 7.13. 287 11. Limiting Case n = ∞. If n → ∞ then A → exp 1 T − T0 Furthermore, if time t < T0 then Z T ln S(u) du . T0 2 σ 1 1 μA → (r − ) T + T0 − t , 2 2 2 2 2 2 1 σA → σ T + T0 − t . 3 3 If time t ≥ T0 then 1 e exp S(t) → [S(t)] T − T0 σ 2 (T − t)2 μA → r − , 2 2 (T − T0) 3 2 2 (T − t) σA → σ . 2 3 (T − T0) T −t T −T0 Z t ln S(u) du , T0 The above values, together with the discrete average price formulas, then yield pricing 288 formulas for the continuous geometric mean. 289 Proof. Two cases: t < T0 and t ≥ T0. Here we look at the latter. 2 1 k 1 k 1 k σ (1 − ) (1 − α) + (1 − − ) (1 − ) , μA = r − (T − T0) n n 2 n n n 2 1 k 2 1 k 1 k k 1 2 2 σA = σ (T − T0) (1 − ) (1 − α) + (1 − − ) (1 − ) (2 − 2 − ) . n n 6 n n n n n T −T0 0 + α Since t = tk + α Δt = T0 + k Δt + α Δt = T0 + k T −T n n , we have that t − T0 k α k k t − T0 T −t = + ⇐⇒ 1 − → . ⇒ → T − T0 n n T − T0 n n T − T0 Hence, as n → ∞, we have that 2 2 2 σ σ 1 (T − t) (T − t)2 = r− μA → r − (T − T0) , 2 2 (T − T0)2 2 2 (T − T0) 3 3 1 (T − t) 2 2 2 (T − t) =σ . σA → σ (T − T0) 3 2 3 (T − T0) 3 (T − T0) 290 Moreover, 1 n−k e n S(t) = [S(t1) ∙ ∙ ∙ S(tk )] S(t) n 1 n−k e ln S(t) ⇒ ln S(t) = [ln S(t1) + . . . + ln S(tk )] + n n 1 k = Δt [ln S(t1) + . . . + ln S(tk )] + (1 − ) ln S(t) . T − T0 n As t = tk + α Δt → tk as n → ∞, we have Z t 1 T −t e ln S(u) du + ln S(t) ln S(t) → T − T 0 T0 T − T0 Z t T −t 1 ln S(u) du + ln S T −T0 (t) = T − T0 T 0 Z t T −t 1 e → S T −T0 (t) exp ⇒ S(t) ln S(u) du . T − T0 T0 1 Note that the average A = [S(t1) ∙ ∙ ∙ S(tn)] n converges to Z t 1 ln S(u) du as n → ∞ , exp T − T0 T0 291 which is just the continuous geometric average. 292 12. Discrete Arithmetic Average. The average A is computed as n 1X A= S(ti). n i=1 The option value is difficult to compute because A is not lognormal. The binomial method can be used but great care must be taken so that the number of nodes does not increase exponentially. Monte Carlo simulation is possibly the most efficient method. Since an option with discrete arithmetic average is very similar to that with discrete geometric average, and since the latter has an analytic pricing formula, the control variate method (with the geometric call price as control variate) can be used to reduce the variance of the Monte Carlo simulation. 293