Second Order Differentiation Yi Heng Bommerholz – 14.08.2006 Summer School 2006 Outline Background What are derivatives? Where do we need derivatives? How to compute derivatives? Basics of Automatic Differentiation Introduction Forward mode strategy Reverse mode strategy Second-Order Automatic Differentiation Module Introduction Forward mode strategy Taylor Series strategy Hessian Performance An Application in Optimal Control Problems Summary 2 Background What are derivatives? Jacobian Matrix: The differential of f : m n , x f ( x) is described by the Jacobian matrix f1 f1 x xm 1 f . J x f f n n x xm 1 Tangents (directional derivatives): dY J dX Gradients : Xb* Yb* J 3 Background What are derivatives? Hessian Matrix: The second order partial derivatives of a function f : constitutes its Hessian matrix 2 f H( f ) x x i j i , j 1, m ,m 4 Background Where do we need derivatives? • • • • • • • • Linear approximation Bending and Acceleration (Second derivatives) Solve algebraic and differential equations Curve fitting Optimization Problems Sensitivity analysis Inverse Problem (data assimilation) Parameter identification 5 Background How to compute derivatives? Symbolic differentiation Derivatives can be computed to machine precision Computation work is expensive For complicated functions, the representation of the final expression may be an unaffordable overhead Divided differences It is easy to implement (the definition of derivative is used) Only original computer program is required (Formula not necessary) The approximation contains truncation error 6 Background How to compute derivatives? Automatic differentiation Machine precision approximation can be obtained Computation work is cheaper Only requires the original computer program To be continued ... 7 Basics of Automatic Differentiation Introduction Automatic differentiation ... • Is also known as computational differentiation, algorithmic differentiation, and differentiation of algorithms; • Is a systematic application of the familar rules of calculus to computer programs, yielding programs for the propagation of numerical values of first, second, or higher order derivatives; • Traverses the code list (or computational graph) in the forward mode, the reverse mode, or a combination of the two; • Typically is implemented by using either source code transformation or operator overloading; • Is a process for evaluating derivatives which depends only on an algorithmic specification of the function to be differentiated. 8 Basics of Automatic Differentiation Introduction Rules of arithmetic operations for gradient vector u, v are scalar function with m independent input variables. (u v) u v, (uv) uv vu , (u / v) (u (u / v)v) / v, v 0, ( (u )) '(u )(u ), for differentiable functions (such as the standard functions) with known derivatives. 9 Basics of Automatic Differentiation Forward mode & Reverse mode An example given function f ( x, y, z ) ( xy cos z )( x 2 2 y 2 3 z 2 ), the partial derivatives are f y ( x 2 2 y 2 3 z 2 ) ( xy cos z ) 2 x 3x 2 y 2 y 3 3 yz 2 2 x cos z , x f x ( x 2 2 y 2 3 z 2 ) ( xy cos z ) 4 y x 3 6 xy 2 3xz 2 4 y cos z , y f sin z ( x 2 2 y 2 3 z 2 ) ( xy cos z ) 6 z z x 2 sin z 2 y 2 sin z 3 z 2 sin z 6 xyz 6 z cos z. f f f f x y z T 10 Basics of Automatic Differentiation Forward mode & Reverse mode - Forward mode Code list + u1 x, Gradient entries u1 [1, 0, 0], u2 y , u2 [0,1, 0], u3 z , u4 u1u2 , u5 cos u3 , u 6 u 4 u5 , u7 u , 2 1 u8 2u2 , 2 u3 [0, 0,1], u4 u1u2 +u2u1 [0, u1 ,0]+[u2 ,0,0]=[u2 ,u1 ,0], u5 ( sin u3 )u3 [0, 0, sin u3 ], u6 u4 u5 [u2 ,u1 , sin u3 ], u7 2u1u1 [2u1 , 0, 0], u8 4u2u2 [0, 4u2 , 0], u9 3u32 , u9 6u3u3 [0, 0, 6u3 ], u10 u7 u8 u9 , u10 u7 u8 u9 [2u1 , 4u2 , 6u3 ], u11 u6u10 , u11 u6u10 u10u6 [2u6u1 u10u2 , 4u6u2 u10u1 ,6u6u3 u10sinu3 ]. f ( x, y, z ) u11 [3x 2 y 2 x cos z 2 y 3 3 yz 2 , 6 xy 2 4 y cos z x 3 3 xz 2 , 6 xyz 6 z cos z x 2 sin z 2 y 2 sin z 3z 2 sin z ]. 11 Basics of Automatic Differentiation Forward mode & Reverse mode - Reverse mode Code list + u11 u11 u11 u11 u10 1, u6 , u6 , u11 u10 u9 u10 u9 u1 x, u2 y , u11 u11 u10 u u u =u6 , 11 11 10 =u6 , u8 u10 u8 u7 u10 u7 u3 z , u4 u1u2 , u11 u u u u11 u11 u6 u10 , 11 11 6 u10 , u10 , u6 u5 u6 u5 u4 u6 u4 u5 cos u3 , u 6 u 4 u5 , u11 u11 u9 u11 u5 6u6u3 u10 sin u3 , u3 u9 u3 u5 u3 u7 u12 , u8 2u2 2 , u9 3u32 , u10 u7 u8 u9 , u11 u6u10 , f ( x , y , z ) [ Adjoints u11 u11 u4 u11 u8 u10u1 4u6u2 , u2 u4 u2 u8 u2 u11 u11 u4 u11 u7 + =u10u2 2u6u1. u1 u4 u1 u7 u1 u11 u11 u11 , , ] [3 x 2 y 2 x cos z 2 y 3 3 yz 2 , u1 u2 u3 6 xy 2 4 y cos z x 3 3 xz 2 , 6 xyz 6 z cos z x 2 sin z 2 y 2 sin z 3 z 2 sin z ]. 12 Second-Order AD Module Introduction Divided differences First order differentiation Forward differentiation: f ( x1 , f xm Backward differentiation: Centered differentiation: f ( x1 , f xm f ( x1 , f xm , xm h, , xn ) f ( x1 , h , xm , , xn ) f ( x1 , h , xm , , xm h, , xn ) , xm h, , xn ) f ( x1 , 2h O ( h) , xn ) , xm h, O ( h) , xn ) O(h 2 ) Second order differentiation f ( x1 , 2 f xm2 , xm h, , xn ) 2 f ( x1 , , xm , h2 , xn ) f ( x1 , , xm h, , xn ) O(h 2 ) 13 Second-Order AD Module Introduction Rules of arithmetic operations for Hessian matrices H (u v) H (u ) H (v), H (uv) uH (v) u T v vT u vH (u ), H (u / v) ( H (u ) (u / v)T v vT (u / v) (u / v) H (v)) / v, v 0, H ( (u )) ''(u )u T u '(u ) H (u ), for twice differentiable functions such as the standard functions. 14 Second-Order AD Module Forward Mode Stategy An example given function f ( x, y ) x 2 2 y 2 sin( xy ), the second order partial derivatives are 2 f 2 f 2 (2 x y cos( xy )) x 2 y sin( xy ), (2 x y cos( xy )) y cos( xy ) xy sin( xy), x 2 xy 2 f 2 f 2 (4 y x cos( xy )) x cos( xy ) xy sin( xy ), (4 y x cos( xy )) 4 x sin( xy ). y 2 yx y 2 f 2 f 2 x x y . The Hessian matrix H ( f ) 2 2 f f 2 y x y 15 Second-Order AD Module Forward Mode Stategy Code list + Gradient entries u1 x, u1 [1, 0], u2 y , u2 [0,1], u3 u12 , u3 2u1u1 =[2u1 ,0], u4 2u22 , u5 sin(u1u2 ), u6 u3 u4 u5 , u4 =4u2u2 [0,4u2 ], u5 sin(u1u2 ) u2 cos (u1u2 )u1 u1 cos (u1u2 )u2 [u2 cos (u1u2 ), u1 cos (u1u2 )], u6 u3 u4 u5 [2u1 u2 cos (u1u2 ), 4u2 u1 cos (u1u2 )]. 16 Second-Order AD Module Forward Mode Stategy ... + 0 H (u1 ) 0 Hessian matrix entries 0 0 , H (u2 ) 0 0 0 , 0 2 0 H (u3 ) H (u )=u1H (u1 ) u u1 u u1 u1H (u1 ) , 0 0 0 0 2 T T H (u4 )=H (2u2 )=2[u2 H (u2 ) u2 u2 u2 u2 u2 H (u2 )] , 0 4 H (u5 ) H (sin(u1u2 )) sin(u1u2 )(u1u2 )T (u1u2 ) cos(u1u2 ) H (u1u2 ) 2 1 T 1 T 1 u22 u1u2 0 1 sin(u1u2 ) cos(u1u2 ) , u u u 2 1 0 1 2 1 2 sin(u1u2 )u22 H (u6 ) H (u3 ) H (u4 ) H (u5 ) cos(u u ) u u sin(u u ) 1 2 1 2 1 2 cos(u1u2 ) u1u2 sin(u1u2 ) . 2 4 u1 sin(u1u2 ) 17 Second-Order AD Module Forward Mode Stategy Hessian Type Cost H(f) O(n2) H(f): n by n matrix H(f)*V O(nnv) V: n by nv matrix VTH(f)*V O(nv2) W: n by nw matrix VTH(f)*W O(nvnw) 18 Second-Order AD Module Taylor Series Strategy We consider f as a scalar function f (x 0 tu) of t. Its Taylor series, up to second order, is f f ( x 0 tu ) f ( x 0 ) t 1 2 f t 2 t 2 t 0 t 2 f ft t f tt t 2 , t 0 where ft and ftt are the first and second order Taylor coefficients. The uniquesness of the Taylor series implies that for u ei , the ith basis vector, we obtain f ft xi x x0 1 2 f , ftt 2 xi2 . x x0 19 Second-Order AD Module Taylor Series Strategy To compute the (i, j ) off-diagonal entry in the Hessian, we set u ei e j . The uniqueness of Taylor expansion implies ft f xi x x0 1 2 f ftt 2 xi2 f x j x x0 , x x0 2 f xi x j x x0 1 2 f 2 x 2j . x x0 20 Second-Order AD Module Hessian Performance • Twice ADIFOR, first produces a gradient code with ADIFOR 2.0, and then runs the gradient code through ADIFOR again. • Forward, implements the forward mode. • Adaptive Forward, uses the forward mode, with preaccumulation at a statement level where deemed appropriate. • Sparse Taylor Series, uses the Taylor series mode to compute the needed entries. 21 An Application in OCPs Problem Definition and Theoretical Analysis Consider the following problem f (x, x, u, v) 0n. t [t0 , t f ] (1a) with the following set of consistent and non-redundant initial conditions: x(t0 ) x 0 ( v) where: x, x u n (1b) are the state (output) variables and their time derivatives respectively, are control (input) variables, and v are time-invariant parameters. Depending on the implementation, the control variables may be approximated by some types of discretization which involves some (or all) of the parameters in set v u u( v ) (1c) 22 An Application in OCPs The first order sensitivity equations f x f x f u f 0n , t [t0 ,t f ] x v x v u v v (2a) with the initial conditions: x x (t0 ) 0 ( v ) v v (2b) 23 An Application in OCPs The second order sensitivity equations T 2 2 2 2 2 2 f x f x x f x f x f u f x I v 2 x I v 2 I n v x 2 v xx v ux v vx T 2 2 2 2 2 x f x f x f x f f u In 2 I 2 v x v xx v ux v vx u v T 2 2 2 2 u f x f x f u f In 2 v x u v x u v u v vu 2 f x 2 f x 2 f u 2 f 2 0n . t [t0 , t f ] xv v xv v uv v v with initial conditions given by: 2 x0 2x (t 0 ) ( v) v 2 v 2 (3a) (3b) 24 An Application in OCPs The second order sensitivity equations The result of Eq. (3a) is post-multiplied by a vector p obtaining: { } p 0n 1 (4) by comparing terms the equivalent form is derived: f T f T Z Z A(x, x, u, v) 0 n (5a) x x with Z, Z n being the matrices whose columns are respectively given by the matrix-vector products zi , zi , with 2 xi 2 xi zi p, zi p, i 1, 2, , n, t [t0 , t f ]. v v Finally, the set of initial conditions for these are: T 2 x01 ( Z(t0 )) p ( v) v T 2 x0 n p ( v) v T (5b) (5c) 25 An Application in OCPs Optimal control problem Find the control vector u(t ) over t [t0 , t f ] to minimize (or maximize) a performance index, J : J (x, u) ( x(t f )) (6) subject to a set of ordinary differential equations: dx f ( x(t ), u(t ), t ) (7) dt where x is the vector of state variables, with initial conditions x(t0 ) x 0 . An additional set of inequality constrains are the lower and upper bounds on the control variables: u L u(t ) uU (8) 26 An Application in OCPs Truncated Newton method for the solution of the NLP The truncated newton method uses an iterative scheme, usually a conjugate gradient method, to solve the Newton equations of the optimization problem: H ( x) p g ( x) (9) where H ( x) is the Hessian matrix, p is the search direction and g ( x) is the gradient vector. 27 An Application in OCPs Implementation Details Step 1 • Automatic derivation of the first and second order sensitivity equations to construct a full augmented IVP. • Creates corresponding program subroutines in a format suitable to a standard IVP solver. Step 2 • Numerical solution of the outer NLP using a truncated –Newton method which solves bound-constrained problems. 28 An Application in OCPs Two approaches with TN method TN algorithm with finite difference scheme • Gradient evaluation requires the solution of the first order sensitivity system • Gradient information is used to approximate the Hessian vector product with a finite difference scheme TN algorithm with the exact Hessian vector product calculation • It uses the second order sensitivity equations defined in Eq. (5a) to obtain the exact Hessian vector product. (Earlier methods of the CVP type were based on first order sensitivities only, i.e. Gradient based algorithms mostly). • This approach has been shown more robust and reliable due to the use of exact second order information. 29 Summary • Basics of derivatives - • Basics of AD - • Compute first order derivatives with forward mode Compute first order derivatives with reverse mode Second Order Differentiation - • Definition of derivatives Application of derivatives Methods to compute derivatives Compute second order derivatives with forward mode strategy Compute second order derivatives with Taylor Series strategy Hessian Performance An Application in Optimal Control Problems - First order and second order sensitivity equations of DAE Solve optimal control problem with CVP method Solve nonlinear programming problems with truncated Newton method Truncated Newton method with exact Hessian vector product calculation 30 References • Abate, Bischof, Roh,Carle "Algorithms and Design for a Second-Order Automatic Differentiation Module„ • Eva Balsa Canto, Julio R. Banga, Antonio A. Alonso, Vassilios S. Vassiliadis "Restricted second order information for the solution of optimal control problems using control vector parameterization„ • Louis B. Rall, George F. Corliss „An Introduction to Automatic Differentiation„ • Andreas Griewank „Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation„ • Stephen G. Nash „A Survey of Truncated-Newton Methods„ 31