Powerpoint - University of Reading

advertisement
Fast reverse-mode
automatic differentiation
using expression
templates in C++
Robin Hogan
University of Reading
Overview
•
•
•
•
•
Spaceborne radar and lidar
Adjoint coding
Automatic differentiation
New approach
Testing with lidar multiple-scattering forward models
Spaceborne radar, lidar and radiometers
EarthCare
The A-Train
– NASA
– 700-km orbit
– CloudSat 94-GHz radar (launch 2006)
– Calipso 532/1064-nm depol. lidar
– MODIS multi-wavelength radiometer
– CERES broad-band radiometer
– AMSR-E microwave radiometer
EarthCARE: launch 2015(?)
– ESA+JAXA
– 400-km orbit: more sensitive
– 94-GHz Doppler radar
– 355-nm HSRL/depol. lidar
– Multispectral imager
– Broad-band radiometer
– Heart-warming name
What do CloudSat and Calipso see?
Cloudsat radar
CALIPSO lidar
•
Radar: ~D6,
detects whole
profile, surface
echo provides
integral constraint
•
Lidar: ~D2, more
sensitive to thin
cirrus and liquid
but attenuated
Radar-lidar ratio
provides size D
•
Target classification
Insects
Aerosol
Rain
Supercooled liquid cloud
Warm liquid cloud
Ice and supercooled liquid
Ice
Clear
No ice/rain but possibly liquid
Ground
Delanoe and Hogan (2008, 2010)
1. New ray of data: define state vector
Use classification to specify variables describing each species at each gate
Ice: extinction coefficient, N0’, lidar extinction-to-backscatter ratio
Liquid: extinction coefficient and number concentration
Rain: rain rate, drop diameter and melting ice
Aerosol: extinction coefficient, particle size and lidar ratio
Unified
retrieval
Ingredients developed
Implement previous work
Not yet developed
2. Convert state vector to radar-lidar resolution
Often the state vector will contain a low resolution description of the profile
3. Forward model
3a. Radar model
Including surface return
and multiple scattering
3b. Lidar model
Including HSRL channels
and multiple scattering
4. Compare to observations
Check for convergence
3c. Radiance model
Solar and IR channels
6. Iteration method
Derive a new state vector
Adjoint of full forward model
Quasi-Newton scheme
Not converged
Converged
7. Calculate retrieval error
Error covariances and averaging kernel
Proceed to next ray of data
Unified retrieval: Forward model
• From state vector x to forward modelled observations H(x)...
Ice & snow
Liquid cloud
Rain
Aerosol
x
Gradient of cost function (vector)
xJ=HTR-1[y–H(x)]
Lookup tables to obtain profiles of extinction, scattering
& backscatter coefficients, asymmetry factor
Ice/radar
Ice/lidar
Ice/radiometer
Liquid/radar
Liquid/lidar
Liquid/radiometer
Rain/radar
Rain/lidar
Rain/radiometer
Aerosol/lidar
Aerosol/radiometer
Vector-matrix multiplications: around
the same cost as the original forward
operations
Sum the contributions from each constituent
Radar scattering
profile
Lidar scattering
profile
Radiative transfer models
Radar forward
modelled obs
Lidar forward
modelled obs
Radiometer
scattering profile
H(x)
Radiometer fwd
modelled obs
Adjoint of radar
model (vector)
Adjoint of lidar
model (vector)
Adjoint of radiometer
model
Adjoint of radiative transfer models
yJ=R-1[y–H(x)]
Radiative transfer models
Observation
Radar reflectivity
factor
Radar reflectivity
factor in deep
convection
Radar Doppler velocity
HSRL lidar in ice and
aerosol
HSRL lidar in liquid
cloud
Lidar depolarization
Infrared radiances
Model
Multiscatter: single scattering option
Speed
N
Status
OK
Multiscatter: single scattering plus TDTS MS
model (Hogan and Battaglia 2008)
N2
OK
Single scattering OK if no NUBF; fast MS model N2
with Doppler does not exist
Multiscatter: PVC model (Hogan 2008)
N
Not available
for MS
OK
Multiscatter: PVC plus TDTS models
N2
OK
N2
N
In progress
No adjoint
Infrared radiances
Multiscatter: under development
Delanoe and Hogan (2008) two-stream source
function method
RTTOV (EUMETSAT license)
N
Solar radiances
LIDORT (permissive license)
N
Disappointing
accuracy for
clouds
Testing
• After much pain have hand-coded adjoint for multiscatter model (in
C) but still need adjoint for all the rest of the algorithm (in C++)
Adjoint and Jacobian coding
• Variational retrieval methods are posed as:
– “find the vector x that minimises the cost function J(x)”
• Two common minimization methods:
– The quasi-Newton method requires the “adjoint code” to compute the
gradient ∂J/∂x for any x
– The Gauss-Newton method writes the observational part of the cost
function as the sum of the squared deviation of the observations from
their forward modelled counterparts y, and requires a code to compute
the Jacobian matrix H = ∂y/∂x
• Since J(x) is complicated (containing all of
our radiative transfer models), the code to
generate ∂J/∂x or ∂y/∂x is even more
complicated
– Can it be generated automatically?
Approaches to adjoint coding
• Do it by hand (e.g. ECMWF)
– Painful and time consuming to debug
– Generates the most efficient code
• Do it numerically: perturb each element of x one by one
– Inefficient and infeasible for large x
– Subject to round-off error
– What I’m using at the moment with Unified Algorithm
• Automatic differentiation 1: Use a source-to-source compiler
– E.g. TAPENADE/TAF/TAC++ generate adjoint source file from
algorithm file: generates quite efficient code
– Comercial: 5k/year for TAF/TAC++ academic license and need
permission to distribute generated source code
– TAPENADE requires to upload file to server
– Limited support for C++ classes and no support for C++ templates
• Automatic differentiation 2: Use an operator overloading technique
– E.g. CppAD, ADOL-C, in principle can work with any language features
– Typically 25 times slower than hand-coded adjoint!
– Can we do better?
Simple example
• Consider simple algorithm y(x0, x1) contrived for didactic purposes:
• Implemented in C or Fortran90 as:
double algorithm(const double x[2]) {
double y = 4.0;
double s = 2.0*x[0] + 3.0*x[1]*x[1];
y *= sin(s);
return y;
}
function algorithm(x) result(y)
implicit none
real, intent(in) :: x(2)
real
:: y
real
:: s
y = 4.0
s = 2.0*x(1) + 3.0*x(2)*x(2)
y = y * sin(s)
return
endfunction
• Task: given ∂J/∂y, we want to compute ∂J/∂x0 and ∂J/∂x1
Creating the adjoint code 1
• Differentiate the algorithm:
• Write each statement in matrix form:
– Consider dy as the
derivative of y with
respect to something
• Transpose the matrix to get equivalent adjoint statement:
– Consider d*y as dJ/dy
Creating the adjoint code 2
• Apply adjoint statements in reverse order:
Reverse mode:
Forward mode:
double algorithm_AD(const double x[2], double y_AD[1], double x_AD[2]) {
double y = 4.0;
double s = 2.0*x[0] + 3.0*x[1]*x[1];
y *= sin(s);
/* Adjoint part: */
double s_AD = 0.0;
Note: need to store intermediate
y_AD[0] += sin(s) * y_AD[0];
values for the reverse pass
s_AD += y * cos(s) * y_AD[0];
Hand-coding is time-consuming
x_AD[0] += 3.0 * s_AD;
and error prone for large codes
x_AD[1] += 6.0 * x[0] * s_AD;
s_AD = 0.0;
y_AD[0] = 0.0;
return y;
}
Automatic differentiation
• We want something like this (now in C++):
adouble algorithm(const adouble x[2]) {
adouble y = 4.0;
Simple change: label
adouble s = 2.0*x[0] + 3.0*x[1]*x[1];
“active” variables as
y *= sin(s);
a new type
return y;
}
// Main code
Stack stack;
// Object where info will be stored
adouble x[2] = {…, …}
// Set algorithm inputs
adouble y = algorithm(x);
// Run algorithm and store info in stack
y.set_gradient(y_AD);
// Set dJ/dy
stack.reverse();
// Run adjoint code from stored info
x_AD[0] = x[0].get_gradient();
// Save resulting values of dJ/dx0
x_AD[1] = x[1].get_gradient();
// ... and dJ/dx1
• Operators (e.g. +–*/) and functions (e.g. sin, exp, log) applying to
adouble objects are overloaded not only to return the result of the
operation, but also to store the gradient information in stack
• Libraries CppAD, SACADO and ADOL-C do this but the result is
around 25 times slower than hand-coded adjoints… why?
Minimum necessary storage
• What is the minimum necessary storage to store these statements?
• If we label each gradient by an integer (since they’re unknown in
forward pass) then we need two stacks that can be added to as the
algorithm progresses:
Statement stack
Operation stack
Index to LHS
gradient
Multiplier
Index to RHS
gradient
Index to first
operation
#
2 (dy)
0
0
2.0
0 (dx0)
3 (ds)
0
1
6.0x1
1 (dx1)
2 (dy)
2
2
sin(s)
2 (dy)
…
…
3
y cos(s)
3 (ds)
4
…
…
• Can then run backwards through stack to compute adjoints
Adjoint algorithm is simple
Reverse mode:
• Need to cope with three different
types of differential statement:
Forward mode:
General differential statement:
dy  i 0 midxi
n
Equivalent adjoint statements:
a d *y
d *y  0
for i = 0 to n:
d * xi  d * xi  mi a
…which can be coded as follows
1. Loop over derivative statements in reverse order
2. Save gradient
3. Skip if gradient equals 0 (big optimization)
4. Loop over operations
5. Update an adjoint
• This does the right thing in our three cases:
– Zero on RHS
– One or more gradients on RHS
– Same gradient on LHS and RHS
“Dual numbers” approach
• How can these stacks be created?
• Consider what happens when compiler sees this line:
y = y * sin(s)
• Compiler splits this up into two parts with temporary t:
adouble t = sin(s)
y = operator*(y, t)
• We could define adouble as “dual number” [x, dx] (invented by
Clifford 1873) and then overload sin and operator*:
[sin(s), cos(s)*ds] = sin([s, ds])
[y*t, t*dy+y*dt] = [y, dy] * [t, dt]
• This would correctly apply
– but only if the gradient terms on the right-hand-side are known!
• This is not useful for the reverse-mode (adjoint) when we want to
store a symbolic representation of the gradient on the forward sweep
which is then filled on the reverse sweep
– Dual numbers are used in some forward-mode-only (tangent linear)
automatic differentiation tools.
So how do CppAD & ADOL-C work?
• In the forward pass they store the whole algorithm symbolically, not
just the derivative form!
• This means every operator and function needs to be stored
symbolically (e.g. 0 for plus, 1 for minus, 42 for atan etc)
• The stored algorithm can then be analysed to generate an adjoint
function
• This all happens behind the scenes so easy to use, but not
surprising that it is 25 times slower than a hand-coded adjoint
Computational graphs
• The basic problem is that standard operator overloading can only
pass information from the most nested operation outwards
Pass y sin(s) to be new y
operator*
Pass value of sin(s)
y
sin
s
Implementing the chain rule
Differentiate multiply
operator
Differentiate
sine function
Computational graph 2
• Clearly differentiation most naturally involves passing information in
the opposite sense
Each node representing
arbitrary function or operator
y(a) needs to be able to take a
real number w and pass
wdy/da down the chain
operator*
Pass sin(s)
Pass y
y
sin
Pass y cos(s)
Add sin(s)dy
to stack
s
Add y cos(s)ds to stack
Binary function or operator
y(a,b) would pass wdy/da to
one argument and wdy/db to
other
At the end of the chain, store
the result on the stack
But how do we implement
this?
What is a template?
• Templates are a key ingredient to generic programming in C++
• Imagine we have a function like this:
double cube(const double x) {
double y = x*x*x;
return y;
}
• We want it to work with any numerical type (single precision,
complex numbers etc) but don’t want to laboriously define a new
overloaded function for each possible type
• Can use a function template:
template <typename Type>
Type cube(Type x) {
Type y = x*x*x;
return y;
}
double a = 1.0;
b = cube(a); // compiler creates function cube<double>
complex<double> c(1.0, 2.0); // c = 1 + 2i
d = cube(c); // compiler creates function cube<complex<double> >
What is an expression template?
• C++ also supports class templates
– Veldhuizen (1995) used this feature to introduce the idea of Expression
Templates to optimize array operations and make C++ as fast as Fortran90 for array-wise operations
• We use it as a way to pass information in both directions through the
expression tree:
– sin(A) for an argument of arbitrary type A is overloaded to return an
object of type Sin<A>
– operator*(A,B) for arguments of arbitrary type A and B is overloaded
to return an object of type Multiply<A,B>
• Now when we compile the statement “y=y*sin(x)”:
– The right-hand-side resolves to an object “RHS” of type
Multiply<adouble,Sin<adouble> >
– The overloaded assignment operator first calls RHS.value() to get y
– It then calls RHS.calc_gradient(), to add entries to operation stack
– Multiply and Sin are defined with member functions so that they can
correctly pass information up and down the expression tree
New approach
• The following types are
passed up the chain at
compile time:
Each function and operator y(a)
implements a function calc_gradient
that takes a real number w and
passes wdy/da down the chain:
Multiply<adouble,Sin<adouble> >
operator*
operator*
adouble
y
Sin<adouble>
sin
adouble
s
Pass sin(s)
Pass y
y
sin
Pass y cos(s)
Add sin(s)dy
to stack
s
Add y cos(s)ds to stack
Implementation of Sin<A>
// Definition of Sin class
template <class A>
class Sin : public Expression<Sin<A> > {
public:
// Member functions
// Constructor: store reference to a and its numerical value
Sin(const Expression<A>& a)
: a_(a), a_value_(a.value()) { }
// Return the value
double value() const
{ return sin(a_value_); }
// Compute derivative and pass to a
void calc_gradient(Stack& stack, double multiplier) const
{ a_.calc_gradient(stack, cos(a_value_)*multiplier); }
private:
// Data members
const A& a_;
// A reference to the object
double a_value_; // The numerical value of object
};
// Overload the sin function: it returns a Sin<A> object
template <class A>
inline
Sin<A> sin(const Expression<A>& a)
{ return Sin<A>(a); }
…Adept
library has
done this for
all operators
and functions
Optimizations
• Why are expression templates fast?
– Compound types representing complex expressions are known at compile
time
– C++ automatically inlines function calls between objects in an
expression, leaving little more than the operations you would put in a
hand-coded application of the chain rule
• Further optimizations:
– Stack object keeps memory allocated between calls to avoid time spent
allocating incrementally more memory
– If the Jacobian is computed it is done in strips to exploit vectorization
(SSE/SSE2 on Intel) and loop unrolling
– The current stack is accessed by a global but thread-local variable,
rather than storing a link to the stack in every adouble object (as in
CppAD and ADOL-C)
Testing using lidar multiple
scattering models
• Photon Variance-Covariance method for small-angle multiple scattering
– Hogan (JAS 2008)
– Somewhat similar to a monochromatic radiance model
– Four coupled ODEs are integrated forward in space
– Several variables at N gates give N output signals
– Computational cost proportional to N
• Time-dependent two-stream method for wide-angle multiple scattering
– Hogan and Battaglia (JAS 2008)
– Similar to a time-dependent 1D advection model
– Four coupled PDEs are integrated forward in time
– Several variables at N gates gives N output signals
– Computational cost proportional to N 2
Simulation of 3D photon transport
• Animation of scalar flux (I++I–)
– Colour scale is logarithmic
– Represents 5 orders of
magnitude
• Domain properties:
– 500-m thick
– 2-km wide
– Optical depth of 20
– No absorption
• In this simulation the lateral
distribution is Gaussian at each
height and each time
Benchmark results
• Time relative to original code, gcc-4.4, Pentium 2.5 GHz, 2 MB cache
Only 5-20% slower than hand-coded adjoint
Adjoint
PVC N=50
TDTS N=50
Hand-coded adjoint
3.0 (1.0+2.0)
3.6 (1.0+2.6)
New C++ library: Adept
3.5 (2.7+0.8)
3.8 (2.6+1.2)
ADOL-C
25 (18+7)
20 (15+5)
CppAD
29 (15+7+7)
34 (17+8+9)
5-9 times faster than leading libraries providing same functionality
Full Jacobian (50x350)
PVC N=50
TDTS N=50
New C++ library: Adept
20
20
ADOL-C
83
69
CppAD
352
470
4-20 times faster for 50x350 Jacobian
Outlook
• New library Adept (Automatic Differentiation using Expression
Templates) produces adjoint with minimum difficulty for user
– No knowledge of templates required by user at all
– Simple and efficient to compute Jacobian matrix as well
– Freely available at http://www.met.reading.ac.uk/clouds/adept/
• Typically 5-20% slower than hand-coded adjoints
– But immeasurably faster in terms of programmer time
• Code is complete for applying to any C code with real numbers
• Further development desirable:
– Complex numbers
– Use within C++ matrix/vector libraries, particularly those that already
use Expression Templates (like the one I use for the Unified Algorithm)
– Easily facilitate checkpointing so large codes don’t exhaust memory
– Automatically compute higher-order derivatives (e.g. Hessian matrix)
• Potential for student projects to get small data assimilation systems
up and running and efficient quickly
• Impossible to apply in Fortran: no template capability!
Minimizing the cost function
1
1
T
T
1
J  y  H (x) R y  H (x)  x  a  B 1 x  a 
2
2
Gradient of cost function (a vector)
x J  HT R 1y  H (x)  B1 x  a
Gauss-Newton method

x i 1  x i   J
–
–
–
–
2
x

1
x J
Rapid convergence (instant for linear
problems)
Get solution error covariance “for
free” at the end
Levenberg-Marquardt is a small
modification to ensure convergence
Need the Jacobian matrix H of every
forward model: can be expensive for
larger problems as forward model may
need to be rerun with each element of
the state vector perturbed
and 2nd derivative (the Hessian matrix):
2x J  HT R 1H  B1
Gradient Descent methods
xi 1  xi  Ax J
– Fast adjoint method to calculate xJ
means don’t need to calculate Jacobian
– Disadvantage: more iterations needed
since we don’t know curvature of J(x)
– Quasi-Newton method to get the search
direction (e.g. L-BFGS used by ECMWF):
builds up an approximate inverse Hessian
A for improved convergence
– Scales well for large x
– Poorer estimate of the error at the end
Time-dependent 2-stream approx.
• Describe diffuse flux in terms of outgoing stream I+ and incoming
stream I–, and numerically integrate the following coupled PDEs:
Time derivative
Remove this and
we have the timeindependent twostream
approximation
Source
1 I  I 

  1I    2 I   S  Scattering from
the quasi-direct
1c t
r


1 I  I 

  1 I    2 I   S 
1c t
r
Spatial derivative
Transport of
radiation from
upstream


Loss by absorption
or scattering
Some of lost radiation
will enter the other
stream
beam into each of
the streams
Gain by scattering
Radiation scattered
from the other
stream
• These can be discretized quite simply in time and space (no implicit
methods or matrix inversion required)
Hogan and Battaglia (2008, J. Atmos. Sci.)
Download