Some open problems in applications of holonomic gradient method

advertisement
Some open problems in applications of
holonomic gradient method to statistics
Akimichi Takemura, Univ. Tokyo
August 26, 2014, Prague
Outline
1. Advertisement and overview of my research on
holonomic gradient method (HGM)
2. My first example: Airy-like function
3. Definition of holonomic functions
4. Connection of HGM to Markov bases (MB)
5. Second example: incomplete gamma function
6. Wishart distribution and hypergeometric
function of a matrix argument
1
A one-page advertisement of HGM
• HGM is a common ground, where statistics,
algebra and numerical analysis meet.
• HGM is a general method and can be applied
when functions are holonomic.
• HGM already has some success stories.
• It is numerically very accurate. With proper
mix of symbolic and numerical computations,
it is often faster than existing methods.
• Connection to MB: i) Gröbner bases for
D-modules, ii) non-central hypergeometric
distribution.
2
Overview of my research
• Beginning: a problem session by Takayama in
September, 2009.
• HGM proposed in “Holonomic gradient
descent and its application to the
Fisher-Bingham integral”, Advances in Applied
Mathematics, 47, 639–658. N3 OST2 . 2011.
• My coauthors on HGM: H.Hashiguchi,
J.Hayakawa, T.Koyama, S.Kuriki, N.Marumo,
H.Nakayama, K.Nishiyama, M.Noro, Y.Numata,
K.Ohara, T.Sei, C.Siriteanu, N.Takayama.
3
• The following site lists 15 manuscripts so far.
http://www.math.kobe-u.ac.jp/OpenXM/Math/hgm/ref-hgm.html
• Wishart discussed in arXiv:1201.0472v3.
Published in Journal of Multivariate Analysis,
doi:10.1016/j.jmva.2013.03.011.
• “Estimation of exponential-polynomial
distribution by holonomic gradient descent”, J.
Hayakawa and A. Takemura. arXiv:1403.7852
4
References on HGM
• Ch.6 of “Dojo” (Hibi ed.) in English.
• Tutorial slide by Takayama
“takayama-hgm-v3.pdf ”
• Saito-Sturmfels-Takayama book (2000).
5
Ex.1: Airy-like function
An exercise problem by Nobuki Takayama on
Sep.16, 2009, during “Kobe Gröbner School”.
Question: Let
∫
A(x) =
∞
e
−t−xt3
dt,
x > 0.
0
Derive a differential equation satisfied by A(x).
Answer:
1 = (27x3 ∂x2 + 54x2 ∂x + 6x + 1)A(x)
= 27x3 A′′ (x) + 54x2 A′ (x) + (6x + 1)A(x).
6
• This question is pretty hard, even if you are
told the answer.
• Actually Prof. Takayama did this by computer
(GB computation for D-modules), asking us to
do this by hand(!)
• I was struggling with this problem, wasting
lots of papers and wondering why I was doing
this exercise.
• After one hour, I suddenly realized that this is
indeed an important problem in statistics.
7
Is this exercise related to statistics?
• Change the notation and let
∫ ∞
−x−θx3
A(θ) =
e
dx.
0
• Let
1 −x−θx3
f (x; θ) =
e
,
A(θ)
x, θ > 0.
• This is an exponential family with the
sufficient statistic T (x) = x3 .
8
• Therefore we are evaluating the normalizing
constant and its derivatives of an exponential
family.
• Is ODE useful? We now know
1 = 27θ3 A′′ (θ) + 54θ2 A′ (θ) + (6θ + 1)A(θ).
• Hence the Fisher information A′′ (θ) is
automatically obtained from A(θ) and A′ (θ).
• Can we numerically evaluate A(θ) and A′ (θ)?
(Also A′′ (θ) for Newton-Raphson?)
9
• For illustration, we use simple linear
approximation (Euler method).
.
A(θ + ∆θ) = A(θ) + ∆θA′ (θ)
. ′
′
A (θ + ∆θ) = A (θ) + ∆θA′′ (θ).
• But from the differential equation we know
1
2 ′
(1
−
(6θ
+
1)A(θ)
−
54θ
A (θ)).
A (θ) =
3
27θ
′′
10
• Punch line: if you keep numerical values of
A(θ), A′ (θ) at one point θ, then you can compute
these values at nearby θ + ∆θ.
• At each point, higher-order derivatives A′′ (θ),
A′′′ (θ), . . . , can be computed as needed.
• Hence by numerically solving ODE, you can
compute A(θ) and its derivatives at any point
→ “Holonomic Gradient Method”
• For explanation we used Euler method, but in
our actual implementation we use
Runge-Kutta method to solve ODE.
11
Definition of holonomic functions
(from Takayama’s tutorial)
• Let f (x) = f (x1 , . . . , xn ) be a smooth function
defined on an open set U of Rn .
• f is called a holonomic function if f is
annihilated by n differential operators Li ,
i = 1, . . . , n, of the form
Li = aimi (x)∂imi + aimi −1 ∂imi −1 + · · · + ai0 (x),
where aij (x) is a rational function in x.
12
• The set of holonomic functions is closed under
addition and multiplication (but not by
division).
• It is also closed under integration
(marginalization in statistics):
∫
f : holonomic ⇒
f (x)dxn : holonomic
• Examples:
– Rational functions are holonomic.
13
– exp(f (x)) is holonomic if f is a rational
function This is important to statistics,
because the density of the normal
distribution is holonomic.
– |x| is not a holonomic function. But
– |x| is a holonomic “distribution”
(generalized function).
– Indicator function of a region defined by
polynomial inequalities is a holonomic
distribution. This is important for statistics.
14
• An example by Oaku (J. Symbolic Comp., 2013)
– Consider the probability of P (X 3 ≥ Y 2 )
under bivariate normal distribution:
∫
−t(x2 +y 2 )
v(t) =
e
dxdy.
x3 ≥y 2
– v(t) satisfies.
(216t4 ∂t4 + (32t4 + 1836t3 )∂t3 + (224t3 + 3594t2 )∂t2
+ (326t2 + 1371t)∂t + 70t + 15)v(t) = 0.
15
Connection of HGM to Markov
bases (MB)
I would like to emphasize that HGM and MB are
closely related.
• First, the algorithms for HGM use Gröbner
bases for D-modules.
• But from statistical viewpoint there is a more
important connection through non-central (or
generalized) hypergeometric distribution.
16
• Let A : d × n be a configuration matrix and let
Fb = {x ∈ Nn | b = Ax}
be a fiber.
• Usually in discussing MB, we consider the
hypergeometric distribution over the fiber Fb
1
1
p(x) ∝
= ∏n
.
x!
i=1 xi !
• This distribution corresponds to the null
hypothesis.
17
• Under alternative hypotheses we want to
consider “generalized hypergeometric
distribution”, which is of the form.
∏n xi
x
x
p
p
1
p
i
p(x) ∝
= ∏i=1
,
p(x) =
n
x!
Z(p) x!
i=1 xi !
• Non-central distribution enables us to
construct exact confidence intervals (open).
• The normalizing constant Z(p) is difficult. But
it is an “A-hypergeometric function”, which is
holonomic and the system of differential
equations for it has been already well studied
(implementation open).
18
• In a problem session by B.Sturmfels in NIMS,
Korea, last month (July 2014), the last
question by him was the following:
Determine all polynomials f (x, y, z) that are
solutions of the following holonomic system of
linear partial differential equations:
∂ 2f
∂ 2f
= 2 and
∂x∂z
∂y
∂f
∂f
∂f
∂f
2x
+y
= 2z
+y
= 20 × f
∂x
∂y
∂z
∂y
Discuss the statistical interpretation of your
polynomial f (x, y, z).
19
• The polynomial solution is the normalizing
constant of the non-central hypergeometric
distribution for the fiber b = (20, 20) of the
following configuration


2 1 0


A=
0 1 2
(Veronese configuration. Hardy-Weinberg
model.)
• There is another non-polynomial solution,
which I could later figure out by asking Bernd
in Kobe after NIMS meeting.
20
• The other solution for a general fiber.
– Let m1 , m2 non-negative integers such that
m1 + m2 is even. Let f (x, y.z) satisfy
∂ 2f
∂ 2f
= 2 ,
∂x∂z
∂y
∂f
∂f
∂f
∂f
2x
+y
= m1 f, 2z
+y
= m2 f
∂x
∂y
∂z
∂y
Write
xk = x(x − 1) . . . (x − k + 1).
– The other answer is given as follows.
21
– m1 , m2 : even
( m1 −1 )i ( m2 −1 )i
∞
∑
2
2
(2i + 1)!
i=0
x
m1 −1
−i
2
z
m2 −1
−i
2
y 2i+1
– m1 , m2 : odd
∞
∑
i=0
( m )i ( m )i
1
2
2
2
(2i + 1)!
x
m1
−i
2
z
m2 1
−i
2
y 2i
• For general configuration matrix A, the rank
(the number of independent solutions) for the
A-hypergeometric system is known and related
to the volume of the convex hull of A.
22
• Construction of solutions is discussed in
Saito-Sturmfels-Takayama book.
• Do solutions other than the normalizing
constant of the non-central hypergeometric
distribution have statistical meaning?
23
Ex.2: Incomplete Gamma function
• Consider incomplete Gamma function
∫ x
G(x) =
y α−1 e−y dy, α, x > 0.
0
• G(x) can be written as
1 α −x
G(x) = x e 1F1 (1; α + 1; x),
α
where 1F1 is the confluent hypergeometric
function 1F1
∞
∑
(a)k k
x
1F1 (a; c; x) =
(c)k k!
k=0
24
• Differential equation (ODE) satisfied by
F = 1F1 :
xF ′′ (x) + (c − x)F ′ (x) − aF = 0
25
Wishart distribution and
hypergeometric function of a
matrix argument (a success story)
• W : m × m symmetric positive definite (W > 0)
• Density of Wishart distribution with d.f. n and
covariance matrix Σ > 0:
n−m−1
2
|W |
f (W ) = C ×
n
|Σ| 2
1
exp(− trW Σ−1 )
2
• C is known (containing gamma functions).
26
• ℓ1 : the largest root of W
• We want to evaluate the probability Pr(ℓ1 < x).
ℓ1 < x ⇔ W < xIm ,
where Im : m × m is the identity matrix
• Hence the probability is given in the
incomplete gamma form:
∫
n−m−1
|W | 2
1
Pr(ℓ1 < x) = C
exp(− trW Σ−1 )dW
n
2
|Σ| 2
0<W <xIm
• From general theory Pr(ℓ1 < x) is holonomic.
27
• Just as in dim=1, Pr(ℓ1 < x) is written as
(
)
( x
) 1
m + 1 n + m + 1 x −1
′
−1
nm
2
C exp − trΣ
x
;
; Σ
1F1
2
2
2
2
• Hypergeometric function of a matrix argument
(Herz(1955)):
∫
Γm (c)
exp(trXY )
1F1 (a; c; Y ) =
Γm (a)Γm (c − a) 0<X<Im
× |X|a−(m+1)/2 |Im − X|c−a−(m+1)/2 dX,
where
(
)
m
∏
1
i−1
m(m−1)
Γm (a) = π 4
Γ a−
.
2
i=1
28
• 1F1 (a; c; Y ) is a symmetric function of
characteristic roots of Y ⇒ its series expression
is written in terms of symmetric polynomials.
• Zonal polynomials (A.T.James)
Cκ (Y ),
κ⊢k
homogeneous symmetric polynomial of degree
k in the characteristic roots of Y .
29
• Series expansion of 1F1 (Constantine(1963))
1F1 (a; c; Y
)=
∞ ∑
∑
(a)κ Cκ (Y )
k=0 κ⊢k
(c)κ
k!
.
• This is a beautiful mathematical result.
However for numerical computation, zonal
polynomials have enormous combinatorial
difficulties and statisticians pretty much forgot
zonal polynomials.
30
• The partial differential equation satisfied by
F (Y ) = 1F1 (a; c; y1 , . . . , ym ) was obtained by
Muirhead(1970).
gi F = 0, i = 1, . . . , m,
where
gi =
yi ∂i2
1 ∑ yj
+ (c − yi )∂i +
(∂i − ∂j ) − a.
2 j̸=i yi − yj
31
• Can we use this PDE for numerical
computation? (People never tried this for 40
years).
• Works!
works very well up to dimension m = 10 (three
years ago)
• Takayama claims that with a computer with
256GB of memory, he can now handle up to
dimension m = 20.
32
HGM for dimension two
• Two partial differential equations
[
]
1
y
2
g1 F = y1 ∂12 + (c − y1 )∂1 +
(∂1 − ∂2 ) − a F = 0,
2 y1 − y2
[
]
1
y
1
g2 F = y2 ∂22 + (c − y2 )∂2 +
(∂2 − ∂1 ) − a F = 0.
2 y2 − y1
• Let us compute higher-order derivative from
these equations.
33
• Divide the second equation by y2 and write
(
c
n1 n2
n1 n2 −2
∂1 ∂2 F = ∂1 ∂2
− ∂2 + ∂2
y2
1
y1
a)
−
(∂2 − ∂1 ) +
F.
2 y2 (y2 − y1 )
y2
• The RHS becomes messy, but an important
fact is that the number of differentiations is
reduced by 1.
• We can reduce the number of differentiations
as long as there are more than 1
differentiations with respect to each variable.
34
• This implies that all higher-order derivatives
can be written as a rational function
combination of the following 4 square-free
mixed derivatives:
F (Y ), ∂1 F (Y ), ∂2 F (Y ), ∂1 ∂2 F (Y ).
• Hence we only keep
F (Y ), ∂1 F (Y ), ∂2 F (Y ), ∂1 ∂2 F (Y )
in memory. We can always compute
higher-order derivatives from these 4 values.
35
• For dimension m, we need to keep 2m
square-free mixed derivatives in memory. This
is the limitation of the current method.
• The problem of initial values is also difficult in
general dimension. N.Takayama generalized
the algorithm of Koev-Edelman(2006) to
handle partial derivatives.
36
Numerical experiments
• Statisticians need good numerical performance
(recall zonal polynomials).
• We were not sure whether it works up to
dimension m = 10.
• 3 years ago, with Intel Core i7 machine. The
computation of the initial value at x0 = 0.2
takes 20 seconds. Then with the step size
0.001, we solve the PDE up to x = 30, which
takes 75 seconds. Output:
Pr[ℓ1 < 30] = 0.999545
37
• This accuracy is somewhat amazing, if we
consider that we updated a 1024-dimensional
vector 30,000 times.
• As I indicated above, it is now working up to
dimension m = 20 with a bigger machine with
256GB memory.
38
• Plot of the cumulative distribution
1.2
by hg
1
0.8
0.6
0.4
0.2
0
0
5
10
15
39
20
Comparison with existing methods (m = 2)
Laplace app.
1.2
HGM
1.0
0.8
0.6
0.4
0.2
Truncation of series k = 50
5
10
15
40
20
25
30
Restriction to diagonal region
• We have so far assumed non-diagonal region
yi ̸= yj .
• On the diagonal yi = yj , the PDE is singular:
∑ yj
1
(∂i − ∂j ) − a.
gi = yi ∂i2 + (c − yi )∂i +
2 j̸=i yi − yj
• Let m = 2. Consider letting y1 → y2 in
[
]
1
y
2
y1 ∂12 + (c − y1 )∂1 +
(∂1 − ∂2 ) − a F = 0.
2 y1 − y2
41
• We apply l’Hospital’s rule to
∂1 − ∂2
.
y1 − y2
• L’Hospital’s rule results in
∂1 − ∂2
lim
= ∂12 − ∂1 ∂2 .
y1 →y2 =y y1 − y
42
• After applying L’Hospital’s rule several times,
we can show that f (y) = F (y, y) satisfies the
following ODE:
( 3 ′′
)
y ′′′
c−y ′
a
f (y) + (c − 1 − y) f (y) +
f (y) − f (y)
8
8
4y
2y
1 ′′
a ′
+ f (y) − f (y) = 0.
4
2
• Actually this computation can be performed
by Oaku’s restriction algorithm(1997) of a
holonomic ideal.
43
• The following asir program
import(‘‘names.rr’’)$
import("nk_restriction.rr")$
dp_gr_print(1)$ dp_ord(0)$
G1=y1*dy1^2 + (c-y1)*dy1+(1/2)*(y2/(y1-y2))*(dy1-dy2)-a; G1=red((y1-y2)*G1);
G2=base_replace(G1,[[y1,y2],[y2,y1],[dy1,dy2],[dy2,dy1]]);
F=base_replace([G1,G2],[[y1,y],[y2,y+z2],[dy1,dy-dz2],[dy2,dz2]]);
A=nk_restriction.restriction_ideal(F,[z2,y],[dz2,dy],[1,0] | param=[a,c]);
end$
outputs the following, which coincides with the
by-hand computation!
-y^2*dy^3+(3*y^2+(-3*c+1)*y)*dy^2+(-2*y^2+(4*a+4*c-2)*y
-2*c^2+2*c)*dy-4*a*y+(4*c-4)*a
• In hindsight, this program (Oaku’s algorithm)
worked only for m = 2, 3.
• For m = 4, computation did not finish in one
month.
44
• Clear the denominator and consider
∏
g̃i = j̸=i (yi − yj ) × gi , i = 1, . . . , m.
• Conjecture: g̃1 , . . . , g̃m generate a holonomic
ideal in C⟨y1 , . . . , ym , ∂1 , . . . , ∂m ⟩.
• True for m ≤ 3, but not true for m ≥ 4.
• Differential equations for diagonal cases
(multiple eigenvalues) have been obtained by
Noro for m ≤ 8 and some further ones have
been obtained by Manuel Kauers recently.
(general case is an open question.)
45
Current summary on HGM
• Holonomic gradient method is practical if we
implement it efficiently.
• Our approach brought a totally new approach
to a longstanding problem in statistics.
• Holonomic gradient methods is general and
can be applied to many problems.
• We stand at the beginning of applications of
D-module theory to statistics!
• The problem of singularity seems to be hard
and interesting.
46
Download