Some open problems in applications of holonomic gradient method to statistics Akimichi Takemura, Univ. Tokyo August 26, 2014, Prague Outline 1. Advertisement and overview of my research on holonomic gradient method (HGM) 2. My first example: Airy-like function 3. Definition of holonomic functions 4. Connection of HGM to Markov bases (MB) 5. Second example: incomplete gamma function 6. Wishart distribution and hypergeometric function of a matrix argument 1 A one-page advertisement of HGM • HGM is a common ground, where statistics, algebra and numerical analysis meet. • HGM is a general method and can be applied when functions are holonomic. • HGM already has some success stories. • It is numerically very accurate. With proper mix of symbolic and numerical computations, it is often faster than existing methods. • Connection to MB: i) Gröbner bases for D-modules, ii) non-central hypergeometric distribution. 2 Overview of my research • Beginning: a problem session by Takayama in September, 2009. • HGM proposed in “Holonomic gradient descent and its application to the Fisher-Bingham integral”, Advances in Applied Mathematics, 47, 639–658. N3 OST2 . 2011. • My coauthors on HGM: H.Hashiguchi, J.Hayakawa, T.Koyama, S.Kuriki, N.Marumo, H.Nakayama, K.Nishiyama, M.Noro, Y.Numata, K.Ohara, T.Sei, C.Siriteanu, N.Takayama. 3 • The following site lists 15 manuscripts so far. http://www.math.kobe-u.ac.jp/OpenXM/Math/hgm/ref-hgm.html • Wishart discussed in arXiv:1201.0472v3. Published in Journal of Multivariate Analysis, doi:10.1016/j.jmva.2013.03.011. • “Estimation of exponential-polynomial distribution by holonomic gradient descent”, J. Hayakawa and A. Takemura. arXiv:1403.7852 4 References on HGM • Ch.6 of “Dojo” (Hibi ed.) in English. • Tutorial slide by Takayama “takayama-hgm-v3.pdf ” • Saito-Sturmfels-Takayama book (2000). 5 Ex.1: Airy-like function An exercise problem by Nobuki Takayama on Sep.16, 2009, during “Kobe Gröbner School”. Question: Let ∫ A(x) = ∞ e −t−xt3 dt, x > 0. 0 Derive a differential equation satisfied by A(x). Answer: 1 = (27x3 ∂x2 + 54x2 ∂x + 6x + 1)A(x) = 27x3 A′′ (x) + 54x2 A′ (x) + (6x + 1)A(x). 6 • This question is pretty hard, even if you are told the answer. • Actually Prof. Takayama did this by computer (GB computation for D-modules), asking us to do this by hand(!) • I was struggling with this problem, wasting lots of papers and wondering why I was doing this exercise. • After one hour, I suddenly realized that this is indeed an important problem in statistics. 7 Is this exercise related to statistics? • Change the notation and let ∫ ∞ −x−θx3 A(θ) = e dx. 0 • Let 1 −x−θx3 f (x; θ) = e , A(θ) x, θ > 0. • This is an exponential family with the sufficient statistic T (x) = x3 . 8 • Therefore we are evaluating the normalizing constant and its derivatives of an exponential family. • Is ODE useful? We now know 1 = 27θ3 A′′ (θ) + 54θ2 A′ (θ) + (6θ + 1)A(θ). • Hence the Fisher information A′′ (θ) is automatically obtained from A(θ) and A′ (θ). • Can we numerically evaluate A(θ) and A′ (θ)? (Also A′′ (θ) for Newton-Raphson?) 9 • For illustration, we use simple linear approximation (Euler method). . A(θ + ∆θ) = A(θ) + ∆θA′ (θ) . ′ ′ A (θ + ∆θ) = A (θ) + ∆θA′′ (θ). • But from the differential equation we know 1 2 ′ (1 − (6θ + 1)A(θ) − 54θ A (θ)). A (θ) = 3 27θ ′′ 10 • Punch line: if you keep numerical values of A(θ), A′ (θ) at one point θ, then you can compute these values at nearby θ + ∆θ. • At each point, higher-order derivatives A′′ (θ), A′′′ (θ), . . . , can be computed as needed. • Hence by numerically solving ODE, you can compute A(θ) and its derivatives at any point → “Holonomic Gradient Method” • For explanation we used Euler method, but in our actual implementation we use Runge-Kutta method to solve ODE. 11 Definition of holonomic functions (from Takayama’s tutorial) • Let f (x) = f (x1 , . . . , xn ) be a smooth function defined on an open set U of Rn . • f is called a holonomic function if f is annihilated by n differential operators Li , i = 1, . . . , n, of the form Li = aimi (x)∂imi + aimi −1 ∂imi −1 + · · · + ai0 (x), where aij (x) is a rational function in x. 12 • The set of holonomic functions is closed under addition and multiplication (but not by division). • It is also closed under integration (marginalization in statistics): ∫ f : holonomic ⇒ f (x)dxn : holonomic • Examples: – Rational functions are holonomic. 13 – exp(f (x)) is holonomic if f is a rational function This is important to statistics, because the density of the normal distribution is holonomic. – |x| is not a holonomic function. But – |x| is a holonomic “distribution” (generalized function). – Indicator function of a region defined by polynomial inequalities is a holonomic distribution. This is important for statistics. 14 • An example by Oaku (J. Symbolic Comp., 2013) – Consider the probability of P (X 3 ≥ Y 2 ) under bivariate normal distribution: ∫ −t(x2 +y 2 ) v(t) = e dxdy. x3 ≥y 2 – v(t) satisfies. (216t4 ∂t4 + (32t4 + 1836t3 )∂t3 + (224t3 + 3594t2 )∂t2 + (326t2 + 1371t)∂t + 70t + 15)v(t) = 0. 15 Connection of HGM to Markov bases (MB) I would like to emphasize that HGM and MB are closely related. • First, the algorithms for HGM use Gröbner bases for D-modules. • But from statistical viewpoint there is a more important connection through non-central (or generalized) hypergeometric distribution. 16 • Let A : d × n be a configuration matrix and let Fb = {x ∈ Nn | b = Ax} be a fiber. • Usually in discussing MB, we consider the hypergeometric distribution over the fiber Fb 1 1 p(x) ∝ = ∏n . x! i=1 xi ! • This distribution corresponds to the null hypothesis. 17 • Under alternative hypotheses we want to consider “generalized hypergeometric distribution”, which is of the form. ∏n xi x x p p 1 p i p(x) ∝ = ∏i=1 , p(x) = n x! Z(p) x! i=1 xi ! • Non-central distribution enables us to construct exact confidence intervals (open). • The normalizing constant Z(p) is difficult. But it is an “A-hypergeometric function”, which is holonomic and the system of differential equations for it has been already well studied (implementation open). 18 • In a problem session by B.Sturmfels in NIMS, Korea, last month (July 2014), the last question by him was the following: Determine all polynomials f (x, y, z) that are solutions of the following holonomic system of linear partial differential equations: ∂ 2f ∂ 2f = 2 and ∂x∂z ∂y ∂f ∂f ∂f ∂f 2x +y = 2z +y = 20 × f ∂x ∂y ∂z ∂y Discuss the statistical interpretation of your polynomial f (x, y, z). 19 • The polynomial solution is the normalizing constant of the non-central hypergeometric distribution for the fiber b = (20, 20) of the following configuration 2 1 0 A= 0 1 2 (Veronese configuration. Hardy-Weinberg model.) • There is another non-polynomial solution, which I could later figure out by asking Bernd in Kobe after NIMS meeting. 20 • The other solution for a general fiber. – Let m1 , m2 non-negative integers such that m1 + m2 is even. Let f (x, y.z) satisfy ∂ 2f ∂ 2f = 2 , ∂x∂z ∂y ∂f ∂f ∂f ∂f 2x +y = m1 f, 2z +y = m2 f ∂x ∂y ∂z ∂y Write xk = x(x − 1) . . . (x − k + 1). – The other answer is given as follows. 21 – m1 , m2 : even ( m1 −1 )i ( m2 −1 )i ∞ ∑ 2 2 (2i + 1)! i=0 x m1 −1 −i 2 z m2 −1 −i 2 y 2i+1 – m1 , m2 : odd ∞ ∑ i=0 ( m )i ( m )i 1 2 2 2 (2i + 1)! x m1 −i 2 z m2 1 −i 2 y 2i • For general configuration matrix A, the rank (the number of independent solutions) for the A-hypergeometric system is known and related to the volume of the convex hull of A. 22 • Construction of solutions is discussed in Saito-Sturmfels-Takayama book. • Do solutions other than the normalizing constant of the non-central hypergeometric distribution have statistical meaning? 23 Ex.2: Incomplete Gamma function • Consider incomplete Gamma function ∫ x G(x) = y α−1 e−y dy, α, x > 0. 0 • G(x) can be written as 1 α −x G(x) = x e 1F1 (1; α + 1; x), α where 1F1 is the confluent hypergeometric function 1F1 ∞ ∑ (a)k k x 1F1 (a; c; x) = (c)k k! k=0 24 • Differential equation (ODE) satisfied by F = 1F1 : xF ′′ (x) + (c − x)F ′ (x) − aF = 0 25 Wishart distribution and hypergeometric function of a matrix argument (a success story) • W : m × m symmetric positive definite (W > 0) • Density of Wishart distribution with d.f. n and covariance matrix Σ > 0: n−m−1 2 |W | f (W ) = C × n |Σ| 2 1 exp(− trW Σ−1 ) 2 • C is known (containing gamma functions). 26 • ℓ1 : the largest root of W • We want to evaluate the probability Pr(ℓ1 < x). ℓ1 < x ⇔ W < xIm , where Im : m × m is the identity matrix • Hence the probability is given in the incomplete gamma form: ∫ n−m−1 |W | 2 1 Pr(ℓ1 < x) = C exp(− trW Σ−1 )dW n 2 |Σ| 2 0<W <xIm • From general theory Pr(ℓ1 < x) is holonomic. 27 • Just as in dim=1, Pr(ℓ1 < x) is written as ( ) ( x ) 1 m + 1 n + m + 1 x −1 ′ −1 nm 2 C exp − trΣ x ; ; Σ 1F1 2 2 2 2 • Hypergeometric function of a matrix argument (Herz(1955)): ∫ Γm (c) exp(trXY ) 1F1 (a; c; Y ) = Γm (a)Γm (c − a) 0<X<Im × |X|a−(m+1)/2 |Im − X|c−a−(m+1)/2 dX, where ( ) m ∏ 1 i−1 m(m−1) Γm (a) = π 4 Γ a− . 2 i=1 28 • 1F1 (a; c; Y ) is a symmetric function of characteristic roots of Y ⇒ its series expression is written in terms of symmetric polynomials. • Zonal polynomials (A.T.James) Cκ (Y ), κ⊢k homogeneous symmetric polynomial of degree k in the characteristic roots of Y . 29 • Series expansion of 1F1 (Constantine(1963)) 1F1 (a; c; Y )= ∞ ∑ ∑ (a)κ Cκ (Y ) k=0 κ⊢k (c)κ k! . • This is a beautiful mathematical result. However for numerical computation, zonal polynomials have enormous combinatorial difficulties and statisticians pretty much forgot zonal polynomials. 30 • The partial differential equation satisfied by F (Y ) = 1F1 (a; c; y1 , . . . , ym ) was obtained by Muirhead(1970). gi F = 0, i = 1, . . . , m, where gi = yi ∂i2 1 ∑ yj + (c − yi )∂i + (∂i − ∂j ) − a. 2 j̸=i yi − yj 31 • Can we use this PDE for numerical computation? (People never tried this for 40 years). • Works! works very well up to dimension m = 10 (three years ago) • Takayama claims that with a computer with 256GB of memory, he can now handle up to dimension m = 20. 32 HGM for dimension two • Two partial differential equations [ ] 1 y 2 g1 F = y1 ∂12 + (c − y1 )∂1 + (∂1 − ∂2 ) − a F = 0, 2 y1 − y2 [ ] 1 y 1 g2 F = y2 ∂22 + (c − y2 )∂2 + (∂2 − ∂1 ) − a F = 0. 2 y2 − y1 • Let us compute higher-order derivative from these equations. 33 • Divide the second equation by y2 and write ( c n1 n2 n1 n2 −2 ∂1 ∂2 F = ∂1 ∂2 − ∂2 + ∂2 y2 1 y1 a) − (∂2 − ∂1 ) + F. 2 y2 (y2 − y1 ) y2 • The RHS becomes messy, but an important fact is that the number of differentiations is reduced by 1. • We can reduce the number of differentiations as long as there are more than 1 differentiations with respect to each variable. 34 • This implies that all higher-order derivatives can be written as a rational function combination of the following 4 square-free mixed derivatives: F (Y ), ∂1 F (Y ), ∂2 F (Y ), ∂1 ∂2 F (Y ). • Hence we only keep F (Y ), ∂1 F (Y ), ∂2 F (Y ), ∂1 ∂2 F (Y ) in memory. We can always compute higher-order derivatives from these 4 values. 35 • For dimension m, we need to keep 2m square-free mixed derivatives in memory. This is the limitation of the current method. • The problem of initial values is also difficult in general dimension. N.Takayama generalized the algorithm of Koev-Edelman(2006) to handle partial derivatives. 36 Numerical experiments • Statisticians need good numerical performance (recall zonal polynomials). • We were not sure whether it works up to dimension m = 10. • 3 years ago, with Intel Core i7 machine. The computation of the initial value at x0 = 0.2 takes 20 seconds. Then with the step size 0.001, we solve the PDE up to x = 30, which takes 75 seconds. Output: Pr[ℓ1 < 30] = 0.999545 37 • This accuracy is somewhat amazing, if we consider that we updated a 1024-dimensional vector 30,000 times. • As I indicated above, it is now working up to dimension m = 20 with a bigger machine with 256GB memory. 38 • Plot of the cumulative distribution 1.2 by hg 1 0.8 0.6 0.4 0.2 0 0 5 10 15 39 20 Comparison with existing methods (m = 2) Laplace app. 1.2 HGM 1.0 0.8 0.6 0.4 0.2 Truncation of series k = 50 5 10 15 40 20 25 30 Restriction to diagonal region • We have so far assumed non-diagonal region yi ̸= yj . • On the diagonal yi = yj , the PDE is singular: ∑ yj 1 (∂i − ∂j ) − a. gi = yi ∂i2 + (c − yi )∂i + 2 j̸=i yi − yj • Let m = 2. Consider letting y1 → y2 in [ ] 1 y 2 y1 ∂12 + (c − y1 )∂1 + (∂1 − ∂2 ) − a F = 0. 2 y1 − y2 41 • We apply l’Hospital’s rule to ∂1 − ∂2 . y1 − y2 • L’Hospital’s rule results in ∂1 − ∂2 lim = ∂12 − ∂1 ∂2 . y1 →y2 =y y1 − y 42 • After applying L’Hospital’s rule several times, we can show that f (y) = F (y, y) satisfies the following ODE: ( 3 ′′ ) y ′′′ c−y ′ a f (y) + (c − 1 − y) f (y) + f (y) − f (y) 8 8 4y 2y 1 ′′ a ′ + f (y) − f (y) = 0. 4 2 • Actually this computation can be performed by Oaku’s restriction algorithm(1997) of a holonomic ideal. 43 • The following asir program import(‘‘names.rr’’)$ import("nk_restriction.rr")$ dp_gr_print(1)$ dp_ord(0)$ G1=y1*dy1^2 + (c-y1)*dy1+(1/2)*(y2/(y1-y2))*(dy1-dy2)-a; G1=red((y1-y2)*G1); G2=base_replace(G1,[[y1,y2],[y2,y1],[dy1,dy2],[dy2,dy1]]); F=base_replace([G1,G2],[[y1,y],[y2,y+z2],[dy1,dy-dz2],[dy2,dz2]]); A=nk_restriction.restriction_ideal(F,[z2,y],[dz2,dy],[1,0] | param=[a,c]); end$ outputs the following, which coincides with the by-hand computation! -y^2*dy^3+(3*y^2+(-3*c+1)*y)*dy^2+(-2*y^2+(4*a+4*c-2)*y -2*c^2+2*c)*dy-4*a*y+(4*c-4)*a • In hindsight, this program (Oaku’s algorithm) worked only for m = 2, 3. • For m = 4, computation did not finish in one month. 44 • Clear the denominator and consider ∏ g̃i = j̸=i (yi − yj ) × gi , i = 1, . . . , m. • Conjecture: g̃1 , . . . , g̃m generate a holonomic ideal in C⟨y1 , . . . , ym , ∂1 , . . . , ∂m ⟩. • True for m ≤ 3, but not true for m ≥ 4. • Differential equations for diagonal cases (multiple eigenvalues) have been obtained by Noro for m ≤ 8 and some further ones have been obtained by Manuel Kauers recently. (general case is an open question.) 45 Current summary on HGM • Holonomic gradient method is practical if we implement it efficiently. • Our approach brought a totally new approach to a longstanding problem in statistics. • Holonomic gradient methods is general and can be applied to many problems. • We stand at the beginning of applications of D-module theory to statistics! • The problem of singularity seems to be hard and interesting. 46