Information Geometry Paul Marriott Introduction Structural results Higher order asymptotics Information Geometry: an overview Dimension Reduction Mixture geometries Information and entropy High dimensional examples Infinite dimensional extensions Paul Marriott Department of Statistics and Actuarial Science, University of Waterloo WOGAS III 5th April 2011 Information Geometry Overview Paul Marriott Introduction Structural results Higher order asymptotics Dimension Reduction Mixture geometries Information and entropy • Overview Information Geometry (IG) in broad sense. • Models: full, curved, extended exponential and ‘universal’ families • Geometries: expected/observed dual affine spaces, information and entropy, mixture geometry High dimensional examples • Methods: higher order asymptotics, tensor calculus, Infinite dimensional extensions • Notation: [1] refers to References while [S:7] refers to curvature, dimension reduction and spectral techniques Wogas III session number Information Geometry Paul Marriott Information Geometry Introduction Structural results Higher order asymptotics Dimension Reduction Mixture geometries Information and entropy High dimensional examples Infinite dimensional extensions • How to connect two probability density or mass functions f (x) and g(x) in some space of models? -1: ρf (x) + (1 − ρ)g(x) ρ g(x)1−ρ +1: f (x) C(ρ) • Two different affine structures used simultaneously -1: Mixture affine geometry on unit measures +1: Exponential affine geometry on positive measures • Fisher Information’s roles • measures angles and lengths • maps between +1 and −1 representations of tangent vectors, [3], [4], [18] Information Geometry Visualising IG: extended trinomial example Paul Marriott Introduction Structural results Information and entropy High dimensional examples Infinite dimensional extensions 1.0 0.8 0.6 0.4 Mixture geometries 0.2 Dimension Reduction (a) −1−geodesics in −1−simplex 0.0 Higher order asymptotics 0.0 0.2 0.4 0.6 0.8 1.0 Information Geometry Visualising IG: extended trinomial example Paul Marriott Introduction Structural results Information and entropy High dimensional examples Infinite dimensional extensions 1.0 0.8 0.6 0.4 Mixture geometries 0.2 Dimension Reduction (a) −1−geodesics in −1−simplex 0.0 Higher order asymptotics 0.0 0.2 0.4 0.6 0.8 1.0 Information Geometry Visualising IG: extended trinomial example Paul Marriott Introduction Structural results High dimensional examples Infinite dimensional extensions 10 1.0 5 0.8 0 0.6 −5 −10 Information and entropy (b) −1−geodesics in +1−simplex 0.4 Mixture geometries 0.2 Dimension Reduction (a) −1−geodesics in −1−simplex 0.0 Higher order asymptotics 0.0 0.2 0.4 0.6 0.8 1.0 −10 −5 0 5 10 Information Geometry Visualising IG: extended trinomial example Paul Marriott Introduction Structural results 10 1.0 5 0.8 0 0.6 −5 −10 Information and entropy (b) −1−geodesics in +1−simplex 0.4 Mixture geometries 0.2 Dimension Reduction (a) −1−geodesics in −1−simplex 0.0 Higher order asymptotics 0.0 0.2 0.4 0.6 0.8 1.0 −10 −5 0 5 10 High dimensional examples (d) +1−geodesics in +1−simplex 0.0 −10 0.2 −5 0.4 0 0.6 5 0.8 10 1.0 (c) +1−geodesics in −1−simplex Infinite dimensional extensions 0.0 0.2 0.4 0.6 0.8 1.0 −10 −5 0 5 10 Information Geometry Visualising IG: extended trinomial example Paul Marriott Introduction Structural results 10 1.0 5 0.8 0 0.6 −5 −10 Information and entropy (b) −1−geodesics in +1−simplex 0.4 Mixture geometries 0.2 Dimension Reduction (a) −1−geodesics in −1−simplex 0.0 Higher order asymptotics 0.0 0.2 0.4 0.6 0.8 1.0 −10 −5 0 5 10 High dimensional examples (d) +1−geodesics in +1−simplex 0.0 −10 0.2 −5 0.4 0 0.6 5 0.8 10 1.0 (c) +1−geodesics in −1−simplex Infinite dimensional extensions 0.0 0.2 0.4 0.6 0.8 1.0 −10 −5 0 5 10 Information Geometry Visualising IG: extended trinomial example Paul Marriott Introduction Structural results 10 1.0 5 0.8 0 0.6 −5 −10 Information and entropy (b) −1−geodesics in +1−simplex 0.4 Mixture geometries 0.2 Dimension Reduction (a) −1−geodesics in −1−simplex 0.0 Higher order asymptotics 0.0 0.2 0.4 0.6 0.8 1.0 −10 −5 0 5 10 High dimensional examples (d) +1−geodesics in +1−simplex 0.0 −10 0.2 −5 0.4 0 0.6 5 0.8 10 1.0 (c) +1−geodesics in −1−simplex Infinite dimensional extensions 0.0 0.2 0.4 0.6 0.8 1.0 −10 −5 0 5 10 Information Geometry Duality Paul Marriott Introduction Structural results • There exists a mixed parameterisation [6] as solution of differential equation Higher order asymptotics 1.0 Mixed parameterisation 0.8 Dimension Reduction 0.4 0.6 Mixture geometries 0.2 Information and entropy Infinite dimensional extensions 0.0 High dimensional examples 0.0 0.2 0.4 0.6 0.8 1.0 • −1-geodesics Fisher orthogonal to +1-geodesics • Limit of mixed parameters give extended exponential family • Key to structural theorem [3] and idea of inferential cuts Information Geometry Asymptotic expansions Paul Marriott • Can apply Edgeworth, saddlepoint or Laplace expansions [29] MLE of mean 1.5 0.0 Infinite dimensional extensions 1.0 Density 0.1 Density 0.4 Information and entropy High dimensional examples MLE of rate 0.5 Mixture geometries 0.5 Dimension Reduction expansions [7] 0.0 Higher order asymptotics • Strong links between IG and higher order asymptotic 0.3 Structural results 0.2 Introduction 0 2 4 mean 6 0 1 2 3 4 5 rate • Flexible, tractable given IG, invariance properties clear [3] Information Geometry Example: survival times Paul Marriott Introduction Structural results Higher order asymptotics Dimension Reduction Mixture geometries • Censored survival times in leukaemia patients adapted from [15] • Censored exponential model, [18, 24] 1 exp λ1 x + λ2 y − log eλ2 t − 1 + eλ1 +λ2 t λ2 Information and entropy High dimensional examples Infinite dimensional extensions this is curved exponential family (λ1 (µ), λ2 (µ)) • Bias of MLE is given by information geometric formula − o 1 n (−1) a cd (−1) a Γcd g + hκλ g κλ 2n • This formula is ‘not difficult’ in the sense only uses sums and partial derivatives. Information Geometry Paul Marriott Introduction Structural results Higher order asymptotics Dimension Reduction Mixture geometries Asymptotic expansions • High dimensional calculus though tensor analysis, McCullagh [27] • Many terms need to be computed in high dimensional problems [S:7] • Language issue • Singularity of Fisher information matters Information and entropy Eigenvalues Infinite dimensional extensions 0.00 0.01 0.02 0.03 0.04 0.05 0.06 High dimensional examples Eigenvalues 0 20 40 60 80 rank • Fisher information can be singular or infinite [22] Information Geometry Paul Marriott Introduction Embedding curvature and affine approximation Structural results Higher order asymptotics Dimension Reduction Mixture geometries Information and entropy High dimensional examples Infinite dimensional extensions • Curvature(s) key part(s) of differential geometry • Tangent space gives best linear approximation • Tangent and curvature gives best two dimensional affine embedding space Information Geometry Paul Marriott Introduction Embedding curvature and affine approximation Structural results Higher order asymptotics Dimension Reduction Mixture geometries Information and entropy High dimensional examples Infinite dimensional extensions • Curvature(s) key part(s) of differential geometry • Tangent space gives best linear approximation • Tangent and curvature gives best two dimensional affine embedding space Information Geometry Paul Marriott Introduction Embedding curvature and affine approximation Structural results Higher order asymptotics Dimension Reduction Mixture geometries Information and entropy High dimensional examples Infinite dimensional extensions • Curvature(s) key part(s) of differential geometry • Tangent space gives best linear approximation • Tangent and curvature gives best two dimensional affine embedding space Information Geometry Curvature and affine approximation Paul Marriott Introduction Structural results Higher order asymptotics Dimension Reduction Mixture geometries Information and entropy High dimensional examples Infinite dimensional extensions • Different affine geometries give different approximating spaces • Low dimensional +1-affine spaces give approximate sufficient statistics [26] • Low dimensional −1 approximations give limits to identification and computation in mixture models [25], [2], [S:3], [S:6] Information Geometry Example: survival times Paul Marriott Introduction Structural results Higher order asymptotics • Example: censored survival times in leukaemia patients adapted from [15] • Use censored exponential distribution Dimension Reduction Mixture geometries 600 800 1000 1200 1400 1600 1800 mu Full exponential family 1.0 theta2 0.5 0.0 −0.5 −0.5 −1.5 log−likelihood −2.5 Infinite dimensional extensions −3.5 High dimensional examples ●●●●●●●●●●●●●●● ●●● ●●● ●● ●●● ●● ●● ●● ●● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ●● ● ●● ● ●● ● ●● ●● ● ●● ● ●● ●● ● ●● ●● ● ●● ●● ● ●● ●● ● 1.5 Log−likelihoods Information and entropy −0.002 0.000 0.002 theta1 0.004 Information Geometry Example: survival times Paul Marriott Introduction Structural results Higher order asymptotics • Example: censored survival times in leukaemia patients adapted from [15] • Use censored exponential distribution Dimension Reduction Distribution of MLE 600 800 1000 1600 1800 500 1200 1400 1000 1.0 0.5 theta2 0.0010 Full exponential family 1.5 Log−likelihoods ●●●●●●●●●●●●●●● ●●● ●●● ●● ●●● ●● ●● ●● ●● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ● ●● ●● ● ●● ● ●● ● ●● ●● ● ●● ● ●● ●● ● ●● ●● ● ●● ●● ● ●● ●● ● −0.5 0.0000 −1.5 −0.5 Density −2.5 Infinite dimensional extensions −3.5 High dimensional examples log−likelihood Information and entropy 0.0 0.0020 Mixture geometries 1500 mu −0.002 20000.000 0.002 2500 theta1 mu 0.004 Information Geometry Mixture Geometry Paul Marriott • Enough structure to compute non-parametric maximum likelihood, [21]. Infinite dimensional extensions ● 0.0 0.1 0.2 4 6 8 ● 2 Direction Derivative Dimension 2 0.1 High dimensional examples 0.0 Information and entropy 0.4 Mixture geometries 0 Dimension Reduction affine space determined by sample size [21] −2 Higher order asymptotics • Lindsay [23] embeds problem in finite dimensional 0.3 Structural results 0.2 Introduction 0.3 0.4 Dimension 1 −10 −5 0 5 10 theta • Directional derivative is key tool to maximise likelihood over a −1-convex hull [S:3] Information Geometry Mixture geometries Paul Marriott Introduction Structural results Higher order asymptotics Dimension Reduction Mixture geometries Information and entropy High dimensional examples Infinite dimensional extensions • Lindsay’s geometry is finite dimensional version of Amari’s −1 geometry • Need to work in convex hull in −1-dimensional affine space • Low dimensional −1 approximations give limits to identification and computation in mixture models, [2] • Information geometry can give efficient approximation of high dimensional convex hulls by polytopes Information Geometry Example: mixture of binomials Paul Marriott Introduction • ‘simple one-parameter binomial [...] models generally provide poor fits to this type of binary data’ Infinite dimensional extensions Direction Derivative 0 ● ● ● ● ● ● ● 0 1 2 3 4 0:size 5 Direction Derivative Probability 0.2 ● ● 0.4 0.5 0.3 ● 0.1 data.dis/sum(data.dis) 0.6 0.4 ● 0.0 High dimensional examples Mixing proportions 0.7 Data and Fit Information and entropy ● ● ● ● ● 6 7 ● 0 1 2 ● 3 4 ● 5 Support points ● 6 ● 7 −500 −400 −300 −200 −100 Mixture geometries 0.2 0.3 Dimension Reduction dead implants in rats 0.1 Higher order asymptotics • Toxicological experiment [20] studied frequencies of 0.0 Structural results 0 1 2 3 4 5 mu • IG gives ways to explore convex hull efficiently 6 7 Information Geometry Paul Marriott Introduction Structural results Higher order asymptotics Information and entropy • Given a set of statistics si (x) construct model which maximise entropy with fixed E(si (S)), [S:1], [S:2] • Models which are orthogonal to level sets of E(si (S)) called least informative models Dimension Reduction High dimensional examples Infinite dimensional extensions 0.6 0.4 0.2 Information and entropy 0.0 Mixture geometries 0.8 1.0 Mixed parameterisation 0.0 0.2 0.4 0.6 0.8 1.0 • Pythagorean results minimising KL divergence by orthogonal projection, [S:6] • Links with decision theory [14], non parameteric methods such as bootstrap and empirical likelihood, [S:5] Information Geometry Network Models Paul Marriott Introduction Structural results Higher order asymptotics Dimension Reduction • High dimensional (curved) exponential family examples based on network models, [16], [S:1] • Binary indicator functions Yij such that we have 1 if an edge exists from i to j and zero otherwise. 1 Mixture geometries y12=1 2 6 5 Information and entropy 4 3 High dimensional examples Infinite dimensional extensions 7 • Build ‘least informative model’ exp η T g(y , X ) Pη (Y = y ) = κ(η) where sufficient statistics are graph statistics Information Geometry Paul Marriott Introduction Example: New England Lawyers Structural results Higher order asymptotics Dimension Reduction Mixture geometries Information and entropy High dimensional examples Infinite dimensional extensions • [19] looks at working relations among 36 partners in a New England law firm • Computing κ(η) is typically intractable since a sum over 2630 terms • Approaches include [S:8] • Pseudo-likelihood [30] • simulated moments [28] • MCMC [17] Information Geometry Paul Marriott Introduction Structural results Graphical models: FEF • Consider the example from [13] of the cyclic graph of order 4 with binary values at each node, [S:1], [S:4]. 1 Higher order asymptotics Dimension Reduction Mixture geometries 2 4 Information and entropy High dimensional examples Infinite dimensional extensions 3 • Models lie in 15-dimensonal simplex, but with constraints imposed by conditional independence • Constraints linear in +1-affine parameters ηi + ηj = ηk + ηl • So get full exponential family Information Geometry Graphical models: CEF Paul Marriott Introduction Structural results Higher order asymptotics Dimension Reduction Mixture geometries • Give ordered set of discrete random variables Xi , i = 1, 2, 3 be binary random variables, [S:1], [S:4]. • The simplex which describes their distribution is 23 − 1 = 7 dimensional • A DAG defines dependences for example the simple graph 1 Information and entropy High dimensional examples Infinite dimensional extensions 2 3 or P(X3 |X2 , X1 ) = P(X3 |X2 ) • These constraints give non-linear constriants in +1 affine space η001 = log( 3 )(1 − π 2 )π (1 − π10 11 1 ) 3 π2 π π11 11 1 and curved exponential families [13]. Information Geometry Paul Marriott Introduction Structural results Higher order asymptotics DAG with hidden variables • In multinomials independence is expressible as a finite set of polynomial equalities, [S:4]. • Add hidden variables Dimension Reduction 1 Mixture geometries H Information and entropy High dimensional examples Infinite dimensional extensions 3 2 • Example lies in 7 dimensional simplex- mixes over a 3 dimensional CEF • The model space is not a manifold but a variety- union of different dimensional manifolds- extended exponential family Information Geometry Paul Marriott Infinite dimensional issues Introduction Structural results Higher order asymptotics Dimension Reduction Mixture geometries Information and entropy High dimensional examples Infinite dimensional extensions • There exists geometry of infinite simplex [1] • Different ‘faces’ of the infinite simplex have different support and different moment structures • Information geometry of infinite dimensional families [12] and [11] uses Hilbert or Banach space structures • There still exist ±1 geodesics between distributions, but there are boundaries. Information Geometry Neighbourhood of Model Paul Marriott Introduction Structural results Higher order asymptotics • Look +1-geodesic joining standard normal and Cauchy, [8] • Given by f (x)ρ g(x)1−ρ /C(ρ) Dimension Reduction Connecting Normal and Cauchy Infinite dimensional extensions 0.94 Normalising function 0.90 High dimensional examples 0.86 Information and entropy 0.98 Mixture geometries 0.0 0.2 0.4 0.6 0.8 1.0 rho • Infinite Fisher information possible [22] Information Geometry Paul Marriott Neighbourhood of Model Introduction Structural results Higher order asymptotics Dimension Reduction Mixture geometries Information and entropy High dimensional examples Infinite dimensional extensions • Sensitivity of inference to model assumptions by understanding ‘neighbourhood of model’, [S:3] • Links to non and semi-parametrics • Mixture of normal and Cauchy, −1- geodesic (1 − ρ)f (x) + ρg(x) • If ρ << 1/n models very ‘close’ by some measures • models very different by other measures • Asymmetry of KL divergence Information Geometry Summary Paul Marriott Introduction Structural results Higher order asymptotics Dimension Reduction Mixture geometries Information and entropy High dimensional examples Infinite dimensional extensions • Overview Information geometry in broad sense. • Models: full, curved and extended exponential, and ‘universal’ families • Geometries: expected/observed dual flat manifolds, information theory, mixture geometry • Methods: Higher order asymptotics, tensor calculus, curvature, dimension reduction and spectral techniques Information Geometry References I Paul Marriott Introduction [1] Anaya-Izquierdo, K., Critchley, F, Marriott P. & Vos P. (2010), Towards information geometry on the space of all distributions, Preprint [2] Anaya-Izquierdo, K and Marriott, P. (2007) Local mixtures of Exponential families, Bernoulli Vol. 13, No. 3, 623-640. [3] Amari, S.-I. (1985). Differential-Geometrical Methods in Statistics. Lecture Notes in Statistics, No. 28, New York: Springer. [4] Amari, S.-I. and Nagaoka, H. (2000). Methods of Information Geometry. Providence, Rhode Island: American Mathematical Society. [5] Barndorff-Nielsen, O., (1978) Information and exponential families in statistical theory, London: John Wiley & Sons [6] Barndorff-Nielsen, O. E. and Blaesild, P. (1983). Exponential models with affine dual foliations. Annals of Statistics, 11(3):753–769. [7] Barndorff-Nielsen, O.E. and Cox, D.R, (1994), Inference and Asymptotics, Chapman & Hall:London Structural results Higher order asymptotics Dimension Reduction Mixture geometries Information and entropy High dimensional examples Infinite dimensional extensions Information Geometry References II Paul Marriott Introduction Structural results [8] Brown, L. D. (1986). Fundamentals of statistical exponential families: with applications in statistical decision theory, Institute of Mathematical Statistics [9] Csiszar, I. and Matus, F., (2005). Closures of exponential families, The Annals of Probability, 33(2):582–600 Higher order asymptotics Dimension Reduction Mixture geometries Information and entropy High dimensional examples Infinite dimensional extensions [10] Efron, B. (1975). Defining the curvature of a statistical problem (with applications to second order efficiency), The Annals of Statistics, 3(6):1189–1242 [11] Fukumizu, K. (2005). Infinite dimensional exponential families by reproducing kernel hilbert spaces. Proceedings of the 2nd International Symposium on Information Geometry and its Applications,p324-333. [12] Gibilisco, P. and Pistone, G. (1998). Connections on non-parametric statistical manifolds by orlicz space geometry. Infinite Dimensional Analysis, Quantum Probability and Related Topics, 1(2):325-347. Information Geometry Paul Marriott Introduction Structural results Higher order asymptotics Dimension Reduction Mixture geometries Information and entropy High dimensional examples Infinite dimensional extensions References III [13] D. Geiger, D.Heckerman, H.King and C. Meek (2001) Stratified Exponential Families: Graphical Models and Model Selection, Annals of Statistics, Vol. 29, No. 2, pp 505-529 [14] Grunwald P.D. and Dawid A.P. (2004) Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory, Annals of Statistics, 32, 4 1267-1433 [15] Hand, D.J., Daly, F., Lunn, A.D., McConway, K.J. and Ostrowski, E. (1994), A handbook of small data sets , Chapman & Hall, London. [16] D. Hunter, (2007) Curved exponential family models for social networks, Social Networks, 29 216–230. [17] Hunter D, Handcock M. (2006) Inference in curved exponential family models for networks J. of Computational and Graphical Statistics 15, 565-583 [18] Kass, R. E. and Vos, P. W. (1997). Geometrical Foundations of Asymptotic Inference. New York: Wiley. Information Geometry Paul Marriott Introduction Structural results Higher order asymptotics Dimension Reduction Mixture geometries Information and entropy High dimensional examples Infinite dimensional extensions References IV [19] Lazega, E., Pattison, P.E., (1999) Multiplexity, generalized exchange and cooperation in organizations: a case study. Social Networks 21, 67–90 [20] L. L. Kupper L.L., and Haseman J.K., (1978), The Use of a Correlated Binomial Model for the Analysis of Certain Toxicological Experiments, Biometrics, Vol. 34, No. 1 (Mar., 1978), pp. 69-76 [21] Mary L. Lesperance and John D. Kalbfleisch (1992) An Algorithm for Computing the Nonparametric MLE of a Mixing DistributionJASA Vol. 87, No. 417 (Mar., 1992), pp. 120-126 [22] Li P., Chen J., & Marriott P., (2009) Non-finite Fisher information and homogeneity: the EM approach, Biometrika 96, 2 pp 411-426. [23] Lindsay, B.G. (1995). Mixture models: Theory, Geometry, and Applications, Hayward CA: Institute of Mathematical Sciences. [24] Marriott P and West S, (2002), On the Geometry of Censored Models, Calcutta Statistical Association Bulletin 52, pp 235-250. Information Geometry Paul Marriott References V Introduction Structural results Higher order asymptotics Dimension Reduction Mixture geometries Information and entropy High dimensional examples Infinite dimensional extensions [25] Marriott, P (2002), On the local geometry of Mixture Models, Biometrika, 89, 1, pp 77-89 [26] Marriott, P., & Vos, P. (2004), On The Global Geometry of Parametric Models and Information Recovery, Bernoulli, 10 (2), 1-11 [27] McCullagh P. (1987) Tensor Methods in Statistics, Chapman and Hall, London. [28] Snijders, T. A. B. (2002), Markov Chain Monte Carlo Estimation of Exponential Random Graph Models, Journal of Social Structure, 3. [29] Small, C.G. (2010) Expansions and asymptotics for statistics, Chapman and Hall [30] Strauss, D., and Ikeda, M. (1990), Pseudolikelihood Estimation for Social Networks,Journal of the American Statistical Association , 85, 204–212.