Probabilistic Inference in PRISM Taisuke Sato Tokyo Institute of Technology Problem • Statistical machine learning is a labor-intensive process: {modeling learning evaluation}* of trial-and-error • Pains of deriving and implementing model-specific learning algorithms and model-specific probabilistic inference Model 1 Model 2 EM2 EM1 EM ... Model n ... MCMC EMn VB model-specific learning algorithms Our solution • Develop a high-level modeling language that offers universal learning and inference methods applicable to every model Model 1 Model 2 ... Model n modeling language EM VB ... MCMC • The user concentrates on modeling and the rest (learning and inference) is taken care of by the system PRISM (http://sato-www.cs.titech.ac.jp/prism/) • Logic-based high-level modeling language Probabilistic models Bayesian network HMM New model PCFG ... PRISM system EM/MAP VT VBVT VB MCMC Learning methods • Its generic inference/learning methods subsume standard algorithms such as FB for HMMs and BP for Bayesian networks Basic ideas • Semantics • program = Turing machine + probabilistic choice + Dirichlet prior • denotation = a probability measure over possible worlds • Propositionalized probability computation (PPC) • programs written at predicate logic level • probability computation at propositional logic level • Dynamic programming for PPC • proof search generates a directed graph (explanation graph) • Probabilities are computed from bottom to top in the graph • Discriminative use • generatively define a model by a PRISM program and descriminatively use it for better prediction performance ABO blood type program values(abo,[a,b,o],[0.5,0.2,0.3]). msw(abo,a) is true with prob. 0.5 btype(X):- gtype(Gf,Gm), pg_table(X,[Gf,Gm]). pg_table(X,GT):((X=a;X=b),(GT=[X,o];GT=[o,X];GT=[X,X]) ; X=o,GT=[o,o] ; X=ab,(GT=[a,b];GT=[b,a])). gtype(Gf,Gm):- msw(abo,Gf),msw(abo,Gm). father a mother b probabilistic primitives simulate gene inheritance from father (left) and mother (right) a o AB b A o child B Propositionalized probability computation 0.55 btype(a)0.25 0.15 0.15 <=> gtype(a,a) v gtype(a,o) v gtype(o,a) 0.25 0.5 0.5 0.15 0.5 0.3 0.15 0.3 0.5 gtype(a,a) <=> msw(abo,a) & msw(abo,a) Explanation graph for btype(a) that explains how btype(a) is proved by probabilistic choice made by msw-atoms gtype(a,o) <=> msw(abo,a) & msw(abo,o) gtype(o,a) <=> msw(abo,o) & msw(abo,a) PPC+DP subsumes forward-backward, belief propagation, insideoutside computation Sum-product computation of probabilities in a bottom-up manner using probabilities assigned to msw atoms Expl. graph is acyclic and dynamic programming (DP) is possible Learning • A program defines a joint distributionP(x,y|q) where x hidden and y observed • P(msw(abo,a),..btype(a),… |qa,qb,qo) where qa+qb+qo=1 • Learning q from observed data y by maximizing • P(y|q) MLE/MAP • P(x*,y|q) where x* = argmax_x P(x,y|q) VT • From a Bayesian point of view, a program defines marginal likelihood ∫P(x,y|q,a) dq • We wish to compute • predictive distribution = ∫P(x|y,q,a) dq • marginal likelihood P(y|a) = Sx∫P(x,y|q,a) dq • Both need approximation • Variational Bayes (VB) VB, VB-VT • MCMC Metropolis-Hastings Sample session 1 - Expl. graph and prob. computation built-in predicate | ?- prism(blood) loading::blood.psm.out | ?- show_sw Switch gene: unfixed_p: a (p: 0.500000000) b (p: 0.200000000) o (p: 0.300000000) | ?- probf(btype(a)) btype(a) <=> gtype(a,a) v gtype(a,o) v gtype(o,a) gtype(a,a) <=> msw(gene,a) & msw(gene,a) gtype(a,o) <=> msw(gene,a) & msw(gene,o) gtype(o,a) <=> msw(gene,o) & msw(gene,a) | ?- prob(btype(a),P) P = 0.55 Sample session 2 - MLE and Viterbi inference | ?- D=[btype(a),btype(a),btype(ab),btype(o)],learn(D) Exporting switch information to the EM routine ... done #em-iters: 0(4) (Converged: -4.965121886) Statistics on learning: Graph size: 18 Number of switches: 1 Number of switch instances: 3 Number of iterations: 4 Final log likelihood: -4.965121886 | ?- prob(btype(a),P) P = 0.598211 | ?- viterbif(btype(a)) btype(a) <= gtype(a,a) gtype(a,a) <= msw(gene,a) & msw(gene,a) Sample session 3 - Bayes inference by MCMC | ?- D=[btype(a), btype(a), btype(ab), btype(o)], marg_mcmc_full(D,[burn_in(1000),end(10000),skip(5)],[VFE,ELM]), marg_exact(D,LogM) VFE = -5.54836 ELM = -5.48608 LogM = -5.48578 |?- D=[btype(a), btype(a), btype(ab) ,btype(o)], predict_mcmc_full(D,[btype(a)],[[_,E,_]]), print_graph(E,[lr('<=')]) btype(a) <= gtype(a,a) gtype(a,a) <= msw(gene,a) & msw(gene,a) Summary • PRISM = Probabilistic Prolog for statistical machine learning • Forward sampling • Exact probability computation • Parameter learning • MLE/MAP, • VT • Bayesian inference • VB • VBVT • MCMC • Viterbi inference • model core (BIC,Cheesman-Stutz,VFE) • smoothing • Current version 2.1