Extracting Relations from Traditional Chinese Medicine Literature via Heterogeneous Entity Networks Online Supplement Huaiyu Wan1,4, Marie-Francine Moens2, Walter Luyten3, Xuezhong Zhou4, Qiaozhu Mei5, Lu Liu6, Jie Tang1* 1 Department of Computer Science and Technology, Tsinghua University, Beijing, China 2 Department of Computer Science, KU Leuven, Belgium 3 Department of Biology, KU Leuven, Belgium 4 School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China 5 School of Information, University of Michigan, Ann Arbor, Michigan, USA 6 Department of Electronic Engineering and Computer Science, Northwestern University, Evanston, Illinois, USA * E-mail: jietang@tsinghua.edu.cn Appendix A: Learning the Heterogeneous Factor Graph Model Probabilistic graphical models[1, 2] use a graph-based representation as the foundation for compactly encoding a complex distribution over a high-dimensional space. In this graphical representation, a node corresponds to a random variable while an edge corresponds to the direct probabilistic dependence between two variables. The graph structure defines the factorization of a joint probabilistic distribution over all variables. A factor graph[3] is a particular type of graphical model, which takes a bipartite graph to represent the factorization of a joint distribution and enables efficient computation of marginal distributions through the sum-product algorithm[4] that is widely used for performing inference on graphical models, such as Bayesian networks and Markov random fields. We now explain how to estimate the free parameters of the model. It is noted that, although the same forms of feature functions are defined for different types of edges, we will train a unique set of weights for each type. Similarly, different types of triadic closures also have the same forms of feature functions but have different weights. 1 One challenge for learning the HFGM of Equation (4) is that the input data is partially labeled. To calculate the partition function Z , we need to sum up the likelihood of possible states for all relations including unlabelled relations. To deal with this problem, we use the labelled data to infer the unknown labels. Let YL denote the known labels, according to Equation (4) we can define the following log-likelihood objective function O( ) : O(θ ) log Pθ (YL | X ,G ) log Pθ (Y | X ,G ) Y |YL log Y |YL 1 exp{ Tf }exp{ T g} Z log exp{ Tf }exp{ T g} log Z Y |YL log exp{ Tf }exp{ T g} log exp{ Tf }exp{ T g} Y |YL (A1) Y Then learning the HFGM is to estimate a parameter configuration θ ( , ) to maximize the objective function O (θ ) , i.e., θ* arg max O(θ ) (A2) θ We employ a gradient descent method to solve the objective function. Specifically, taking the parameters as an example, we can calculate the gradients as: T T T T O(θ ) log Y |YL exp{ f }exp{ g} log Y exp{ f }exp{ g} T T exp{ f }exp{ g} f exp{ Tf }exp{ T g} f Y |YL Y T T exp{ f }exp{ g } Y |Y Y exp{ Tf }exp{ Tg} L Ε Pθ (Y|YL , X ,G ) [f ] Ε Pθ (Y , X ,G ) [f ] For a specific parameter em , we have 2 (A3) O(θ ) Ε Pθ ( yei |YL , X ,G ) [ f em ( xeim , yei )] Ε Pθ ( yei |X ,G ) [ f em ( xeim , yei )] em (A4) where Ε Pθ ( yei |YL , X ,G ) [ f em ( xeim , yei )] is the expectation of feature function f em under the distribution Pθ ( yei |YL , X,G) estimated by the learned model given the partially labeled data YL ; Ε Pθ ( yei |X ,G ) [ f em ( xeim , yei )] is the expectation of f em under the distribution Pθ ( yei | X,G) estimated by the learned model without any known information. Similarly, O(θ ) E Pθ (Ycj |YL , X ,G ) [ g cn (Ycj )] E Pθ (Ycj |X ,G ) [ g cn (Ycj )] cn (A5) where ΕPθ (Ycj |YL ,X ,G ) [ gcn (Ycj )] is the expectation of feature function g cn under the distribution Pθ (Ycj |YL , X ,G ) estimated by the learned model given the partially labeled data YL ; EPθ (Ycj |X ,G ) [ gcn (Ycj )] is the expectation of g cn under the distribution Pθ (Ycj | X ,G ) estimated by the learned model without any known information. It is intractable to estimate the marginal probabilities in equations (A4) and (A5) because the graphical structure in HFGM can be arbitrary and contain cycles. There are several methods to approximately solve the problem and in this work we chose loopy belief propagation (LBP)[5] due to its ease of implementation and effectiveness. Algorithm A1 shows the details of the semisupervised learning algorithm of the model. Algorithm 1 Semi-supervised learning the HFGM Input: heterogeneous TCM network G (V , E , X) ; labeled edges YL Y ; learning rate ; Output: estimated parameters θ * ; Initialize θ 0 ; 3 repeat Perform LBP to calculate marginal distribution Pθ ( yei |YL , X,G) and Pθ (Ycj |YL , X ,G ) ; Calculate Ε Pθ ( yei |YL , X ,G ) [ f em ( xeim , yei )] under Pθ ( yei |YL , X,G) ; Calculate ΕPθ (Ycj |YL ,X ,G ) [ gcn (Ycj )] under Pθ (Ycj |YL , X ,G ) ; Perform LBP to calculate marginal distribution Pθ ( yei | X,G) and Pθ (Ycj | X ,G ) ; Calculate Ε Pθ ( yei |X ,G ) [ f em ( xeim , yei )] under Pθ ( yei | X,G) ; Calculate EPθ (Ycj |X ,G ) [ gcn (Ycj )] under Pθ (Ycj | X ,G ) ; Calculate the gradient of em and cn according to equations (A4) and (A5); em em cn cn O(θ ) ; em O(θ ) ; cn until Convergence; Appendix B: Inferring Unknown Relations Based on the estimated parameters θ * , we can predict the labels of unknown edges by finding a label configuration which maximizes the joint probability, i.e., Y * arg max P(Y |YL , X,G) Y (9) Again, we use LBP to calculate the marginal distribution P( yei |YL , X,G) , and then assign each edge with the label which has the largest marginal probability. The marginal probability is taken as the prediction confidence. References 1. Pearl J, ed. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San Francisco, CA: Morgan Kaufmann, 1988. 2. Koller D, Friedman N, ed. Probabilistic Graphical Models: Principles and Techniques. Cambridge, MA: The MIT Press, 2009. 4 3. Kschischang FR, Frey BJ, Loeliger HA. Factor graphs and the sum-product algorithm. IEEE Trans Inf Theory 2001;47:498‒519. 4. Mooij JM, Kappen HJ. Sufficient conditions for convergence of the sum-product algorithm. IEEE Trans Inf Theory 2007;53:4422‒37. 5. Murphy KP, Weiss Y, Jordan MI. Loopy belief propagation for approximate inference: An empirical study. Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence (UAI'99). 1999:467‒75. 5