BMJ

advertisement
Extracting Relations from Traditional Chinese Medicine Literature via Heterogeneous Entity Networks
Online Supplement
Huaiyu Wan1,4, Marie-Francine Moens2, Walter Luyten3, Xuezhong Zhou4, Qiaozhu Mei5, Lu Liu6, Jie Tang1*
1
Department of Computer Science and Technology, Tsinghua University, Beijing, China
2
Department of Computer Science, KU Leuven, Belgium
3
Department of Biology, KU Leuven, Belgium
4
School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China
5
School of Information, University of Michigan, Ann Arbor, Michigan, USA
6
Department of Electronic Engineering and Computer Science, Northwestern University, Evanston, Illinois, USA
*
E-mail: jietang@tsinghua.edu.cn
Appendix A: Learning the Heterogeneous Factor Graph Model
Probabilistic graphical models[1, 2] use a graph-based representation as the foundation for
compactly encoding a complex distribution over a high-dimensional space. In this graphical
representation, a node corresponds to a random variable while an edge corresponds to the direct
probabilistic dependence between two variables. The graph structure defines the factorization of
a joint probabilistic distribution over all variables.
A factor graph[3] is a particular type of graphical model, which takes a bipartite graph to
represent the factorization of a joint distribution and enables efficient computation of marginal
distributions through the sum-product algorithm[4] that is widely used for performing inference
on graphical models, such as Bayesian networks and Markov random fields.
We now explain how to estimate the free parameters of the model. It is noted that, although
the same forms of feature functions are defined for different types of edges, we will train a
unique set of weights for each type. Similarly, different types of triadic closures also have the
same forms of feature functions but have different weights.
1
One challenge for learning the HFGM of Equation (4) is that the input data is partially
labeled. To calculate the partition function Z , we need to sum up the likelihood of possible
states for all relations including unlabelled relations. To deal with this problem, we use the
labelled data to infer the unknown labels. Let YL denote the known labels, according to Equation
(4) we can define the following log-likelihood objective function O( ) :
O(θ )  log Pθ (YL | X ,G )
 log  Pθ (Y | X ,G )
Y |YL
 log 
Y |YL
1
exp{ Tf }exp{ T g}
Z
 log  exp{ Tf }exp{ T g}  log Z
Y |YL
 log  exp{ Tf }exp{ T g}  log  exp{ Tf }exp{ T g}
Y |YL
(A1)
Y
Then learning the HFGM is to estimate a parameter configuration θ  ( ,  ) to maximize the
objective function O (θ ) , i.e.,
θ*  arg max O(θ )
(A2)
θ
We employ a gradient descent method to solve the objective function. Specifically, taking the
parameters  as an example, we can calculate the gradients as:


T
T
T
T
O(θ )  log  Y |YL exp{ f }exp{ g}  log  Y exp{ f }exp{ g}



T
T
exp{ f }exp{ g}  f  exp{ Tf }exp{ T g}  f

Y |YL

 Y
T
T
exp{

f
}exp{

g
}
Y |Y
Y exp{ Tf }exp{ Tg}
L
 Ε Pθ (Y|YL , X ,G ) [f ]  Ε Pθ (Y , X ,G ) [f ]
For a specific parameter  em , we have
2
(A3)
O(θ )
 Ε Pθ ( yei |YL , X ,G ) [ f em ( xeim , yei )]  Ε Pθ ( yei |X ,G ) [ f em ( xeim , yei )]
 em
(A4)
where Ε Pθ ( yei |YL , X ,G ) [ f em ( xeim , yei )] is the expectation of feature function f em under the distribution
Pθ ( yei |YL , X,G) estimated by the learned model given the partially labeled data YL ;
Ε Pθ ( yei |X ,G ) [ f em ( xeim , yei )] is the expectation of f em under the distribution Pθ ( yei | X,G) estimated
by the learned model without any known information.
Similarly,
O(θ )
 E Pθ (Ycj |YL , X ,G ) [ g cn (Ycj )]  E Pθ (Ycj |X ,G ) [ g cn (Ycj )]
 cn
(A5)
where ΕPθ (Ycj |YL ,X ,G ) [ gcn (Ycj )] is the expectation of feature function g cn under the distribution
Pθ (Ycj |YL , X ,G ) estimated by the learned model given the partially labeled data YL ;
EPθ (Ycj |X ,G ) [ gcn (Ycj )] is the expectation of g cn under the distribution Pθ (Ycj | X ,G ) estimated by the
learned model without any known information.
It is intractable to estimate the marginal probabilities in equations (A4) and (A5) because the
graphical structure in HFGM can be arbitrary and contain cycles. There are several methods to
approximately solve the problem and in this work we chose loopy belief propagation (LBP)[5]
due to its ease of implementation and effectiveness. Algorithm A1 shows the details of the semisupervised learning algorithm of the model.
Algorithm 1 Semi-supervised learning the HFGM
Input: heterogeneous TCM network G  (V , E , X) ;
labeled edges YL  Y ;
learning rate  ;
Output: estimated parameters θ * ;
Initialize θ  0 ;
3
repeat
Perform LBP to calculate marginal distribution Pθ ( yei |YL , X,G) and Pθ (Ycj |YL , X ,G ) ;
Calculate Ε Pθ ( yei |YL , X ,G ) [ f em ( xeim , yei )] under Pθ ( yei |YL , X,G) ;
Calculate ΕPθ (Ycj |YL ,X ,G ) [ gcn (Ycj )] under Pθ (Ycj |YL , X ,G ) ;
Perform LBP to calculate marginal distribution Pθ ( yei | X,G) and Pθ (Ycj | X ,G ) ;
Calculate Ε Pθ ( yei |X ,G ) [ f em ( xeim , yei )] under Pθ ( yei | X,G) ;
Calculate EPθ (Ycj |X ,G ) [ gcn (Ycj )] under Pθ (Ycj | X ,G ) ;
Calculate the gradient of  em and  cn according to equations (A4) and (A5);
 em   em   
 cn   cn   
O(θ )
;
 em
O(θ )
;
 cn
until Convergence;
Appendix B: Inferring Unknown Relations
Based on the estimated parameters θ * , we can predict the labels of unknown edges by finding a
label configuration which maximizes the joint probability, i.e.,
Y *  arg max P(Y |YL , X,G)
Y
(9)
Again, we use LBP to calculate the marginal distribution P( yei |YL , X,G) , and then assign
each edge with the label which has the largest marginal probability. The marginal probability is
taken as the prediction confidence.
References
1.
Pearl J, ed. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San
Francisco, CA: Morgan Kaufmann, 1988.
2. Koller D, Friedman N, ed. Probabilistic Graphical Models: Principles and Techniques. Cambridge,
MA: The MIT Press, 2009.
4
3. Kschischang FR, Frey BJ, Loeliger HA. Factor graphs and the sum-product algorithm. IEEE Trans
Inf Theory 2001;47:498‒519.
4. Mooij JM, Kappen HJ. Sufficient conditions for convergence of the sum-product algorithm. IEEE
Trans Inf Theory 2007;53:4422‒37.
5. Murphy KP, Weiss Y, Jordan MI. Loopy belief propagation for approximate inference: An empirical
study. Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence (UAI'99).
1999:467‒75.
5
Download