First Order Random Walk Learning Ni Lao, 12.1.2011 Abstract . 1. In trod u cti on We want to extend PRA model to have some expressive power of first order logic, which can represent a wide range of human knowledge, e.g. Unlexicalized path: HasFather(X, A), IsA(A, B) => IsA(A, B) Existence quantifier: A,Write(X, A) => IsA(X, writer) A, MarriedTo(X, A) => IsA(X, single) Lexicalized path: Conjunction: Write(X, A), IsA(A, novel) => IsA(X, novelist) Write(X, A), IsA(A, novel) & IsA(X, American) => IsA(X, US_novelist) Disjunctions(or clustering?): Write(X, A) &(IsA(A, novel) ||IsA(A, poem)) => IsA(X, writer) 2. Desi gn 2.1. Propositionalization (feature construction) We cast reasoning as a classification problem: given a query-target node pair (x, y), predict whether or not they have relationship R. We extract a set of real valued features for (x, y) and use a trained classifier to perform this binary prediction. We define basic random walk features as path-constrained random walk probabilities between x, y or certain constant node z in the graph. We use π to denote a path, which is a sequence of relations. We use X, Y to represent the place-holder of query-target nodes. z X Y Random walks used to generate features These basic features are categorized into two groups. Target dependent basic features P(X→Y; π) -- the probability of reaching Y from X following path π. P(z→Y; π) -- the probability of reaching Y from z following path π. (this generalizes I(Y=y) -- 1 if the target node is y, or else 0) Target independent basic features P(X→z; π) -- the probability of reaching node z from query node X following path π P(z→X; π) -- the probability of reaching node z from query node X following path π P(X→?; π) -- the probability of reaching any node from query node X following path π I(X→φ; π) -- 1 if random walk reaches no node from query node X following path π, else 0 A conjunction feature f1 f 2 ... f K is the conjunction of K basic features and at least one of them is target dependent. For efficiency reason, we start random walks either from X or z but not Y. The proposed model is a generalization to PRA and its two extensions--query independent paths, and popular entity biases (Lao & Cohen, 2010). 2.2. B a c k w a r d R a n d o m Wa l k s Although the paths are initially constrained to have maximum length L, we will introduce longer paths by forward and backward paths. Starting from node Y (or X) we can perform backward random walks to calculate P(Y←n; π) or P(X←n; π) --the probability of reaching Y (or X) from any node n. They are useful for estimating properties of longer path candidates: P( X Y ; 1 2 ) P( X n; 1 )P(Y n; 2 ) n where π1π2 is the concatenation of path π1 and π2. z X Y induction X Y train/test Random walks produced at different stages of learning 2.3. The training process The overall training process is: 1) find paths starting from X with length L<=2 and support M>=5 2) find backward paths starting from X and Y with length L<=2 and support M>=5 3) add X-started forward paths with accuracy >0.1 as initial features 4) train model 5) estimate the gradient (or gain) of long paths by concatenating forward X-paths (or Z-paths) with backward Y-paths 6) estimate the gradient (or gain) for adding conjunction features (including lexicalized and unary features, e.g. P(Y←writer; φ) or P(X→male; IsA), where φ is an empty path) 7) add top features/ paths to feature set 8) go back to 4) or terminate training 2.4. Conjunction Feature Induction Conjunction is the key element to express the first order logic semantics. In order to develop an efficient procedure of inducing conjunction features, let’s first define a few concepts. Given a set of examples {(xi,yi)}, where x i [ x1i , x2i ,..., xmi ] is a feature vector, yi is a binary label. Definition 1: the mean of feature Xj is X E[ X ] 1 x i j j j m i Definition 2: the signal of feature Xj for sample (xi,yi) is sij xij X j . Definition 3: the error of sample i is ei pi y i where pi ( T xi ) is the prediction by logistic regression model Conjunctional Feature Gradient Theorem: assume a logistic regression model is trained based on existing features, and we consider adding a new conjunction feature X j X k with parameter θj,k. The gradient of θj,k w.r.t. the objective function is L( ) ei sij ski j ,k i (1) Proof: L( ) ei x ij xki j ,k i ei ( s ij X j )( ski X k ) i ei s ij ski X j ei xki X k ei x ij X j X k ei i i i i L( ) L( ) L( ) e s s X j Xk X jXk * k j bias i i i i j k ei s ij ski i *We can drop the last three terms if the model already contains feature k, j, bias and is already trained to convergence. If we ignore small errors and signals with magnitudes smaller than thresholds terr and tsig we can effectively ignore many terms in Eq. (1). At each training iteration, the CFI algorithm (as outlined in Algorithm 1) estimates a sparse vector ha,b as an approximation of L( ) / j ,k , and adds top J features to the model. Algorithm 1: Conjunctional Feature Induction set hj,k=0, for any pair of features j,k for each sample (x i,yi) with |ei| > terr find the set of features Xj, Xk where | s ij | > tsig and | ski | > tsig ha,b+= ei s ij ski 3. E xp eri men t Table 1 Compare to other methods. 4. Con cl u si on References [1] Quinlan, R. (1990) Learning Logical Definitions from Relations, in: Machine Learning, 5, 239-266. Ap p en d i x