first.order.random.w..

advertisement
First Order Random Walk Learning
Ni Lao, 12.1.2011
Abstract
.
1.
In trod u cti on
We want to extend PRA model to have some expressive power of first order logic, which can
represent a wide range of human knowledge, e.g.
Unlexicalized path:
HasFather(X, A), IsA(A, B) => IsA(A, B)
Existence quantifier:
 A,Write(X, A) => IsA(X, writer)
 A, MarriedTo(X, A) => IsA(X, single)
Lexicalized path:
Conjunction:
Write(X, A), IsA(A, novel) => IsA(X, novelist)
Write(X, A), IsA(A, novel) & IsA(X, American) => IsA(X, US_novelist)
Disjunctions(or clustering?): Write(X, A) &(IsA(A, novel) ||IsA(A, poem)) => IsA(X, writer)
2.
Desi gn
2.1.
Propositionalization (feature construction)
We cast reasoning as a classification problem: given a query-target node pair (x, y), predict
whether or not they have relationship R. We extract a set of real valued features for (x, y) and
use a trained classifier to perform this binary prediction.
We define basic random walk features as path-constrained random walk probabilities between
x, y or certain constant node z in the graph. We use π to denote a path, which is a sequence of
relations. We use X, Y to represent the place-holder of query-target nodes.
z
X
Y
Random walks used to generate features
These basic features are categorized into two groups.
Target dependent basic features
P(X→Y; π) -- the probability of reaching Y from X following path π.
P(z→Y; π) -- the probability of reaching Y from z following path π.
(this generalizes I(Y=y) -- 1 if the target node is y, or else 0)
Target independent basic features
P(X→z; π) -- the probability of reaching node z from query node X following path π
P(z→X; π) -- the probability of reaching node z from query node X following path π
P(X→?; π) -- the probability of reaching any node from query node X following path π
I(X→φ; π) -- 1 if random walk reaches no node from query node X following path π, else 0
A conjunction feature f1  f 2 ...  f K is the conjunction of K basic features and at least one of them
is target dependent.
For efficiency reason, we start random walks either from X or z but not Y.
The proposed model is a generalization to PRA and its two extensions--query independent paths,
and popular entity biases (Lao & Cohen, 2010).
2.2.
B a c k w a r d R a n d o m Wa l k s
Although the paths are initially constrained to have maximum length L, we will introduce
longer paths by forward and backward paths. Starting from node Y (or X) we can perform
backward random walks to calculate P(Y←n; π) or P(X←n; π) --the probability of reaching Y
(or X) from any node n. They are useful for estimating properties of longer path candidates:
P( X  Y ; 1 2 )   P( X  n; 1 )P(Y  n; 2 )
n
where π1π2 is the concatenation of path π1 and π2.
z
X
Y
induction
X
Y
train/test
Random walks produced at different stages of learning
2.3. The training process
The overall training process is:
1) find paths starting from X with length L<=2 and support M>=5
2) find backward paths starting from X and Y with length L<=2 and support M>=5
3) add X-started forward paths with accuracy >0.1 as initial features
4) train model
5) estimate the gradient (or gain) of long paths by concatenating forward X-paths (or Z-paths)
with backward Y-paths
6) estimate the gradient (or gain) for adding conjunction features (including lexicalized and
unary features, e.g. P(Y←writer; φ) or P(X→male; IsA), where φ is an empty path)
7) add top features/ paths to feature set
8) go back to 4) or terminate training
2.4.
Conjunction Feature Induction
Conjunction is the key element to express the first order logic semantics. In order to develop an
efficient procedure of inducing conjunction features, let’s first define a few concepts.
Given a set of examples {(xi,yi)}, where x i  [ x1i , x2i ,..., xmi ] is a feature vector, yi is a binary label.
Definition 1: the mean of feature Xj is X  E[ X ]  1  x i
j
j
j
m i
Definition 2: the signal of feature Xj for sample (xi,yi) is sij  xij  X j .
Definition 3: the error of sample i is ei  pi  y i where pi   ( T xi ) is the prediction by
logistic regression model
Conjunctional Feature Gradient Theorem: assume a logistic regression model is trained
based on existing features, and we consider adding a new conjunction feature X j X k with
parameter θj,k. The gradient of θj,k w.r.t. the objective function is
L( )
  ei sij ski
 j ,k
i
(1)
Proof:

L( )
  ei x ij xki
 j ,k
i
  ei ( s ij  X j )( ski  X k )
i
  ei s ij ski  X j  ei xki  X k  ei x ij  X j X k  ei
i
i
i
i
L( )
L( )
L( )
 e s s  X j
 Xk
 X jXk
*
 k
 j
bias
i
i i i
j k
  ei s ij ski
i
*We can drop the last three terms if the model already contains feature k, j, bias and is already trained to convergence.
If we ignore small errors and signals with magnitudes smaller than thresholds terr and tsig we can
effectively ignore many terms in Eq. (1). At each training iteration, the CFI algorithm (as outlined in
Algorithm 1) estimates a sparse vector ha,b as an approximation of L( ) /  j ,k , and adds top J
features to the model.
Algorithm 1: Conjunctional Feature Induction
set hj,k=0, for any pair of features j,k
for each sample (x i,yi) with |ei| > terr
find the set of features Xj, Xk where | s ij | > tsig and | ski | > tsig
ha,b+= ei s ij ski
3.
E xp eri men t
Table 1 Compare to other methods.
4.
Con cl u si on
References
[1] Quinlan, R. (1990) Learning Logical Definitions from Relations, in: Machine Learning, 5, 239-266.
Ap p en d i x
Download