A Kernel-based Approach to Learning Semantic Parsers Rohit J. Kate Raymond J. Mooney

advertisement
Machine Learning Group
A Kernel-based Approach to Learning
Semantic Parsers
Rohit J. Kate
Doctoral Dissertation Proposal
Supervisor: Raymond J. Mooney
Machine Learning Group
Department of Computer Sciences
University of Texas at Austin
November 21, 2005
University of Texas at Austin
Outline
• Semantic Parsing
• Related Work
• Background on Kernel-based Methods
• Completed Research
• Proposed Research
• Conclusions
2
Semantic Parsing
• Semantic Parsing: Transforming natural language
(NL) sentences into computer executable complete
meaning representations (MRs)
• Importance of Semantic Parsing
– Natural language communication with computers
– Insights into human language acquisition
• Example application domains
– CLang: Robocup Coach Language
– Geoquery: A Database Query Application
3
CLang: RoboCup Coach Language
• In RoboCup Coach competition teams compete to
coach simulated players
• The coaching instructions are given in a formal
language called CLang
Coach
If our player 4 has
the ball, our player
4 should shoot.
Simulated soccer field
Semantic Parsing
CLang
((bowner our {4}) (do our {4} shoot))
4
Geoquery: A Database Query Application
• Query application for U.S. geography database
containing about 800 facts [Zelle & Mooney, 1996]
User
Which rivers
run through the
states bordering
Texas?
Semantic Parsing
Query answer(traverse_2(
next_to(stateid(‘texas’))))
5
Learning Semantic Parsers
• Assume meaning representation languages (MRLs) have
deterministic context free grammars
– true for almost all computer languages
– MRs can be parsed unambiguously
6
NL: Which rivers run through the states bordering Texas?
MR: answer(traverse_2(next_to(stateid(‘texas’))))
Parse tree of MR:
ANSWER
RIVER
answer
TRAVERSE_2
traverse_2
STATE
NEXT_TO
STATE
next_to
STATEID
stateid
‘texas’
Non-terminals: ANSWER, RIVER, TRAVERSE_2, STATE, NEXT_TO, STATEID
Terminals: answer, traverse_2, next_to, stateid, ‘texas’
Productions: ANSWER  answer(RIVER), RIVER  TRAVERSE_2(STATE),
STATE  NEXT_TO(STATE), STATE  NEXT_TO(STATE)
TRAVERSE_2  traverse_2, NEXT_TO  next_to, STATEID  ‘texas’
7
Learning Semantic Parsers
• Assume meaning representation languages (MRLs) have
deterministic context free grammars
– true for almost all computer languages
– MRs can be parsed unambiguously
• Training data consists of NL sentences paired with their
MRs
• Induce a semantic parser which can map novel NL
sentences to their correct MRs
• Learning problem differs from that of syntactic parsing
where training data has trees annotated over the NL
sentences
8
Outline
• Semantic Parsing
• Related Work
• Background on Kernel-based Methods
• Completed Research
• Proposed Research
• Conclusions
9
Related Work: CHILL
[Zelle & Mooney, 1996]
• Uses Inductive Logic Programming (ILP) to
induce a semantic parser
• Learns rules to control actions of a deterministic
shift-reduce parser
• Processes sentence one word at a time making
hard parsing decision each time
• Brittle and ILP techniques do not scale to large
corpora
10
Related Work: SILT
[Kate, Wong & Mooney, 2005]
• Transformation rules associate NL patterns with
MRL templates
NL pattern
our left [3] penalty area
MRL template
AREA  (left (penalty-area our))
• NL patterns matched in the sentence are replaced
by the MRL templates
• By the end of parsing, NL sentence gets
transformed into its MR
• Two versions: string patterns and syntactic tree
patterns
11
Related Work: SILT contd.
Weaknesses of SILT:
• Hard-matching transformation rules are brittle:
– For e.g. for NL pattern our left [3] penalty area
“our left penalty area”
“our left side of penalty area ”
“left of our penalty area”
“our ah.. left penalty area”
• Parsing is done deterministically which is less
robust than probabilistic parsing
12
Related Work: WASP [Wong, 2005]
• Based on Synchronous Context-free Grammars
• Uses Machine Translation technique of statistical
word alignment to find good transformation rules
• Builds a maximum entropy model for parsing
• The transformation rules are hard-matching
13
Related Work: SCISSOR [Ge & Mooney, 2005]
• Based on a fairly standard approach to compositional
semantics [Jurafsky and Martin, 2000]
• A statistical parser is used to generate a semantically
augmented parse tree (SAPT)
– Augment Collins’ head-driven model 2 (Bikel’s
S-bowner
implementation, 2004)
to incorporate semantic labels
NP-player
VP-bowner
• Translate
a complete
formal meaning
PRP$-team SAPT
NN-playerinto
CD-unum
VB-bowner
NP-null
representation
our
player
2
has
DT-null
NN-null
the
ball
14
Related Work: Zettlemoyer & Collins [2005]
• Uses Combinatorial Categorial Grammar (CCG)
formalism to learn a statistical semantic parser
• Generates CCG lexicon relating NL words to
semantic types through general hand-built
template rules
• Uses maximum entropy model for compacting this
lexicon and doing probabilistic CCG parsing
15
Outline
• Semantic Parsing
• Related Work
• Background on Kernel-based Methods
• Completed Research
• Proposed Research
• Conclusions
16
Traditional Machine Learning with Structured
Data
Examples
Information
loss
Feature Engineering
Feature
Vectors
Machine Learning Algorithm
17
Kernel-based Machine Learning with
Structured Data
Examples
Implicit mapping to
potentially infinite
number of features
Kernel Computations
Kernelized Machine Learning Algorithm
18
Kernel Functions
• A kernel K is a similarity function over domain X
which maps any two objects x, y in X to their
similarity score K(x,y)
• For x1, x2 ,…, xn in X, the n-by-n matrix (K(xi,xj))ij
should be symmetric and positive-semidefinite,
then the kernel function calculates the dot-product
of the implicit feature vectors in some highdimensional feature space
• Machine learning algorithms which use the data
only to compute similarity can be kernelized (e.g.
Support Vector Machines, Nearest Neighbor etc.)
19
String Subsequence Kernel
• Define kernel between two strings as the number
of common subsequences between them [Lodhi et
al., 2002]
• All possible subsequences become the implicit
feature vectors and the kernel computes their
dot-products
s = “left side of our penalty area”
t = “our left penalty area”
K(s,t) = ?
20
String Subsequence Kernel
• Define kernel between two strings as the number
of common subsequences between them [Lodhi et
al., 2002]
• All possible subsequences become the implicit
feature vectors and the kernel computes their
dot-products
s = “left side of our penalty area”
t = “our left penalty area”
u = left
K(s,t) = 1+?
21
String Subsequence Kernel
• Define kernel between two strings as the number
of common subsequences between them [Lodhi et
al., 2002]
• All possible subsequences become the implicit
feature vectors and the kernel computes their
dot-products
s = “left side of our penalty area”
t = “our left penalty area”
u = our
K(s,t) = 2+?
22
String Subsequence Kernel
• Define kernel between two strings as the number
of common subsequences between them [Lodhi et
al., 2002]
• All possible subsequences become the implicit
feature vectors and the kernel computes their
dot-products
s = “left side of our penalty area”
t = “our left penalty area”
u = penalty
K(s,t) = 3+?
23
String Subsequence Kernel
• Define kernel between two strings as the number
of common subsequences between them [Lodhi et
al., 2002]
• All possible subsequences become the implicit
feature vectors and the kernel computes their
dot-products
s = “left side of our penalty area”
t = “our left penalty area”
u = area
K(s,t) = 4+?
24
String Subsequence Kernel
• Define kernel between two strings as the number
of common subsequences between them [Lodhi et
al., 2002]
• All possible subsequences become the implicit
feature vectors and the kernel computes their
dot-products
s = “left side of our penalty area”
t = “our left penalty area”
u = left penalty
K(s,t) = 5+?
25
String Subsequence Kernel
• Define kernel between two strings as the number
of common subsequences between them [Lodhi et
al., 2002]
• All possible subsequences become the implicit
feature vectors and the kernel computes their
dot-products
s = “left side of our penalty area”
t = “our left penalty area”
K(s,t) = 11
26
Normalized String Subsequence Kernel
• Normalize the kernel (range [0,1]) to remove any bias due
to different string lengths
K ( s, t )
K normalized(s, t ) 
K (s, s) * K (t , t )
• Lodhi et al. [2002] give O(n|s||t|) for computing string
subsequence kernel
• Used for Text Categorization [Lodhi et al, 2002] and
Information Extraction [Bunescu & Mooney, 2005b]
27
Support Vector Machines
• Mapping data to high-dimensional feature spaces
can lead to overfitting of training data (“curse of
dimensionality”)
• Support Vector Machines (SVMs) are known to be
resistant to this overfitting
28
SVMs: Maximum Margin
• Given positive and negative examples, SVMs find a separating
hyperplane such that the margin ρ between the closest examples
is maximized
• Maximizing the margin is good according to intuition and PAC
theory
ρ
Separating
hyperplane
29
SVMs: Probability Estimates
• Probability estimate of a point belonging to a class
can be obtained using its distance from the
hyperplane [Platt, 1999]
30
Why Kernel-based Approach to Learning
Semantic Parsers?
• Natural language sentences are structured
• Natural languages are flexible, various ways to
express the same semantic concept
CLang MR: (left (penalty-area our))
NL: our left penalty area
our left side of penalty area
left side of our penalty area
left of our penalty area
our penalty area towards the left side
our ah.. left penalty area
31
Why Kernel-based Approach to Learning
Semantic Parsers?
right side of our penalty area
left of our penalty area
opponent’s right penalty area
our left side of penalty area
our ah.. left penalty area
our right midfield
our left penalty area
left side of our penalty area
our penalty area towards the left side
Kernel methods can robustly capture the range of NL contexts.
32
Outline
• Semantic Parsing
• Related Work
• Background on Kernel-based Methods
• Completed Research
• Proposed Research
• Conclusions
33
KRISP: Kernel-based Robust Interpretation
by Semantic Parsing
• Learns semantic parser from NL sentences paired
with their respective MRs given MRL grammar
• Productions of MRL are treated like semantic
concepts
• SVM classifier is trained for each production with
string subsequence kernel
• These classifiers are used to compositionally build
MRs of the sentences
34
Overview of KRISP
MRL Grammar
NL sentences
with MRs
Collect positive and
negative examples
Train string-kernel-based
SVM classifiers
Training
Testing
Novel NL sentences
Best semantic
derivations (correct
and incorrect)
Semantic
Parser
Best MRs
35
Overview of KRISP
MRL Grammar
NL sentences
with MRs
Collect positive and
negative examples
Train string-kernel-based
SVM classifiers
Training
Testing
Novel NL sentences
Best semantic
derivations (correct
and incorrect)
Semantic
Parser
Best MRs
36
Overview of KRISP’s Semantic Parsing
• We first define Semantic Derivation of an NL
sentence
• We define probability of a semantic derivation
• Semantic parsing of an NL sentence involves
finding its most probable semantic derivation
• Straightforward to obtain MR from a semantic
derivation
37
Semantic Derivation of an NL Sentence
MR parse with non-terminals on the nodes:
ANSWER
RIVER
answer
TRAVERSE_2
traverse_2
STATE
NEXT_TO
STATE
next_to
STATEID
stateid
‘texas’
Which rivers run through the states bordering Texas?
38
Semantic Derivation of an NL Sentence
MR parse with productions on the nodes:
ANSWER  answer(RIVER)
RIVER  TRAVERSE_2(STATE)
TRAVERSE_2  traverse_2
STATE  NEXT_TO(STATE)
NEXT_TO  next_to
STATE  STATEID
STATEID  ‘texas’
Which rivers run through the states bordering Texas?
39
Semantic Derivation of an NL Sentence
Semantic Derivation: Each node covers an NL substring:
ANSWER  answer(RIVER)
RIVER  TRAVERSE_2(STATE)
TRAVERSE_2  traverse_2
STATE  NEXT_TO(STATE)
NEXT_TO  next_to
STATE  STATEID
STATEID  ‘texas’
Which rivers run through the states bordering Texas?
40
Semantic Derivation of an NL Sentence
Semantic Derivation: Each node contains a production
and the substring of NL sentence it covers:
(ANSWER  answer(RIVER), [1..9])
(RIVER  TRAVERSE_2(STATE), [1..9])
(TRAVERSE_2  traverse_2, [1..4])
(STATE  NEXT_TO(STATE), [5..9])
(NEXT_TO  next_to, [5..7]) (STATE  STATEID, [8..9])
(STATEID  ‘texas’, [8..9])
Which rivers run through the states bordering Texas?
1
2
3
4
5
6
7
8
9
41
Semantic Derivation of an NL Sentence
Substrings in NL sentence may be in a different order:
ANSWER  answer(RIVER)
RIVER  TRAVERSE_2(STATE)
TRAVERSE_2  traverse_2
STATE  NEXT_TO(STATE)
NEXT_TO  next_to
STATE  STATEID
STATEID  ‘texas’
Through the states that border Texas which rivers run?
42
Semantic Derivation of an NL Sentence
Nodes are allowed to permute the children productions
from the original MR parse
(ANSWER  answer(RIVER), [1..10])
(RIVER  TRAVERSE_2(STATE), [1..10]]
(STATE  NEXT_TO(STATE), [1..6])
(NEXT_TO  next_to, [1..5])
(TRAVERSE_2  traverse_2, [7..10])
(STATE  STATEID, [6..6])
(STATEID 
‘texas’, [6..6])
Through the states that border Texas which rivers run?
1
2
3
4
5
6
7
8
9 10
43
Probability of a Semantic Derivation
• Let Pπ(s[i..j]) be the probability that production π covers
the substring s[i..j],
• For e.g., PNEXT_TO  next_to (“the states bordering”)
(NEXT_TO  next_to, [5..7])
the states bordering
5
6
7
• Obtained from the string-kernel-based SVM classifiers
trained for each production π
• Probability of a semantic derivation D:
P ( D) 
 P (s[i.. j ])
( ,[ i .. j ])D
44
Computing the Most Probable Semantic
Derivation
• Task of semantic parsing is to find the most probable
semantic derivation
• Let En,s[i..j], partial derivation, denote any subtree of a
derivation tree with n as the LHS non-terminal of the root
production covering sentence s from index i to j
• Example of ESTATE,s[5..9] :
(STATE  NEXT_TO(STATE), [5..9])
(NEXT_TO  next_to, [5..7]) (STATE  STATEID, [8..9])
(STATEID  ‘texas’, [8..9])
the states bordering Texas?
5
6
7
8
9
• Derivation D is then EANSWER, s[1..|s|]
45
Computing the Most Probable Semantic
Derivation contd.
• Let E*STATE,s[5.,9], denote the most probable partial derivation
among all ESTATE,s[5.,9]
• This is computed recursively as follows:
E*STATE,s[5..9]
(STATE  NEXT_TO(STATE), [5..9])
the states bordering Texas?
5
6
7
8
9
E * n , s[ i.. j ]  makeTree( arg max ( P ( s[i.. j ])
  n  n1 .. nt G
))
46
Computing the Most Probable Semantic
Derivation contd.
• Let E*STATE,s[5.,9], denote the most probable partial derivation
among all ESTATE,s[5.,9]
• This is computed recursively as follows:
E*STATE,s[5..9]
(STATE  NEXT_TO(STATE), [5..9])
E*NEXT_TO,s[i..j]
E*STATE,s[i..j]
the states bordering Texas?
5
6
7
8
9
E * n , s[ i.. j ]  makeTree( arg max ( P ( s[i.. j ])
  n  n1 .. nt G
))
47
Computing the Most Probable Semantic
Derivation contd.
• Let E*STATE,s[5.,9], denote the most probable partial derivation
among all ESTATE,s[5.,9]
• This is computed recursively as follows:
E*STATE,s[5..9]
(STATE  NEXT_TO(STATE), [5..9])
E*NEXT_TO,s[5..5]
E*STATE,s[6..9]
the states bordering Texas?
5
6
7
8
9
E * n , s[ i.. j ]  makeTree( arg max ( P ( s[i.. j ])
  n  n1 .. nt G
))
48
Computing the Most Probable Semantic
Derivation contd.
• Let E*STATE,s[5.,9], denote the most probable partial derivation
among all ESTATE,s[5.,9]
• This is computed recursively as follows:
E*STATE,s[5..9]
(STATE  NEXT_TO(STATE), [5..9])
E*NEXT_TO,s[5..6]
E*STATE,s[7..9]
the states bordering Texas?
5
6
7
8
9
E * n , s[ i.. j ]  makeTree( arg max ( P ( s[i.. j ])
  n  n1 .. nt G
))
49
Computing the Most Probable Semantic
Derivation contd.
• Let E*STATE,s[5.,9], denote the most probable partial derivation
among all ESTATE,s[5.,9]
• This is computed recursively as follows:
E*STATE,s[5..9]
(STATE  NEXT_TO(STATE), [5..9])
E*NEXT_TO,s[5..7]
E*STATE,s[8..9]
the states bordering Texas?
5
6
7
8
9
E * n , s[ i.. j ]  makeTree( arg max ( P ( s[i.. j ])
  n  n1 .. nt G
))
50
Computing the Most Probable Semantic
Derivation contd.
• Let E*STATE,s[5.,9], denote the most probable partial derivation
among all ESTATE,s[5.,9]
• This is computed recursively as follows:
E*STATE,s[5..9]
(STATE  NEXT_TO(STATE), [5..9])
E*NEXT_TO,s[5..8]
E*STATE,s[9..9]
the states bordering Texas?
5
6
7
8
9
E * n , s[ i.. j ]  makeTree( arg max ( P ( s[i.. j ])
  n  n1 .. nt G
))
51
Computing the Most Probable Semantic
Derivation contd.
• Let E*STATE,s[5.,9], denote the most probable partial derivation
among all ESTATE,s[5.,9]
• This is computed recursively as follows:
E*STATE,s[5..9]
(STATE  NEXT_TO(STATE), [5..9])
E*NEXT_TO,s[i..j]
E*STATE,s[i..j]
the states bordering Texas?
5
6
7
8
9
E *n,s[i.. j ]  makeTree( arg max ( P ( s[i.. j ])  P( E *nk , pk )))
  nn1 .. nt G
( p1 ,..., pt )
partition( s[ i .. j ],t )
k 1..t
52
Computing the Most Probable Semantic
Derivation contd.
• Let E*STATE,s[5.,9], denote the most probable partial derivation
among all ESTATE,s[5.,9]
• This is computed recursively as follows:
E*STATE,s[5..9]
(STATE  NEXT_TO(STATE), [5..9])
E*STATE,s[i..j]
E*NEXT_TO,s[i..j]
the states bordering Texas?
5
6
7
8
9
E *n,s[i.. j ]  makeTree( arg max ( P ( s[i.. j ])  P( E *nk , pk )))
  nn1 .. nt G
( p1 ,..., pt )
partition( s[ i .. j ],t )
k 1..t
53
Computing the Most Probable Semantic
Derivation contd.
• Implemented by extending Earley’s [1970]
context-free grammar parsing algorithm
• Predicts subtrees top-down and completes them
bottom-up
• Dynamic programming algorithm which generates
and compactly stores each subtree once
• Extended because:
– Probability of a production depends on which substring
of the sentence it covers
– Leaves are not terminals but substrings of words
54
Computing the Most Probable Semantic
Derivation contd.
• Does a greedy approximation search, with beam
width ω, and returns ω most probable derivations
it finds
• Uses a threshold θ to prune low probability trees
55
Overview of KRISP
MRL Grammar
NL sentences
with MRs
Collect positive and
negative examples
Train string-kernel-based
SVM classifiers
Best semantic
derivations (correct
and incorrect)
Pπ(s[i..j])
Training
Testing
Novel NL sentences
Semantic
Parser
Best MRs
56
KRISP’s Training Algorithm
• Takes NL sentences paired with their respective
MRs as input
• Obtains MR parses
• Proceeds in iterations
• In the first iteration, for every production π:
– Call those sentences positives whose MR parses use
that production
– Call the remaining sentences negatives
57
KRISP’s Training Algorithm contd.
First Iteration
STATE  NEXT_TO(STATE)
Positives
Negatives
•which rivers run through the states bordering
texas?
•what state has the highest population ?
•what is the most populated state bordering
oklahoma ?
•which states have cities named austin ?
•what states does the delaware river run through ?
•what is the largest city in states that border
california ?
•what is the lowest point of the state with the
largest area ?
…
…
String-kernel-based
SVM classifier
PSTATENEXT_TO(STATE) (s[i..j])
58
KRISP’s Training Algorithm contd.
• Using these classifiers Pπ(s[i..j]), obtain the ω best
semantic derivations of each training sentence
• Some of these derivations will give the correct
MR, called correct derivations, some will give
incorrect MRs, called incorrect derivations
• For the next iteration, collect positives from most
probable correct derivation
• Collect negatives from incorrect derivations with
higher probability than the most probable correct
derivation
59
KRISP’s Training Algorithm contd.
Most probable correct derivation:
(ANSWER  answer(RIVER), [1..9])
(RIVER  TRAVERSE_2(STATE), [1..9])
(TRAVERSE_2  traverse_2, [1..4])
(STATE  NEXT_TO(STATE), [5..9])
(NEXT_TO  next_to, [5..7]) (STATE  STATEID, [8..9])
(STATEID  ‘texas’, [8..9])
Which rivers run through the states bordering Texas?
1
2
3
4
5
6
7
8
9
60
KRISP’s Training Algorithm contd.
Most probable correct derivation: Collect positive
examples
(ANSWER  answer(RIVER), [1..9])
(RIVER  TRAVERSE_2(STATE), [1..9])
(TRAVERSE_2  traverse_2, [1..4])
(STATE  NEXT_TO(STATE), [5..9])
(NEXT_TO  next_to, [5..7]) (STATE  STATEID, [8..9])
(STATEID  ‘texas’, [8..9])
Which rivers run through the states bordering Texas?
1
2
3
4
5
6
7
8
9
61
KRISP’s Training Algorithm contd.
Incorrect derivation with probability greater than the
most probable correct derivation:
(ANSWER  answer(RIVER), [1..9])
(RIVER  TRAVERSE_2(STATE), [1..9])
(TRAVERSE_2  traverse_2, [1..7])
(STATE  STATEID, [8..9])
(STATEID  ‘texas’, [8..9])
Which rivers run through the states bordering Texas?
1
2
3
4
5
6
Incorrect MR: answer(traverse_2(stateid(‘texas’)))
7
8
9
62
KRISP’s Training Algorithm contd.
Most Probable
Correct derivation:
Incorrect derivation:
(ANSWER  answer(RIVER), [1..9])
(ANSWER  answer(RIVER), [1..9])
(RIVER  TRAVERSE_2(STATE), [1..9])
(TRAVERSE_2 
traverse_2, [1..4])
(STATE  NEXT_TO
(STATE), [5..9])
(NEXT_TO 
next_to, [5..7])
(TRAVERSE_2  traverse_2, [1..7])
(STATE 
STATEID, [8..9])
(STATEID 
‘texas’, [8..9])
Which rivers run through the states bordering Texas?
(RIVER  TRAVERSE_2(STATE), [1..9])
(STATE  STATEID, [8..9])
(STATEID ‘texas’,[8..9])
Which rivers run through the states bordering Texas?
Traverse both trees in breadth-first order till the first nodes where
their productions differ are found.
63
KRISP’s Training Algorithm contd.
Most Probable
Correct derivation:
Incorrect derivation:
(ANSWER  answer(RIVER), [1..9])
(ANSWER  answer(RIVER), [1..9])
(RIVER  TRAVERSE_2(STATE), [1..9])
(TRAVERSE_2 
traverse_2, [1..4])
(STATE  NEXT_TO
(STATE), [5..9])
(NEXT_TO 
next_to, [5..7])
(TRAVERSE_2  traverse_2, [1..7])
(STATE 
STATEID, [8..9])
(STATEID 
‘texas’, [8..9])
Which rivers run through the states bordering Texas?
(RIVER  TRAVERSE_2(STATE), [1..9])
(STATE  STATEID, [8..9])
(STATEID ‘texas’,[8..9])
Which rivers run through the states bordering Texas?
Traverse both trees in breadth-first order till the first nodes where
their productions differ are found.
64
KRISP’s Training Algorithm contd.
Most Probable
Correct derivation:
Incorrect derivation:
(ANSWER  answer(RIVER), [1..9])
(ANSWER  answer(RIVER), [1..9])
(RIVER  TRAVERSE_2(STATE), [1..9])
(TRAVERSE_2 
traverse_2, [1..4])
(STATE  NEXT_TO
(STATE), [5..9])
(NEXT_TO 
next_to, [5..7])
(TRAVERSE_2  traverse_2, [1..7])
(STATE 
STATEID, [8..9])
(STATEID 
‘texas’, [8..9])
Which rivers run through the states bordering Texas?
(RIVER  TRAVERSE_2(STATE), [1..9])
(STATE  STATEID, [8..9])
(STATEID ‘texas’,[8..9])
Which rivers run through the states bordering Texas?
Traverse both trees in breadth-first order till the first nodes where
their productions differ are found.
65
KRISP’s Training Algorithm contd.
Most Probable
Correct derivation:
Incorrect derivation:
(ANSWER  answer(RIVER), [1..9])
(ANSWER  answer(RIVER), [1..9])
(RIVER  TRAVERSE_2(STATE), [1..9])
(TRAVERSE_2 
traverse_2, [1..4])
(STATE  NEXT_TO
(STATE), [5..9])
(NEXT_TO 
next_to, [5..7])
(TRAVERSE_2  traverse_2, [1..7])
(STATE 
STATEID, [8..9])
(STATEID 
‘texas’, [8..9])
Which rivers run through the states bordering Texas?
(RIVER  TRAVERSE_2(STATE), [1..9])
(STATE  STATEID, [8..9])
(STATEID ‘texas’,[8..9])
Which rivers run through the states bordering Texas?
Traverse both trees in breadth-first order till the first nodes where
their productions differ are found.
66
KRISP’s Training Algorithm contd.
Most Probable
Correct derivation:
Incorrect derivation:
(ANSWER  answer(RIVER), [1..9])
(ANSWER  answer(RIVER), [1..9])
(RIVER  TRAVERSE_2(STATE), [1..9])
(TRAVERSE_2 
traverse_2, [1..4])
(STATE  NEXT_TO
(STATE), [5..9])
(NEXT_TO 
next_to, [5..7])
(TRAVERSE_2  traverse_2, [1..7])
(STATE 
STATEID, [8..9])
(STATEID 
‘texas’, [8..9])
Which rivers run through the states bordering Texas?
(RIVER  TRAVERSE_2(STATE), [1..9])
(STATE  STATEID, [8..9])
(STATEID ‘texas’,[8..9])
Which rivers run through the states bordering Texas?
Traverse both trees in breadth-first order till the first nodes where
their productions differ are found.
67
KRISP’s Training Algorithm contd.
Most Probable
Correct derivation:
Incorrect derivation:
(ANSWER  answer(RIVER), [1..9])
(ANSWER  answer(RIVER), [1..9])
(RIVER  TRAVERSE_2(STATE), [1..9])
(TRAVERSE_2 
traverse_2, [1..4])
(STATE  NEXT_TO
(STATE), [5..9])
(NEXT_TO 
next_to, [5..7])
(TRAVERSE_2  traverse_2, [1..7])
(STATE 
STATEID, [8..9])
(STATEID 
‘texas’, [8..9])
Which rivers run through the states bordering Texas?
(RIVER  TRAVERSE_2(STATE), [1..9])
(STATE  STATEID, [8..9])
(STATEID ‘texas’,[8..9])
Which rivers run through the states bordering Texas?
Mark the words under these nodes.
68
KRISP’s Training Algorithm contd.
Most Probable
Correct derivation:
Incorrect derivation:
(ANSWER  answer(RIVER), [1..9])
(ANSWER  answer(RIVER), [1..9])
(RIVER  TRAVERSE_2(STATE), [1..9])
(TRAVERSE_2 
traverse_2, [1..4])
(STATE  NEXT_TO
(STATE), [5..9])
(NEXT_TO 
next_to, [5..7])
(TRAVERSE_2  traverse_2, [1..7])
(STATE 
STATEID, [8..9])
(STATEID 
‘texas’, [8..9])
Which rivers run through the states bordering Texas?
(RIVER  TRAVERSE_2(STATE), [1..9])
(STATE  STATEID, [8..9])
(STATEID ‘texas’,[8..9])
Which rivers run through the states bordering Texas?
Mark the words under these nodes.
69
KRISP’s Training Algorithm contd.
Most Probable
Correct derivation:
Incorrect derivation:
(ANSWER  answer(RIVER), [1..9])
(ANSWER  answer(RIVER), [1..9])
(RIVER  TRAVERSE_2(STATE), [1..9])
(TRAVERSE_2 
traverse_2, [1..4])
(STATE  NEXT_TO
(STATE), [5..9])
(NEXT_TO 
next_to, [5..7])
(TRAVERSE_2  traverse_2, [1..7])
(STATE 
STATEID, [8..9])
(STATEID 
‘texas’, [8..9])
Which rivers run through the states bordering Texas?
(RIVER  TRAVERSE_2(STATE), [1..9])
(STATE  STATEID, [8..9])
(STATEID ‘texas’,[8..9])
Which rivers run through the states bordering Texas?
Consider all the productions covering the marked words.
Collect negatives for productions which cover any marked word
in incorrect derivation but not in the correct derivation.
70
KRISP’s Training Algorithm contd.
Most Probable
Correct derivation:
Incorrect derivation:
(ANSWER  answer(RIVER), [1..9])
(ANSWER  answer(RIVER), [1..9])
(RIVER  TRAVERSE_2(STATE), [1..9])
(TRAVERSE_2 
traverse_2, [1..4])
(STATE  NEXT_TO
(STATE), [5..9])
(NEXT_TO 
next_to, [5..7])
(TRAVERSE_2  traverse_2, [1..7])
(STATE 
STATEID, [8..9])
(STATEID 
‘texas’, [8..9])
Which rivers run through the states bordering Texas?
(RIVER  TRAVERSE_2(STATE), [1..9])
(STATE  STATEID, [8..9])
(STATEID ‘texas’,[8..9])
Which rivers run through the states bordering Texas?
Consider the productions covering the marked words.
Collect negatives for productions which cover any marked word
in incorrect derivation but not in the correct derivation.
71
KRISP’s Training Algorithm contd.
Next Iteration
STATE  NEXT_TO(STATE)
Positives
Negatives
•the states bordering texas?
•what state has the highest population ?
•state bordering oklahoma ?
•what states does the delaware river run through ?
•states that border california ?
•which states have cities named austin ?
•states which share border
•what is the lowest point of the state with the
largest area ?
•next to state of iowa
•which rivers run through states bordering
…
…
String-kernel-based
SVM classifier
PSTATENEXT_TO(STATE) (s[i..j])
72
KRISP’s Training Algorithm contd.
• In the next iteration, SVM classifiers are trained
with the new positive examples and the
accumulated negative examples
• Iterate specified number of times
73
Experimental Corpora
• CLang
– 300 randomly selected pieces of coaching advice from the log files
of the 2003 RoboCup Coach Competition
– 22.52 words on average in NL sentences
– 13.42 tokens on average in MRs
• Geo250 [Zelle & Mooney, 1996]
– 250 queries for the given U.S. geography database
– 6.76 words on average in NL sentences
– 6.20 tokens on average in MRs
• Geo880 [Tang & Mooney, 2001]
– Superset of Geo250 with 880 queries
– 7.48 words on average in NL sentences
– 6.47 tokens on average in MRs
74
Experimental Methodology
• Evaluated using standard 10-fold cross validation
• Correctness
– CLang: output exactly matches the correct
representation
– Geoquery: the resulting query retrieves the same
answer as the correct representation
• Metrics
Precision 
Number of correct MRs
Number of test sentences with complete output MRs
Recall 
Number of correct MRs
Number of test sentences
75
Experimental Methodology contd.
• Compared Systems:
–
–
–
–
SILT [Kate, Wong & Mooney, 2005]
WASP [Wong, 2005]
SCISSOR [Ge & Mooney, 2005]
CHILL
• COCKTAIL ILP algorithm [Tang & Mooney, 2001]
– Zettlemoyer & Collins (2005)
• Different Experimental Setup (600 training, 280 testing
examples)
• Results available only for Geo880 corpus
– Geobase
• Hand-built NL interface [Borland International, 1988]
• Results available only for Geo250
76
Experimental Methodology contd.
• KRISP gives probabilities for its semantic
derivation which are taken as confidences of the
MRs
• We plot precision-recall curves by first sorting the
best MR for each sentence by confidences and
then finding precision for every recall value
• WASP and SCISSOR also output confidences so
we show their precision-recall curves
• Results of other systems shown as points on
precision-recall graphs
77
Results on CLang
requires more annotation on corpus
CHILL gives 49.2% precision and 12.67% recall with 160 examples, can’t run beyond.
78
Results on Geo250
79
Results on Geo880
80
Results on Multilingual Geo250
• We have Geo250 corpus translated into Japanese,
Spanish and Turkish
• KRISP is directly applicable to other languages
81
Results on Multilingual Geo250
82
Outline
• Semantic Parsing
• Related Work
• Background on Kernel-based Methods
• Completed Research
• Proposed Research
– Short-term
– Long term
• Conclusions
83
Short Term: Exploiting Natural Language
Syntax
• KRISP currently uses only word order of the
sentence
• Semantic interpretation depends largely on NL
syntax, exploiting it should help semantic parsing
• We already have syntactic annotations on our
corpora, used in SILT-tree and SCISSOR
• Existing syntactic parsers can be trained on our
corpora in addition to WSJ [Bikel, 2004]
84
Exploiting Natural Language Syntax contd.
• Most natural extension of KRISP is to use
syntactic-tree kernel instead of string kernel
• Syntactic-tree kernel
– Introduced by Collins & Duffy [2001]
– K(x,y) = Number of subtrees common between x & y
NP
NP
NP
JJ
left
NP
PP
NN
side
IN
of
NP
PRP$
JJ
NN
NN
our penalty area
left
PP
NN
side
IN
of
NP
DT
NN
the midfield
K(x,y) = ?
85
Exploiting Natural Language Syntax contd.
• Most natural extension of KRISP is to use
syntactic-tree kernel instead of string kernel
• Syntactic-tree kernel
– Introduced by Collins & Duffy [2001]
– K(x,y) = Number of subtrees common between x & y
NP
NP
NP
JJ
left
NP
PP
NN
side
IN
of
NP
PRP$
JJ
NN
NN
our penalty area
left
PP
NN
side
IN
of
NP
DT
NN
the midfield
K(x,y) = 1+?
86
Exploiting Natural Language Syntax contd.
• Most natural extension of KRISP is to use
syntactic-tree kernel instead of string kernel
• Syntactic-tree kernel
– Introduced by Collins & Duffy [2001]
– K(x,y) = Number of subtrees common between x & y
NP
NP
NP
JJ
left
NP
PP
NN
side
IN
of
NP
PRP$
JJ
NN
NN
our penalty area
left
PP
NN
side
IN
of
NP
DT
NN
the midfield
K(x,y) = 2+?
87
Exploiting Natural Language Syntax contd.
• Most natural extension of KRISP is to use
syntactic-tree kernel instead of string kernel
• Syntactic-tree kernel
– Introduced by Collins & Duffy [2001]
– K(x,y) = Number of subtrees common between x & y
NP
NP
NP
JJ
left
NP
PP
NN
side
IN
of
NP
PRP$
JJ
NN
NN
our penalty area
left
PP
NN
side
IN
of
NP
DT
NN
the midfield
K(x,y) = 3+?
88
Exploiting Natural Language Syntax contd.
• Most natural extension of KRISP is to use
syntactic-tree kernel instead of string kernel
• Syntactic-tree kernel
– Introduced by Collins & Duffy [2001]
– K(x,y) = Number of subtrees common between x & y
NP
NP
NP
JJ
left
NP
PP
NN
side
IN
of
NP
PRP$
JJ
NN
NN
our penalty area
left
PP
NN
side
IN
of
NP
DT
NN
the midfield
K(x,y) = 8
89
Exploiting Natural Language Syntax contd.
• Often the syntactic information needed is present
in dependency trees, full syntactic trees not
necessary
• Dependency trees capture most important
functional relationship between words
• Various dependency tree kernels have been used
successfully for doing Information Extraction
[Zelenko, Aone & Richardella, 2003], [Cumby &
Roth, 2003], [Culotta & Sorenson, 2004],
[Bunescu & Mooney, 2005a]
90
Short Term: Noisy NL Sentences
• If users are interacting with the semantic parser through
speech then many ways noise can be present [Zue & Glass,
2000]
– Speech recognition errors
– Interjections (um’s and ah’s)
– Environment noise (door slams, phone rings etc.)
– Out-of-domain words and ill-formed utterances
• In KRISP, presence of extra words or corrupted words may
decrease kernel values but won’t affect semantic parsing in
a hard way
• KRISP is hence more robust to noise compared to systems
with hard-matching rules like SILT and WASP, or systems
doing complete syntactic-semantic parsing like SCISSOR
91
Noisy NL Sentences contd.
• We plan to do preliminary experiments by
artificially corrupting our existing corpora
• Then we plan to get and experiment on some real
world noisy corpus
92
Short Term: Committees of Semantic Parsers
System Correct CLang MRs out of 300
KRISP
178
WASP
185
SCISSOR
232
Committee
Upper-bound on Correct MRs
KRISP+WASP
223
KRISP+SCISSOR
253
WASP+SCISSOR
246
KRISP+WASP+SCISSOR
259
• Good indication that forming their committee will improve
performance.
93
Committees of Semantic Parsers contd.
Two general approaches to combine parse trees
[Henderson & Brill, 1999]
• Parser Switching: Learn which parser works best
on which types of sentences
• Parse Hybridization: Look into output MRs and
combine their best components
– Particularly useful when none of the parser generates
complete MRs
Prior work is specific to combining syntactic parses.
We plan to explore these two general approaches for
combining MRs.
94
Long Term: Non-parallel Training Corpus
• Training data contained NL sentences aligned with their
respective MRs
• In some domains many NL sentences and semantic MRs
may be available but not aligned
– For e.g. in RoboCup commentary task [Binsted et al., 2000], NL
sentences and symbolic description of events are available but not
aligned
• Referential ambiguity: Which NL description refers to
which symbolic description?
• In our present work we resolve which portion of the
sentence refers to which production of MR parse
• Same approach could be extended to one level higher
95
Non-parallel Training Corpus contd.
• Let training corpus be {(Mi, Si)|i=1..N} where each Mi is a
set of MRs and each Si is a set of NL sentences
• Align every MR in Mi to every NL sentence in Si for
i=1..N
• Use KRISP’s training algorithm to learn classifiers
• Find best alignment for MRs and NL sentences in (Mi, Si)
by semantic parsing using these classifiers
• Repeat till alignments don’t change
• We plan to first do preliminary experiments by artificially
making our corpus non-parallel and then extracting the
alignments then test on a real-world corpus
96
Long Term: Complex Relation Extraction
• Bunescu & Mooney [2005b] use string-based
kernel to extract binary relation “protein-protein
interaction” from text
• This can be viewed as learning for an MRL
grammar with only one production
INTERACTION  PROTEIN PROTEIN
• Complex relation is an n-ary relation among n
typed entities [McDonald et al., 2005]
– For example, (person, job, company)
NL sentence: John Smith is the CEO of Inc. Corp.
Extraction: (John Smith, CEO, Inc. Corp.)
97
Complex Relation Extraction contd.
(person, job,company)
(person, job)
(job, company)
John Smith is the CEO of Inc. Corp.
KRISP should be applicable to extract complex relations
by treating it like higher level production composed
of lower level productions.
98
Conclusions
• KRISP: A new kernel-based approach to learning
semantic parser
• String-kernel-based SVM classifiers trained for
each MRL production
• Classifiers used to compositionally build complete
MRs of NL sentences
• Evaluated on two real-world corpora
– Performs better than deterministic rule-based systems
– Performs comparable to recent statistical systems
• Proposed work: exploit NL syntax, form
committees and broaden application domains
99
Thank You!
Questions??
10
0
Extra: Dealing with Constants
• MRL grammar may contain productions
corresponding to constants in the domain:
STATEID  ‘new york’ RIVERID  ‘colorado’
NUM  ‘2’ STRING  ‘DR4C10’
• User can specify these as constant productions
giving their NL substrings
• Classifiers are not learned for these productions
• Matching substring’s probability is taken as 1
• If n constant productions have same substring then
each gets probability of 1/n
STATEID  ‘colorado’ RIVERID  ‘colorado’
10
1
Extra: Better String Subsequence Kernel
• Subsequences with gaps should be downweighted
• Decay factor λ in the range of (0,1] penalizes gaps
• All subsequences are the implicit features and
penalties are the feature values
s = “left side of our penalty area”
t = “our left penalty area”
u = left penalty
K(s,t) = 4+?
10
2
Extra: Better String Subsequence Kernel
• Subsequences with gaps should be downweighted
• Decay factor λ in the range of (0,1] penalizes gaps
• All subsequences are the implicit features and
penalties are the feature values
Gap of 3 => λ3
s = “left side of our penalty area”
Gap of 0 => λ0
t = “our left penalty area”
u = left penalty
K(s,t) = 4+λ3*λ0 +?
10
3
Extra: Better String Subsequence Kernel
• Subsequences with gaps should be downweighted
• Decay factor λ in the range of (0,1] penalizes gaps
• All subsequences are the implicit features and
penalties are the feature values
s = “left side of our penalty area”
t = “our left penalty area”
K(s,t) = 4+3λ+3 λ3+ λ5
10
4
Extra: KRISP’s Training Algorithm contd.
• What if none of the ω most probable derivations
of a sentence is correct?
• Extended Earley’s algorithm can be forced to
derive only the correct derivations by making sure
all subtrees it generates exist in the correct MR
parse
10
5
Extra: N-best MRs for Geo880
10
6
Extra: KRISP’s Average Running Times
Corpus
Average Training
Time (minutes)
Average Testing
Time (minutes)
Geo250
1.44
0.05
Geo880
18.1
0.65
CLang
58.85
3.18
Average running times per fold in minutes taken by KRISP.
10
7
Extra: KRISP’s Learning PR Curves on
CLang
10
8
Extra: KRISP’s Learning PR Curves on
Geo250
10
9
Extra: KRISP’s Learning PR Curves on
Geo880
11
0
Extra: Experimental Methodology
• Correctness
– CLang: output exactly matches the correct
representation
– Geoquery: the resulting query retrieves the same
answer as the correct representation
If the ball is in our penalty area, all our players
except player 4 should stay in our half.
Correct:
((bpos (penalty-area our))
(do (player-except our{4}) (pos (half
our)))
((bpos (penalty-area opp))
Output:
(do (player-except our{4}) (pos (half
our)))
11
1
Extra: Formal Language Grammar
NL:
If our player 4 has the ball, our player 4 should shoot.
CLang: ((bowner our {4}) (do our {4} shoot))
CLang Parse:
RULE
CONDITION
bowner
DIRECTIVE
TEAM
UNUM
our
4
do
TEAM
UNUM
ACTION
our
4
shoot
• Non-terminals: RULE, CONDITION, ACTION…
• Terminals: bowner, our, 4…
• Productions: RULE  CONDITION DIRECTIVE
DIRECTIVE  do TEAM UNUM ACTION
ACTION  shoot
11
2
Download