Learning for Semantic Parsing of Natural Language Machine Learning Group Raymond J. Mooney

advertisement
Machine Learning Group
Learning for Semantic Parsing of
Natural Language
Raymond J. Mooney
Ruifang Ge, Rohit Kate, Yuk Wah Wong
John Zelle, Cynthia Thompson
Machine Learning Group
Department of Computer Sciences
University of Texas at Austin
December 19, 2005
University of Texas at Austin
Syntactic Natural Language Learning
• Most computational research in natural-language
learning has addressed “low-level” syntactic
processing.
–
–
–
–
Morphology (e.g. past-tense generation)
Part-of-speech tagging
Shallow syntactic parsing (chunking)
Syntactic parsing
2
Semantic Natural Language Learning
• Learning for semantic analysis has been restricted
to relatively “shallow” meaning representations.
– Word sense disambiguation (e.g. SENSEVAL)
– Semantic role assignment (determining agent, patient,
instrument, etc., e.g. FrameNet, PropBank)
– Information extraction
3
Semantic Parsing
• A semantic parser maps a natural-language
sentence to a complete, detailed semantic
representation: logical form or meaning
representation (MR).
• For many applications, the desired output is
immediately executable by another program.
• Two application domains:
– CLang: RoboCup Coach Language
– GeoQuery: A Database Query Application
4
CLang: RoboCup Coach Language
• In RoboCup Coach competition teams compete to
coach simulated players
• The coaching instructions are given in a formal
language called CLang
Coach
If the ball is in our
penalty area, then all our
players except player 4
should stay in our half.
Simulated soccer field
Semantic Parsing
CLang
((bpos (penalty-area our))
(do (player-except our{4}) (pos (half our)))
5
GeoQuery: A Database Query Application
• Query application for U.S. geography database
containing about 800 facts [Zelle & Mooney, 1996]
How many
cities are
there in the
US?
User
Semantic Parsing
Query answer(A,
count(B, (city(B), loc(B, C),
const(C, countryid(USA))),A))
6
Learning Semantic Parsers
• Manually programming robust semantic parsers is
difficult due to the complexity of the task.
• Semantic parsers can be learned automatically from
sentences paired with their logical form.
NLLF
Training Exs
Natural
Language
Semantic-Parser
Learner
Semantic
Parser
Logical
Form
7
Engineering Motivation
• Most computational language-learning research
strives for broad coverage while sacrificing depth.
– “Scaling up by dumbing down”
• Realistic semantic parsing currently entails
domain dependence.
• Domain-dependent natural-language interfaces
have a large potential market.
• Learning makes developing specific applications
more tractable.
• Training corpora can be easily developed by
tagging existing corpora of formal statements
with natural-language glosses.
8
Cognitive Science Motivation
• Most natural-language learning methods require
supervised training data that is not available to a
child.
– General lack of negative feedback on grammar.
– No POS-tagged or treebank data.
• Assuming a child can infer the likely meaning of
an utterance from context, NLLF pairs are more
cognitively plausible training data.
9
Our Semantic-Parser Learners
• CHILL+WOLFIE (Zelle & Mooney, 1996; Thompson & Mooney,
1999, 2003)
– Separates parser-learning and semantic-lexicon learning.
– Learns a deterministic parser using ILP techniques.
• COCKTAIL (Tang & Mooney, 2001)
– Improved ILP algorithm for CHILL.
• SILT (Kate, Wong & Mooney, 2005)
– Learns symbolic transformation rules for mapping directly from NL to LF.
• SCISSOR (Ge & Mooney, 2005)
– Integrates semantic interpretation into Collins’ statistical syntactic parser.
• WASP (Wong & Mooney, in preparation)
– Uses syntax-based statistical machine translation methods.
• KRISP (Kate & Mooney, in preparation)
– Uses a series of SVM classifiers employing a string-kernel to iteratively build
semantic representations.
10
SCISSOR:
Semantic Composition that Integrates Syntax
and Semantics to get Optimal Representations
• Based on a fairly standard approach to compositional
semantics [Jurafsky and Martin, 2000]
• A statistical parser is used to generate a semantically
augmented parse tree (SAPT)
– Augment Collins’ head-driven model 2 to incorporate
S-bowner
semantic labels
NP-player
VP-bowner
• Translate
a complete
formal meaning
PRP$-team SAPT
NN-playerinto
CD-unum
VB-bowner
NP-null
representation
(MR) 2
our
player
has
DT-null
NN-null
MR: bowner(player(our,2))
the
ball
11
Overview of SCISSOR
NL Sentence
SAPT Training Examples
learner
Integrated Semantic Parser
SAPT
TRAINING
TESTING
ComposeMR
MR
12
SCISSOR SAPT Parser Implementation
• Semantic labels added to Bikel’s (2004) opensource version of the Collins statistical parser.
• Head-driven derivation of production rules
augmented to also generate semantic labels.
• Parameter estimates during training employ an
augmented smoothing technique to account for
additional data sparsity created by semantic labels.
• Parsing of test sentences to find the most probable
SAPT is performed using a standard beam-search
constrained version of CKY chart-parsing
algorithm.
13
ComposeMR
bowner
player
team
player
our
player
bowner
unum
2
null
bowner
has
null
null
the
ball
14
ComposeMR
bowner(_)
player(_,_)
team
player(_,_)
our
player
bowner(_)
unum
2
null
bowner(_)
has
null
null
the
ball
15
ComposeMR
bowner(player(our,2))
bowner(_)
bowner(_)
player(our,2)
player(_,_)
player(_,_)
team
player(_,_)
our
player
unum
2
null
bowner(_)
has
null
null
the
ball
player(team,unum)
bowner(player)
16
WASP
A Machine Translation Approach to Semantic Parsing
• Based on a semantic grammar of the natural
language.
• Uses machine translation techniques
– Synchronous context-free grammars (SCFG) (Wu,
1997; Melamed, 2004; Chiang, 2005)
– Word alignments (Brown et al., 1993; Och & Ney,
2003)
• Hence the name: Word Alignment-based
Semantic Parsing
17
Synchronous Context-Free Grammars (SCFG)
• Developed by Aho & Ullman (1972) as a theory of
compilers that combines syntax analysis and code
generation in a single phase
• Generates a pair of strings in a single derivation
18
Compiling, Machine Translation, and
Semantic Parsing
• SCFG: formal language to formal language
(compiling)
• Alignment models: natural language to natural
language (machine translation)
• WASP: natural language to formal language
(semantic parsing)
19
Context-Free Semantic Grammar
QUERY
QUERY  What is CITY
What
is
CITY
CITY  the capital CITY
CITY  of STATE
STATE  Ohio
the
capital
of
CITY
STATE
Ohio
20
Productions of
Synchronous Context-Free Grammars
pattern
template
QUERY  What is CITY / answer(CITY)
• Referred to as transformation rules in Kate, Wong
& Mooney (2005)
21
Synchronous Context-Free Grammars
QUERY
What
is
the
QUERY
answer
CITY
capital
of
(
capital
CITY
STATE
Ohio
CITY
(
loc_2
)
CITY
(
stateid
)
STATE
(
)
'ohio'
)
CITY
Ohio
the
capital
CITY
capital(CITY)
QUERY
CITY

 What
of
STATE
is
CITY
loc_2(STATE)
// answer(CITY)
answer(capital(loc_2(stateid('ohio'))))
STATE
Ohio
//stateid('ohio')
What is the capital
of
22
Parsing Model of WASP
•
•
•
•
N (non-terminals) = {QUERY, CITY, STATE, …}
S (start symbol) = QUERY
Tm (MRL terminals) = {answer, capital, loc_2, (, ), …}
Tn (NL words) = {What, is, the, capital, of, Ohio, …}
QUERY  What is CITY / answer(CITY)
• L (lexicon) =
CITY  the capital CITY / capital(CITY)
CITY  of STATE / loc_2(STATE)
STATE  Ohio / stateid('ohio')
• λ (parameters of probabilistic model) = ?
23
Probabilistic Parsing Model
d1
CITY
CITY
capital
capital
CITY
of
STATE
Ohio
(
loc_2
CITY
(
)
STATE
stateid
(
)
'ohio'
)
CITY  capital CITY / capital(CITY)
CITY  of STATE / loc_2(STATE)
STATE  Ohio / stateid('ohio')
24
Probabilistic Parsing Model
d2
CITY
CITY
capital
capital
CITY
of
RIVER
Ohio
(
loc_2
CITY
(
)
RIVER
riverid
(
)
'ohio'
)
CITY  capital CITY / capital(CITY)
CITY  of RIVER / loc_2(RIVER)
RIVER  Ohio / riverid('ohio')
25
Probabilistic Parsing Model
d1
d2
CITY
capital
(
loc_2
CITY
(
stateid
)
capital
STATE
(
CITY
)
'ohio'
loc_2
)
CITY  capital CITY / capital(CITY)
0.5
CITY  of STATE / loc_2(STATE)
0.3
STATE  Ohio / stateid('ohio')
0.5
+
(
CITY
(
riverid
λ
Pr(d1|capital of Ohio) = exp( 1.3 ) / Z
)
RIVER
(
)
'ohio'
)
CITY  capital CITY / capital(CITY)
0.5
CITY  of RIVER / loc_2(RIVER)
0.05
RIVER  Ohio / riverid('ohio')
0.5
+
λ
Pr(d2|capital of Ohio) = exp( 1.05 ) / Z
normalization constant
26
Parsing Model of WASP
•
•
•
•
N (non-terminals) = {QUERY, CITY, STATE, …}
S (start symbol) = QUERY
Tm (MRL terminals) = {answer, capital, loc_2, (, ), …}
Tn (NL words) = {What, is, the, capital, of, Ohio, …}
QUERY  What is CITY / answer(CITY)
• L (lexicon) =
CITY  the capital CITY / capital(CITY)
CITY  of STATE / loc_2(STATE)
STATE  Ohio / stateid('ohio')
• λ (parameters of probabilistic model)
27
Overview of WASP
Unambiguous CFG of MRL
Lexical acquisition
Training set, {(e,f)}
Lexicon, L
Parameter estimation
Training
Parsing model parameterized by λ
Testing
Input sentence, e'
Semantic parsing
Output MR, f'
28
Lexical Acquisition
• Transformation rules are extracted from word
alignments between an NL sentence, e, and its
correct MR, f, for each training example, (e, f)
29
Word Alignments
Le
And
programme
the
a
program
été
has
mis
en
been
application
implemented
• A mapping from French words to their meanings
expressed in English
30
Lexical Acquisition
• Train a statistical word alignment model (IBM
Model 5) on training set
• Obtain most probable n-to-1 word alignments for
each training example
• Extract transformation rules from these word
alignments
• Lexicon L consists of all extracted transformation
rules
31
Word Alignment for Semantic Parsing
The
goalie
should
always
stay
in
our
half
( ( true ) ( do our { 1 } ( pos ( half our ) ) ) )
• How to introduce syntactic tokens such as parens?
32
Use of MRL Grammar
RULE  (CONDITION DIRECTIVE)
The
CONDITION  (true)
goalie
should
DIRECTIVE  (do TEAM {UNUM} ACTION)
always
TEAM  our
UNUM  1
stay
ACTION  (pos REGION)
in
n-to-1
our
REGION  (half TEAM)
half
TEAM  our
top-down,
left-most
derivation of
an unambiguous
CFG
33
Extracting Transformation Rules
The
goalie
should
always
stay
in
our
TEAM
half
RULE  (CONDITION DIRECTIVE)
CONDITION  (true)
DIRECTIVE  (do TEAM {UNUM} ACTION)
TEAM  our
UNUM  1
ACTION  (pos REGION)
REGION  (half TEAM)
TEAM  our
TEAM  our / our
34
Extracting Transformation Rules
The
goalie
should
always
stay
in
REGION
TEAM
half
RULE  (CONDITION DIRECTIVE)
CONDITION  (true)
DIRECTIVE  (do TEAM {UNUM} ACTION)
TEAM  our
UNUM  1
ACTION  (pos REGION)
REGION  (half TEAM)
our)
TEAM  our
REGION  TEAM half / (half TEAM)
35
Extracting Transformation Rules
The
goalie
should
always
ACTION
stay
in
REGION
RULE  (CONDITION DIRECTIVE)
CONDITION  (true)
DIRECTIVE  (do TEAM {UNUM} ACTION)
TEAM  our
UNUM  1
ACTION  (pos (half
REGION)
our))
REGION  (half our)
ACTION  stay in REGION / (pos REGION)
36
Probabilistic Parsing Model
• Based on maximum-entropy model:
1
Pr (d | e) 
exp   i f i (d)
Z (e)
i
• Features fi (d) are number of times each
transformation rule is used in a derivation d
• Output translation is the yield of most probable
derivation
f *  marg max d Prλ (d | e) 
37
Parameter Estimation
• Maximum conditional log-likelihood criterion
λ*  arg max λ
 log Pr (f | e)
λ
( e ,f )
• Since correct derivations are not included in
training data, parameters λ* are learned in an
unsupervised manner
• EM algorithm combined with improved iterative
scaling, where hidden variables are correct
derivations (Riezler et al., 2000)
38
KRISP: Kernel-based Robust Interpretation
by Semantic Parsing
• Learns semantic parser from NL sentences paired
with their respective MRs given MRL grammar
• Productions of MRL are treated like semantic
concepts
• SVM classifier is trained for each production with
string subsequence kernel
• These classifiers are used to compositionally build
MRs of the sentences
39
Kernel Functions
• A kernel K is a similarity function over domain X
which maps any two objects x, y in X to their
similarity score K(x,y)
• For x1, x2 ,…, xn in X, the n-by-n matrix (K(xi,xj))ij
should be symmetric and positive-semidefinite,
then the kernel function calculates the dot-product
of the implicit feature vectors in some highdimensional feature space
• Machine learning algorithms which use the data
only to compute similarity can be kernelized (e.g.
Support Vector Machines, Nearest Neighbor etc.)
40
String Subsequence Kernel
• Define kernel between two strings as the number
of common subsequences between them [Lodhi et
al., 2002]
• All possible subsequences become the implicit
feature vectors and the kernel computes their
dot-products
s = “left side of our penalty area”
t = “our left penalty area”
K(s,t) = ?
41
String Subsequence Kernel
• Define kernel between two strings as the number
of common subsequences between them [Lodhi et
al., 2002]
• All possible subsequences become the implicit
feature vectors and the kernel computes their
dot-products
s = “left side of our penalty area”
t = “our left penalty area”
u = left
K(s,t) = 1+?
42
String Subsequence Kernel
• Define kernel between two strings as the number
of common subsequences between them [Lodhi et
al., 2002]
• All possible subsequences become the implicit
feature vectors and the kernel computes their
dot-products
s = “left side of our penalty area”
t = “our left penalty area”
u = our
K(s,t) = 2+?
43
String Subsequence Kernel
• Define kernel between two strings as the number
of common subsequences between them [Lodhi et
al., 2002]
• All possible subsequences become the implicit
feature vectors and the kernel computes their
dot-products
s = “left side of our penalty area”
t = “our left penalty area”
u = penalty
K(s,t) = 3+?
44
String Subsequence Kernel
• Define kernel between two strings as the number
of common subsequences between them [Lodhi et
al., 2002]
• All possible subsequences become the implicit
feature vectors and the kernel computes their
dot-products
s = “left side of our penalty area”
t = “our left penalty area”
u = area
K(s,t) = 4+?
45
String Subsequence Kernel
• Define kernel between two strings as the number
of common subsequences between them [Lodhi et
al., 2002]
• All possible subsequences become the implicit
feature vectors and the kernel computes their
dot-products
s = “left side of our penalty area”
t = “our left penalty area”
u = left penalty
K(s,t) = 5+?
46
String Subsequence Kernel
• Define kernel between two strings as the number
of common subsequences between them [Lodhi et
al., 2002]
• All possible subsequences become the implicit
feature vectors and the kernel computes their
dot-products
s = “left side of our penalty area”
t = “our left penalty area”
K(s,t) = 11
47
Normalized String Subsequence Kernel
• Normalize the kernel (range [0,1]) to remove any bias due
to different string lengths
K ( s, t )
K normalized(s, t ) 
K (s, s) * K (t , t )
• Lodhi et al. [2002] give O(n|s||t|) for computing string
subsequence kernel
• Used for Text Categorization [Lodhi et al, 2002] and
Information Extraction [Bunescu & Mooney, 2005b]
48
Support Vector Machines
• SVM’s are classifiers that learn linear separators
that maximize the margin between data and the
classification boundary.
• Kernel’s allow SVM’s to learn non-linear
separators by implicitly mapping data to a higherdimensional feature space.
ρ
49
Overview of KRISP
MRL Grammar
NL sentences
with MRs
Collect positive and
negative examples
Train string-kernel-based
SVM classifiers
Training
Testing
Novel NL sentences
Best semantic
derivations (correct
and incorrect)
Semantic
Parser
Best MRs
50
Overview of KRISP’s Semantic Parsing
• We first define Semantic Derivation of an NL
sentence
• We define probability of a semantic derivation
• Semantic parsing of an NL sentence involves
finding its most probable semantic derivation
• Straightforward to obtain MR from a semantic
derivation
51
Semantic Derivation of an NL Sentence
MR parse with non-terminals on the nodes:
ANSWER
RIVER
answer
TRAVERSE_2
traverse_2
STATE
NEXT_TO
STATE
next_to
STATEID
stateid
‘texas’
Which rivers run through the states bordering Texas?
52
Semantic Derivation of an NL Sentence
MR parse with productions on the nodes:
ANSWER  answer(RIVER)
RIVER  TRAVERSE_2(STATE)
TRAVERSE_2  traverse_2
STATE  NEXT_TO(STATE)
NEXT_TO  next_to
STATE  STATEID
STATEID  ‘texas’
Which rivers run through the states bordering Texas?
53
Semantic Derivation of an NL Sentence
Semantic Derivation: Each node covers an NL substring:
ANSWER  answer(RIVER)
RIVER  TRAVERSE_2(STATE)
TRAVERSE_2  traverse_2
STATE  NEXT_TO(STATE)
NEXT_TO  next_to
STATE  STATEID
STATEID  ‘texas’
Which rivers run through the states bordering Texas?
54
Semantic Derivation of an NL Sentence
Semantic Derivation: Each node contains a production
and the substring of NL sentence it covers:
(ANSWER  answer(RIVER), [1..9])
(RIVER  TRAVERSE_2(STATE), [1..9])
(TRAVERSE_2  traverse_2, [1..4])
(STATE  NEXT_TO(STATE), [5..9])
(NEXT_TO  next_to, [5..7]) (STATE  STATEID, [8..9])
(STATEID  ‘texas’, [8..9])
Which rivers run through the states bordering Texas?
1
2
3
4
5
6
7
8
9
55
Probability of a Semantic Derivation
• Let Pπ(s[i..j]) be the probability that production π covers
the substring s[i..j],
• For e.g., PNEXT_TO  next_to (“the states bordering”)
(NEXT_TO  next_to, [5..7])
the states bordering
5
6
7
• Obtained from the string-kernel-based SVM classifiers
trained for each production π
• Probability of a semantic derivation D:
P ( D) 
 P (s[i.. j ])
( ,[ i .. j ])D
56
Computing the
Most Probable Semantic Derivation
• Implemented by extending Earley’s [1970]
context-free grammar parsing algorithm
• Dynamic programming algorithm which generates
and compactly stores each subtree once.
• Does a greedy approximation search, with beam
width ω, and returns ω most probable derivations
it finds.
57
KRISP’s Training Algorithm
• Takes NL sentences paired with their respective
MRs as input
• Obtains MR parses
• Proceeds in iterations
• In the first iteration, for every production π:
– Call those sentences positives whose MR parses use
that production
– Call the remaining sentences negatives
58
KRISP’s Training Algorithm contd.
First Iteration
STATE  NEXT_TO(STATE)
Positives
Negatives
•which rivers run through the states bordering
texas?
•what state has the highest population ?
•what is the most populated state bordering
oklahoma ?
•which states have cities named austin ?
•what states does the delaware river run through ?
•what is the largest city in states that border
california ?
•what is the lowest point of the state with the
largest area ?
…
…
String-kernel-based
SVM classifier
PSTATENEXT_TO(STATE) (s[i..j])
59
KRISP’s Training Algorithm contd.
• Using these classifiers Pπ(s[i..j]), obtain the ω best
semantic derivations of each training sentence
• Some of these derivations will give the correct
MR, called correct derivations, some will give
incorrect MRs, called incorrect derivations
• For the next iteration, collect positives from most
probable correct derivation
• Collect negatives from incorrect derivations with
higher probability than the most probable correct
derivation
60
KRISP’s Training Algorithm contd.
Next Iteration
STATE  NEXT_TO(STATE)
Positives
Negatives
•the states bordering texas?
•what state has the highest population ?
•state bordering oklahoma ?
•what states does the delaware river run through ?
•states that border california ?
•which states have cities named austin ?
•states which share border
•what is the lowest point of the state with the
largest area ?
•next to state of iowa
•which rivers run through states bordering
…
…
String-kernel-based
SVM classifier
PSTATENEXT_TO(STATE) (s[i..j])
61
Experimental Corpora
• CLang
– 300 randomly selected pieces of coaching advice from
the log files of the 2003 RoboCup Coach Competition
– 22.52 words on average in NL sentences
– 14.24 tokens on average in formal expressions
• GeoQuery [Zelle & Mooney, 1996]
– 250 queries for the given U.S. geography database
– 6.87 words on average in NL sentences
– 5.32 tokens on average in formal expressions
62
Experimental Methodology
• Evaluated using standard 10-fold cross validation
• Correctness
– CLang: output exactly matches the correct
representation
– Geoquery: the resulting query retrieves the same
answer as the correct representation
• Metrics
| Correct Completed Parses |
Precision 
| Completed Parses |
|Correct Completed Parses|
Recall 
|Sentences|
63
Precision Learning Curve for CLang
64
Recall Learning Curve for CLang
65
Precision Learning Curve for GeoQuery
66
Recall Learning Curve for Geoquery
67
Future Work
• Explore methods that can automatically generate
SAPTs to minimize the annotation effort for
SCISSOR.
• Learning semantic parsers just from sentences
paired with “perceptual context.”
68
Conclusions
• Learning semantic parsers is an important and
challenging problem in natural-language learning.
• We have obtained promising results on several
applications using a variety of approaches with
different strengths and weaknesses.
• Not many others have explored this problem, I
would encourage others to consider it.
• More and larger corpora are needed for training
and testing semantic parser induction.
69
Thank You!
Our papers on learning semantic parsers are on-line at:
http://www.cs.utexas.edu/~ml/publication/lsp.html
Our corpora can be downloaded from:
http://www.cs.utexas.edu/~ml/nldata.html
Questions??
70
Download