Deep Learning for Question Answering

advertisement
Deep Learning for Question
Answering
Jordan Boyd-Graber
University of Colorado Boulder
19. NOVEMBER 2014
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
1 of 49
Why Deep Learning
Plan
Why Deep Learning
Review of Logistic Regression
Can’t Somebody else do it? (Feature Engineering)
Deep Learning from Data
Tricks and Toolkits
Toolkits for Deep Learning
Quiz Bowl
Deep Quiz Bowl
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
2 of 49
Why Deep Learning
Deep Learning was once known as “Neural Networks”
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
3 of 49
Why Deep Learning
But it came back . . .
• More data
eology Detection Using Recursive Neural Networks
er1 , Peter Enns2 , Jordan Boyd-Graber3,4 , Philip Resnik2,4
mputer Science, 2 Linguistics, 3 iSchool, and 4 UMIACS
University of Maryland
,peter,jbg}@umiacs.umd.edu, resnik@umd.edu
• Better tricks
(regularization)
• Faster computers
tract
s often reveal their posting automated techology from text focus
wordlists, ignoring synn from
recent
work in
Jordan
Boyd-Graber
They
“ death tax ” and created a its adverse effects
dubbed it
on small
big lie about
the
businesses
|
Boulder
Deep Learning for Question Answering
|
4 of 49
Why Deep Learning
And companies are investing . . .
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
5 of 49
Why Deep Learning
And companies are investing . . .
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
5 of 49
Why Deep Learning
And companies are investing . . .
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
5 of 49
Review of Logistic Regression
Plan
Why Deep Learning
Review of Logistic Regression
Can’t Somebody else do it? (Feature Engineering)
Deep Learning from Data
Tricks and Toolkits
Toolkits for Deep Learning
Quiz Bowl
Deep Quiz Bowl
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
6 of 49
Review of Logistic Regression
Map inputs to output
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
7 of 49
Review of Logistic Regression
Map inputs to output
Input
Vector x1 . . . xd
inputs encoded as
real numbers
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
7 of 49
Review of Logistic Regression
Map inputs to output
Output
Input
Vector x1 . . . xd
f
X
i
W i xi + b
!
multiply inputs by
weights
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
7 of 49
Review of Logistic Regression
Map inputs to output
Output
Input
Vector x1 . . . xd
f
X
i
W i xi + b
!
add bias
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
7 of 49
Review of Logistic Regression
Map inputs to output
Activation
Output
Input
Vector x1 . . . xd
f
X
i
Jordan Boyd-Graber
|
Boulder
W i xi + b
!
f (z) ⌘
1
1 + exp( z)
pass through
nonlinear sigmoid
Deep Learning for Question Answering
|
7 of 49
Review of Logistic Regression
What’s a sigmoid?
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
8 of 49
Review of Logistic Regression
In the shallow end
• This is still logistic regression
• Engineering features x is difficult (and requires expertise)
• Can we learn how to represent inputs into final decision?
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
9 of 49
Can’t Somebody else do it? (Feature Engineering)
Plan
Why Deep Learning
Review of Logistic Regression
Can’t Somebody else do it? (Feature Engineering)
Deep Learning from Data
Tricks and Toolkits
Toolkits for Deep Learning
Quiz Bowl
Deep Quiz Bowl
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
10 of 49
Can’t Somebody else do it? (Feature Engineering)
Learn the features and the function
⇣
⌘
(2)
(1)
(1)
(1)
(1)
a1 = f W11 x1 + W12 x2 + W13 x3 + b1
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
11 of 49
Can’t Somebody else do it? (Feature Engineering)
Learn the features and the function
⇣
⌘
(2)
(1)
(1)
(1)
(1)
a2 = f W21 x1 + W22 x2 + W23 x3 + b2
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
11 of 49
Can’t Somebody else do it? (Feature Engineering)
Learn the features and the function
⇣
⌘
(2)
(1)
(1)
(1)
(1)
a3 = f W31 x1 + W32 x2 + W33 x3 + b3
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
11 of 49
Can’t Somebody else do it? (Feature Engineering)
Learn the features and the function
⇣
⌘
(3)
(2) (2)
(2) (2)
(2) (2)
(2)
hW ,b (x) = a1 = f W11 a1 + W12 a2 + W13 a3 + b1
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
11 of 49
Can’t Somebody else do it? (Feature Engineering)
Objective Function
• For every example x, y of our supervised training set, we want the
label y to match the prediction hW ,b (x).
1
J(W , b; x, y ) ⌘ ||hW ,b (x)
2
Jordan Boyd-Graber
|
Boulder
y ||2
Deep Learning for Question Answering
(1)
|
12 of 49
Can’t Somebody else do it? (Feature Engineering)
Objective Function
• For every example x, y of our supervised training set, we want the
label y to match the prediction hW ,b (x).
1
J(W , b; x, y ) ⌘ ||hW ,b (x)
2
y ||2
(1)
• We want this value, summed over all of the examples to be as
small as possible
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
12 of 49
Can’t Somebody else do it? (Feature Engineering)
Objective Function
• For every example x, y of our supervised training set, we want the
label y to match the prediction hW ,b (x).
1
J(W , b; x, y ) ⌘ ||hW ,b (x)
2
y ||2
(1)
• We want this value, summed over all of the examples to be as
small as possible
• We also want the weights not to be too large
2
Jordan Boyd-Graber
|
Boulder
nX
l 1
l
sl+1 ⇣
sl X
X
i=1 j=1
Wjil
⌘2
Deep Learning for Question Answering
(2)
|
12 of 49
Can’t Somebody else do it? (Feature Engineering)
Objective Function
• For every example x, y of our supervised training set, we want the
label y to match the prediction hW ,b (x).
1
J(W , b; x, y ) ⌘ ||hW ,b (x)
2
y ||2
(1)
• We want this value, summed over all of the examples to be as
small as possible
• We also want the weights not to be too large
2
Jordan Boyd-Graber
|
Boulder
nX
l 1
l
sl+1 ⇣
sl X
X
i=1 j=1
Wjil
⌘2
Deep Learning for Question Answering
(2)
|
12 of 49
Can’t Somebody else do it? (Feature Engineering)
Objective Function
• For every example x, y of our supervised training set, we want the
label y to match the prediction hW ,b (x).
1
J(W , b; x, y ) ⌘ ||hW ,b (x)
2
y ||2
(1)
• We want this value, summed over all of the examples to be as
small as possible
• We also want the weights not to be too large
2
nX
l 1
l
sl+1 ⇣
sl X
X
Wjil
i=1 j=1
⌘2
(2)
Sum over all layers
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
12 of 49
Can’t Somebody else do it? (Feature Engineering)
Objective Function
• For every example x, y of our supervised training set, we want the
label y to match the prediction hW ,b (x).
1
J(W , b; x, y ) ⌘ ||hW ,b (x)
2
y ||2
(1)
• We want this value, summed over all of the examples to be as
small as possible
• We also want the weights not to be too large
2
nX
l 1
l
sl+1 ⇣
sl X
X
Wjil
i=1 j=1
⌘2
(2)
Sum over all sources
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
12 of 49
Can’t Somebody else do it? (Feature Engineering)
Objective Function
• For every example x, y of our supervised training set, we want the
label y to match the prediction hW ,b (x).
1
J(W , b; x, y ) ⌘ ||hW ,b (x)
2
y ||2
(1)
• We want this value, summed over all of the examples to be as
small as possible
• We also want the weights not to be too large
2
nX
l 1
l
sl+1 ⇣
sl X
X
i=1 j=1
Wjil
⌘2
(2)
Sum over all destinations
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
12 of 49
Can’t Somebody else do it? (Feature Engineering)
Objective Function
Putting it all together:
"
m
1 X1
J(W , b) =
||hW ,b (x (i) )
m
2
Jordan Boyd-Graber
|
Boulder
i=1
y
(i) 2
||
#
+
2
sl+1 ⇣
nX
sl X
l 1X
l
Wjil
i=1 j=1
Deep Learning for Question Answering
⌘2
(3)
|
13 of 49
Can’t Somebody else do it? (Feature Engineering)
Objective Function
Putting it all together:
"
m
1 X1
J(W , b) =
||hW ,b (x (i) )
m
2
i=1
y
(i) 2
||
#
+
2
sl+1 ⇣
nX
sl X
l 1X
l
Wjil
i=1 j=1
⌘2
(3)
• Our goal is to minimize J(W , b) as a function of W and b
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
13 of 49
Can’t Somebody else do it? (Feature Engineering)
Objective Function
Putting it all together:
"
m
1 X1
J(W , b) =
||hW ,b (x (i) )
m
2
i=1
y
(i) 2
||
#
+
2
sl+1 ⇣
nX
sl X
l 1X
l
Wjil
i=1 j=1
⌘2
(3)
• Our goal is to minimize J(W , b) as a function of W and b
• Initialize W and b to small random value near zero
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
13 of 49
Can’t Somebody else do it? (Feature Engineering)
Objective Function
Putting it all together:
"
m
1 X1
J(W , b) =
||hW ,b (x (i) )
m
2
i=1
y
(i) 2
||
#
+
2
sl+1 ⇣
nX
sl X
l 1X
l
Wjil
i=1 j=1
⌘2
(3)
• Our goal is to minimize J(W , b) as a function of W and b
• Initialize W and b to small random value near zero
• Adjust parameters to optimize J
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
13 of 49
Deep Learning from Data
Plan
Why Deep Learning
Review of Logistic Regression
Can’t Somebody else do it? (Feature Engineering)
Deep Learning from Data
Tricks and Toolkits
Toolkits for Deep Learning
Quiz Bowl
Deep Quiz Bowl
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
14 of 49
Deep Learning from Data
Gradient Descent
Goal
Optimize J with respect to variables W and b
Objective
0
Parameter
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
15 of 49
Deep Learning from Data
Gradient Descent
Goal
Optimize J with respect to variables W and b
0
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
15 of 49
Deep Learning from Data
Gradient Descent
Goal
Optimize J with respect to variables W and b
0
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
15 of 49
Deep Learning from Data
Gradient Descent
Goal
Optimize J with respect to variables W and b
0
Jordan Boyd-Graber
|
Boulder
1
Deep Learning for Question Answering
|
15 of 49
Deep Learning from Data
Gradient Descent
Goal
Optimize J with respect to variables W and b
0
Jordan Boyd-Graber
|
Boulder
1
Deep Learning for Question Answering
|
15 of 49
Deep Learning from Data
Gradient Descent
Goal
Optimize J with respect to variables W and b
0
1
2
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
15 of 49
Deep Learning from Data
Gradient Descent
Goal
Optimize J with respect to variables W and b
0
1
2
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
15 of 49
Deep Learning from Data
Gradient Descent
Goal
Optimize J with respect to variables W and b
0
1
2
Jordan Boyd-Graber
|
Boulder
3
Deep Learning for Question Answering
|
15 of 49
Deep Learning from Data
Gradient Descent
Goal
Optimize J with respect to variables W and b
0
1
2
3
Undiscovered
Country
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
15 of 49
Deep Learning from Data
Gradient Descent
Goal
Optimize J with respect to variables W and b
α
∂
J
∂W
0
(l)
Wij
Jordan Boyd-Graber
|
Boulder
(l)
Wij
Deep Learning for Question Answering
|
15 of 49
Deep Learning from Data
Backpropigation
• For convenience, write the input to sigmoid
(l)
zi
=
n
X
(l 1)
Wij
(l 1)
xj + b i
(4)
j=1
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
16 of 49
Deep Learning from Data
Backpropigation
• For convenience, write the input to sigmoid
(l)
zi
=
n
X
(l 1)
Wij
(l 1)
xj + b i
(4)
j=1
• The gradient is a function of a node’s error i(l)
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
16 of 49
Deep Learning from Data
Backpropigation
• For convenience, write the input to sigmoid
(l)
zi
=
n
X
(l 1)
Wij
(l 1)
xj + b i
(4)
j=1
• The gradient is a function of a node’s error i(l)
• For output nodes, the error is obvious:
(nl )
i
Jordan Boyd-Graber
|
=
@
||y
(n )
@zi
Boulder
l
hw ,b (x)||2 =
⇣
yi
(nl )
ai
⌘
⇣
⌘
(n ) 1
· f 0 zi l
2
Deep Learning for Question Answering
(5)
|
16 of 49
Deep Learning from Data
Backpropigation
• For convenience, write the input to sigmoid
(l)
zi
=
n
X
(l 1)
Wij
(l 1)
xj + b i
(4)
j=1
• The gradient is a function of a node’s error i(l)
• For output nodes, the error is obvious:
(nl )
i
=
@
||y
(n )
@zi
l
hw ,b (x)||2 =
⇣
yi
(nl )
ai
⌘
⇣
⌘
(n ) 1
· f 0 zi l
2
(5)
• Other nodes must “backpropagate” downstream error based on
connection strength
(l)
i
0
st+1
X
(l+1)
@
=
W
ji
j=1
Jordan Boyd-Graber
|
Boulder
1
(l+1) A 0 (l)
f (zi )
j
Deep Learning for Question Answering
(6)
|
16 of 49
Deep Learning from Data
Backpropigation
• For convenience, write the input to sigmoid
(l)
zi
=
n
X
(l 1)
Wij
(l 1)
xj + b i
(4)
j=1
• The gradient is a function of a node’s error i(l)
• For output nodes, the error is obvious:
(nl )
i
=
@
||y
(n )
@zi
l
hw ,b (x)||2 =
⇣
yi
(nl )
ai
⌘
⇣
⌘
(n ) 1
· f 0 zi l
2
(5)
• Other nodes must “backpropagate” downstream error based on
connection strength
(l)
i
0
st+1
X
(l+1)
@
=
W
ji
j=1
Jordan Boyd-Graber
|
Boulder
1
(l+1) A 0 (l)
f (zi )
j
Deep Learning for Question Answering
(6)
|
16 of 49
Deep Learning from Data
Backpropigation
• For convenience, write the input to sigmoid
(l)
zi
=
n
X
(l 1)
Wij
(l 1)
xj + b i
(4)
j=1
• The gradient is a function of a node’s error i(l)
• For output nodes, the error is obvious:
(nl )
i
=
@
||y
(n )
@zi
l
hw ,b (x)||2 =
⇣
yi
(nl )
ai
⌘
⇣
⌘
(n ) 1
· f 0 zi l
2
(5)
• Other nodes must “backpropagate” downstream error based on
connection strength
(l)
i
0
st+1
X
(l+1)
@
=
W
ji
j=1
Jordan Boyd-Graber
|
Boulder
1
(l+1) A 0 (l)
f (zi )
j
(chain rule)
Deep Learning for Question Answering
(6)
|
16 of 49
Deep Learning from Data
Partial Derivatives
• For weights, the partial derivatives are
@
(l)
@Wij
(l) (l+1)
i
J(W , b; x, y ) = aj
(7)
• For the bias terms, the partial derivatives are
@
(l)
@bi
J(W , b; x, y ) =
(l+1)
i
(8)
• But this is just for a single example . . .
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
17 of 49
Deep Learning from Data
Full Gradient Descent Algorithm
1. Initialize U (l) and V (l) as zero
2. For each example i1 . . . m
2.1 Use backpropagation to compute rW J and rb J
2.2 Update weight shifts U (l) = U (l) + rW (l) J(W , b; x, y )
2.3 Update bias shifts V (l) = V (l) + rb(l) J(W , b; x, y )
3. Update the parameters
W
(l)
=W
(l)
b (l) =b (l)
✓
1 (l)
U
m
↵

1 (l)
↵
V
m
◆
(9)
(10)
4. Repeat until weights stop changing
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
18 of 49
Tricks and Toolkits
Plan
Why Deep Learning
Review of Logistic Regression
Can’t Somebody else do it? (Feature Engineering)
Deep Learning from Data
Tricks and Toolkits
Toolkits for Deep Learning
Quiz Bowl
Deep Quiz Bowl
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
19 of 49
Tricks and Toolkits
Tricks
• Stochastic gradient: compute gradient from a few examples
• Hardware: Do matrix computations on gpus
• Dropout: Randomly set some inputs to zero
• Initialization: Using an autoencoder can help representation
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
20 of 49
Toolkits for Deep Learning
Plan
Why Deep Learning
Review of Logistic Regression
Can’t Somebody else do it? (Feature Engineering)
Deep Learning from Data
Tricks and Toolkits
Toolkits for Deep Learning
Quiz Bowl
Deep Quiz Bowl
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
21 of 49
Toolkits for Deep Learning
• Theano: Python package (Yoshua Bengio)
• Torch7: Lua package (Yann LeCunn)
• ConvNetJS: Javascript package (Andrej Karpathy)
• Both automatically compute gradients and have numerical
optimization
• Working group this summer at umd
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
22 of 49
Quiz Bowl
Plan
Why Deep Learning
Review of Logistic Regression
Can’t Somebody else do it? (Feature Engineering)
Deep Learning from Data
Tricks and Toolkits
Toolkits for Deep Learning
Quiz Bowl
Deep Quiz Bowl
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
23 of 49
Quiz Bowl
Humans doing Incremental Classification
• Game called “quiz bowl”
• Two teams play each other
Moderator reads a question
When a team knows the
answer, they signal (“buzz” in)
If right, they get points;
otherwise, rest of the question
is read to the other team
• Hundreds of teams in the US
alone
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
24 of 49
Quiz Bowl
Humans doing Incremental Classification
• Game called “quiz bowl”
• Two teams play each other
Moderator reads a question
When a team knows the
answer, they signal (“buzz” in)
If right, they get points;
otherwise, rest of the question
is read to the other team
• Hundreds of teams in the US
alone
• Example . . .
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
24 of 49
Quiz Bowl
Sample Question 1
With Leo Szilard, he invented a doubly-eponymous
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
25 of 49
Quiz Bowl
Sample Question 1
With Leo Szilard, he invented a doubly-eponymous refrigerator with
no moving parts. He did not take interaction with neighbors into
account when formulating his theory of
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
25 of 49
Quiz Bowl
Sample Question 1
With Leo Szilard, he invented a doubly-eponymous refrigerator with
no moving parts. He did not take interaction with neighbors into
account when formulating his theory of heat capacity, so
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
25 of 49
Quiz Bowl
Sample Question 1
With Leo Szilard, he invented a doubly-eponymous refrigerator with
no moving parts. He did not take interaction with neighbors into
account when formulating his theory of heat capacity, so Debye
adjusted the theory for low temperatures. His summation convention
automatically sums repeated indices in tensor products. His name is
attached to the A and B coefficients
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
25 of 49
Quiz Bowl
Sample Question 1
With Leo Szilard, he invented a doubly-eponymous refrigerator with
no moving parts. He did not take interaction with neighbors into
account when formulating his theory of heat capacity, so Debye
adjusted the theory for low temperatures. His summation convention
automatically sums repeated indices in tensor products. His name is
attached to the A and B coefficients for spontaneous and stimulated
emission, the subject of one of his multiple groundbreaking 1905
papers. He further developed the model of statistics sent to him by
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
25 of 49
Quiz Bowl
Sample Question 1
With Leo Szilard, he invented a doubly-eponymous refrigerator with
no moving parts. He did not take interaction with neighbors into
account when formulating his theory of heat capacity, so Debye
adjusted the theory for low temperatures. His summation convention
automatically sums repeated indices in tensor products. His name is
attached to the A and B coefficients for spontaneous and stimulated
emission, the subject of one of his multiple groundbreaking 1905
papers. He further developed the model of statistics sent to him by
Bose to describe particles with integer spin. For 10 points, who is this
German physicist best known for formulating the
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
25 of 49
Quiz Bowl
Sample Question 1
With Leo Szilard, he invented a doubly-eponymous refrigerator with
no moving parts. He did not take interaction with neighbors into
account when formulating his theory of heat capacity, so Debye
adjusted the theory for low temperatures. His summation convention
automatically sums repeated indices in tensor products. His name is
attached to the A and B coefficients for spontaneous and stimulated
emission, the subject of one of his multiple groundbreaking 1905
papers. He further developed the model of statistics sent to him by
Bose to describe particles with integer spin. For 10 points, who is this
German physicist best known for formulating the special and general
theories of relativity?
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
25 of 49
Quiz Bowl
Sample Question 1
With Leo Szilard, he invented a doubly-eponymous refrigerator with
no moving parts. He did not take interaction with neighbors into
account when formulating his theory of heat capacity, so Debye
adjusted the theory for low temperatures. His summation convention
automatically sums repeated indices in tensor products. His name is
attached to the A and B coefficients for spontaneous and stimulated
emission, the subject of one of his multiple groundbreaking 1905
papers. He further developed the model of statistics sent to him by
Bose to describe particles with integer spin. For 10 points, who is this
German physicist best known for formulating the special and general
theories of relativity?
Albert Einstein
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
25 of 49
Quiz Bowl
Sample Question 1
With Leo Szilard, he invented a doubly-eponymous refrigerator with
no moving parts. He did not take interaction with neighbors into
account when formulating his theory of heat capacity, so Debye
adjusted the theory for low temperatures. His summation convention
automatically
sums repeated indices in tensor products. His name is
Faster = Smarter
attached
to
the
A and
B coefficients for spontaneous and stimulated
1. Colorado School
of Mines
emission, the subject of one of his multiple groundbreaking 1905
2. Brigham Young University
papers. He further developed the model of statistics sent to him by
3.
California
Institute
of Technology
Bose
to describe
particles
with integer spin. For 10 points, who is this
German
best known for formulating the special and general
4.
Harveyphysicist
Mudd College
theories
of
relativity?
5. University of Colorado
Albert Einstein
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
25 of 49
Quiz Bowl
Humans doing Incremental Classification
• This is not Jeopardy (Watson)
• There are buzzers, but players
can only buzz at the end of a
question
• Doesn’t discriminate knowledge
• Quiz bowl questions are
pyramidal
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
26 of 49
Quiz Bowl
Humans doing Incremental Classification
• Thousands of questions are written every year
• Large question databases
• Teams practice on these questions (some online, e.g. IRC)
• How can we learn from this?
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
27 of 49
Quiz Bowl
System for Incremental Classifiers
• Treat this as a MDP
• Action: buzz now or wait
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
28 of 49
Quiz Bowl
System for Incremental Classifiers
• Treat this as a MDP
• Action: buzz now or wait
1. Content Model is constantly generating guesses
2. Oracle provides examples where it is correct
3. The Policy generalizes to test data
4. Features represent our state
content model
Jordan Boyd-Graber
|
Boulder
oracle
policy
features
Deep Learning for Question Answering
|
28 of 49
Quiz Bowl
System for Incremental Classifiers
• Treat this as a MDP
• Action: buzz now or wait
1. Content Model is constantly generating guesses
2. Oracle provides examples where it is correct
3. The Policy generalizes to test data
4. Features represent our state
content model
Jordan Boyd-Graber
|
Boulder
oracle
policy
features
Deep Learning for Question Answering
|
28 of 49
Quiz Bowl
Content Model
content model oracle policy features
• Bayesian generative model with answers as latent state
• Unambiguous Wikipedia pages
• Unigram term weightings (naı̈ve Bayes, BM25)
• Maintains posterior distribution over guesses
• Always has a guess of what it should answer
policy will tell us when to trust it
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
29 of 49
Quiz Bowl
Vector Space Model
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
30 of 49
Quiz Bowl
Vector Space Model
arabian
persian
gulf
kingdom
expatriates
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
30 of 49
Quiz Bowl
Vector Space Model
arabian
persian
gulf
kingdom
expatriates
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
30 of 49
Quiz Bowl
Vector Space Model
arabian
persian
gulf
kingdom
expatriates
Qatar
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
30 of 49
Quiz Bowl
Vector Space Model
arabian
persian
gulf
kingdom
expatriates
Qatar
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
30 of 49
Quiz Bowl
Vector Space Model
arabian
persian
gulf
kingdom
expatriates
Qatar
Bahrain
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
30 of 49
Quiz Bowl
Vector Space Model
arabian
persian
gulf
kingdom
expatriates
Gummibears
Pokemon
Jordan Boyd-Graber
|
Boulder
Qatar
Bahrain
Deep Learning for Question Answering
|
30 of 49
Quiz Bowl
Vector Space Model
arabian
persian
gulf
kingdom
expatriates
Gummibears
Pokemon
Jordan Boyd-Graber
|
Boulder
Qatar
Bahrain
?
Deep Learning for Question Answering
|
30 of 49
Quiz Bowl
Vector Space Model
arabian
persian
gulf
kingdom
expatriates
Gummibears
Pokemon
Jordan Boyd-Graber
|
Boulder
Qatar
Bahrain
?
Deep Learning for Question Answering
|
30 of 49
Quiz Bowl
Oracle
content model
oracle
policy
features
Revealed Text Content Model Answer
The National Endowment for the Arts, War
o
Martha
Graham
Lyndon
Johnson
The National Endowment for the Arts, War
on Poverty, and Medicare were establish by,
for 10 points, what Texan
George Bush
Lyndon
Johnson
The National Endowment for the Arts, War
on Poverty, and Medicare were establish by,
for 10 points, what Texan who defeated
Barry Goldwater, promoted the Great
Society, and succeeded John F. Kennedy?
Lyndon
Johnson
Lyndon
Johnson
Oracle
• As each token is revealed, look at content model’s guess
• If it’s right, positive instance; otherwise negative
• Nearly optimal policy to buzz whenever correct (upper bound)
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
31 of 49
Quiz Bowl
Policy
content model
• Mapping: state 7! action
oracle
policy
features
• Use oracle as example actions
• Learned as classifier (Langford et al., 2005)
• At test time, use the same features as for training
Question text (so far)
Guess
Posterior distribution
Change in posterior
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
32 of 49
Quiz Bowl
Features (by example)
content model
oracle
policy
features
content
Observation: This man won the Battle
Jordan Boyd-Graber
|
0.02 Tokugawa
0.01 Erwin Rommel
0.01 Joan of Arc
0.01 Stephen Crane
Boulder
Deep Learning for Question Answering
|
33 of 49
Quiz Bowl
Features (by example)
content model
oracle
Text
content
Observation: This man won the Battle
Jordan Boyd-Graber
|
0.02 Tokugawa
0.01 Erwin Rommel
0.01 Joan of Arc
0.01 Stephen Crane
Boulder
policy
features
Guess
Posterior
State Representation
idx: 05
ftp: f
gss: tokugawa
top_1: 0.02
top_2: 0.01
top_3: 0.01
this_man: 01
won: 01
battle: 01
Deep Learning for Question Answering
|
33 of 49
Quiz Bowl
Features (by example)
content model
oracle
Text
Jordan Boyd-Graber
|
0.02 Tokugawa
0.01 Erwin Rommel
0.01 Joan of Arc
0.01 Stephen Crane
Boulder
features
Guess
Posterior
State Representation
idx: 05
ftp: f
gss: tokugawa
top_1: 0.02
top_2: 0.01
top_3: 0.01
this_man: 01
won: 01
battle: 01
Wait
content
Observation: This man won the Battle
policy
Deep Learning for Question Answering
|
33 of 49
Quiz Bowl
Text
Observation: This man won the Battle
idx: 05
ftp: f
Features (by example)
0.02 Tokugawa
Posterior
0.01 Erwin Rommel
0.01 Joan of Arc
0.01 Stephen Crane
gss: tokugawa
top_1: 0.02
top_2: 0.01
top_3: 0.01
this_man: 01
won: 01
battle: 01
Wait
content
Guess
State Representation
Observation: This man won the Battle of Zela over Pontus. He
wrote about his victory at Alesia in his Commentaries on the
0.11 Mithridates
0.09 Julius Caesar
0.08 Alexander the Great
0.07 Sulla
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
33 of 49
Quiz Bowl
Text
Observation: This man won the Battle
content
Wait
idx: 21
f
ftp:
0.11 Mithridates
0.09 Julius Caesar
this_man:
commentaries:
0.08 Alexander the Great
pontus:
0.07 Sulla
…
|
Boulder
top_1: 0.02
top_2: 0.01
top_3: 0.01
this_man: 01
won: 01
battle: 01
Text
Observation: This man won the Battle of Zela over
Pontus. He
State Representation
wrote about his victory at Alesia in his Commentaries
on the
Jordan Boyd-Graber
Posterior
gss: tokugawa
idx: 05
ftp: f
Features (by example)
0.02 Tokugawa
0.01 Erwin Rommel
0.01 Joan of Arc
0.01 Stephen Crane
Guess
State Representation
Guess
Posterior
gss: mithridates
top_1: 0.11
top_2: 0.09
top_3: 0.08
Change
new: true
delta: +0.09
01
01
01
…
Deep Learning for Question Answering
|
33 of 49
Quiz Bowl
Text
Observation: This man won the Battle
content
Wait
idx: 21
f
Wait
ftp:
0.11 Mithridates
0.09 Julius Caesar
this_man:
commentaries:
0.08 Alexander the Great
pontus:
0.07 Sulla
…
|
Boulder
top_1: 0.02
top_2: 0.01
top_3: 0.01
this_man: 01
won: 01
battle: 01
Text
Observation: This man won the Battle of Zela over
Pontus. He
State Representation
wrote about his victory at Alesia in his Commentaries
on the
Jordan Boyd-Graber
Posterior
gss: tokugawa
idx: 05
ftp: f
Features (by example)
0.02 Tokugawa
0.01 Erwin Rommel
0.01 Joan of Arc
0.01 Stephen Crane
Guess
State Representation
Guess
Posterior
gss: mithridates
top_1: 0.11
top_2: 0.09
top_3: 0.08
Change
new: true
delta: +0.09
01
01
01
…
Deep Learning for Question Answering
|
33 of 49
Quiz Bowl
Text
Observation: This man won the Battle
top_1: 0.02
top_2: 0.01
top_3: 0.01
this_man: 01
won: 01
battle: 01
Wait
content
Posterior
gss: tokugawa
idx: 05
ftp: f
Features (by example)
0.02 Tokugawa
0.01 Erwin Rommel
0.01 Joan of Arc
0.01 Stephen Crane
Guess
State Representation
Text
Observation: This man won the Battle of Zela over
Pontus. He
State Representation
wrote about his victory at Alesia in his Commentaries
on the
idx: 21
f
Wait
ftp:
0.11 Mithridates
0.09 Julius Caesar
this_man:
commentaries:
0.08 Alexander the Great
pontus:
0.07 Sulla
…
Guess
Posterior
gss: mithridates
top_1: 0.11
top_2: 0.09
top_3: 0.08
Change
new: true
delta: +0.09
01
01
01
…
Observation: This man won the Battle of Zela over Pontus. He
wrote about his victory at Alesia in his Commentaries on the Gallic
Wars. FTP, name this Roman
0.89 Julius Caesar
0.02 Augustus
0.01 Sulla
0.01 Pompey
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
33 of 49
Quiz Bowl
Text
Observation: This man won the Battle
content
Wait
idx: 21
f
Wait
ftp:
0.11 Mithridates
0.09 Julius Caesar
this_man:
commentaries:
0.08 Alexander the Great
pontus:
0.07 Sulla
…
Guess
Posterior
gss: mithridates
top_1: 0.11
top_2: 0.09
top_3: 0.08
idx: 55
ftp: t
0.89 Julius Caesar
this_man:
0.02 Augustus
this_roman:
0.01 Sulla
gallic:
…
0.01 Pompey
Boulder
Change
new: true
delta: +0.09
01
01
01
…
Observation: This man won the Battle of Zela over
Text Pontus. He Guess
wrote about his victory at Alesia in his Commentaries on the Gallic
State Representation
Wars. FTP, name this Roman
|
top_1: 0.02
top_2: 0.01
top_3: 0.01
this_man: 01
won: 01
battle: 01
Text
Observation: This man won the Battle of Zela over
Pontus. He
State Representation
wrote about his victory at Alesia in his Commentaries
on the
Jordan Boyd-Graber
Posterior
gss: tokugawa
idx: 05
ftp: f
Features (by example)
0.02 Tokugawa
0.01 Erwin Rommel
0.01 Joan of Arc
0.01 Stephen Crane
Guess
State Representation
gss: j_caesar
Posterior
top_1: 0.89
top_2: 0.02
top_3: 0.01
Change
new: true
delta: +0.78
01
01
01
…
Deep Learning for Question Answering
|
33 of 49
Quiz Bowl
Text
Posterior
gss: tokugawa
idx: 05
ftp: f
Features (by example)
0.02 Tokugawa
0.01 Erwin Rommel
0.01 Joan of Arc
0.01 Stephen Crane
top_1: 0.02
top_2: 0.01
top_3: 0.01
this_man: 01
won: 01
battle: 01
Wait
content
Guess
State Representation
Observation: This man won the Battle
Text
Observation: This man won the Battle of Zela over
Pontus. He
State Representation
wrote about his victory at Alesia in his Commentaries
on the
idx: 21
f
Wait
ftp:
0.11 Mithridates
0.09 Julius Caesar
this_man:
commentaries:
0.08 Alexander the Great
pontus:
0.07 Sulla
…
Guess
Posterior
gss: mithridates
top_1: 0.11
top_2: 0.09
top_3: 0.08
new: true
delta: +0.09
01
01
01
…
Observation: This man won the Battle of Zela over
Text Pontus. He Guess
wrote about his victory at Alesia in his Commentaries on the Gallic
State Representation
Wars. FTP, name this Roman
idx: 55
ftp: t
Buzz
0.89 Julius Caesar
this_man:
0.02 Augustus
this_roman:
0.01 Sulla
gallic:
…
0.01 Pompey
Change
gss: j_caesar
Posterior
top_1: 0.89
top_2: 0.02
top_3: 0.01
Change
new: true
delta: +0.78
01
01
01
…
Answer: Julius Caesar
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
33 of 49
Quiz Bowl
Simulating a Game
• Present tokens incrementally to algorithm, see where it buzzes
• Compare to where humans buzzed in
• Payo↵ matrix (wrt Computer)
1
2
3
4
5
6
Jordan Boyd-Graber
|
Boulder
Computer
first and wrong
—
first and wrong
first and correct
wrong
right
Human
right
first and correct
wrong
—
first and wrong
first and wrong
Payo↵
15
10
5
+10
+5
+15
Deep Learning for Question Answering
|
34 of 49
Quiz Bowl
Interface
• Users could “veto” categories or
tournaments
• Questions presented in
canonical order
• Approximate string matching
(w/ override)
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
35 of 49
Quiz Bowl
Interface
• Started on Amazon Mechanical
Turk
• 7000 questions were answered
in the first day
• Over 43000 questions were
answered in the space of two
weeks
• Total of 461 unique users
• Leaderboard to encourage users
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
35 of 49
Quiz Bowl
Error Analysis
Posterior
Opponent
maurice ravel
1.0
0.8
0.6
0.4
0.2
0.0
0
10
20
30
40
50
60
70
Tokens Revealed
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
36 of 49
Quiz Bowl
Error Analysis
Posterior
Opponent
maurice ravel
1.0
0.8
0.6
0.4
maurice ravel
pictures at an exhibition
0.2
0.0
0
10
20
30
40
50
60
70
Tokens Revealed
Prediction
answer
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
36 of 49
Quiz Bowl
Error Analysis
Posterior
Opponent
maurice ravel
1.0
0.8
0.6
0.4
maurice ravel
pictures at an exhibition
0.2
0.0
0
10
20
30
40
50
60
70
Tokens Revealed
Prediction
Buzz
answer
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
36 of 49
Quiz Bowl
Error Analysis
Posterior
Opponent
maurice ravel
1.0
0.8
this_french bolero
orchestrated
0.6
composer
0.4
pictures
maurice ravel
pictures at an exhibition
0.2
0.0
0
10
20
30
40
50
60
70
Tokens Revealed
Prediction
answer
Jordan Boyd-Graber
|
Boulder
Buzz
Observation
feature
Deep Learning for Question Answering
|
36 of 49
Quiz Bowl
Error Analysis
Posterior
Opponent
maurice ravel
1.0
0.8
• Too slow
• Coreference and correct
0.6
• Not enough information
0.2
question type
/ not weighting later
clues higher
composer
0.4
pictures
maurice ravel
pictures at an exhibition
0.0
0
10
answer
|
Boulder
20
30
40
50
60
70
Tokens Revealed
Prediction
Jordan Boyd-Graber
this_french bolero
orchestrated
Buzz
Observation
feature
Deep Learning for Question Answering
|
36 of 49
Quiz Bowl
Error Analysis
enrico fermi
1.0
0.8
• Too slow
0.6
• Coreference and correct
question type
• Not enough information
/ not weighting later
clues higher
paradox
0.4
magnetic field magnetic field
Boulder
neutrino
0.0
0
20
40
60
80
100
Tokens Revealed
answer
|
this_italian
0.2
Prediction
Jordan Boyd-Graber
this_man
zero
magnetic
Buzz
Observation
feature
Deep Learning for Question Answering
|
36 of 49
Quiz Bowl
Error Analysis
george washington
0.8
• Too slow
generals
0.6
• Coreference and correct
question type
• Not enough information
/ not weighting later
clues higher
language
chill
0.4
0.2
charlemagne yasunari kawabata
0.0
0
answer
|
Boulder
10
20
30
40
50
60
Tokens Revealed
Prediction
Jordan Boyd-Graber
mistress
Buzz
Observation
feature
Deep Learning for Question Answering
|
36 of 49
Quiz Bowl
How can we do better?
• Use order of words in a sentence “this man shot Lee Harvey
Oswald” very di↵erent from “Lee Harvey Oswald shot this man”
• Use relationship between questions (“China” and “Taiwan”)
• Use learned features and dimensions, not the words we start with
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
37 of 49
Quiz Bowl
How can we do better?
• Use order of words in a sentence “this man shot Lee Harvey
Oswald” very di↵erent from “Lee Harvey Oswald shot this man”
• Use relationship between questions (“China” and “Taiwan”)
• Use learned features and dimensions, not the words we start with
• Recursive Neural Networks (Socher et al., 2012)
• First-time a learned representation has been applied to question
answering
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
37 of 49
Deep Quiz Bowl
Plan
Why Deep Learning
Review of Logistic Regression
Can’t Somebody else do it? (Feature Engineering)
Deep Learning from Data
Tricks and Toolkits
Toolkits for Deep Learning
Quiz Bowl
Deep Quiz Bowl
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
38 of 49
Deep Quiz Bowl
Using Compositionality
attacked
this
admiral
kotte
chinese
in
ceylon
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
39 of 49
Deep Quiz Bowl
Using Compositionality
attacked
this
admiral
kotte
chinese
in
ceylon
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
39 of 49
Deep Quiz Bowl
Using Compositionality
attacked
this
admiral
kotte
chinese
in
ceylon
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
39 of 49
Deep Quiz Bowl
Using Compositionality
attacked
this
admiral
kotte
chinese
in
?
ceylon
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
39 of 49
Deep Quiz Bowl
Using Compositionality
attacked
this
admiral
kotte
chinese
in
ceylon
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
39 of 49
Deep Quiz Bowl
Using Compositionality
attacked
this
admiral
kotte
chinese
in
ceylon
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
39 of 49
Deep Quiz Bowl
Using Compositionality
attacked
this
admiral
kotte
chinese
in
ceylon
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
39 of 49
Deep Quiz Bowl
We repeat this process up to the root, which is
h
=f (W
· heconomy + WPREP · hon
NSUBJ
depended
Using Compositionality
+ Wv · xdepended + b).
(3)
The composition equation for any node n with
children K(n) and word vector xw is hn =
X
f (Wv · xw + b +
WR(n,k) · hk ), = (4)
k K(n)
where R(n, k) is the dependency relation between node n and child node k.
attacked
3.2
Training
Our goal is to map
questions
admiral
kotteto their corresponding answer entities. Because there are
a limited number of possible answers, we can
view this as a multi-class classification task.
While a softmax layer over every node in the
tree could predict answers (Socher et al., 2011;
Iyyer et al., 2014), this method overlooks that
most answers are themselves words (features)
in other questions (e.g., a question on World
Jordan Boyd-Graber
|
Boulder
Intuitively, we want to encourage the ve
of question sentences to be near their co
answers and far away from incorrect ans
We accomplish this goal by using a contra
max-margin objective function describe
low. While we are not interested in obtain
ranked list of answers,3 we observe bette
formance by adding the weighted approxi
rank pairwise (warp) loss proposed in W
et al. (2011) to our objective function.
Given a sentence paired with its correc
swer c, we randomly select j incorrect an
from the set of all incorrect answers and d
this subset as Z. Since c is part of the v
ulary, it has a vector xc We . An inco
answer z Z is also associated with a v
xz We . We define S to be the set of all n
in the sentence’s dependency tree, wher
individual node s S is associated with
2
Of course, questions never contain their own a
as part of the text.
3
In quiz bowl, all wrong guesses are equally
mental to a team’s score, no matter how “close” a
is to the correct answer.
Deep Learning for Question Answering
|
39 of 49
Deep Quiz Bowl
We repeat this process up to the root, which is
h
=f (W
· heconomy + WPREP · hon
NSUBJ
depended
Using Compositionality
+ Wv · xdepended + b).
(3)
The composition equation for any node n with
children K(n) and word vector xw is hn =
X
f (Wv · xw + b +
WR(n,k) · hk ), = (4)
k K(n)
where R(n, k) is the dependency relation between node n and child node k.
attacked
3.2
Training
Our goal is to map
questions
admiral
kotteto their corresponding answer entities. Because there are
a limited number of possible answers, we can
view this as a multi-class classification task.
While a softmax layer over every node in the
tree could predict answers (Socher et al., 2011;
Iyyer et al., 2014), this method overlooks that
most answers are themselves words (features)
in other questions (e.g., a question on World
Jordan Boyd-Graber
|
Boulder
Intuitively, we want to encourage the ve
of question sentences to be near their co
answers and far away from incorrect ans
We accomplish this goal by using a contra
max-margin objective function describe
low. While we are not interested in obtain
ranked list of answers,3 we observe bette
formance by adding the weighted approxi
rank pairwise (warp) loss proposed in W
et al. (2011) to our objective function.
Given a sentence paired with its correc
swer c, we randomly select j incorrect an
from the set of all incorrect answers and d
this subset as Z. Since c is part of the v
ulary, it has a vector xc We . An inco
answer z Z is also associated with a v
xz We . We define S to be the set of all n
in the sentence’s dependency tree, wher
individual node s S is associated with
2
Of course, questions never contain their own a
as part of the text.
3
In quiz bowl, all wrong guesses are equally
mental to a team’s score, no matter how “close” a
is to the correct answer.
Deep Learning for Question Answering
|
39 of 49
Deep Quiz Bowl
We repeat this process up to the root, which is
h
=f (W
· heconomy + WPREP · hon
NSUBJ
depended
Using Compositionality
+ Wv · xdepended + b).
(3)
The composition equation for any node n with
children K(n) and word vector xw is hn =
X
f (Wv · xw + b +
WR(n,k) · hk ), = (4)
k K(n)
where R(n, k) is the dependency relation between node n and child node k.
attacked
3.2
Training
Our goal is to map
questions
admiral
kotteto their corresponding answer entities. Because there are
a limited number of possible answers, we can
view this as a multi-class classification task.
While a softmax layer over every node in the
tree could predict answers (Socher et al., 2011;
Iyyer et al., 2014), this method overlooks that
most answers are themselves words (features)
in other questions (e.g., a question on World
Jordan Boyd-Graber
|
Boulder
Intuitively, we want to encourage the ve
of question sentences to be near their co
answers and far away from incorrect ans
We accomplish this goal by using a contra
max-margin objective function describe
low. While we are not interested in obtain
ranked list of answers,3 we observe bette
formance by adding the weighted approxi
rank pairwise (warp) loss proposed in W
et al. (2011) to our objective function.
Given a sentence paired with its correc
swer c, we randomly select j incorrect an
from the set of all incorrect answers and d
this subset as Z. Since c is part of the v
ulary, it has a vector xc We . An inco
answer z Z is also associated with a v
xz We . We define S to be the set of all n
in the sentence’s dependency tree, wher
individual node s S is associated with
2
Of course, questions never contain their own a
as part of the text.
3
In quiz bowl, all wrong guesses are equally
mental to a team’s score, no matter how “close” a
is to the correct answer.
Deep Learning for Question Answering
|
39 of 49
+ b).
(2)
Deep Quiz Bowl
We repeat this process up to the root, which is
hdepended =f (WNSUBJ · heconomy + WPREP · hon
Using Compositionality
+ Wv · xdepended + b).
(3)
The composition equation for any node n with
children K(n) and word vector xw is hn =
X
f (Wv · xw + b +
WR(n,k) · hk ), = (4)
k K(n)
where R(n, k) is the dependency relation between node n and child node k.
attacked
3.2
Training
Our goal is to map
questions
admiral
kotteto their corresponding answer entities. Because there are
a limited number of possible answers, we can
this
chinese
in
view this as a multi-class classification task.
While a softmax layer over every node in the
ceylon et al., 2011;
tree could predict answers (Socher
Iyyer et al., 2014), this method overlooks that
most answers are themselves words (features)
in other questions (e.g., a question on World
Jordan Boyd-Graber
|
Boulder
Intuitively, we want to encourage the vec
of question sentences to be near their cor
answers and far away from incorrect ans
We accomplish this goal by using a contra
max-margin objective function described
low. While we are not interested in obtaini
ranked list of answers,3 we observe better
formance by adding the weighted approxim
rank pairwise (warp) loss proposed in We
et al. (2011) to our objective function.
Given a sentence paired with its correc
swer c, we randomly select j incorrect ans
from the set of all incorrect answers and de
this subset as Z. Since c is part of the vo
ulary, it has a vector xc We . An incor
answer z Z is also associated with a ve
xz We . We define S to be the set of all n
in the sentence’s dependency tree, wher
individual node s S is associated with
2
Of course, questions never contain their own a
as part of the text.
3
In quiz bowl, all wrong guesses are equally
mental to a team’s score, no matter how “close” a
is to the correct answer.
Deep Learning for Question Answering
|
39 of 49
Deep Quiz Bowl
Training
• Initialize embeddings from
word2vec
• Randomly initialize composition
matrices
• Update using warp
Randomly choose an instance
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
40 of 49
Deep Quiz Bowl
Training
• Initialize embeddings from
word2vec
• Randomly initialize composition
matrices
• Update using warp
Randomly choose an instance
Look where it lands
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
40 of 49
Deep Quiz Bowl
Training
• Initialize embeddings from
word2vec
• Randomly initialize composition
matrices
• Update using warp
Randomly choose an instance
Look where it lands
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
40 of 49
Deep Quiz Bowl
Training
• Initialize embeddings from
word2vec
• Randomly initialize composition
matrices
• Update using warp
Randomly choose an instance
Look where it lands
Has a correct answer
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
40 of 49
Deep Quiz Bowl
Training
• Initialize embeddings from
word2vec
• Randomly initialize composition
matrices
• Update using warp
Randomly choose an instance
Look where it lands
Has a correct answer
Wrong answers may be closer
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
40 of 49
Deep Quiz Bowl
Training
• Initialize embeddings from
word2vec
• Randomly initialize composition
matrices
• Update using warp
Randomly choose an instance
Look where it lands
Has a correct answer
Wrong answers may be closer
Push away wrong answers
Bring correct answers closer
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
40 of 49
Deep Quiz Bowl
Training
• We use rnn information from parsed questions
• And bag of words features from Wikipedia
• Combine both features together
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
41 of 49
Deep Quiz Bowl
Embedding
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
42 of 49
Deep Quiz Bowl
Comparing rnn to bow
Model
bow-qb
rnn
bow-wiki
combined
Sent 1
37.5
47.1
53.7
59.8
History
Sent 2
65.9
72.1
76.6
81.8
Full
71.4
73.7
77.5
82.3
Sent 1
27.4
36.4
41.8
44.7
Literature
Sent 2
54.0
68.2
74.0
78.7
Full
61.9
69.1
73.3
76.6
Percentage accuracy of di↵erent vector-space models.
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
43 of 49
Deep Quiz Bowl
Comparing rnn to bow
Model
bow-qb
rnn
bow-wiki
combined
Sent 1
37.5
47.1
53.7
59.8
History
Sent 2
65.9
72.1
76.6
81.8
Full
71.4
73.7
77.5
82.3
Sent 1
27.4
36.4
41.8
44.7
Literature
Sent 2
54.0
68.2
74.0
78.7
Full
61.9
69.1
73.3
76.6
Percentage accuracy of di↵erent vector-space models.
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
43 of 49
Deep Quiz Bowl
Comparing rnn to bow
Model
bow-qb
rnn
bow-wiki
combined
Sent 1
37.5
47.1
53.7
59.8
History
Sent 2
65.9
72.1
76.6
81.8
Full
71.4
73.7
77.5
82.3
Sent 1
27.4
36.4
41.8
44.7
Literature
Sent 2
54.0
68.2
74.0
78.7
Full
61.9
69.1
73.3
76.6
Percentage accuracy of di↵erent vector-space models.
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
43 of 49
Deep Quiz Bowl
Comparing rnn to bow
Model
bow-qb
rnn
bow-wiki
combined
Sent 1
37.5
47.1
53.7
59.8
History
Sent 2
65.9
72.1
76.6
81.8
Full
71.4
73.7
77.5
82.3
Sent 1
27.4
36.4
41.8
44.7
Literature
Sent 2
54.0
68.2
74.0
78.7
Full
61.9
69.1
73.3
76.6
Percentage accuracy of di↵erent vector-space models.
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
43 of 49
Deep Quiz Bowl
Now we’re able to beat humans
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
44 of 49
Deep Quiz Bowl
Examining vectors
Thomas Mann
Henrik Ibsen
Joseph Conrad
Jordan Boyd-Graber
|
Boulder
Henry James
Franz Kafka
Deep Learning for Question Answering
|
45 of 49
Deep Quiz Bowl
Examining vectors
Akbar
Muhammad
Shah Jahan
Jordan Boyd-Graber
|
Boulder
Ghana
Babur
Deep Learning for Question Answering
|
45 of 49
Deep Quiz Bowl
Future Steps
• Using Wikipedia: transforming them into question-like sentences
• Incorporating additional information: story plots
• New network structures: capturing coreference
• Exhibition
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
46 of 49
Deep Quiz Bowl
(a) Buzzes over all Questions
(b) Wuthering Heights Question Text
Jordan Boyd-Graber
|
Boulder
(c) Buzzes on Wuthering Heights
Deep Learning for Question Answering
|
47 of 49
Deep Quiz Bowl
Accuracy vs. Speed
1.0
Total
0.8
Accuracy
500
1000
0.6
1500
2000
2500
0.4
3000
40
60
80
100
Number of Tokens Revealed
Jordan Boyd-Graber
|
Boulder
Deep Learning for Question Answering
|
48 of 49
Download