Deep Learning for Question Answering Jordan Boyd-Graber University of Colorado Boulder 19. NOVEMBER 2014 Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 1 of 49 Why Deep Learning Plan Why Deep Learning Review of Logistic Regression Can’t Somebody else do it? (Feature Engineering) Deep Learning from Data Tricks and Toolkits Toolkits for Deep Learning Quiz Bowl Deep Quiz Bowl Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 2 of 49 Why Deep Learning Deep Learning was once known as “Neural Networks” Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 3 of 49 Why Deep Learning But it came back . . . • More data eology Detection Using Recursive Neural Networks er1 , Peter Enns2 , Jordan Boyd-Graber3,4 , Philip Resnik2,4 mputer Science, 2 Linguistics, 3 iSchool, and 4 UMIACS University of Maryland ,peter,jbg}@umiacs.umd.edu, resnik@umd.edu • Better tricks (regularization) • Faster computers tract s often reveal their posting automated techology from text focus wordlists, ignoring synn from recent work in Jordan Boyd-Graber They “ death tax ” and created a its adverse effects dubbed it on small big lie about the businesses | Boulder Deep Learning for Question Answering | 4 of 49 Why Deep Learning And companies are investing . . . Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 5 of 49 Why Deep Learning And companies are investing . . . Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 5 of 49 Why Deep Learning And companies are investing . . . Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 5 of 49 Review of Logistic Regression Plan Why Deep Learning Review of Logistic Regression Can’t Somebody else do it? (Feature Engineering) Deep Learning from Data Tricks and Toolkits Toolkits for Deep Learning Quiz Bowl Deep Quiz Bowl Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 6 of 49 Review of Logistic Regression Map inputs to output Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 7 of 49 Review of Logistic Regression Map inputs to output Input Vector x1 . . . xd inputs encoded as real numbers Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 7 of 49 Review of Logistic Regression Map inputs to output Output Input Vector x1 . . . xd f X i W i xi + b ! multiply inputs by weights Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 7 of 49 Review of Logistic Regression Map inputs to output Output Input Vector x1 . . . xd f X i W i xi + b ! add bias Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 7 of 49 Review of Logistic Regression Map inputs to output Activation Output Input Vector x1 . . . xd f X i Jordan Boyd-Graber | Boulder W i xi + b ! f (z) ⌘ 1 1 + exp( z) pass through nonlinear sigmoid Deep Learning for Question Answering | 7 of 49 Review of Logistic Regression What’s a sigmoid? Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 8 of 49 Review of Logistic Regression In the shallow end • This is still logistic regression • Engineering features x is difficult (and requires expertise) • Can we learn how to represent inputs into final decision? Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 9 of 49 Can’t Somebody else do it? (Feature Engineering) Plan Why Deep Learning Review of Logistic Regression Can’t Somebody else do it? (Feature Engineering) Deep Learning from Data Tricks and Toolkits Toolkits for Deep Learning Quiz Bowl Deep Quiz Bowl Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 10 of 49 Can’t Somebody else do it? (Feature Engineering) Learn the features and the function ⇣ ⌘ (2) (1) (1) (1) (1) a1 = f W11 x1 + W12 x2 + W13 x3 + b1 Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 11 of 49 Can’t Somebody else do it? (Feature Engineering) Learn the features and the function ⇣ ⌘ (2) (1) (1) (1) (1) a2 = f W21 x1 + W22 x2 + W23 x3 + b2 Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 11 of 49 Can’t Somebody else do it? (Feature Engineering) Learn the features and the function ⇣ ⌘ (2) (1) (1) (1) (1) a3 = f W31 x1 + W32 x2 + W33 x3 + b3 Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 11 of 49 Can’t Somebody else do it? (Feature Engineering) Learn the features and the function ⇣ ⌘ (3) (2) (2) (2) (2) (2) (2) (2) hW ,b (x) = a1 = f W11 a1 + W12 a2 + W13 a3 + b1 Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 11 of 49 Can’t Somebody else do it? (Feature Engineering) Objective Function • For every example x, y of our supervised training set, we want the label y to match the prediction hW ,b (x). 1 J(W , b; x, y ) ⌘ ||hW ,b (x) 2 Jordan Boyd-Graber | Boulder y ||2 Deep Learning for Question Answering (1) | 12 of 49 Can’t Somebody else do it? (Feature Engineering) Objective Function • For every example x, y of our supervised training set, we want the label y to match the prediction hW ,b (x). 1 J(W , b; x, y ) ⌘ ||hW ,b (x) 2 y ||2 (1) • We want this value, summed over all of the examples to be as small as possible Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 12 of 49 Can’t Somebody else do it? (Feature Engineering) Objective Function • For every example x, y of our supervised training set, we want the label y to match the prediction hW ,b (x). 1 J(W , b; x, y ) ⌘ ||hW ,b (x) 2 y ||2 (1) • We want this value, summed over all of the examples to be as small as possible • We also want the weights not to be too large 2 Jordan Boyd-Graber | Boulder nX l 1 l sl+1 ⇣ sl X X i=1 j=1 Wjil ⌘2 Deep Learning for Question Answering (2) | 12 of 49 Can’t Somebody else do it? (Feature Engineering) Objective Function • For every example x, y of our supervised training set, we want the label y to match the prediction hW ,b (x). 1 J(W , b; x, y ) ⌘ ||hW ,b (x) 2 y ||2 (1) • We want this value, summed over all of the examples to be as small as possible • We also want the weights not to be too large 2 Jordan Boyd-Graber | Boulder nX l 1 l sl+1 ⇣ sl X X i=1 j=1 Wjil ⌘2 Deep Learning for Question Answering (2) | 12 of 49 Can’t Somebody else do it? (Feature Engineering) Objective Function • For every example x, y of our supervised training set, we want the label y to match the prediction hW ,b (x). 1 J(W , b; x, y ) ⌘ ||hW ,b (x) 2 y ||2 (1) • We want this value, summed over all of the examples to be as small as possible • We also want the weights not to be too large 2 nX l 1 l sl+1 ⇣ sl X X Wjil i=1 j=1 ⌘2 (2) Sum over all layers Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 12 of 49 Can’t Somebody else do it? (Feature Engineering) Objective Function • For every example x, y of our supervised training set, we want the label y to match the prediction hW ,b (x). 1 J(W , b; x, y ) ⌘ ||hW ,b (x) 2 y ||2 (1) • We want this value, summed over all of the examples to be as small as possible • We also want the weights not to be too large 2 nX l 1 l sl+1 ⇣ sl X X Wjil i=1 j=1 ⌘2 (2) Sum over all sources Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 12 of 49 Can’t Somebody else do it? (Feature Engineering) Objective Function • For every example x, y of our supervised training set, we want the label y to match the prediction hW ,b (x). 1 J(W , b; x, y ) ⌘ ||hW ,b (x) 2 y ||2 (1) • We want this value, summed over all of the examples to be as small as possible • We also want the weights not to be too large 2 nX l 1 l sl+1 ⇣ sl X X i=1 j=1 Wjil ⌘2 (2) Sum over all destinations Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 12 of 49 Can’t Somebody else do it? (Feature Engineering) Objective Function Putting it all together: " m 1 X1 J(W , b) = ||hW ,b (x (i) ) m 2 Jordan Boyd-Graber | Boulder i=1 y (i) 2 || # + 2 sl+1 ⇣ nX sl X l 1X l Wjil i=1 j=1 Deep Learning for Question Answering ⌘2 (3) | 13 of 49 Can’t Somebody else do it? (Feature Engineering) Objective Function Putting it all together: " m 1 X1 J(W , b) = ||hW ,b (x (i) ) m 2 i=1 y (i) 2 || # + 2 sl+1 ⇣ nX sl X l 1X l Wjil i=1 j=1 ⌘2 (3) • Our goal is to minimize J(W , b) as a function of W and b Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 13 of 49 Can’t Somebody else do it? (Feature Engineering) Objective Function Putting it all together: " m 1 X1 J(W , b) = ||hW ,b (x (i) ) m 2 i=1 y (i) 2 || # + 2 sl+1 ⇣ nX sl X l 1X l Wjil i=1 j=1 ⌘2 (3) • Our goal is to minimize J(W , b) as a function of W and b • Initialize W and b to small random value near zero Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 13 of 49 Can’t Somebody else do it? (Feature Engineering) Objective Function Putting it all together: " m 1 X1 J(W , b) = ||hW ,b (x (i) ) m 2 i=1 y (i) 2 || # + 2 sl+1 ⇣ nX sl X l 1X l Wjil i=1 j=1 ⌘2 (3) • Our goal is to minimize J(W , b) as a function of W and b • Initialize W and b to small random value near zero • Adjust parameters to optimize J Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 13 of 49 Deep Learning from Data Plan Why Deep Learning Review of Logistic Regression Can’t Somebody else do it? (Feature Engineering) Deep Learning from Data Tricks and Toolkits Toolkits for Deep Learning Quiz Bowl Deep Quiz Bowl Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 14 of 49 Deep Learning from Data Gradient Descent Goal Optimize J with respect to variables W and b Objective 0 Parameter Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 15 of 49 Deep Learning from Data Gradient Descent Goal Optimize J with respect to variables W and b 0 Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 15 of 49 Deep Learning from Data Gradient Descent Goal Optimize J with respect to variables W and b 0 Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 15 of 49 Deep Learning from Data Gradient Descent Goal Optimize J with respect to variables W and b 0 Jordan Boyd-Graber | Boulder 1 Deep Learning for Question Answering | 15 of 49 Deep Learning from Data Gradient Descent Goal Optimize J with respect to variables W and b 0 Jordan Boyd-Graber | Boulder 1 Deep Learning for Question Answering | 15 of 49 Deep Learning from Data Gradient Descent Goal Optimize J with respect to variables W and b 0 1 2 Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 15 of 49 Deep Learning from Data Gradient Descent Goal Optimize J with respect to variables W and b 0 1 2 Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 15 of 49 Deep Learning from Data Gradient Descent Goal Optimize J with respect to variables W and b 0 1 2 Jordan Boyd-Graber | Boulder 3 Deep Learning for Question Answering | 15 of 49 Deep Learning from Data Gradient Descent Goal Optimize J with respect to variables W and b 0 1 2 3 Undiscovered Country Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 15 of 49 Deep Learning from Data Gradient Descent Goal Optimize J with respect to variables W and b α ∂ J ∂W 0 (l) Wij Jordan Boyd-Graber | Boulder (l) Wij Deep Learning for Question Answering | 15 of 49 Deep Learning from Data Backpropigation • For convenience, write the input to sigmoid (l) zi = n X (l 1) Wij (l 1) xj + b i (4) j=1 Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 16 of 49 Deep Learning from Data Backpropigation • For convenience, write the input to sigmoid (l) zi = n X (l 1) Wij (l 1) xj + b i (4) j=1 • The gradient is a function of a node’s error i(l) Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 16 of 49 Deep Learning from Data Backpropigation • For convenience, write the input to sigmoid (l) zi = n X (l 1) Wij (l 1) xj + b i (4) j=1 • The gradient is a function of a node’s error i(l) • For output nodes, the error is obvious: (nl ) i Jordan Boyd-Graber | = @ ||y (n ) @zi Boulder l hw ,b (x)||2 = ⇣ yi (nl ) ai ⌘ ⇣ ⌘ (n ) 1 · f 0 zi l 2 Deep Learning for Question Answering (5) | 16 of 49 Deep Learning from Data Backpropigation • For convenience, write the input to sigmoid (l) zi = n X (l 1) Wij (l 1) xj + b i (4) j=1 • The gradient is a function of a node’s error i(l) • For output nodes, the error is obvious: (nl ) i = @ ||y (n ) @zi l hw ,b (x)||2 = ⇣ yi (nl ) ai ⌘ ⇣ ⌘ (n ) 1 · f 0 zi l 2 (5) • Other nodes must “backpropagate” downstream error based on connection strength (l) i 0 st+1 X (l+1) @ = W ji j=1 Jordan Boyd-Graber | Boulder 1 (l+1) A 0 (l) f (zi ) j Deep Learning for Question Answering (6) | 16 of 49 Deep Learning from Data Backpropigation • For convenience, write the input to sigmoid (l) zi = n X (l 1) Wij (l 1) xj + b i (4) j=1 • The gradient is a function of a node’s error i(l) • For output nodes, the error is obvious: (nl ) i = @ ||y (n ) @zi l hw ,b (x)||2 = ⇣ yi (nl ) ai ⌘ ⇣ ⌘ (n ) 1 · f 0 zi l 2 (5) • Other nodes must “backpropagate” downstream error based on connection strength (l) i 0 st+1 X (l+1) @ = W ji j=1 Jordan Boyd-Graber | Boulder 1 (l+1) A 0 (l) f (zi ) j Deep Learning for Question Answering (6) | 16 of 49 Deep Learning from Data Backpropigation • For convenience, write the input to sigmoid (l) zi = n X (l 1) Wij (l 1) xj + b i (4) j=1 • The gradient is a function of a node’s error i(l) • For output nodes, the error is obvious: (nl ) i = @ ||y (n ) @zi l hw ,b (x)||2 = ⇣ yi (nl ) ai ⌘ ⇣ ⌘ (n ) 1 · f 0 zi l 2 (5) • Other nodes must “backpropagate” downstream error based on connection strength (l) i 0 st+1 X (l+1) @ = W ji j=1 Jordan Boyd-Graber | Boulder 1 (l+1) A 0 (l) f (zi ) j (chain rule) Deep Learning for Question Answering (6) | 16 of 49 Deep Learning from Data Partial Derivatives • For weights, the partial derivatives are @ (l) @Wij (l) (l+1) i J(W , b; x, y ) = aj (7) • For the bias terms, the partial derivatives are @ (l) @bi J(W , b; x, y ) = (l+1) i (8) • But this is just for a single example . . . Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 17 of 49 Deep Learning from Data Full Gradient Descent Algorithm 1. Initialize U (l) and V (l) as zero 2. For each example i1 . . . m 2.1 Use backpropagation to compute rW J and rb J 2.2 Update weight shifts U (l) = U (l) + rW (l) J(W , b; x, y ) 2.3 Update bias shifts V (l) = V (l) + rb(l) J(W , b; x, y ) 3. Update the parameters W (l) =W (l) b (l) =b (l) ✓ 1 (l) U m ↵ 1 (l) ↵ V m ◆ (9) (10) 4. Repeat until weights stop changing Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 18 of 49 Tricks and Toolkits Plan Why Deep Learning Review of Logistic Regression Can’t Somebody else do it? (Feature Engineering) Deep Learning from Data Tricks and Toolkits Toolkits for Deep Learning Quiz Bowl Deep Quiz Bowl Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 19 of 49 Tricks and Toolkits Tricks • Stochastic gradient: compute gradient from a few examples • Hardware: Do matrix computations on gpus • Dropout: Randomly set some inputs to zero • Initialization: Using an autoencoder can help representation Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 20 of 49 Toolkits for Deep Learning Plan Why Deep Learning Review of Logistic Regression Can’t Somebody else do it? (Feature Engineering) Deep Learning from Data Tricks and Toolkits Toolkits for Deep Learning Quiz Bowl Deep Quiz Bowl Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 21 of 49 Toolkits for Deep Learning • Theano: Python package (Yoshua Bengio) • Torch7: Lua package (Yann LeCunn) • ConvNetJS: Javascript package (Andrej Karpathy) • Both automatically compute gradients and have numerical optimization • Working group this summer at umd Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 22 of 49 Quiz Bowl Plan Why Deep Learning Review of Logistic Regression Can’t Somebody else do it? (Feature Engineering) Deep Learning from Data Tricks and Toolkits Toolkits for Deep Learning Quiz Bowl Deep Quiz Bowl Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 23 of 49 Quiz Bowl Humans doing Incremental Classification • Game called “quiz bowl” • Two teams play each other Moderator reads a question When a team knows the answer, they signal (“buzz” in) If right, they get points; otherwise, rest of the question is read to the other team • Hundreds of teams in the US alone Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 24 of 49 Quiz Bowl Humans doing Incremental Classification • Game called “quiz bowl” • Two teams play each other Moderator reads a question When a team knows the answer, they signal (“buzz” in) If right, they get points; otherwise, rest of the question is read to the other team • Hundreds of teams in the US alone • Example . . . Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 24 of 49 Quiz Bowl Sample Question 1 With Leo Szilard, he invented a doubly-eponymous Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 25 of 49 Quiz Bowl Sample Question 1 With Leo Szilard, he invented a doubly-eponymous refrigerator with no moving parts. He did not take interaction with neighbors into account when formulating his theory of Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 25 of 49 Quiz Bowl Sample Question 1 With Leo Szilard, he invented a doubly-eponymous refrigerator with no moving parts. He did not take interaction with neighbors into account when formulating his theory of heat capacity, so Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 25 of 49 Quiz Bowl Sample Question 1 With Leo Szilard, he invented a doubly-eponymous refrigerator with no moving parts. He did not take interaction with neighbors into account when formulating his theory of heat capacity, so Debye adjusted the theory for low temperatures. His summation convention automatically sums repeated indices in tensor products. His name is attached to the A and B coefficients Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 25 of 49 Quiz Bowl Sample Question 1 With Leo Szilard, he invented a doubly-eponymous refrigerator with no moving parts. He did not take interaction with neighbors into account when formulating his theory of heat capacity, so Debye adjusted the theory for low temperatures. His summation convention automatically sums repeated indices in tensor products. His name is attached to the A and B coefficients for spontaneous and stimulated emission, the subject of one of his multiple groundbreaking 1905 papers. He further developed the model of statistics sent to him by Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 25 of 49 Quiz Bowl Sample Question 1 With Leo Szilard, he invented a doubly-eponymous refrigerator with no moving parts. He did not take interaction with neighbors into account when formulating his theory of heat capacity, so Debye adjusted the theory for low temperatures. His summation convention automatically sums repeated indices in tensor products. His name is attached to the A and B coefficients for spontaneous and stimulated emission, the subject of one of his multiple groundbreaking 1905 papers. He further developed the model of statistics sent to him by Bose to describe particles with integer spin. For 10 points, who is this German physicist best known for formulating the Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 25 of 49 Quiz Bowl Sample Question 1 With Leo Szilard, he invented a doubly-eponymous refrigerator with no moving parts. He did not take interaction with neighbors into account when formulating his theory of heat capacity, so Debye adjusted the theory for low temperatures. His summation convention automatically sums repeated indices in tensor products. His name is attached to the A and B coefficients for spontaneous and stimulated emission, the subject of one of his multiple groundbreaking 1905 papers. He further developed the model of statistics sent to him by Bose to describe particles with integer spin. For 10 points, who is this German physicist best known for formulating the special and general theories of relativity? Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 25 of 49 Quiz Bowl Sample Question 1 With Leo Szilard, he invented a doubly-eponymous refrigerator with no moving parts. He did not take interaction with neighbors into account when formulating his theory of heat capacity, so Debye adjusted the theory for low temperatures. His summation convention automatically sums repeated indices in tensor products. His name is attached to the A and B coefficients for spontaneous and stimulated emission, the subject of one of his multiple groundbreaking 1905 papers. He further developed the model of statistics sent to him by Bose to describe particles with integer spin. For 10 points, who is this German physicist best known for formulating the special and general theories of relativity? Albert Einstein Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 25 of 49 Quiz Bowl Sample Question 1 With Leo Szilard, he invented a doubly-eponymous refrigerator with no moving parts. He did not take interaction with neighbors into account when formulating his theory of heat capacity, so Debye adjusted the theory for low temperatures. His summation convention automatically sums repeated indices in tensor products. His name is Faster = Smarter attached to the A and B coefficients for spontaneous and stimulated 1. Colorado School of Mines emission, the subject of one of his multiple groundbreaking 1905 2. Brigham Young University papers. He further developed the model of statistics sent to him by 3. California Institute of Technology Bose to describe particles with integer spin. For 10 points, who is this German best known for formulating the special and general 4. Harveyphysicist Mudd College theories of relativity? 5. University of Colorado Albert Einstein Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 25 of 49 Quiz Bowl Humans doing Incremental Classification • This is not Jeopardy (Watson) • There are buzzers, but players can only buzz at the end of a question • Doesn’t discriminate knowledge • Quiz bowl questions are pyramidal Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 26 of 49 Quiz Bowl Humans doing Incremental Classification • Thousands of questions are written every year • Large question databases • Teams practice on these questions (some online, e.g. IRC) • How can we learn from this? Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 27 of 49 Quiz Bowl System for Incremental Classifiers • Treat this as a MDP • Action: buzz now or wait Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 28 of 49 Quiz Bowl System for Incremental Classifiers • Treat this as a MDP • Action: buzz now or wait 1. Content Model is constantly generating guesses 2. Oracle provides examples where it is correct 3. The Policy generalizes to test data 4. Features represent our state content model Jordan Boyd-Graber | Boulder oracle policy features Deep Learning for Question Answering | 28 of 49 Quiz Bowl System for Incremental Classifiers • Treat this as a MDP • Action: buzz now or wait 1. Content Model is constantly generating guesses 2. Oracle provides examples where it is correct 3. The Policy generalizes to test data 4. Features represent our state content model Jordan Boyd-Graber | Boulder oracle policy features Deep Learning for Question Answering | 28 of 49 Quiz Bowl Content Model content model oracle policy features • Bayesian generative model with answers as latent state • Unambiguous Wikipedia pages • Unigram term weightings (naı̈ve Bayes, BM25) • Maintains posterior distribution over guesses • Always has a guess of what it should answer policy will tell us when to trust it Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 29 of 49 Quiz Bowl Vector Space Model Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 30 of 49 Quiz Bowl Vector Space Model arabian persian gulf kingdom expatriates Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 30 of 49 Quiz Bowl Vector Space Model arabian persian gulf kingdom expatriates Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 30 of 49 Quiz Bowl Vector Space Model arabian persian gulf kingdom expatriates Qatar Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 30 of 49 Quiz Bowl Vector Space Model arabian persian gulf kingdom expatriates Qatar Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 30 of 49 Quiz Bowl Vector Space Model arabian persian gulf kingdom expatriates Qatar Bahrain Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 30 of 49 Quiz Bowl Vector Space Model arabian persian gulf kingdom expatriates Gummibears Pokemon Jordan Boyd-Graber | Boulder Qatar Bahrain Deep Learning for Question Answering | 30 of 49 Quiz Bowl Vector Space Model arabian persian gulf kingdom expatriates Gummibears Pokemon Jordan Boyd-Graber | Boulder Qatar Bahrain ? Deep Learning for Question Answering | 30 of 49 Quiz Bowl Vector Space Model arabian persian gulf kingdom expatriates Gummibears Pokemon Jordan Boyd-Graber | Boulder Qatar Bahrain ? Deep Learning for Question Answering | 30 of 49 Quiz Bowl Oracle content model oracle policy features Revealed Text Content Model Answer The National Endowment for the Arts, War o Martha Graham Lyndon Johnson The National Endowment for the Arts, War on Poverty, and Medicare were establish by, for 10 points, what Texan George Bush Lyndon Johnson The National Endowment for the Arts, War on Poverty, and Medicare were establish by, for 10 points, what Texan who defeated Barry Goldwater, promoted the Great Society, and succeeded John F. Kennedy? Lyndon Johnson Lyndon Johnson Oracle • As each token is revealed, look at content model’s guess • If it’s right, positive instance; otherwise negative • Nearly optimal policy to buzz whenever correct (upper bound) Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 31 of 49 Quiz Bowl Policy content model • Mapping: state 7! action oracle policy features • Use oracle as example actions • Learned as classifier (Langford et al., 2005) • At test time, use the same features as for training Question text (so far) Guess Posterior distribution Change in posterior Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 32 of 49 Quiz Bowl Features (by example) content model oracle policy features content Observation: This man won the Battle Jordan Boyd-Graber | 0.02 Tokugawa 0.01 Erwin Rommel 0.01 Joan of Arc 0.01 Stephen Crane Boulder Deep Learning for Question Answering | 33 of 49 Quiz Bowl Features (by example) content model oracle Text content Observation: This man won the Battle Jordan Boyd-Graber | 0.02 Tokugawa 0.01 Erwin Rommel 0.01 Joan of Arc 0.01 Stephen Crane Boulder policy features Guess Posterior State Representation idx: 05 ftp: f gss: tokugawa top_1: 0.02 top_2: 0.01 top_3: 0.01 this_man: 01 won: 01 battle: 01 Deep Learning for Question Answering | 33 of 49 Quiz Bowl Features (by example) content model oracle Text Jordan Boyd-Graber | 0.02 Tokugawa 0.01 Erwin Rommel 0.01 Joan of Arc 0.01 Stephen Crane Boulder features Guess Posterior State Representation idx: 05 ftp: f gss: tokugawa top_1: 0.02 top_2: 0.01 top_3: 0.01 this_man: 01 won: 01 battle: 01 Wait content Observation: This man won the Battle policy Deep Learning for Question Answering | 33 of 49 Quiz Bowl Text Observation: This man won the Battle idx: 05 ftp: f Features (by example) 0.02 Tokugawa Posterior 0.01 Erwin Rommel 0.01 Joan of Arc 0.01 Stephen Crane gss: tokugawa top_1: 0.02 top_2: 0.01 top_3: 0.01 this_man: 01 won: 01 battle: 01 Wait content Guess State Representation Observation: This man won the Battle of Zela over Pontus. He wrote about his victory at Alesia in his Commentaries on the 0.11 Mithridates 0.09 Julius Caesar 0.08 Alexander the Great 0.07 Sulla Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 33 of 49 Quiz Bowl Text Observation: This man won the Battle content Wait idx: 21 f ftp: 0.11 Mithridates 0.09 Julius Caesar this_man: commentaries: 0.08 Alexander the Great pontus: 0.07 Sulla … | Boulder top_1: 0.02 top_2: 0.01 top_3: 0.01 this_man: 01 won: 01 battle: 01 Text Observation: This man won the Battle of Zela over Pontus. He State Representation wrote about his victory at Alesia in his Commentaries on the Jordan Boyd-Graber Posterior gss: tokugawa idx: 05 ftp: f Features (by example) 0.02 Tokugawa 0.01 Erwin Rommel 0.01 Joan of Arc 0.01 Stephen Crane Guess State Representation Guess Posterior gss: mithridates top_1: 0.11 top_2: 0.09 top_3: 0.08 Change new: true delta: +0.09 01 01 01 … Deep Learning for Question Answering | 33 of 49 Quiz Bowl Text Observation: This man won the Battle content Wait idx: 21 f Wait ftp: 0.11 Mithridates 0.09 Julius Caesar this_man: commentaries: 0.08 Alexander the Great pontus: 0.07 Sulla … | Boulder top_1: 0.02 top_2: 0.01 top_3: 0.01 this_man: 01 won: 01 battle: 01 Text Observation: This man won the Battle of Zela over Pontus. He State Representation wrote about his victory at Alesia in his Commentaries on the Jordan Boyd-Graber Posterior gss: tokugawa idx: 05 ftp: f Features (by example) 0.02 Tokugawa 0.01 Erwin Rommel 0.01 Joan of Arc 0.01 Stephen Crane Guess State Representation Guess Posterior gss: mithridates top_1: 0.11 top_2: 0.09 top_3: 0.08 Change new: true delta: +0.09 01 01 01 … Deep Learning for Question Answering | 33 of 49 Quiz Bowl Text Observation: This man won the Battle top_1: 0.02 top_2: 0.01 top_3: 0.01 this_man: 01 won: 01 battle: 01 Wait content Posterior gss: tokugawa idx: 05 ftp: f Features (by example) 0.02 Tokugawa 0.01 Erwin Rommel 0.01 Joan of Arc 0.01 Stephen Crane Guess State Representation Text Observation: This man won the Battle of Zela over Pontus. He State Representation wrote about his victory at Alesia in his Commentaries on the idx: 21 f Wait ftp: 0.11 Mithridates 0.09 Julius Caesar this_man: commentaries: 0.08 Alexander the Great pontus: 0.07 Sulla … Guess Posterior gss: mithridates top_1: 0.11 top_2: 0.09 top_3: 0.08 Change new: true delta: +0.09 01 01 01 … Observation: This man won the Battle of Zela over Pontus. He wrote about his victory at Alesia in his Commentaries on the Gallic Wars. FTP, name this Roman 0.89 Julius Caesar 0.02 Augustus 0.01 Sulla 0.01 Pompey Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 33 of 49 Quiz Bowl Text Observation: This man won the Battle content Wait idx: 21 f Wait ftp: 0.11 Mithridates 0.09 Julius Caesar this_man: commentaries: 0.08 Alexander the Great pontus: 0.07 Sulla … Guess Posterior gss: mithridates top_1: 0.11 top_2: 0.09 top_3: 0.08 idx: 55 ftp: t 0.89 Julius Caesar this_man: 0.02 Augustus this_roman: 0.01 Sulla gallic: … 0.01 Pompey Boulder Change new: true delta: +0.09 01 01 01 … Observation: This man won the Battle of Zela over Text Pontus. He Guess wrote about his victory at Alesia in his Commentaries on the Gallic State Representation Wars. FTP, name this Roman | top_1: 0.02 top_2: 0.01 top_3: 0.01 this_man: 01 won: 01 battle: 01 Text Observation: This man won the Battle of Zela over Pontus. He State Representation wrote about his victory at Alesia in his Commentaries on the Jordan Boyd-Graber Posterior gss: tokugawa idx: 05 ftp: f Features (by example) 0.02 Tokugawa 0.01 Erwin Rommel 0.01 Joan of Arc 0.01 Stephen Crane Guess State Representation gss: j_caesar Posterior top_1: 0.89 top_2: 0.02 top_3: 0.01 Change new: true delta: +0.78 01 01 01 … Deep Learning for Question Answering | 33 of 49 Quiz Bowl Text Posterior gss: tokugawa idx: 05 ftp: f Features (by example) 0.02 Tokugawa 0.01 Erwin Rommel 0.01 Joan of Arc 0.01 Stephen Crane top_1: 0.02 top_2: 0.01 top_3: 0.01 this_man: 01 won: 01 battle: 01 Wait content Guess State Representation Observation: This man won the Battle Text Observation: This man won the Battle of Zela over Pontus. He State Representation wrote about his victory at Alesia in his Commentaries on the idx: 21 f Wait ftp: 0.11 Mithridates 0.09 Julius Caesar this_man: commentaries: 0.08 Alexander the Great pontus: 0.07 Sulla … Guess Posterior gss: mithridates top_1: 0.11 top_2: 0.09 top_3: 0.08 new: true delta: +0.09 01 01 01 … Observation: This man won the Battle of Zela over Text Pontus. He Guess wrote about his victory at Alesia in his Commentaries on the Gallic State Representation Wars. FTP, name this Roman idx: 55 ftp: t Buzz 0.89 Julius Caesar this_man: 0.02 Augustus this_roman: 0.01 Sulla gallic: … 0.01 Pompey Change gss: j_caesar Posterior top_1: 0.89 top_2: 0.02 top_3: 0.01 Change new: true delta: +0.78 01 01 01 … Answer: Julius Caesar Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 33 of 49 Quiz Bowl Simulating a Game • Present tokens incrementally to algorithm, see where it buzzes • Compare to where humans buzzed in • Payo↵ matrix (wrt Computer) 1 2 3 4 5 6 Jordan Boyd-Graber | Boulder Computer first and wrong — first and wrong first and correct wrong right Human right first and correct wrong — first and wrong first and wrong Payo↵ 15 10 5 +10 +5 +15 Deep Learning for Question Answering | 34 of 49 Quiz Bowl Interface • Users could “veto” categories or tournaments • Questions presented in canonical order • Approximate string matching (w/ override) Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 35 of 49 Quiz Bowl Interface • Started on Amazon Mechanical Turk • 7000 questions were answered in the first day • Over 43000 questions were answered in the space of two weeks • Total of 461 unique users • Leaderboard to encourage users Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 35 of 49 Quiz Bowl Error Analysis Posterior Opponent maurice ravel 1.0 0.8 0.6 0.4 0.2 0.0 0 10 20 30 40 50 60 70 Tokens Revealed Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 36 of 49 Quiz Bowl Error Analysis Posterior Opponent maurice ravel 1.0 0.8 0.6 0.4 maurice ravel pictures at an exhibition 0.2 0.0 0 10 20 30 40 50 60 70 Tokens Revealed Prediction answer Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 36 of 49 Quiz Bowl Error Analysis Posterior Opponent maurice ravel 1.0 0.8 0.6 0.4 maurice ravel pictures at an exhibition 0.2 0.0 0 10 20 30 40 50 60 70 Tokens Revealed Prediction Buzz answer Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 36 of 49 Quiz Bowl Error Analysis Posterior Opponent maurice ravel 1.0 0.8 this_french bolero orchestrated 0.6 composer 0.4 pictures maurice ravel pictures at an exhibition 0.2 0.0 0 10 20 30 40 50 60 70 Tokens Revealed Prediction answer Jordan Boyd-Graber | Boulder Buzz Observation feature Deep Learning for Question Answering | 36 of 49 Quiz Bowl Error Analysis Posterior Opponent maurice ravel 1.0 0.8 • Too slow • Coreference and correct 0.6 • Not enough information 0.2 question type / not weighting later clues higher composer 0.4 pictures maurice ravel pictures at an exhibition 0.0 0 10 answer | Boulder 20 30 40 50 60 70 Tokens Revealed Prediction Jordan Boyd-Graber this_french bolero orchestrated Buzz Observation feature Deep Learning for Question Answering | 36 of 49 Quiz Bowl Error Analysis enrico fermi 1.0 0.8 • Too slow 0.6 • Coreference and correct question type • Not enough information / not weighting later clues higher paradox 0.4 magnetic field magnetic field Boulder neutrino 0.0 0 20 40 60 80 100 Tokens Revealed answer | this_italian 0.2 Prediction Jordan Boyd-Graber this_man zero magnetic Buzz Observation feature Deep Learning for Question Answering | 36 of 49 Quiz Bowl Error Analysis george washington 0.8 • Too slow generals 0.6 • Coreference and correct question type • Not enough information / not weighting later clues higher language chill 0.4 0.2 charlemagne yasunari kawabata 0.0 0 answer | Boulder 10 20 30 40 50 60 Tokens Revealed Prediction Jordan Boyd-Graber mistress Buzz Observation feature Deep Learning for Question Answering | 36 of 49 Quiz Bowl How can we do better? • Use order of words in a sentence “this man shot Lee Harvey Oswald” very di↵erent from “Lee Harvey Oswald shot this man” • Use relationship between questions (“China” and “Taiwan”) • Use learned features and dimensions, not the words we start with Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 37 of 49 Quiz Bowl How can we do better? • Use order of words in a sentence “this man shot Lee Harvey Oswald” very di↵erent from “Lee Harvey Oswald shot this man” • Use relationship between questions (“China” and “Taiwan”) • Use learned features and dimensions, not the words we start with • Recursive Neural Networks (Socher et al., 2012) • First-time a learned representation has been applied to question answering Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 37 of 49 Deep Quiz Bowl Plan Why Deep Learning Review of Logistic Regression Can’t Somebody else do it? (Feature Engineering) Deep Learning from Data Tricks and Toolkits Toolkits for Deep Learning Quiz Bowl Deep Quiz Bowl Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 38 of 49 Deep Quiz Bowl Using Compositionality attacked this admiral kotte chinese in ceylon Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 39 of 49 Deep Quiz Bowl Using Compositionality attacked this admiral kotte chinese in ceylon Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 39 of 49 Deep Quiz Bowl Using Compositionality attacked this admiral kotte chinese in ceylon Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 39 of 49 Deep Quiz Bowl Using Compositionality attacked this admiral kotte chinese in ? ceylon Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 39 of 49 Deep Quiz Bowl Using Compositionality attacked this admiral kotte chinese in ceylon Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 39 of 49 Deep Quiz Bowl Using Compositionality attacked this admiral kotte chinese in ceylon Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 39 of 49 Deep Quiz Bowl Using Compositionality attacked this admiral kotte chinese in ceylon Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 39 of 49 Deep Quiz Bowl We repeat this process up to the root, which is h =f (W · heconomy + WPREP · hon NSUBJ depended Using Compositionality + Wv · xdepended + b). (3) The composition equation for any node n with children K(n) and word vector xw is hn = X f (Wv · xw + b + WR(n,k) · hk ), = (4) k K(n) where R(n, k) is the dependency relation between node n and child node k. attacked 3.2 Training Our goal is to map questions admiral kotteto their corresponding answer entities. Because there are a limited number of possible answers, we can view this as a multi-class classification task. While a softmax layer over every node in the tree could predict answers (Socher et al., 2011; Iyyer et al., 2014), this method overlooks that most answers are themselves words (features) in other questions (e.g., a question on World Jordan Boyd-Graber | Boulder Intuitively, we want to encourage the ve of question sentences to be near their co answers and far away from incorrect ans We accomplish this goal by using a contra max-margin objective function describe low. While we are not interested in obtain ranked list of answers,3 we observe bette formance by adding the weighted approxi rank pairwise (warp) loss proposed in W et al. (2011) to our objective function. Given a sentence paired with its correc swer c, we randomly select j incorrect an from the set of all incorrect answers and d this subset as Z. Since c is part of the v ulary, it has a vector xc We . An inco answer z Z is also associated with a v xz We . We define S to be the set of all n in the sentence’s dependency tree, wher individual node s S is associated with 2 Of course, questions never contain their own a as part of the text. 3 In quiz bowl, all wrong guesses are equally mental to a team’s score, no matter how “close” a is to the correct answer. Deep Learning for Question Answering | 39 of 49 Deep Quiz Bowl We repeat this process up to the root, which is h =f (W · heconomy + WPREP · hon NSUBJ depended Using Compositionality + Wv · xdepended + b). (3) The composition equation for any node n with children K(n) and word vector xw is hn = X f (Wv · xw + b + WR(n,k) · hk ), = (4) k K(n) where R(n, k) is the dependency relation between node n and child node k. attacked 3.2 Training Our goal is to map questions admiral kotteto their corresponding answer entities. Because there are a limited number of possible answers, we can view this as a multi-class classification task. While a softmax layer over every node in the tree could predict answers (Socher et al., 2011; Iyyer et al., 2014), this method overlooks that most answers are themselves words (features) in other questions (e.g., a question on World Jordan Boyd-Graber | Boulder Intuitively, we want to encourage the ve of question sentences to be near their co answers and far away from incorrect ans We accomplish this goal by using a contra max-margin objective function describe low. While we are not interested in obtain ranked list of answers,3 we observe bette formance by adding the weighted approxi rank pairwise (warp) loss proposed in W et al. (2011) to our objective function. Given a sentence paired with its correc swer c, we randomly select j incorrect an from the set of all incorrect answers and d this subset as Z. Since c is part of the v ulary, it has a vector xc We . An inco answer z Z is also associated with a v xz We . We define S to be the set of all n in the sentence’s dependency tree, wher individual node s S is associated with 2 Of course, questions never contain their own a as part of the text. 3 In quiz bowl, all wrong guesses are equally mental to a team’s score, no matter how “close” a is to the correct answer. Deep Learning for Question Answering | 39 of 49 Deep Quiz Bowl We repeat this process up to the root, which is h =f (W · heconomy + WPREP · hon NSUBJ depended Using Compositionality + Wv · xdepended + b). (3) The composition equation for any node n with children K(n) and word vector xw is hn = X f (Wv · xw + b + WR(n,k) · hk ), = (4) k K(n) where R(n, k) is the dependency relation between node n and child node k. attacked 3.2 Training Our goal is to map questions admiral kotteto their corresponding answer entities. Because there are a limited number of possible answers, we can view this as a multi-class classification task. While a softmax layer over every node in the tree could predict answers (Socher et al., 2011; Iyyer et al., 2014), this method overlooks that most answers are themselves words (features) in other questions (e.g., a question on World Jordan Boyd-Graber | Boulder Intuitively, we want to encourage the ve of question sentences to be near their co answers and far away from incorrect ans We accomplish this goal by using a contra max-margin objective function describe low. While we are not interested in obtain ranked list of answers,3 we observe bette formance by adding the weighted approxi rank pairwise (warp) loss proposed in W et al. (2011) to our objective function. Given a sentence paired with its correc swer c, we randomly select j incorrect an from the set of all incorrect answers and d this subset as Z. Since c is part of the v ulary, it has a vector xc We . An inco answer z Z is also associated with a v xz We . We define S to be the set of all n in the sentence’s dependency tree, wher individual node s S is associated with 2 Of course, questions never contain their own a as part of the text. 3 In quiz bowl, all wrong guesses are equally mental to a team’s score, no matter how “close” a is to the correct answer. Deep Learning for Question Answering | 39 of 49 + b). (2) Deep Quiz Bowl We repeat this process up to the root, which is hdepended =f (WNSUBJ · heconomy + WPREP · hon Using Compositionality + Wv · xdepended + b). (3) The composition equation for any node n with children K(n) and word vector xw is hn = X f (Wv · xw + b + WR(n,k) · hk ), = (4) k K(n) where R(n, k) is the dependency relation between node n and child node k. attacked 3.2 Training Our goal is to map questions admiral kotteto their corresponding answer entities. Because there are a limited number of possible answers, we can this chinese in view this as a multi-class classification task. While a softmax layer over every node in the ceylon et al., 2011; tree could predict answers (Socher Iyyer et al., 2014), this method overlooks that most answers are themselves words (features) in other questions (e.g., a question on World Jordan Boyd-Graber | Boulder Intuitively, we want to encourage the vec of question sentences to be near their cor answers and far away from incorrect ans We accomplish this goal by using a contra max-margin objective function described low. While we are not interested in obtaini ranked list of answers,3 we observe better formance by adding the weighted approxim rank pairwise (warp) loss proposed in We et al. (2011) to our objective function. Given a sentence paired with its correc swer c, we randomly select j incorrect ans from the set of all incorrect answers and de this subset as Z. Since c is part of the vo ulary, it has a vector xc We . An incor answer z Z is also associated with a ve xz We . We define S to be the set of all n in the sentence’s dependency tree, wher individual node s S is associated with 2 Of course, questions never contain their own a as part of the text. 3 In quiz bowl, all wrong guesses are equally mental to a team’s score, no matter how “close” a is to the correct answer. Deep Learning for Question Answering | 39 of 49 Deep Quiz Bowl Training • Initialize embeddings from word2vec • Randomly initialize composition matrices • Update using warp Randomly choose an instance Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 40 of 49 Deep Quiz Bowl Training • Initialize embeddings from word2vec • Randomly initialize composition matrices • Update using warp Randomly choose an instance Look where it lands Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 40 of 49 Deep Quiz Bowl Training • Initialize embeddings from word2vec • Randomly initialize composition matrices • Update using warp Randomly choose an instance Look where it lands Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 40 of 49 Deep Quiz Bowl Training • Initialize embeddings from word2vec • Randomly initialize composition matrices • Update using warp Randomly choose an instance Look where it lands Has a correct answer Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 40 of 49 Deep Quiz Bowl Training • Initialize embeddings from word2vec • Randomly initialize composition matrices • Update using warp Randomly choose an instance Look where it lands Has a correct answer Wrong answers may be closer Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 40 of 49 Deep Quiz Bowl Training • Initialize embeddings from word2vec • Randomly initialize composition matrices • Update using warp Randomly choose an instance Look where it lands Has a correct answer Wrong answers may be closer Push away wrong answers Bring correct answers closer Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 40 of 49 Deep Quiz Bowl Training • We use rnn information from parsed questions • And bag of words features from Wikipedia • Combine both features together Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 41 of 49 Deep Quiz Bowl Embedding Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 42 of 49 Deep Quiz Bowl Comparing rnn to bow Model bow-qb rnn bow-wiki combined Sent 1 37.5 47.1 53.7 59.8 History Sent 2 65.9 72.1 76.6 81.8 Full 71.4 73.7 77.5 82.3 Sent 1 27.4 36.4 41.8 44.7 Literature Sent 2 54.0 68.2 74.0 78.7 Full 61.9 69.1 73.3 76.6 Percentage accuracy of di↵erent vector-space models. Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 43 of 49 Deep Quiz Bowl Comparing rnn to bow Model bow-qb rnn bow-wiki combined Sent 1 37.5 47.1 53.7 59.8 History Sent 2 65.9 72.1 76.6 81.8 Full 71.4 73.7 77.5 82.3 Sent 1 27.4 36.4 41.8 44.7 Literature Sent 2 54.0 68.2 74.0 78.7 Full 61.9 69.1 73.3 76.6 Percentage accuracy of di↵erent vector-space models. Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 43 of 49 Deep Quiz Bowl Comparing rnn to bow Model bow-qb rnn bow-wiki combined Sent 1 37.5 47.1 53.7 59.8 History Sent 2 65.9 72.1 76.6 81.8 Full 71.4 73.7 77.5 82.3 Sent 1 27.4 36.4 41.8 44.7 Literature Sent 2 54.0 68.2 74.0 78.7 Full 61.9 69.1 73.3 76.6 Percentage accuracy of di↵erent vector-space models. Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 43 of 49 Deep Quiz Bowl Comparing rnn to bow Model bow-qb rnn bow-wiki combined Sent 1 37.5 47.1 53.7 59.8 History Sent 2 65.9 72.1 76.6 81.8 Full 71.4 73.7 77.5 82.3 Sent 1 27.4 36.4 41.8 44.7 Literature Sent 2 54.0 68.2 74.0 78.7 Full 61.9 69.1 73.3 76.6 Percentage accuracy of di↵erent vector-space models. Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 43 of 49 Deep Quiz Bowl Now we’re able to beat humans Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 44 of 49 Deep Quiz Bowl Examining vectors Thomas Mann Henrik Ibsen Joseph Conrad Jordan Boyd-Graber | Boulder Henry James Franz Kafka Deep Learning for Question Answering | 45 of 49 Deep Quiz Bowl Examining vectors Akbar Muhammad Shah Jahan Jordan Boyd-Graber | Boulder Ghana Babur Deep Learning for Question Answering | 45 of 49 Deep Quiz Bowl Future Steps • Using Wikipedia: transforming them into question-like sentences • Incorporating additional information: story plots • New network structures: capturing coreference • Exhibition Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 46 of 49 Deep Quiz Bowl (a) Buzzes over all Questions (b) Wuthering Heights Question Text Jordan Boyd-Graber | Boulder (c) Buzzes on Wuthering Heights Deep Learning for Question Answering | 47 of 49 Deep Quiz Bowl Accuracy vs. Speed 1.0 Total 0.8 Accuracy 500 1000 0.6 1500 2000 2500 0.4 3000 40 60 80 100 Number of Tokens Revealed Jordan Boyd-Graber | Boulder Deep Learning for Question Answering | 48 of 49