Deep Learning in Natural Language Wei-Yun Ma 馬偉雲 中研院資訊所助研究員 ma@iis.sinica.edu.tw 2016/12/21 Outline Introduction to applications of DL on NLP Word Representations Named entity recognition using Window-based NN Language model using Recurrent NN Machine translation using LSTM Syntactic Parsing using Recursive NN Sentiment analysis using Recursive NN Sentence classification using Convolutional NN Information retrieval using DSSM Knowledge base completion using knowledge base embedding • Concluding Remarks • • • • • • • • • • Word Representations • Dictionary-based Word Representation (Discrete Word Representation) Is there other ways to represent a word? Adapted slides from "Deep Learning for NLP(without Magic)" of Socher and Manning Continuous Word Representations • A word is represented as a dense vector 蝴蝶 = 0.234 0.283 −0.435 0.485 −0.934 −0.384 0.234 0.548 −0.834 0.437 0.483 Continuous Word Representations • Word embedding captures the word meaning and project the meaning into a semantic vector space Adapted picture from "Deep Learning for NLP(without Magic)" of Socher and Manning Why are Continuous Word Representations useful? Adapt slides of NAACL 2015 tutorial given by Wen-tau Yih, Xiaodong He and Jianfeng Gao Why are Continuous Word Representations useful? Trainable features could be directly trained through supervised training on labeled data. Adapted slides of NAACL 2015 tutorial given by Wen-tau Yih, Xiaodong He and Jianfeng Gao Or, they could be pre-trained through unsupervised training on raw data and then trained through supervised training on labeled data. Why are Continuous Word Representations useful? Adapted slides from "Deep Learning for NLP(without Magic)" of Socher and Manning Continuous Word Representations Adapted slides of NAACL 2015 tutorial given by Wen-tau Yih, Xiaodong He and Jianfeng Gao Latent Semantic Analysis Adapted slides of NAACL 2015 tutorial given by Wen-tau Yih, Xiaodong He and Jianfeng Gao Word2vec Continuous Bag-of-Words (CBOW) (Mikolov et.al, Workshop at ICLR 2013) Skip-gram 12 Word2vec 彩色 的 蝴蝶 翩翩 起舞 彩色 彩色 的 的 蝴蝶 蝴蝶 翩翩 翩翩 起舞 起舞 Continuous Bag-of-Words (CBOW) (Mikolov et.al, Workshop at ICLR 2013) Skip-gram 13 Review Softmax • Softmax layer as the output layer Softmax Layer z1 z2 3 1 z3 -3 e e e e z1 20 e 0.88 ÷ z 2 2.7 e z3 3 + ∑e j =1 y1 = e 0.12 ÷ 0.05 Probability: 1 > 𝑦𝑦𝑖𝑖 > 0 ∑𝑖𝑖 𝑦𝑦𝑖𝑖 = 1 y2 = e ÷ ≈0 y3 = e 3 z1 ∑e zj j =1 3 z2 ∑e zj j =1 3 z3 ∑e zj j =1 zj Adapted slides of Hung-yi Lee Skip-gram using softmax • The objective of the Skip-gram model is to maximize the average log probability • The basic Skip-gram formulation defines p(wt+j |wt) using the softmax function: (Mikolov et.al, NIPS 2013) The probability is also used on training. But so many calculations! Skip-gram using Negative sampling • We define Negative sampling (NEG) by the objective k is the number of negative samples • The sigmoid function The function is also used on training. So little calculations! (Mikolov et.al, NIPS 2013) (Mikolov et.al, NIPS 2013) (Mikolov et.al, Workshop at ICLR 2013) Named entity recognition using Window-based NN NER: I O Iive O in O New BLOC York ILOC last O year O O BPER IPER BLOC ILOC BORGIORG Neural Network I Iive context in New York last context year Language model using Recurrent NN • Review recurrent NN y1 Wo Wh y2 Wo Wh Wi x1 y3 Wo Wh Wi …… Wi x3 x2 The same network is used again and again. Output yi depends on x1, x2, …… xi Adapted slides of Hung-yi Lee Language model using Recurrent NN • Language Model: Estimate the probability of a given word sequence • The problems of traditional N-gram LM: No memory and need smoothing Training: P(彩色 的 蝴蝶 ) = P(彩色)*P(的|彩色 )* P(蝴蝶 | 彩色 P(蝴蝶 | 彩色 的) 的) = count(蝴蝶)/ count(彩色 的 蝴蝶) = 3/789 Testing: P(彩色 的 P(鳳蝶| 彩色 鳳蝶) = P(彩色)*P(的|彩色 )* P(鳳蝶| 彩色 的) = count(鳳蝶)/ count(彩色 的 的) 鳳蝶) = 0/789 • Recurrent NN has memory and it can provide smoothing naturally Language model using Recurrent NN • Review recurrent NN y1 Wo Wh y2 Wo Wh Wi x1 y3 Wo Wh Wi …… Wi x3 x2 The same network is used again and again. Output yi depends on x1, x2, …… xi Adapted slides of Hung-yi Lee Language model using Recurrent NN Training: P(next word is “彩色”) P(next word is “的”) P(next word is “蝴蝶”) P(next word is “END”) 1-of-N encoding 1-of-N encoding 1-of-N encoding 1-of-N encoding of “START” of “彩色” of “的” of “蝴蝶” Adapted slides of Hung-yi Lee Language model using Recurrent NN Testing: P(next word is “彩色”) P(next word is “的”) P(next word is “鳳蝶”) P(next word is “END”) 1-of-N encoding 1-of-N encoding 1-of-N encoding 1-of-N encoding of “START” of “彩色” of “的” of “鳳蝶” Adapted slides of Hung-yi Lee Machine translation using LSTM • Review LSTM Other part of the network Signal control the output gate (Other part of the network) Signal control the input gate (Other part of the network) Output Gate Special Neuron: 4 inputs, 1 output Memory Cell Forget Gate Input Gate LSTM Other part of the network Signal control the forget gate (Other part of the network) Adapted slides of Hung-yi Lee Machine translation using LSTM Target language Source language Softmax v is the fixed dimensional representation of the input sequence given by the last hidden state of the LSTM (Sutskever et.al, NIPS 2014) Machine translation using LSTM (Sutskever et.al, NIPS 2014) Model Analysis (Sutskever et.al, NIPS 2014) Model Analysis (Sutskever et.al, NIPS 2014) Syntactic Parsing using Recursive NN • What is parsing? Short Ans: Structure prediction Adapted slides from Socher’s deep learning course Syntactic Parsing using Recursive NN Adapted slides from Socher’s deep learning course Syntactic Parsing using Recursive NN Adapted slides from Socher’s deep learning course Syntactic Parsing using Recursive NN Adapted slides from Socher’s deep learning course Syntactic Parsing using Recursive NN Adapted slides from Socher’s deep learning course Syntactic Parsing using Recursive NN Adapted slides from Socher’s deep learning course Sentiment analysis using Recursive NN • What is sentiment analysis? negative postive • Most methods start with a bag of words +linguistic fearures/processing/lexicon, but they are hard to distinguish the above case. Adapted slides from Socher’s deep learning course Sentiment analysis using Recursive NN 5 classes ( -- , - , 0 , + , ++ ) (Socher et.al, ICML 2011) Sentiment analysis using Recursive NN 5 classes is decided by softmax (Socher et.al, ICML 2011) Sentence classification using Convolutional NN Adapted slides from Socher’s deep learning course Sentence classification using Convolutional NN (KIM, EMNLP 2014) Sentence classification using Convolutional NN Model Comparison Adapted slides from "Deep Learning for NLP(without Magic)" of Socher and Manning Model Comparison Adapted slides from "Deep Learning for NLP(without Magic)" of Socher and Manning Information retrieval using DSSM query Source: http://www.cs.toronto.edu/~hinton/science.pdf Bag-of-word word string (document or query) How to achieve that? (No target ……) Adapted slides of Hung-yi Lee Click-through data: q1 d1 : + d2 : - q2 Training: query q1 d3 : …… Far apart close document d1 + d4 : + document d2 - Far apart close Adapted slides of Hung-yi Lee query q2 document d3 - document d4 + DSSM v.s. Typical DNN DSSM Typical DNN reference input query q document d + query q document d + Adapted slides of Hung-yi Lee Click-through data: q1 q2 d1 : + d2 : d3 : …… d4 : + • How to do retrieval? Retrieved New Query q’ Document d1 Document d2 Adapted slides of Hung-yi Lee Knowledge base completion using knowledge base embedding Adapted slides of NAACL 2015 tutorial given by Wen-tau Yih, Xiaodong He and Jianfeng Gao Knowledge Base Applications Adapted slides of NAACL 2015 tutorial given by Wen-tau Yih, Xiaodong He and Jianfeng Gao Reasoning with Knowledge Base Adapted slides of NAACL 2015 tutorial given by Wen-tau Yih, Xiaodong He and Jianfeng Gao Knowledge Base Embedding Adapted slides of NAACL 2015 tutorial given by Wen-tau Yih, Xiaodong He and Jianfeng Gao Knowledge Base Representation – Tensor Adapted slides of NAACL 2015 tutorial given by Wen-tau Yih, Xiaodong He and Jianfeng Gao Knowledge Base Representation – Tensor Adapted slides of NAACL 2015 tutorial given by Wen-tau Yih, Xiaodong He and Jianfeng Gao Tensor Decomposition Objective Adapted slides of NAACL 2015 tutorial given by Wen-tau Yih, Xiaodong He and Jianfeng Gao Measure the Degree of a Relationship Adapted slides of NAACL 2015 tutorial given by Wen-tau Yih, Xiaodong He and Jianfeng Gao Concluding Remarks • Pre-trained through unsupervised training on raw data and then trained through supervised training on labeled data. • Physical meaning of each model is crucial • Novel applications are everywhere • Collaboration of traditional NLP/resource and DNN-based NLP/resource My current work – Lexical Knowledge Base Embedding • E-Hownet (http://ehownet.iis.sinica.edu.tw/index.php) My current work – Lexical Knowledge Base Embedding • E-Hownet (http://ehownet.iis.sinica.edu.tw/index.php) 工廠: 場所 domain location telic 製造 工 0.234 0.283 −0.435 0.485 −0.934 −0.384 0.234 0.548 −0.834 0.437 0.483 My current work – Dependency Chinese Word Embedding property theme property 大學 兼任 助理 納保 爭議 未 歇 59 CKIP Chinese Parser theme negation property property Head[VP] apposition 爭議 未 大學 兼任 兼任 助理 Nac Dc Ncb VG2 VG2 Nab Head[S] Head[S] Head[NP] Head[NP] range Head[NP] 歇 歇 爭議 爭議 納保 納保 VA12 VA12 Nac Nac Nb Nb 60 Demo • http://dep2.ckip.cc Reference Tutorials • "Deep Learning for NLP(without Magic)“ of Socher and Manning • NAACL 2015 tutorial given by Wen-tau Yih, Xiaodong He and Jianfeng Gao • UFLDL Tutorial • Socher’s deep learning course 2015 • Hung-yi Lee’s deep learning course 2015 Thank you for your attention!