Character-Aware Neural Language Models Presenter: Congyao Zheng, Charles Ouazana Yoon Kim1 Yacine Jernite2 1 School David Sontag2 Alexander M. Rush1 of Engineering and Applied Sciences Harvard University 2 Courant Institute of Mathematical Sciences New York University Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence Table of Contents 1 Introduction 2 Model 3 Experiment Setup 4 Training sets 5 Architecture and Optimization 6 Results 7 Conclusion Table of Contents 1 Introduction 2 Model 3 Experiment Setup 4 Training sets 5 Architecture and Optimization 6 Results 7 Conclusion Introduction Context: NLMs are blind to subword information (e.g. morphemes). For example, eventful, eventfully, uneventful, and uneventfully (a priori) should have structurally related embeddings in the vector space. A simple neural language model that relies only on character-level inputs convolutional neural network (CNN) Introduction Context: NLMs are blind to subword information (e.g. morphemes). For example, eventful, eventfully, uneventful, and uneventfully (a priori) should have structurally related embeddings in the vector space. A simple neural language model that relies only on character-level inputs convolutional neural network (CNN) a highway network over characters Introduction Context: NLMs are blind to subword information (e.g. morphemes). For example, eventful, eventfully, uneventful, and uneventfully (a priori) should have structurally related embeddings in the vector space. A simple neural language model that relies only on character-level inputs convolutional neural network (CNN) a highway network over characters a long short-term memory (LSTM) recurrent neural network language model (RNN-LM) Table of Contents 1 Introduction 2 Model 3 Experiment Setup 4 Training sets 5 Architecture and Optimization 6 Results 7 Conclusion Model Highway Networks: Using gate units: - unimpeded information flow across several layers on ”information highways” - training of deep networks by adaptively carrying some dimensions of the input directly to the output. Table of Contents 1 Introduction 2 Model 3 Experiment Setup 4 Training sets 5 Architecture and Optimization 6 Results 7 Conclusion Experiment Setup Test the model on corpora of varying languages and sizes Perplexity (PPL) of the Negative Log-likelihood (NLL) Figure: Perplexity of a model over a sequence Figure: the negative log-likelihood of the test set Experiment Setup Dataset Optimal hyperparameters tuned on PTB Morphologically rich languages: Czech, German, French, Spanish, Russian, and Arabic. Optimization Back-propagate: 35 time steps Dropout: 0.5 LSTM input-to-hidden layers hidden-to-output softmax layer The norm of the gradients ≤ 5 A hierarchical softmax Table of Contents 1 Introduction 2 Model 3 Experiment Setup 4 Training sets 5 Architecture and Optimization 6 Results 7 Conclusion Training sets The article used three different training sets from different languages to evaluate the model: PTB Dataset 1m tokens and |V| = 10k standard training (0-20), validation (21-22), and test (23-24) Training sets The article used three different training sets from different languages to evaluate the model: PTB Dataset 1m tokens and |V| = 10k standard training (0-20), validation (21-22), and test (23-24) 2013 ACL Workshop on Machine Translation Dataset Two datasets: DATA-S with 1m tokens per language, and the large datasets (DATA-L) Botha and Blunsom (2014) preprocessing of the data Training sets The article used three different training sets from different languages to evaluate the model: PTB Dataset 1m tokens and |V| = 10k standard training (0-20), validation (21-22), and test (23-24) 2013 ACL Workshop on Machine Translation Dataset Two datasets: DATA-S with 1m tokens per language, and the large datasets (DATA-L) Botha and Blunsom (2014) preprocessing of the data Arabic data: News-Commentary corpus Training dataset summary DATA-S |V| |C| T Language English (EN) Czech (Cs) German (DE) Spanish (Es) French (FR) Russian (RU) Arabic (AR) 10k 46k 37k 27k 25k 62k 86k 51 101 74 72 76 62 132 1 1 1 1 1 1 4 m m m m m m m DATA-L |V| |C| T 60k 206k 339k 152k 137k 497k - 197 195 260 222 225 111 - 20 m 17 m 51 m 56 m 57 m 25 m - Table of Contents 1 Introduction 2 Model 3 Experiment Setup 4 Training sets 5 Architecture and Optimization 6 Results 7 Conclusion Neural Network Architecture Figure: d = dimensionality of character embeddings; w = filter widths; h = number of filter matrices; f , g = nonlinearity functions; l = number of layers; m = number of hidden units. Neural Network Architecture Figure: d = dimensionality of character embeddings; w = filter widths; h = number of filter matrices; f , g = nonlinearity functions; l = number of layers; m = number of hidden units. Optimization Truncated backpropagation Backpropagate using stochastic gradient descent with a learning rate 1.0 and halved if the perplexity does not decrease by more than 1.0 Optimization Truncated backpropagation Backpropagate using stochastic gradient descent with a learning rate 1.0 and halved if the perplexity does not decrease by more than 1.0 Regularization Dropout of probability 0.5 is implemented on the LSTM input-to-hidden layers Gradient norm constraint to be below 5 Optimization Truncated backpropagation Backpropagate using stochastic gradient descent with a learning rate 1.0 and halved if the perplexity does not decrease by more than 1.0 Regularization Dropout of probability 0.5 is implemented on the LSTM input-to-hidden layers Gradient norm constraint to be below 5 Hierarchical softmax for large Dataset Random split of V in exclusive and collectively exhaustive subsets V1 , . . . , Vc j j exp h · p + q r r r r t exp (ht · s + t ) Pr (wt+1 = j | w1:t ) = Pc 0 0 × P 0 0 r r r 0 =1 exp (ht · s + t ) exp ht · pjr + qrj 0 j ∈Vr Table of Contents 1 Introduction 2 Model 3 Experiment Setup 4 Training sets 5 Architecture and Optimization 6 Results 7 Conclusion Results Figure: PTB Results Highway layer effect on the understanding Figure: Nearest neighbor words (based on cosine similarity) Table of Contents 1 Introduction 2 Model 3 Experiment Setup 4 Training sets 5 Architecture and Optimization 6 Results 7 Conclusion Conclusion Character CNN versus word level model The speed lost by training the CNN on the characters can be compensated by GPU optimization and the smaller size of the models used The model outperforms our performs similarly words/morpheme embeddings based models with fewer parameters Conclusion Character CNN versus word level model The speed lost by training the CNN on the characters can be compensated by GPU optimization and the smaller size of the models used The model outperforms our performs similarly words/morpheme embeddings based models with fewer parameters Related work Character only models: Inputs and Outputs of the model are characters Representation of a word as as a combination of its embedding and its character-level CNN output Deep-CNN at character level has shown to perform well for train classification.