Uploaded by cha ouaz

NLP Presentation TP3-2

advertisement
Character-Aware Neural Language Models
Presenter: Congyao Zheng, Charles Ouazana
Yoon Kim1
Yacine Jernite2
1 School
David Sontag2
Alexander M. Rush1
of Engineering and Applied Sciences
Harvard University
2 Courant
Institute of Mathematical Sciences
New York University
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence
Table of Contents
1
Introduction
2
Model
3
Experiment Setup
4
Training sets
5
Architecture and Optimization
6
Results
7
Conclusion
Table of Contents
1
Introduction
2
Model
3
Experiment Setup
4
Training sets
5
Architecture and Optimization
6
Results
7
Conclusion
Introduction
Context: NLMs are blind to subword information (e.g. morphemes). For
example, eventful, eventfully, uneventful, and uneventfully (a priori) should
have structurally related embeddings in the vector space.
A simple neural language model that relies only on character-level inputs
convolutional neural network (CNN)
Introduction
Context: NLMs are blind to subword information (e.g. morphemes). For
example, eventful, eventfully, uneventful, and uneventfully (a priori) should
have structurally related embeddings in the vector space.
A simple neural language model that relies only on character-level inputs
convolutional neural network (CNN)
a highway network over characters
Introduction
Context: NLMs are blind to subword information (e.g. morphemes). For
example, eventful, eventfully, uneventful, and uneventfully (a priori) should
have structurally related embeddings in the vector space.
A simple neural language model that relies only on character-level inputs
convolutional neural network (CNN)
a highway network over characters
a long short-term memory (LSTM) recurrent neural network language
model (RNN-LM)
Table of Contents
1
Introduction
2
Model
3
Experiment Setup
4
Training sets
5
Architecture and Optimization
6
Results
7
Conclusion
Model
Highway Networks:
Using gate units:
- unimpeded information flow across
several layers on ”information highways”
- training of deep networks by adaptively carrying some dimensions of the
input directly to the output.
Table of Contents
1
Introduction
2
Model
3
Experiment Setup
4
Training sets
5
Architecture and Optimization
6
Results
7
Conclusion
Experiment Setup
Test the model on corpora of varying languages and sizes
Perplexity (PPL) of the Negative Log-likelihood (NLL)
Figure: Perplexity of a model over a sequence
Figure: the negative log-likelihood of the test set
Experiment Setup
Dataset
Optimal hyperparameters tuned on PTB
Morphologically rich languages: Czech, German, French, Spanish,
Russian, and Arabic.
Optimization
Back-propagate: 35 time steps
Dropout: 0.5
LSTM input-to-hidden layers
hidden-to-output softmax layer
The norm of the gradients ≤ 5
A hierarchical softmax
Table of Contents
1
Introduction
2
Model
3
Experiment Setup
4
Training sets
5
Architecture and Optimization
6
Results
7
Conclusion
Training sets
The article used three different training sets from different languages to
evaluate the model:
PTB Dataset
1m tokens and |V| = 10k
standard training (0-20), validation (21-22), and test (23-24)
Training sets
The article used three different training sets from different languages to
evaluate the model:
PTB Dataset
1m tokens and |V| = 10k
standard training (0-20), validation (21-22), and test (23-24)
2013 ACL Workshop on Machine Translation Dataset
Two datasets: DATA-S with 1m tokens per language, and the large
datasets (DATA-L)
Botha and Blunsom (2014) preprocessing of the data
Training sets
The article used three different training sets from different languages to
evaluate the model:
PTB Dataset
1m tokens and |V| = 10k
standard training (0-20), validation (21-22), and test (23-24)
2013 ACL Workshop on Machine Translation Dataset
Two datasets: DATA-S with 1m tokens per language, and the large
datasets (DATA-L)
Botha and Blunsom (2014) preprocessing of the data
Arabic data: News-Commentary corpus
Training dataset summary
DATA-S
|V| |C|
T
Language
English (EN)
Czech (Cs)
German (DE)
Spanish (Es)
French (FR)
Russian (RU)
Arabic (AR)
10k
46k
37k
27k
25k
62k
86k
51
101
74
72
76
62
132
1
1
1
1
1
1
4
m
m
m
m
m
m
m
DATA-L
|V| |C|
T
60k
206k
339k
152k
137k
497k
-
197
195
260
222
225
111
-
20 m
17 m
51 m
56 m
57 m
25 m
-
Table of Contents
1
Introduction
2
Model
3
Experiment Setup
4
Training sets
5
Architecture and Optimization
6
Results
7
Conclusion
Neural Network Architecture
Figure: d = dimensionality of character embeddings; w = filter widths; h =
number of filter matrices; f , g = nonlinearity functions; l = number of layers;
m = number of hidden units.
Neural Network Architecture
Figure: d = dimensionality of character embeddings; w = filter widths; h =
number of filter matrices; f , g = nonlinearity functions; l = number of layers;
m = number of hidden units.
Optimization
Truncated backpropagation
Backpropagate using stochastic gradient descent with a learning rate 1.0
and halved if the perplexity does not decrease by more than 1.0
Optimization
Truncated backpropagation
Backpropagate using stochastic gradient descent with a learning rate 1.0
and halved if the perplexity does not decrease by more than 1.0
Regularization
Dropout of probability 0.5 is implemented on the LSTM
input-to-hidden layers
Gradient norm constraint to be below 5
Optimization
Truncated backpropagation
Backpropagate using stochastic gradient descent with a learning rate 1.0
and halved if the perplexity does not decrease by more than 1.0
Regularization
Dropout of probability 0.5 is implemented on the LSTM
input-to-hidden layers
Gradient norm constraint to be below 5
Hierarchical softmax for large Dataset
Random split of V in exclusive and collectively exhaustive subsets
V1 , . . . , Vc
j
j
exp
h
·
p
+
q
r
r
r
r
t
exp (ht · s + t )
Pr (wt+1 = j | w1:t ) = Pc
0
0 × P
0
0
r
r
r 0 =1 exp (ht · s + t )
exp ht · pjr + qrj
0
j ∈Vr
Table of Contents
1
Introduction
2
Model
3
Experiment Setup
4
Training sets
5
Architecture and Optimization
6
Results
7
Conclusion
Results
Figure: PTB Results
Highway layer effect on the understanding
Figure: Nearest neighbor words (based on cosine similarity)
Table of Contents
1
Introduction
2
Model
3
Experiment Setup
4
Training sets
5
Architecture and Optimization
6
Results
7
Conclusion
Conclusion
Character CNN versus word level model
The speed lost by training the CNN on the characters can be
compensated by GPU optimization and the smaller size of the models
used
The model outperforms our performs similarly words/morpheme
embeddings based models with fewer parameters
Conclusion
Character CNN versus word level model
The speed lost by training the CNN on the characters can be
compensated by GPU optimization and the smaller size of the models
used
The model outperforms our performs similarly words/morpheme
embeddings based models with fewer parameters
Related work
Character only models: Inputs and Outputs of the model are
characters
Representation of a word as as a combination of its embedding and
its character-level CNN output
Deep-CNN at character level has shown to perform well for train
classification.
Download