Generating Text with Recurrent Neural Networks

advertisement
GENERATING TEXT WITH RECURRENT
NEURAL NETWORKS
Ilya Sutskever, James Martens, and Geoffrey Hinton, ICML 2011
2013-4-1
Institute of Electronics, NCTU
指導教授: 王聖智 S. J. Wang
學生
: 陳冠廷 K. T. Chen
2
Outline
• Introduction
• Motivation
• What is the RNN?
• Why do we choose RNN to solve this problem?
• How to train RNN?
• Contribution
• Character-Level language modeling
• The multiplicative RNN
• The Experiments
• Discussion
3
Outline
• Introduction
• Motivation
• What is the RNN?
• Why do we choose RNN to solve this problem?
• How to train RNN?
• Contribution
• Character-Level language modeling
• The multiplicative RNN
• The Experiments
• Discussion
4
Movation
• Read some sentences and then try to predict next character.
Easter is a Christian festival
and holiday celebrating the
resurrection of Jesus Christ..
?
Easter is a Christian festival
and holiday celebrating the
resurrection of Jesus Christ
on the third day after his
crucifixion at Calvary as
described in the New
Testament.
5
Recurrent neural networks
• A recurrent neural network (RNN) is a class of neural network
where connections between units form a directed cycle
Feed-forward neural network
output
Recurrent neural network
output
hidden
hidden
input
input
6
Why do we choose RNNs?
• RNNs are suitable to deal with sequential data.(memory)
• RNNs are neural network in time
predictions
𝑊ℎ𝑜
𝑊ℎ𝑜
𝑊ℎ𝑜
𝑊ℎℎ
𝑊ℎℎ
𝑊𝑣ℎ
𝑊𝑣ℎ
hiddens
𝑊𝑣ℎ
inputs
t -1
t
t +1
time
7
How to train RNN?
• Backpropagation through time (BPTT)
• The gradient is easy to compute with backpropagation.
• RNNs learn by minimizing the training error.
predictions
𝑊ℎ𝑜
𝑊ℎ𝑜
𝑊ℎ𝑜
𝑊ℎℎ
𝑊ℎℎ
𝑊𝑣ℎ
𝑊𝑣ℎ
hiddens
𝑊𝑣ℎ
inputs
t -1
t
t +1
time
9
RNNs are hard to train
• They can be volatile and can exhibit long-range sensitivity
to small parameter perturbations.
• The Butterfly Effect
• The “vanishing gradient problem” makes gradient descent
ineffective.
outputs
hiddens
inputs
time
10
How to overcome vanishing gradient?
• Long-short term memory.(LSTM)
• Modify the architecture of neural network.
data write keep read
• Hessian-Free optimizer. (James Martens et al. 2011.)
• Base on the Newton’s method + conjugate gradient algorithm
• Echo State network.
• Only learn the hidden-output weighted.
11
Outline
• Introduction
• Motivation
• What is the RNN?
• Why do we choose RNN to solve this problem?
• How to train RNN?
• Contribution
• Character-Level language modeling
• The multiplicative RNN
• The Experiments
• Discussion
12
Character-Level language modeling
• The RNN observes a sequence of characters.
• The target output at each time step is defined as the input
character at the next time-step.
o
“Hell”
“Hello”
target
“Hel”
l
“He”
l
“H”
e
H
e
l
l
o
Hidden state
stores relevant
Information.
13
The standard RNN
𝑊ℎℎ
ℎ𝑡 = tanh(𝑊ℎ𝑥 𝑥𝑡 + 𝑊ℎℎ ℎ𝑡−1 + 𝑏ℎ )
…….
…….
𝑜𝑡 = 𝑊𝑜ℎ ℎ𝑡 + 𝑏𝑜
𝑊ℎ𝑥
H
……
character: 1-of-86
Softmax
Predict distribution
for next character.
• The current input 𝑥𝑡 is transformed via the visible-to-
hidden weight matrix 𝑊ℎ𝑥 ,and then contributes additively
to the input for the current hidden state.
14
Some motivation from model a tree
i
..fixi
..fix
e
..fixe
n
.fixin
• Each node is a hidden state vector. The next character
must transform this to a new node.
• The next hidden state needs to depend on the conjunction
of the current character and the current hidden
representation.
15
The Multiplicative RNN
• They tried several neural network architectures and found
the “Multiplicative-RNN” (MRNN) to be more effective than
the regular RNN
The weight matrix is
chosen by the current
character
Current input
character
ℎ𝑡 = tanh(𝑊ℎ𝑥 𝑥𝑡 + 𝑊ℎℎ
𝑜𝑡 = 𝑊𝑜ℎ ℎ𝑡 + 𝑏𝑜
𝑥𝑡
ℎ𝑡−1 + 𝑏ℎ )
16
The Multiplicative RNN
• Naïve implementation : assign a matrix to each character
• This requires a lot of parameters. (86*1500*1500)
• This could make the net overfit.
• Difficult to parallelize on a GPU
• Factorize the matrices of each character
• Fewer parameters
• Easier to parallelize
17
The Multiplicative RNN
• We can get groups a and b to interact multiplicatively by
using “factors”
𝒗𝑓
f
Group c
Group a
𝒖𝑓
𝒄𝑓 = (𝒃𝑇 𝒘𝑓 )(𝒂𝑇 𝒖𝑓 )𝒗𝑓
𝒄𝑓 = (𝒃𝑇 𝒘𝑓 )(𝒗𝑓 𝒖𝑓 𝑇 ) 𝒂
Scalar
coefficient
𝒘𝑓
Group b
Outer product
transition matrix
with rank 1
𝒃𝑇 𝒘𝑓 𝒗𝑓 𝒖𝑓 𝑇
𝒄=
𝑓
𝒂
18
The Multiplicative RNN
𝒃𝑇 𝒘𝑓 𝒗𝑓 𝒖𝑓 𝑇
𝒄=
…….
𝒗𝑓
𝒘ℎ𝑓
H
……
character: 1-of-86
…….
f
…….
1500
hidden
units
𝒖𝑓
𝑓
1500
hidden
units
Predict distribution
for next character.
Each factor f defines a rank
one matrix,𝒗𝑓 𝒖𝑓 𝑇
𝒂
19
The Multiplicative RNN
Output
𝑊ℎ𝑜
𝑊ℎ𝑓
𝑊ℎ𝑜
𝑊𝑓ℎ
𝑊𝑣ℎ
𝑊𝑓ℎ
𝑊𝑣ℎ
t
𝑊ℎ𝑓
𝑊ℎ𝑜
𝑊𝑓ℎ
𝑊𝑣ℎ
𝑊𝑣𝑓
𝑊𝑣𝑓
t-1
𝑊ℎ𝑓
𝑊ℎ𝑜
t+1
𝑊ℎ𝑓
𝑊𝑓ℎ
𝑊𝑣ℎ
𝑊𝑣𝑓
t +2
Input
characters
Time
20
The Multiplicative RNN:Key advantages
• The MRNN combines conjunction of contexts and
characters more easily:
Predict “i,e,_”
Predict “n”
fix
fixi
i
• The MRNN has two nonlinearities per timestep,whick
make its dynamics even ricker and more powerful.
21
Outline
• Introduction
• Motivation
• What is the RNN?
• Why do we choose RNN to solve this problem?
• How to train RNN?
• Contribution
• Character-Level language modeling
• The multiplicative RNN
• The Experiments
• Discussion
22
The Experiments
• Training on three large datasets
• ~1GB of the English Wikipedia
• ~1GB of articles from New York Times
• ~100MB of JMLR and AISTATS paper
• Compare with the Sequence Memorizer (Wood et al.) and
PAQ (Mahoney et al.)
23
Training on subsequences
millions long
This is an extremely long string of text………………………………………………………………………….
250
This is an extre ………
his is an extrem……..
is is an extreme……...
The subsequences
s is an extremel……...
is an extremely……...
is an extremely .......
s an extremely l.......
…..
Compute the gradient and
the curvature on subset of
the subsequences.
Use a different subset at
each iteration
24
Parallelization
• Use HF optimizer to evaluate the gradient and curvature
on large minibatches of data.
Data
GPU
GPU
+
gradient
+
curvature
GPU
GPU
GPU
GPU
GPU
GPU
25
The architecture of model
• Use 1500 hidden units and 1500 multiplicative factors on 250-
long sequences.
......
1500
predicetions
......
….
….
….
….
….
….
….
….
......
hiddens
input
500
• Arguably the largest and deepest neural network ever trained.
26
Demo
• The MRNN extracts “higher level information”, stores it for
many timesteps ,and uses it to make a prediction.
• Parentheses sensitivity
• (Wlching et al. 2005) the latter has received numerical testimony without much
•
•
•
•
•
•
deeply grow
(Wlching, Wulungching, Alching, Blching, Clching et al." 2076) and Jill Abbas, The
Scriptures reported that Achsia and employed a
the sequence memoizer (Wood et al McWhitt), now called "The Fair Savings.'""
interpreted a critic. In t
Wlching ethics, like virtual locations. The signature tolerator is necessary to en
Wlching et al., or Australia Christi and an undergraduate work in over knowledge,
inc
They often talk about examples as of January 19, . The "Hall Your Way" (NRF film)
and OSCIP
Her image was fixed through an archipelago's go after Carol^^'s first century, but
simply to
27
Outline
• Introduction
• Motivation
• What is the RNN?
• Why do we choose RNN to solve this problem?
• How to train RNN?
• Contribution
• Character-Level language modeling
• The multiplicative RNN
• The Experiments
• Discussion
28
Discussion
• The MRNN model generated text contains very few non-
words. (e.g., “cryptoliation”, “homosomalist”). This let
MRNN can deal with real words that it didn’t see in the
training set.
• If they have more computational power, they could train
much bigger MRNNs with millions of units and billions of
connections.
29
Reference
• Generating Text with Recurrent Neural Networks, Ilya Sutskever,
James Martens, and Geoffrey Hinton, ICML 2011
• Factored Conditional Restricted Boltzmann Machines for
Modeling Motion Style, GrahamW. Taylor, Geoffrey E. Hinton
• Coursera : Neural Networks for Machine Learning ,Geoffrey
Hinton
• http://www.cs.toronto.edu/~ilya/rnn.html
Download