Uploaded by k201681

TOA Research Report

advertisement
Research Paper: Neural Turing Machines
Authors: Alex Graves, Greg Wayne, Ivo Danihelka
Published: 10 December 2014 Google Deep mind, London, United Kingdom
Report by: Musadiq Gilal (20k-1681)
Introduction and Insight:
This research paper comes from a study at Google's Deep mind. It puts together an intelligent
analogy of Turing machines and the human brain. By combining the neural networks with Turing
machines, forms a Neural Turing machine or NTM. Due to its combination with TM, the NTM is
capable of performing simple mathematical functions and algorithms such as copying sorting and a
cache recall.
Three main techniques are used by computer programmers: basic operations (such as arithmetic
operations), logical flow control (branching), and external memory, which can be written to and read
from during computation. Despite having widespread success modelling complex data,
contemporary machine learning has generally ignored the use of external memory and logical flow
control. Recurrent neural networks (RNNs) differ from other machine learning techniques in several
ways. For their capacity to learn and perform complex data transformations over long intervals of
time. RNNs are also acknowledged to be Turing-Complete. However, what is feasible in theory may
not always be straightforward in practice. The NTM paper consequently, enhances the potential of
conventional recurrent networks to make the resolution of computational activities.
The paper refers to this device as a "Neural Turing Machine" since this enrichment is primarily
accomplished through a sizable, addressable memory, similar to how Turing enhanced finite-state
machines by the use of an infinite memory tape (NTM). According to my understanding of the
paper, An NTM, as opposed to a Turing machine, is a differentiable computer that can be trained via
gradient descent, providing a useful method for creating learning algorithms.
What are Recurrent Neural Networks?
Recurrent neural networks are a type of machine that exhibits dynamic state, or state whose
evolution is influenced by both the system's input and its current state. RNNs have a dispersed state
in contrast to hidden Markov models, which also feature dynamic state, making them substantially
more powerful in terms of memory and processing power. Dynamic state is important because it
allows for context-dependent computation; a signal entering at one point can change the network's
behavior at a later point.
By including perfect integrators for memory storage in the network, LSTM solves the issue. The
equation x(t + 1) = x(t) + i(t), where i(t) is an input to the system, is the most straightforward
illustration of a perfect integrator. Signals do not dynamically disappear or explode thanks to the
implicit identity matrix Ix(t). An equation of the type x(t + 1) = x(t) + g(context)i. if added with a
mechanism to this integrator that allows an enclosing network to choose when the integrator listens
to inputs (t). Today, we have the ability to retain information selectively for an infinite amount of
time.
Working/ Architecture:
The controller network takes inputs from the outside world during each update cycle and responds
by emitting outputs. Through a series of concurrent read and write heads, it can also read from and
write to a memory matrix. The boundary between the NTM circuit and the outside world is shown
by the dashed line.
Neural Turing machines
A neural network controller and a memory bank are the two fundamental parts of a Neural Turing
Machine (NTM) design. A high-level diagram of the NTM architecture as shown in Figure 1. The
controller communicates with the outside environment through, similar to most neural networks,
vectors of input and output. It also engages with a memory matrix, in contrast to a conventional
network employing read-only and write operations. By way of comparison to the Turing machine,
the authors use the "Heads" are the network outputs that parametrize these operations.
Importantly, the architecture is straightforward because each part may be changed. Write and read
procedures that have varying degrees of interaction with every memory component addressing
multiple elements. Each weighting, one per read or write head, defines the degree to which the
head reads or writes 5 at each location. A head can thereby attend sharply to the memory at a single
location or weakly to the memory at many locations.
Read Operation:
The scientists let M(t) represent the information stored in the NxM memory matrix at time t, where
N denotes the number of memory locations and M is the size of the vector stored at each site. Now
they, let W(t) represent a vector of weightings distributed over the N places that a read head emits
at time t. All weightings are normalized, therefore the N elements of W(t) that conform to the
following restrictions are:
The length M read vector rt returned by the head is defined as a convex combination of the rowvectors Mt(i) in memory:
which clearly differentiates both the memory and the weighting.
Write Operation:
The paper divides each write into two parts: an erase and an add, drawing inspiration from the input
and forget gates in LSTM. The memory vectors Mt(1)(i) from the prior time-step are updated as
follows when a weighting W(t) released by a write head at time t and an erase vector e(t) with an
erase vector whose M elements all fall in the range (0, 1) are both provided:
where the multiplication against the memory location acts point-wise and 1 is a row-vector of all 1-s.
As a result, a memory location's elements are only reset to zero if the weighting at the location and
the erase element are both one; otherwise, the memory is left intact. The erasures can be carried
out in any order when there are several write heads because multiplication is commutative.
After the erase step, each write head additionally generates a length M add vector at, which is
appended to the memory:
Once again, the order in which the adds are performed by multiple heads is irrelevant. The
combined erase and add operations of all the write heads produces the final content of the memory
at time t. Since both erase and add are differentiable, the composite write operation is differentiable
too. Note that both the erase and add vectors have M independent components, allowing finegrained control over which elements in each memory location are modified.
Addressing Mechanisms:
Even though the authors now demonstrated the equations for reading and writing, they have
omitted to explain how the weightings are determined. By combining two addressing systems with
complimentary facilities, these weightings are created. The first approach, called "content-based
addressing," directs focus to specific places based on how closely their present values match those
given by the controller. This relates to Hopfield networks' content-addressing (Hopfield, 1982). The
benefit of content-based addressing is that retrieval is easy because all that is needed is for the
controller to approximate a portion of the stored data, which is then compared to memory to
produce the precise recorded value.
Not all issues, however, lend themselves nicely to content-based treatment. In some tasks, a
variable's content is arbitrary, yet it still needs a name or address that can be remembered. The
process f(x, y) = x y must still be defined even though the variables x and y can each have any two
values in arithmetic problems. The values of the variables x and y could be taken by a controller for
this purpose, stored in various addresses, then retrieved and multiplied using an algorithm. In this
instance, location rather than content is used to address the variables. This method of addressing is
referred to as "location-based addressing." Since the content of a memory location may contain
location information, content-based addressing is strictly more flexible than location-based
addressing. But in their tests, enabling location-based addressing as a primitive operation turned out
to be crucial for particular types of generalization, so they use both approaches simultaneously.
Controller Network:
The amount of the memory, the number of read and write heads, and the permitted position shift
range are a few of the free parameters for the NTM architecture that was previously explained. The
sort of neural network utilized as the controller, however, may be the architectural decision that has
the biggest impact. One must choose between using a feedforward and recurrent network, in
particular. A recurrent controller with internal memory, like the LSTM, can supplement the matrix's
greater memory. The memory matrix is equivalent to RAM if the controller is compared to the
central processing unit of a digital computer (albeit it uses adaptive rather than predetermined
instructions), and the hidden activations of the recurrent controller are comparable to the registers
in the processor. They enable the controller to combine data from several operational time steps. In
contrast, a feedforward controller can imitate a recurrent network by repeatedly reading from and
writing to the same region in memory.
Experiments:
This section presents initial research on a number of straightforward algorithmic operations,
including copying and sorting data sequences. The objective was to demonstrate that NTM can solve
the challenges and that it can do it by learning small internal programs. Such solutions stand out
because they generalize much beyond the scope of the training data. For instance, the authors were
interested in seeing if a network that had been trained to replicate sequences up to 20 characters
could copy a sequence 100 characters long without any more training.
Three architectures—a regular LSTM network, an NTM with an LSTM controller, and an NTM with a
feedforward controller—were compared for each experiment. All of the tasks were episodic, thus at
the beginning of each input sequence, the authors reset the networks' dynamic states. This entailed
changing the previous hidden state of the LSTM networks to a learned bias vector. The prior read
vector values, the controller's previous state, and the memory's contents were all reset to bias
values for NTM. With logistic sigmoid output layers and cross-entropy objective function training, all
of the tasks were supervised learning problems with binary objectives. In bits-per-sequence,
sequence prediction errors are presented.
Example Algorithm:
initialise: move head to start location
while input delimiter not seen do
receive input vector
write input to head location
increment head location by 1
end while return head to start location
while true do
read output vector from head location
emit output
increment head location by 1
end while
Associative Recall:
The previous tasks demonstrate that the NTM is capable of using algorithms on linear data
structures that are rather basic. When one data item refers to another, this is known as
"indirection," which leads to the next level of complexity in data organization. By building a list of
objects such that querying with one of the items forces the network to provide the following item,
the authors test the NTM's capacity for learning an instance of this more interesting class. A
sequence of binary vectors that is bounded on the left and right by delimiter symbols is how they
define an item more specifically.
Conclusion:
The authors presented the Neural Turing Machine, an architecture for a neural network that uses
both biological working memory models and the creation of digital computers served as sources of
inspiration. The architecture is definable from beginning to end, similar to traditional neural
networks, and gradient descent is a training method. Their investigations show that it is capable to
using example data to teach elementary algorithms, then using those algorithms to generalise a long
way from its workout regimen.
Download