Research Paper: Neural Turing Machines Authors: Alex Graves, Greg Wayne, Ivo Danihelka Published: 10 December 2014 Google Deep mind, London, United Kingdom Report by: Musadiq Gilal (20k-1681) Introduction and Insight: This research paper comes from a study at Google's Deep mind. It puts together an intelligent analogy of Turing machines and the human brain. By combining the neural networks with Turing machines, forms a Neural Turing machine or NTM. Due to its combination with TM, the NTM is capable of performing simple mathematical functions and algorithms such as copying sorting and a cache recall. Three main techniques are used by computer programmers: basic operations (such as arithmetic operations), logical flow control (branching), and external memory, which can be written to and read from during computation. Despite having widespread success modelling complex data, contemporary machine learning has generally ignored the use of external memory and logical flow control. Recurrent neural networks (RNNs) differ from other machine learning techniques in several ways. For their capacity to learn and perform complex data transformations over long intervals of time. RNNs are also acknowledged to be Turing-Complete. However, what is feasible in theory may not always be straightforward in practice. The NTM paper consequently, enhances the potential of conventional recurrent networks to make the resolution of computational activities. The paper refers to this device as a "Neural Turing Machine" since this enrichment is primarily accomplished through a sizable, addressable memory, similar to how Turing enhanced finite-state machines by the use of an infinite memory tape (NTM). According to my understanding of the paper, An NTM, as opposed to a Turing machine, is a differentiable computer that can be trained via gradient descent, providing a useful method for creating learning algorithms. What are Recurrent Neural Networks? Recurrent neural networks are a type of machine that exhibits dynamic state, or state whose evolution is influenced by both the system's input and its current state. RNNs have a dispersed state in contrast to hidden Markov models, which also feature dynamic state, making them substantially more powerful in terms of memory and processing power. Dynamic state is important because it allows for context-dependent computation; a signal entering at one point can change the network's behavior at a later point. By including perfect integrators for memory storage in the network, LSTM solves the issue. The equation x(t + 1) = x(t) + i(t), where i(t) is an input to the system, is the most straightforward illustration of a perfect integrator. Signals do not dynamically disappear or explode thanks to the implicit identity matrix Ix(t). An equation of the type x(t + 1) = x(t) + g(context)i. if added with a mechanism to this integrator that allows an enclosing network to choose when the integrator listens to inputs (t). Today, we have the ability to retain information selectively for an infinite amount of time. Working/ Architecture: The controller network takes inputs from the outside world during each update cycle and responds by emitting outputs. Through a series of concurrent read and write heads, it can also read from and write to a memory matrix. The boundary between the NTM circuit and the outside world is shown by the dashed line. Neural Turing machines A neural network controller and a memory bank are the two fundamental parts of a Neural Turing Machine (NTM) design. A high-level diagram of the NTM architecture as shown in Figure 1. The controller communicates with the outside environment through, similar to most neural networks, vectors of input and output. It also engages with a memory matrix, in contrast to a conventional network employing read-only and write operations. By way of comparison to the Turing machine, the authors use the "Heads" are the network outputs that parametrize these operations. Importantly, the architecture is straightforward because each part may be changed. Write and read procedures that have varying degrees of interaction with every memory component addressing multiple elements. Each weighting, one per read or write head, defines the degree to which the head reads or writes 5 at each location. A head can thereby attend sharply to the memory at a single location or weakly to the memory at many locations. Read Operation: The scientists let M(t) represent the information stored in the NxM memory matrix at time t, where N denotes the number of memory locations and M is the size of the vector stored at each site. Now they, let W(t) represent a vector of weightings distributed over the N places that a read head emits at time t. All weightings are normalized, therefore the N elements of W(t) that conform to the following restrictions are: The length M read vector rt returned by the head is defined as a convex combination of the rowvectors Mt(i) in memory: which clearly differentiates both the memory and the weighting. Write Operation: The paper divides each write into two parts: an erase and an add, drawing inspiration from the input and forget gates in LSTM. The memory vectors Mt(1)(i) from the prior time-step are updated as follows when a weighting W(t) released by a write head at time t and an erase vector e(t) with an erase vector whose M elements all fall in the range (0, 1) are both provided: where the multiplication against the memory location acts point-wise and 1 is a row-vector of all 1-s. As a result, a memory location's elements are only reset to zero if the weighting at the location and the erase element are both one; otherwise, the memory is left intact. The erasures can be carried out in any order when there are several write heads because multiplication is commutative. After the erase step, each write head additionally generates a length M add vector at, which is appended to the memory: Once again, the order in which the adds are performed by multiple heads is irrelevant. The combined erase and add operations of all the write heads produces the final content of the memory at time t. Since both erase and add are differentiable, the composite write operation is differentiable too. Note that both the erase and add vectors have M independent components, allowing finegrained control over which elements in each memory location are modified. Addressing Mechanisms: Even though the authors now demonstrated the equations for reading and writing, they have omitted to explain how the weightings are determined. By combining two addressing systems with complimentary facilities, these weightings are created. The first approach, called "content-based addressing," directs focus to specific places based on how closely their present values match those given by the controller. This relates to Hopfield networks' content-addressing (Hopfield, 1982). The benefit of content-based addressing is that retrieval is easy because all that is needed is for the controller to approximate a portion of the stored data, which is then compared to memory to produce the precise recorded value. Not all issues, however, lend themselves nicely to content-based treatment. In some tasks, a variable's content is arbitrary, yet it still needs a name or address that can be remembered. The process f(x, y) = x y must still be defined even though the variables x and y can each have any two values in arithmetic problems. The values of the variables x and y could be taken by a controller for this purpose, stored in various addresses, then retrieved and multiplied using an algorithm. In this instance, location rather than content is used to address the variables. This method of addressing is referred to as "location-based addressing." Since the content of a memory location may contain location information, content-based addressing is strictly more flexible than location-based addressing. But in their tests, enabling location-based addressing as a primitive operation turned out to be crucial for particular types of generalization, so they use both approaches simultaneously. Controller Network: The amount of the memory, the number of read and write heads, and the permitted position shift range are a few of the free parameters for the NTM architecture that was previously explained. The sort of neural network utilized as the controller, however, may be the architectural decision that has the biggest impact. One must choose between using a feedforward and recurrent network, in particular. A recurrent controller with internal memory, like the LSTM, can supplement the matrix's greater memory. The memory matrix is equivalent to RAM if the controller is compared to the central processing unit of a digital computer (albeit it uses adaptive rather than predetermined instructions), and the hidden activations of the recurrent controller are comparable to the registers in the processor. They enable the controller to combine data from several operational time steps. In contrast, a feedforward controller can imitate a recurrent network by repeatedly reading from and writing to the same region in memory. Experiments: This section presents initial research on a number of straightforward algorithmic operations, including copying and sorting data sequences. The objective was to demonstrate that NTM can solve the challenges and that it can do it by learning small internal programs. Such solutions stand out because they generalize much beyond the scope of the training data. For instance, the authors were interested in seeing if a network that had been trained to replicate sequences up to 20 characters could copy a sequence 100 characters long without any more training. Three architectures—a regular LSTM network, an NTM with an LSTM controller, and an NTM with a feedforward controller—were compared for each experiment. All of the tasks were episodic, thus at the beginning of each input sequence, the authors reset the networks' dynamic states. This entailed changing the previous hidden state of the LSTM networks to a learned bias vector. The prior read vector values, the controller's previous state, and the memory's contents were all reset to bias values for NTM. With logistic sigmoid output layers and cross-entropy objective function training, all of the tasks were supervised learning problems with binary objectives. In bits-per-sequence, sequence prediction errors are presented. Example Algorithm: initialise: move head to start location while input delimiter not seen do receive input vector write input to head location increment head location by 1 end while return head to start location while true do read output vector from head location emit output increment head location by 1 end while Associative Recall: The previous tasks demonstrate that the NTM is capable of using algorithms on linear data structures that are rather basic. When one data item refers to another, this is known as "indirection," which leads to the next level of complexity in data organization. By building a list of objects such that querying with one of the items forces the network to provide the following item, the authors test the NTM's capacity for learning an instance of this more interesting class. A sequence of binary vectors that is bounded on the left and right by delimiter symbols is how they define an item more specifically. Conclusion: The authors presented the Neural Turing Machine, an architecture for a neural network that uses both biological working memory models and the creation of digital computers served as sources of inspiration. The architecture is definable from beginning to end, similar to traditional neural networks, and gradient descent is a training method. Their investigations show that it is capable to using example data to teach elementary algorithms, then using those algorithms to generalise a long way from its workout regimen.