[DR011] Neural Turing Machine

Posted on July 08, 2017

This popular work combines traditional neural network with an extra memory. By controlling the read heads and write heads, the network, or controller, is able to store and retrieve information from the memory. The memory accessing resembles Turing Machine writing and reading from a paper tape. 


Although researchers are astonished at its advent, I think the model itself is just an extended version of LSTM if you treat the memory module as a special gate. However, training such a meticulously designed model to handle data properly is a very admirable effort. The model enables researchers to explore what is learned and to express the trained model in the form of pseudo-code. Apart from the design, the experiments (copying and sorting sequences) are totally different from traditional tasks. 

If you are familiar with RNN/LSTM and attention model (check my previous post), you may interpret Neural Turing Machine (NTM) as an improved soft-itemwise attention model that operates on memory chunks


So basically, NTM can be summarized into the first and second figure introduced in this post.

NTM is composed of controller and memory. The controller receives the input, communicate with the memory, and produce the output.

The memory at time step t is a N \times M matrix, where N is the number of slots in the memory, and M is the size of each memory slot. To address of the location in the memory, they use a normalized weighting w_t, such that:

The memory reading r_t given address vector w_t is a weighted sum over all rows:

The writing procedure is like the forget gate and output gate in traditional LSTM. In their terms, an erase vector e_t and an add vector a_t are learned to control the memory decay and addition respectively:

Stop here and you will find that the NTM is highly similar with LSTM except that in LSTM, he memory is updated by fully connecting with its previous copy and current input, while in NTM the updates of memory is controlled with attention mechanism.


Now, how to generate the addressing vector w_t seems to be a crucial problem.

Let’s dive into it step by step.
During Content Addressing, they learn w_t^c by calculating:

where K[\cdot, \cdot] is a measure of similarity. I called their addressing procedure as “an improved soft-itemwise attention” for a reason because this part is virtually a standard soft-attention mechanism.

However, the use of K[\cdot, \cdot] still surprises me. To calculate a traditional soft-attention, we usually just learn a vector and normalize it over positions. NTM learns a key or a probe to distribute larger weight to those similar memories. It is quite contradictory. On one hand, if NTM wants to locate a memory slot i_0 precisely, it has to recall and generate k_t that looks like M_t(i_0). On the other hand, NTM can locate a slot by recalling a fuzzy key k_t and then retrieve the precise version of that memory. The authors relate it with Hopfield Network while I think it is more like Self Organizing Mapping or Winner-Takes-All model. 


During Interpolation, a standard forget gate is employed:


During Convolutional Shift, a shift vector s_t is trained to circularly shift w:

Before Convolutional Shift, the address w_t^{\cdot} is a soft-attention vector, which means nothing will change if you shuffle the memory location. Shifting in addressing allows the controller to traverse all memory by simply shifting with unit step, which proves to be an essential tool to arrange memory.


During Sharpening, the w_t is re-normalized into a more sparse vector by:

, where \gamma_t \geq 1.


I strongly recommend you to read the experiment part by yourselves. It is very interesting, innovative and incisive. 



[DR011]: There are many details I want to comment on:

(1) the use of K[\cdot, \cdot] is novel. It allows the controller to retrieve precise memory by recalling a fuzzy one. Quite like how human searches for information. I wonder what will happen if we change the cosine metric with other metrics and what happens when it is replaced with a standard soft-attention vector.

(2) The design of shifting plays a significant part in handling sequences. Moving relatively rather than absolutely along the tape is a basic operation of Turing Machine. What will happen if we add the shifting part to LSTM? Maybe it can work as well as LSTM. (3) 

(3) As impressive their experiment is, it lacks ablation experiments to examine how much each component in the addressing contributes.

(4) Another great benefit of having an external memory is that a program can setup the environment without inputs. Like setting up global variables before major loops in many algorithms. However, NTM does not have that spare time to setup the memory in the experiments. It is required to output immediately after the input terminates.

(5) NTM fails to establish feasible loops by its own. It can loop as the input sequence coming in. However, it fails to generally create a loop counter (in Section 4.2). I think introducing some integer registers or discrete operations might solve this problem.

(6) Disentangling and complicating neural network is always a hot topic. It is natural that researcher will find the detailed model is much easier to train than the black box one, like NTM versus LSTM. The extra bonus is that we can visualize and understand how the neural network accomplishes the mission. However, although the feature is learned by the neural network itself, we actually introduce many hand-crafted operations into the model design itself. It is ironic that NTM, which aims to learn more generally than LSTM, fails to generalize on the design of the model. We have already outshined hand-crafted program via machine learning, surpassed hand-crafted feature via deep learning, and am going to transcend hand-crafted model via some kind of general learning. (Maybe Genetic Algorithm? :D) Or model with such generalizing ability does not exist according to the No Free Lunch Theorem.