Attention Model 101 (Seminar slide)

This PDF is made in July 2016. Therefore some advanced works are not included. I will try my best to explain about the slides as much as I can. This slide is mostly based on Survey on the attention based RNN model and its applications in computer vision. If you are a rookie of Attention Model, I strongly recommend you to go through this survey first.

OK! Let ’s begin our journey!

Other papers to be mentioned are (but not limited to):

Show, Attend and Tell- Neural Image Caption Generation with Visual Attention
[NIPS2014] Recurrent Models of Visual Attention
[NIPS2015]Spatial Transformer Networks
Action Recognition using Visual Attention

They are not the most advanced works for now, but some of them are quite classical and worth reading.

This sharing is presented in following orders:

First we will talk about RNN:

RNN is short for Recurrent neural network.

Assuming the input x at each time step is a vector of length D, and there are T time steps. Hence a single sample input is in shape of (T, D). A neural will receive and output T times to process a single sample. The update formula is listed as:

Here is a temporally expanded illustration of how an RNN neuron handles an input sample. Keep in mind that the circles in the red rectangle are the same neuron but at different time steps.

If you are still not familiar with RNN, you can regard it as a black box that takes a sequence of input and produces sequence of outputs.

Attention mechanism is a basic human instinct. You use it every moment awake even though you do not notice it. Limited by the perceptive area of our eyes, or trying to separate your friend's voice in a clamorous club, your brain will select what you should see or hear subconsciously. This is called Attention.

This is my favorite picture when explaining Attention mechanism to people. In this experiment, all participants were asked to observe the picture on the top-left corner, but they might be asked to describe different characteristics of that image. There were 7 different tasks and the corresponding lines tracked where they looked. As you can see, 1) different tasks resulted in different attention patterns and 2) no one could observe the picture by just one glimpse. Instead their focus traveled from places to places.

The following picture is an illustration of machine-learned attention. It tries to describe the top-left image in natural language, and the masks show where the machine is “looking” when composing the sentence.

According to the survey, there are 4 kinds of attention models:

But all of them share the similar implementation based on RNN:

1> Note that represents $x$ the raw/normal/unhandled input, and $x'$ is the maksed input. For example, we can crop a part of the image from $x$ to generate $x'$ .

2> RNN takes the input $x'_{t-1}$ , updates its hidden state $s_{t-1}$ and output $o_{t-1}$

3> The hidden state $s_{t-1}$ is used to calculate the next mask/cropping area/position of interest/.. and apply it to the next input $x_t$ to generate the input of network $x'_t$ .

4> The same for the next input

If the input is a static image, it will be expand on the dimension of time. That means RNN can have multiple glimpse on different places of the same image.

There are four category of attentions:

Item-wise means the input is a vector, or something that can be flattened into a vector. There is no (spatial) connection between different neurons. And the location-wise means the input is spatially related and we will choose a connected area to pay attention.

Soft attention means we implement the attention mechanism by differently weighting different neurons. Like multiplying a weight vector to the raw input or multiplying a mask with an image. All neurons are taken into calculation in soft attention. On the contrary, hard attention only chooses several neurons to forward. It might produces the probability for each neuron and samples based on that probabilities. Those not sampled will not be included in the next computation. Since it is a non-differential process, it is usually trained by reinforcement learning.

Following slides shows the most classical way to implement these 4 kinds of attentions:

1> Item-wise Soft attention:
$e_t$ is the raw mask
$\alpha_{tj}$ is the normalized mask of neuron at position j
$x'_t$ is the input with attention

2> Item-wise Hard Attention:
$e_t$ is the raw probability
$\alpha_{tj}$ is the normalized probability of neuron at position j
$\mathcal{L}$ is the sampled indices, $\mathcal{C}$ is the choosing distribution with probabilty $\alpha_{tj}$
$x'_t$ is the input with attention, to keep the input dimension consistent, one can fill 0 to the unchosen neurons (or can make the dimension of input equal to the number of sampled neurons).

3> Location-wise Soft Attention:
similar to Item-wise Soft, except that the mask is in two dimensions

4> Location-wise Hard Attetion
There are many ways to implment Location-wise Hard Attetion
The following one firstly produces a coordinate $[X, Y]$ .
And then samples a cropping center $[X', Y']$ around $[X, Y]$ .
And then crops $x'_t$ as the attention-area.
Since the cropping center is randomly selected around $[X, Y]$ , a well-trained model should produce $[X, Y]$ that is close to the efficacious area.

5> Another method to implment Location-wise Hard Attetion is by STN
It first learn a $2\times 3$ transformation matrix $A_j$ , which defines the rotation, scaling, and shifting parameters.

It might be a little bit confusing if you are not familiar with Computer Graphics or rendering.
Here is a sampling procedure to make it easier:

We map the picture $X_{in}$ coordinate into real number between $(-1, 1)$ . For instance, if an image is 32x32, then the coordinate of its pixels is denoted as something like: $(-1, -1)$ (the left-bottom pixel), $(-1+\frac{2}{32-1}*3, -1+\frac{2}{32-1}*4)$ (pixel at the third row of the fourth column).
The coordinate of the new image/cropped image/resampled image/image of attention-area $X_{out}$ is also denoted as real number between $(-1, 1)$ . For instance, if the it is 8x6, then $(-1+\frac{2}{8-1}*2, -1+\frac{2}{6-1}*3)$ represents the pixel at the second row of the third column.
To find out what pixel in $X_{out}$ is corresponded to in the original $X_{in}$ or source $S$ , we multipy the transformation matrix $A_j$ with $(x_i^{X_{out}}, y_i^{X_{out}}, 1)^T$ and get $(x_i^S, y_i^S)$ . That is to say, we should copy pixel $(x_i^S, y_i^S)$ in $X_{in}$ to pixel $(x_i^{X_{out}}, y_i^{X_{out}})$ in $X_{out}$ . If $(x_i^S, y_i^S)$ lies between pixels, we can use interpolation method to calculate the pixel value.

Interpolation:

Results of attention models. I implemented the left one, where the digits are rotated and cropped properly.

Applications:

Given a static image, having multiple glimpse can increase the recognition accuray and can explicitly show how the neural network does the job. That mimcis human’s way to observe an object.

When captioning an image, we have to describe multiple objects. Attention model allows us to focus on different objects at different time steps/different words.

Video is a natural source of sequential data:

Summary (personal comments):

When training an attention model, it frustrates you most when you cannot identify whether the attention-area is erroneously selected or the recognition part works poorly.

Although it has been a while since I read the attention-model-related paper last time, still you are welcomed to discuss with me. The comment component might be blocked by the Great Firewall :(

Thanks for your patience and support!