[DR012] Attention is all you need

Google never fails to impress everyone with their amazing designs and sensational titles. This work replaces traditional structure (i.e. RNN and CNN) in Translation tasks with an attention-based model. They achieve better performance with less computation. I am not familiar with Natual Language Processing, but still, I believe it is a great breakthrough.

RNN or CNN, which takes sequences as input, has to handle sentences word by word, which is unable to parallelize. Furthermore, if the sentence is too long, the model might forget the content in a distance or mix it with the following content. By encoding the position and applying the attention mechanism, the proposed model can relate two distant words, which can be parallelized to accelerate the training.

The proposed network “Transformer” can be described in the following figure:

, where the left part is the encoder and the right part is the decoder. The gray boxes will be stacked multiple layers to form a deep network.

The Multi-head attention is a combination of different Scaled Dot-Production Attention with different dimensions.

And the Scaled Dot-Product Attention is defined as:

The attention is a function of three components: $Q$ the query, $K$ a set of keys and $V$ a set of values. It is a soft-itemwise attention on $V$ . With this mechanism, a word can relate to a previous word by paying a high attention to it, while in RNN the length of the linkage is linear to the distance.

To employ the information of word, they also introduce a method to encode the position:

Visualization of the attention:

[DR012]: To be honest, although I understand the self-attention mechanism, I do not fully understand how the data or word pairs are fed into the network. The superiority of this model is that it can find the relationship between words more directly, while the RNN and CNN have to go through the whole sentences. The former is more consistent with human’s logic.

The multi-head attention and the position embedding is strange but proved effective. But adding the position vector with an embedded word vector is still somehow unnatural.

The attention defined by the authors is “An attention function can be described as mapping a query and a set of key-value pairs to an output,
where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.”

However, it is quite different from the traditional concept of attention. I think it is more like learning a relationship network between words rather than attention. Psychologically speaking, attention should be a dynamic distribution of resources to accomplish a complex task. From my perspective, not all methods that combine several items with weights can be tagged as attention. And attention is achieved in sequence rather than in O(1).