[DR019] Efficient Estimation of Word Representations in Vector Space

Posted on September 25, 2017

“Word2Vec” is a more popular name for this paper. However, this paper didn’t contain too many details (i.e. equations) when describing the model. One has to read its source code to figure out the accurate design. I strongly recommend you to read this post (Chinese) or this document to get a better understanding. Since I am not authorized to use their material, I will leave the detail to you.


To assess whether an embedding really captures the latent relationship of words, they designed some questions to evaluate the linear regularities. Some example of tasks is like:

For instance, suppose V() is the word to vector function and we should have V(Chicago) – V(Illinois) + V(Stockton) = V(California). It is like the old version of GRE verbal questions – “What is the word that is similar to lucky in the same sense as easiest is similar to easy?”. 


To tackle word embedding problem, they designed two kinds of models: Continuous Bag-of-Words Model (CBOW) and Continuous Skip-gram Model (Skip-gram).

I think one innovation is that they use a Hierachical Softmax design to handle the output, which reduces the complexity by a large margin (i.e. from linear to log). This improvement allows them to train on a much larger dataset with a larger dimension of embedding.

Normally, we will output a N-length vector to indicate the probability of the output, where N is the total number of words. However, N can be extremly large when handling a large dataset. Therefore, they used a Huffman binary tree to represent a word. At each branch of the tree, deciding whether to go left or right becomes a binary classification problem. The eventual probability is the production of all probabilities of binary classification on the path. Please refer to two posts mentioned in the beginning to understand the details of these two models.

In the following work, they used negative sampling to replace the complex hierarchical softmax. The crux of negative sampling is to combine the correct answer with some wrong answers to form a binary classification problem.


[DR019]: They proposed a very interesting task to evaluate the embedding space. However, is the linear regularity the most accurate metric to evaluate an embedding? Maybe some embeddings capture the relationship between words in another metric space. Linearity can preserve the meaning in a simple way, which can be easily applied by other following models. But mapping relationship on a linear space limits the expression ability of the embedding. Non-linear relationship requires proper kernel function or mapping when being employed, but it might exploit the more complex relationship between words. To balance these ends, one might embed the words in a slightly more complex space, like quadratic space. 

Another interesting extension is to separate the encoding or to enforce some regularization between a different dimension of embedding vectors. Reducing the redundancy between vectors might enable us to have a more compact code.