Posted on May 08, 2017

## Introduction

The last time I wrote notes was last April (on my old blog).

I spend last summer holiday preparing for application and focused on attention models. Suggested by my tutor, I have started to research Re-ID (video) since then. From September to January, I was devoted to Generative Adversarial Network (GAN) and other related works and brought the idea of adversarial training into our laboratory. By employing that we came up with an unsupervised domain adpatation model, which is submitted to ICCV2017. My reading list stopped before the acceptance result of ICLR2017 is released and this post will keep updating.

The basic idea of GAN is required before reading this one.

## Basic GAN

Basically, GAN[1] is a very innovative idea. It is proposed to generate high-quality images, which should look like the ones from the real dataset. To achieve that, an auxiliary discriminator is trained to distinguish the real ones from the generated (fake) ones. And the generator wants to generate images that can confuse the discriminator. They are updated successively.

There are many explanations of how GAN works. And my understanding is: the discriminator is trained to locate the differences between two distributions (i.e. the difference between the real data distribution and the generated (fake) distribution), while the generator wants to minimize it. Since we cannot change the real data distribution, we can "pull" the generated distribution towards the real data distribution. To that aim, we can define some loss so that the differences located by the discriminator can be back-propagated to the generator. For instance, the accuracy of the discriminator is most frequently used. Normally, the loss of generator is against the loss of discriminator, which explains the adjective word "adversarial". I believe the idea of adversarial training/learning is more valuable than GAN itself.

## Co-evolution

During my sophomore year (before GAN is invented), I was learning algorithms of Computational Intelligence (CI). CI is a sub-field of Artificial Intelligence and is mainly about nature-inspired algorithms like Genetic Algorithm, Swarm Optimization and Ant Colony Optimization. So in books about CI I read about a training method called “Co-evolution”, which I think it should be the prototype of GAN (unfortunately no one has mentioned that even after GAN becomes extremely popular).

In Co-evolution, there are two or more than two agents. They can compete or cooperate with each other to optimize some function. For instance, in nature, rabbits and wolfs might run much slower than they currently do. But in natural-selection, wolfs that run faster are preferred and the slower ones are eliminated. The same thing happens with rabbits. After many iterations, both wolf and rabbits run faster than before. The most brilliant essence lies in that unlike a fixed optimization problem, the loss function is dynamic and is decided by the environment, which usually involves multiple, external agents.

This idea highly resembles GAN. And I prefer to regard GAN as a differentiable neural network version of co-evolution. Because in earlier days, co-evolution is implemented with Genetic Algorithm, which is completely non-differentiable. I have been to a seminar on CI and the discussion of the relationship between CI and ML was frequently brought up. GAN can be regarded as a very successful example of transferring idea of CI to ML. This might be a feasible way to develop new ML algorithms.

Check this slides for more details on Co-evolution.

## Extensions of GAN

From my perspective, the extensions of GAN can be categorized into 3 classes: improvement/application of the generation process, a new design of the loss function, and a mixture of GAN with other models. Those modifications nevertheless might exist at the same time in one work.

Improving the quality of generated images is one of the major contributions of GAN. Following works either enhance the generation process or apply the generation to interesting fields.

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks (DCGAN) [2]:

DCGAN is a classical work that brings convolution operation into GAN. Although it is not mathematically innovative, it did show some astonishing artificial images. My favorite results is the one applies linear operation (plus and subtraction) on the noise space and the one that interpolates between two noise vectors. These results disclose some visual relationship between the noise space and the image space.

Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks (L-GAN) [3]:

It is hard to generate every pixel at once. Even for a human painter, he or she has to draw the framework first before diving into details. Hence L-GAN proposes that a small image should be rendered at small scale first, which is then extended to a larger scale gradually. During the expansion, more details is included into the generated image. A Laplacian pyramid is built to implement that idea:

Similar to the generator, the discrimination is executed on multi-scales like generation but in the reversed direction.

This idea of multi-scale processing is frequently proposed in recent works on segmentation, pose estimation, generation and so on. It grants neural network the access to different scales, which might help to recover the information that is lost in convolution.

Generating images with recurrent adversarial networks [4]:

This work is a combination of DRAW[5] and GAN. Actually, I am more interested in DRAW, which is a model that can read and write/generate the image multiple times with attention mechanism. It shares with [3] the same inspiration that image should be generated in more than one steps. In DRAW [5], they show the procedure of how a neural network “read” and “write” an image in steps, which matches human’s intuition. Since this is a note about GAN, I am not going to include more discussion about attention model here.

Generative Adversarial Text to Image Synthesis [6]:

There are many application of GAN and this particular one is one of my favorites. It combines natural language processing with computer vision and has some amazing results. The basic methodology is adding the encoded features of sentences to the GAN, like Cond-GAN[8] does (described later).

After training, the model can generate an image that is related to the topic of sentence. Apart from that, they can even transfer styles between pictures:

This work discovers that if the sentences/words are embedded in a meaningful space, then it is feasible to take those feautes to other fields like comptuer vision. With the development of NLP, there are acutally many pretrained model that is competent to encode sentences/words efficiently.

Another line of work is to design a new loss function using adversarial training or to propose an explanation of how GAN works.

InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets [7]

The original noize space z in GAN has equal significance in every dimension. However, InfoGAN states that the coding space can be disentangled into two parts: (i) c, the latent representation and (ii) z, the incompressible noise. For instance, c might record the significant features like shape and color, while z represent some irrelevant details of the image. They are used jointly (concatenated) as the code to generate images.

In order to strengthen the relationship between latent code c and the image, they propose that the Mutual Information between them should be as large as possible. In information theory, Mutual Information reflects the dependency between two random variables. The final optimization objective is

$\min_G \max_D V_I(D, G)=V(D,G)-\lambda I(c;G(z,c))$

, where $V(D,G)$ is the standard GAN loss and $I(c;G(z,c))$ is the mutual information between latent code c and the generated image. Notice that c is a concatenated vector and each bit of it is individual to each other.

By imposing a mutual information regularization, the generated image is highly related to c. Manipulating each individual bit of code c can result in conspicuous changing of a certain characteristic of image:

What we can learn from InfoGAN: (1) Code can be separated into two parts, one for major feature and another one for insignificant intra-variation. (2) Mutual information can be used as a measurement of dependency.

Conditional Generative Adversarial Nets (Cond-GAN) [8]

Cond-GAN is a really simple modification of GAN that allows us to attached supervised side information into the generation process. More specifically, you just have to concatenate the side information with code z and with generated image x (G(z)) during generation and discrimination respectively.

One benefit is that the conditional side information is added during training, which forces the generated image to be closely related to the side information. By manipulating the side information during generation, we can get the desired pattern.

CondGAN is different from InfoGAN. In InfoGAN, you can decide how many bits the latent code has, but you cannot decide the meaning of each bit. In Cond-GAN, when providing the side information for each image, you are actually labeling attributes at every bit.

Energy-based Generative Adversarial Network (EBGAN) [9]

EBGAN tries to develop GAN from an energy-based perspective. The generator generates an image with minimal energy, and the discriminator is replaced with an energy estimator to assign high energy to fake images and low energy to real images. In EBGAN, energy of an object/image is defined as the reconstruction loss of that image via an AutoEncoder (AE). That is to say, the real images should be successfully reconstructed and the fake/generated one cannot be reconstructed by the same AE.

Mathematically, the generator wants to minimize the energy (loss of reconstruction) of the generated image:

$f_G(z)=\|D(G(z))\| = \|Dec(Enc(G(z)))-G(z)\|$,

where Dec is the decoder of AE and Enc is the encoder of AE. The objective of AE is to assign high energy to fake images and low energy to real images:

$f_D(x,z) =D(x)+[m-D(G(z))]^+\\+\|Dec(Enc(x))-x\|+[m-\|Dec(Enc(G(z)))-G(z)\|]^+$

where $[\cdot]^+$is the hinge loss. And a repelling loss is introduced to avoid the generated images cluster in certain modes.

There many advantages of using an energy-based model. First, you have a wide choice of the energy function. And the energy can be translated into possibility easily with Gibbs Sampling.

f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization [10]

Since there are many new concepts are introduced in f-GAN, I am not sure whether I capture the whole spirit of f-GAN. According to [10], “generative-adversarial approach is a special case of an existing more general variational divergence estimation approach.

f-divergence is a function $D(Q\|P)$ to measure the difference between two probability distributions. There are many forms of divergence. To estimate the divergence, they use the Fenchel conjugate form to replace the original divergence.

The discriminator is currently a “divergence estimator” with parameter $\omega$:

And GAN is actually a special case using a JS-like divergence to measure the difference between real and fake distributions.

To avoid confusion, I will update the note of this paper after I totally get it. If you like math, you might enjoy digging the concepts in this paper.

Wasserstein GAN (WGAN) [11]

In [11], the author analyzes why the traditional GAN is hard to train. The reason lies in the divergence (JS-divergence) used by GAN. JS-divergence cannot accurately measure the distance between two distributions when the overlapped part is negligible. Thus, WGAN uses Wasserstein as the divergence to measure the differences between the real distribution and the fake distribution. However, Wasserstein divergence is hard to calculate and they use a dual form of it as the objective.

Many comprehensive blogs (Read-through: Wasserstein GANÃÂ¤ÃÂ»ÃÂ¤ÃÂ¤ÃÂºÃÂºÃÂ¦ÃÂÃÂÃÂ¦ÃÂ¡ÃÂÃÂ¥ÃÂÃÂ«ÃÂ§ÃÂ»ÃÂÃÂ§ÃÂÃÂWasserstein GAN) have more details on WGAN. The final engineering modification is so simple that shock many people.

Loss-Sensitive Generative Adversarial Networks on Lipschitz Densities (LSGAN) [12]

Like WGAN, LSGAN tries to fix the GAN’s notorious hardship of training by limiting the generating capacity. In order to do that, LSGAN stops pushing the distance between the real distribution and the fake distribution when the certain margin is met.

$\Delta(\cdot, \cdot)$ is a distance function and in [11] the pixel-wised L1 distance is used. The final objective is

Equ. (8) is designed to minimize the score of image x $L_{\theta}(x)$ and push the score of a generated image $G_{\phi^*}(z)$ until the margin $\Delta(x, G_{\phi^*}(z))$ is meet. In (9), the objective is to train $G_{\phi}$ so that the score of the generated image $G_{\phi^*}(z)$ is minimized. Here S is the critic/ discriminator and T is the generator.

[11] proves that both LSGAN and WGAN is special cases of a general case of the Generalized LSGAN. It also contains many mathematical deductions and other interesting Theorems about LSGAN.

The last type of extension is to mix GAN with other models, especially AutoEncoder. The inspiration is very clear: GAN can only generate an image from a noise input z and that image is sent to discriminator afterward. If we want to solve traditional problems like classification, what we need is the ability to encode an image into a feature, which is the exact opposite with GAN. However, AutoEncoder is designed to do that.

By observing the data flow of GAN and AE, we can find that they are connected closely:

As you can see, the Encoder and the Discriminator share the same structure and the Generator and the Decoder share the same structure.

Autoencoding beyond pixels using a learned similarity metric (VAE/GAN) [13]

[13] proposes a model combined with a VariationalAE and a GAN, in which the Decoder of the uses the same network with the Discriminator:

The losses are standard losses of VAE and GAN separately and notice that there are 3 inputs of the Discriminator: x (real data), x~(reconstructed data) and x_p (generated from sampled noise). The algorithm is elegant:

Although both VAE and GAN are trained, [13] pays more attention to generation than encoding. The results show that the VAE/GAN is able to generate targeted image from a given input by manipulating the coding space.

AAE is also combining GAN with AE. But different from VAE/GAN, the battlefield of adversarial training is located in the coding space:

And it focuses on the encoding rather than the generation/reconstruction. Compared with VAE, AAE is capable of filling the entire coding space:

## References:

[2] Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

[3] Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks

[4] Generating images with recurrent adversarial networks

[5] DRAW: A Recurrent Neural Network For Image Generation

[6] Generative Adversarial Text to Image Synthesis

[7] InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets