Notes on Unsupervised Image Translation

Do you think the machine can translate image as we translate the language? Check this astonishing Gif!

Can you guess which one is the real video?

In the past, transfer learning traditionally stays at the feature level. For instance, aligning features between target domain and source domain with MMD. However, with the development of image generation models, translating image pixelwise becomes feasible. One of the most famous branches is Neural Art, which learns the “painting style” from one picture and renders another picture with the learned style.

A demonstration of Neural Art from jayanthkoushik/neural-style.

Recently, thanks for GAN, another type of image translation is introduced. It aims to translate the given image into another modality. For example, translate between aerial maps and Google maps, winter landscapes and summer landscapes, or night view and day view.

In the task of image translation, paired data is essential but usually sparse. With paired data from both domains/modalities/modes, we can simply learn a mapping from one to another with a supervised model. However, it is much more challenging to learn this translation when the paired data is very sparse or missing.

To that aim, we have to capture the key difference between two modes and switch that during translation. It can be done by embedding them to a high level, meaningful space. For example, the difference between horses and the zebras is their skin texture. If we are able to separate the shape from the texture in a latent space and to recover the image from that space, then it is easy to switch between images of horses and zebras.

We know that adversarial learning is employed in GAN to provide guidance to generate fake images that look like the real ones. It is also introduced in image translation to locate the difference between domains.

Also, the idea of dual learning is very helpful in the translation. Dual learning can be simply described as “Learning a task and its dual tasks simultaneously”. For instance, “translating English to French” is a dual task of “translating French to English”. By learning them simultaneously in a closure, they can be improved successively. Another classical example is autoencoder, in which encoding and decoding are dual tasks of each other.

Four related paper will be discussed in this post:

Dual Learning for Machine Translation (by Microsoft)
Learning from Simulated and Unsupervised Images through Adversarial Training (by Apple)
Image-to-Image Translation with Conditional Adversarial Networks (by UCB)
Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks (by UCB)

Dual Learning for Machine Translation

This paper is to translate sentences between languages with a limited number of paired data (or bilingual corpus). It introduces the idea of dual learning, although it resembles some classical methods like autoencoder.

Here is the overview of their model:

Person A knows how to speak English (whether an English sentence is correct), while Person B knows French.
A model $P(\cdot | s_A;\Theta_{AB})$ will translate an English sentence $s_A$ to a Frence sentence $s_{mid}$ . Then Person B can rate the sentence $s_{mid}$ and feeds back to $\Theta_{AB}$ .
Another model $P(\cdot | s_{mid}};\Theta_{BA})$ will translate $s_{mid}$ back to a English sentence, which will be compared with the original sentence $s_A$ . The difference will propagate to $\Theta_{BA}$ .

It really looks like the autoencoder except the encoding space is a complex space rather than a compact one. In previous steps, Person A and Person B can be trained with mono-lingual corpus individually. And no bilingual corpus is required.

Note that during translation, given a sentence $s$ , $P(\cdot | s;\Theta_{AB})$ and $P(\cdot | s;\Theta_{BA})$ only estimate the probability of words in the target language. To ensemble a sentence, we have to sample from this probability space to get the final translated sentence. Since the model is based on sampling, the reinforcement learning (policy gradient theorem) is used to maximize the probability of correct translations.

They achieve higher performance even compared with those trained with the bilingual corpus. The idea of dual learning is very meaningful and is widely applied in other fields implicitly.

Learning from Simulated and Unsupervised Images through Adversarial Training

This paper is the first AI paper published by Apple. It wants to make the synthetic image more realistic. It can be actually done by improving the rendering techniques, which is a topic in Computer Graphics for years and has some disadvantages. For example, you can hardly introduce random background noise in reality into a simulation system manually. And bad image quality (i.e. fuzzy, blurry) is sometimes a special characteristic of those real images, which is somewhat contradictory to what we want to achieve in an accurate simulation.

Generating images from noise (GAN) is hard. To make things easier, they first generate an image that is distinguishable from the real one and then try to refine the image with adversarial learning. However, they have to maintain the content in the image during refining. Therefore, an extra loss (L1 loss) is introduced to keep the consistency.

The benefit of this translation is more than making image real. With this method, they can produce as much data as they want in other tasks. They augmented data on both gaze estimation and hand pose recognition. Although they show that the model trained on the refined images has better performance when tested on real data compared with the one trained on raw synthetic images, they did not compare with the state-of-the-art methods. Besides, these synthetic images are simple.

Actually, I am doing the same thing but on Re-ID dataset. The result is really frustrating. There are some possible reasons/issues that can be worked on:

Images from Re-ID are very different from other recognition tasks. The number of people is very large and the number of samples per person is very small. In comparison, MNIST has 10 classes and 10000 images per class, while VIPeR (a Re-ID dataset) has 632 classes and 4 images per class. This kind of sparsity makes the discriminator/generator hard to bridge between domains because the manifold is quite quirky.
The pose variation in person might introduce a huge difference in appearance. Therefore different images from the same person can be mapped into totally different places.
There are blocking, shaking and another distractor in Re-ID since it is usually collected in a real scenario. It is hard to generate those from scratch.
The dressing of synthetic people is too different from the real one. And simply adding a loss pixel-wise to maintain color/dressing consistency is useless, unless we can map their dressing in a very high level of description.

Image-to-Image Translation with Conditional Adversarial Networks

This work is to translate image between domains with paired images. The idea is quite simple but I am sure the engineering work is demanding. It is based on Conditional Adversarial Network (check my last post for description) and the conditional input is the original image.

We denote a pair of corresponding images as $(x, y)\sim p_{data}(x,y)$ and the generated image as $G(x, z)$ , where $z$ is the noise input.Apart from the standard GAN loss,

they added a L1 loss between $G(x, z)$ and $y$ to enforce the consistency:

The final objective is a weighted sum:

Quite elegant, isn't it? In this paper, many GAN techniques are introduced, like patchGAN, why choose L1 loss over L2, how to evaluate and so on. It builds a great foundation for the next work.

Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks

CycleGAN brings the image translation to another level, where no paired data is needed. Unlike language, it is much more common than images from different domains do not show in pairs. And generating paired images manually seems infeasible (you cannot ask Vincent van Gogh to paint another painting and you cannot take exactly the same scene depicted in his painting).

CycleGAN manages to achieve unsupervised translation by keeping the generation in a loop. That is by training the forwarding and backwarding translation model simultaneously.

Although the idea is straightforward, it does help maintain the generated images meaningfully. And adding an adversarial loss is to accomplish the translation. Extensive experiments are extremely astonishing:

Before ending notes on cycleGAN, I strongly recommend a blog (in Chinese) I read today. It includes a comprehensive discussion of cycleGAN and models that resemble or are related to it. Those papers are:

1. Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks by BAIR

2. Learning to Discover Cross-Domain Relations with Generative Adversarial Networks by SK T-Brain

3. DualGAN: Unsupervised Dual Learning for Image-to-Image Translation by Memorial University of Newfoundland & Simon Fraser University

4. Unsupervised Cross-Domain Image Generation by FAIR

5. Coupled Generative Adversarial Networks by MEIR

6. Adversarially Learned Inference by MILA

7. Adversarial Feature Learning by UC Berkeley

8. Mode Regularized Generative Adversarial Networks by MILA & PolyU

9. Energy-Based Generative Adversarial Networks by NYU

If you find CycleGAN is worth researching, you might have to go through most of them.

What I Learn:

Adversarial learning and Dual learning will be popular methodologies in future research of Machine Learning/ Computer Vision/ Natural Language Processing. And unsupervised/reinforcement tasks might be discussed more frequently than before.
Training a GAN can sometimes “embed” the distribution into the generation/discrimination process.
Introducing new components (discriminator in GAN, backward translation in dual learning) in a reasonable way can help the major task.