[DR007] Unsupervised Pixel-Level Domain Adaptation with Generative Adversarial Networks

I have mixed feelings for this paper since it shares the exact same idea with my effect from September to January. However, I was working on person re-identification and the images were much more complex than those in their experiments. Still, hope that someone can finish this job to push this method further.

Let’s focus on this paper. Previous domain adaptation methods focused on learning a domain-invariant representation for both target and source domain.However, this paper wants to achieve adaptation by transferring images from source domain to the target domain. The advantage is 5-fold:

Decoupling from the Task-Specific Architecture: The adaption module aims only at transferring the image style, which means it is unrelated to the task and remains feasible when used alone.
Generalization Across Label Spaces: Most related papers assumed that the target and the source domain share the same categories so that the aligned feature space is meaningful, while this paper transfers on pixel level, which allows the unseen label for the adapted images.
Training Stability: Multiple losses are proposed to stabilize the training, while previous methods relying on adversarial training suffer from the instability brought by random initialization.
Data Augmentation: The model can be used as a method to generate theoretically infinite images.
Interpretability: The adapted result is in form of the image rather than the feature, which is more interpretable.

Actually, this paper is closely related to the paper published by Apple and other unsupervised image translation works. Please check this post for more discussions. All of them try to translate images from one domain to another. Nevertheless, other works stop at translating images and focus on the quality of the generated images, while this paper explores the potential of these adapted images in transfer tasks.

The structure is quite self-explanatory:

There are three components: G, the pixel-level transferer, T the classifier, and D the discriminator. They are optimized based on the following formula:

where

is a standard GAN loss,

is a classification loss on both source images and the adapted images, which, in accordance with them, can help stablize the training, and

is a consistent loss. Note that $\textbf{\textrm{m}}$ is a mask of the foreground, only the inconsistency in the valid area tagged by $\textbf{\textrm{m}}$ will be punished. This loss is a scale-invariant loss I have introduced in [DR005]. This enables the adaption to change the appearance of the object in a consistent direction.

Although the adaption scenario is much easier than Re-ID, which I tried to conquer, still it has very convincing experiments to support their contribution.

If you are working on domain adaptation, please check their astonishing performance. For instance, in

the adapted result is very close to or even better than the Target-only performance, which means they can achieve the supervised performance with unsupervised learning. Provided this method can be employed in more complex and practical problems, it will be a promising method to connect lab-datasets with real-world data.

[DR007]: It is a pity that I am always one step behind others. It is also an affirmation that my idea is feasible. Although this paper has pioneered in using unsupervised image translation to adapt between domains, it requires effects to move from simple datasets like MNIST to complex datasets like those in Re-ID.

In this paper, the difference between foreground, which is the content, and the background, which is actually meaningless and transparent, is explored. I have also tried it on Re-ID and it brought certain improvement of the adapted images. Still, there is other information it does not address. For instance, the shape of the object might be altered due to the camera shift. Like in Office dataset, the camera angle differs between domains. If we can extract the skeleton and allow a certain amount of transformation during adaptation, maybe the adapted image can be more realistic.

Other traditional post-processing effects like fuzzy, blurry, and changing of focus can also help aligning images from different domains. As powerful the neural network is, we can introduce those non-linear transformations manually rather than by learning, especially when data is sparse.