[DR003] Domain Separation Networks

Today I am going to introduce a paper that shares a similar idea that my tutor and me want to verify. Although it is pleasant to realize that our idea and research direction is feasible, still we regret that we could have implemented it first.

When it comes to transfer learning, we usually want to learn a model which is so general that it can map inputs from both source and target domain into an aligned space. However, this expectation is too idealistic even with labeled information from two domains. Therefore, we should accept that a part of the coding is domain-specific, or in this paper’s expression--private to each domain.

The essence of this paper can be concluded in one self-explanatory figure:

The caption can do the most of explanation and this idea is not that hard to think of. Nevertheless, the detail of its design is worth reading.

To keep feature from different domains aligned, two kinds of $\mathcal{L}_{similarity}$ are compared: (1) A classical MMD formula with RBF kernels:

and (2) an advanced loss related with adversarial training:

These two similarity loss represent two mainstreams of domain adaptation.

They do not use the normal $L_2$ loss as the reconstruction error. Instead, a scale-invariant mean square error is employed:

[DR003] This reconstruction term outperforms the normal $L_2$ loss. This is a very interesting term that I cannot explain at present. ~~I will add it to the furture DR reading list.~~ This paper has been alreadly reported on [DR005].

To ensure the shared coding $h_c$ and domain-private coding $h_p$ are exclusive, an orthogonality loss is introduced:

In our design, we use a sparsity regularzer (L1 loss) to keep $h_p$ sparse and simple. And a subtraction is used to eliminate the collision between codings. I tried to use the masking to replace the subtraction, but the performance is not satisfying. [DR003] Maybe this model can be adapted in to an attention model where the domain-private coding $h_p$ can be replaced to a selective mask. From another point of view, if we treat the $h_c$ and $h_p$ as two individual space (which means we will not sum them but concatenate them in the following processing), it is unnecessary to maintain this exclusive restriction.

In the experiment part, they solve four asymmetric domain adaptation tasks and choose hyper-parameter with a small part of labelled data. Those setting are somehow questionable if the reviewers are strict with details. But this paper gives a very thorough explanation about why they eliminate popular tasks (Office, Caltech-256) and on how they validate. It is what I should elucidate in my future writings.

Brief summary about [DR003]: A wonderful job. Separate coding rather than learn a super-general coding. We are one step late. Fluent writing and detailed supplementary material.