[DR003] Domain Separation Networks

Posted on June 21, 2017

Today I am going to introduce a paper that shares a similar idea that my tutor and me want to verify. Although it is pleasant to realize that our idea and research direction is feasible, still we regret that we could have implemented it first. 

When it comes to transfer learning, we usually want to learn a model which is so general that it can map inputs from both source and target domain into an aligned space. However, this expectation is too idealistic even with labeled information from two domains. Therefore, we should accept that a part of the coding is domain-specific, or in this paper’s expression--private to each domain.

The essence of this paper can be concluded in one self-explanatory figure:

The caption can do the most of explanation and this idea is not that hard to think of. Nevertheless, the detail of its design is worth reading.

To keep feature from different domains aligned, two kinds of \mathcal{L}_{similarity} are compared: (1) A classical MMD formula with RBF kernels:

and (2) an advanced loss related with adversarial training:

These two similarity loss represent two mainstreams of domain adaptation.


They do not use the normal L_2 loss as the reconstruction error. Instead, a scale-invariant mean square error is employed:

[DR003] This reconstruction term outperforms the normal L_2 loss. This is a very interesting term that I cannot explain at present. I will add it to the furture DR reading list. This paper has been alreadly reported on [DR005].


To ensure the shared coding h_c and domain-private coding h_p are exclusive, an orthogonality loss is introduced:

In our design, we use a sparsity regularzer (L1 loss) to keep h_p sparse and simple. And a subtraction is used to eliminate the collision between codings. I tried to use the masking to replace the subtraction, but the performance is not satisfying. [DR003] Maybe this model can be adapted in to an attention model where the domain-private coding h_p can be replaced to a selective mask. From another point of view, if we treat the h_cand h_p as two individual space (which means we will not sum them but concatenate them in the following processing), it is unnecessary to maintain this exclusive restriction.


In the experiment part, they solve four asymmetric domain adaptation tasks and choose hyper-parameter with a small part of labelled data. Those setting are somehow questionable if the reviewers are strict with details. But this paper gives a very thorough explanation about why they eliminate popular tasks (Office, Caltech-256) and on how they validate. It is what I should elucidate in my future writings.


Brief summary about [DR003]: A wonderful job. Separate coding rather than learn a super-general coding. We are one step late. Fluent writing and detailed supplementary material.