# [DR013] Distral: Robust Multitask Reinforcement Learning

Posted on July 15, 2017

In this work, the authors successfully use Multi-Task Learning methodology to boost the training of Reinforcement Learning models (policy learning). Directly employing Multi-Task Learning to the RL may suffer from several problems: 1) the gradients from different tasks might conflict with each other and 2) an easy task might dominate the training since it receives reward much easier and earlier than other tasks. To utilize MTL properly, they aim to learn a distilled (averaged) policy $\pi_0$ that is used to regularize the task-specific policy $\pi_i$.

More specifically, the objective is:

, where $\gamma$ is a discount coefficient, $c_{KL}$ and $c_{Ent}$ are hyperparameters to balance between two regularization terms

asking the task specific policy space to be consistent with the distill policy and  preventing the policy converge to a greedy policy which locally maximizes expected returns. The second line is just a transformation of the first line, where $\alpha = \frac{c_{KL}}{c_{KL}+c_{Ent}}, \beta = \frac{1}{c_{KL}+c_{Ent}}$

Not limited to optimizing equation (1) alternatively over $\pi_0$ and $\pi_i$, which resembles EM, they proposed an improved parameterization to train $\pi_0$ and $\pi_i$ simultaneously. They use optimal Boltzmann policy to form the framework. Consider the estimated distilled policy is parameterized by $\theta_0$

and the estimation of soft advantage on task $i$ is

, then the estimation policy $\pi_i$ looks like

This two-column design resembles traditional multi-task methods, where a shared component is adjusted by a task-specific component. Diving into the gradient of $J$ w.r.t. $\theta_0$,

we might find that the optimal solution for $\pi_0$ is the centroid of all task policies $\hat{\pi}_0^*(a'_t|s_t) = \frac{1}{n}\sum_i{\hat{\pi}_i}(a'_t|s_t)$. Compared with other multi-task methods, which transfer knowledge in the space of parameters (by sharing or regularizing parameters), the proposed method operates in the space of policies.

Compared with baselines

DisTraL converges much faster and better. The ablation experiment is comprehensive and the comparison is threefold: 1) KL divergence vs entropy divergence, 2) alternative training vs joint optimization, and 3) seperate policy vs two-column distilled policy.

[DR013]: Directions of furture research are already given in the paper:

1. Combining Distral with techniques which use auxiliary losses [12, 15, 14]
2. Exploring use of multiple distilled policies or latent variables in the distilled policy to allow for more diversity of behaviours
3. Exploring settings for continual learning where tasks are encountered sequentially
4. And exploring ways to adaptively adjust the KL and entropy costs to better control the amounts of transfer and exploration.

Apart from those, I think there are several other detailed prospecitves we can investigate:

1. The DisTraL model distills policy on every set of $(a, t)$, which is sub-optimal. We should pay greater attention to those policies that consented by more than $\tau$ tasks, where $\tau$ is a threshold. This is because on states where different task-policies disagree with each other the optimal $\pi_0^*$ is a misleading policy that fails to standarize diverse policies.
2. The distilled policy $\pi_0(a_t|s_t)$ interacts with not only the current task-specific policy $\pi_i(a_t|s_t)$ but also the context, or the history of that policy, namely $\pi_0(a_t|s_t, s_{t-1}, ...)$. I don’t know how to formulate it the best way. One possible idea is to change the regularization term from
$\mathbb{E}_{\pi_i}[\sum_{t\ge 0}\gamma^t \log\frac{\pi_i(a_t|s_t)}{\pi_0(a_t|s_t)}]$ to
$\mathbb{E}_{\pi_i}[\sum_{t\ge 1}\gamma^t \sum_{s_{t-1}, a_{t-1}} p_i(s_t|s_{t-1}, a_{t-1}) \pi_i(a_{t-1}|s_{t-1}) \log\frac{\pi_i(a_t|s_t)}{\pi_0(a_t|s_t, s_{t-1})}]$ . It is like N-gram Model in NLP in that we build the shared policy over multiple steps.
You can achieve this by including the previous position/action into the definition of “state”.
3. Transfering across different environments are more meaningful and convincing than transfering between different tasks in a same environment. Since in the latter setting, we might gain improvement because multiple agents are exploring the same map simultaneously. It will be more general to design some experiments where the input spaces are different across different tasks (but the action space might be the same). The shared policy can take the intermediate feature as the input.
None