[DR010] Relative Entropy Policy Search

Relative Entropy Policy Search (REPS) proposes a new policy search approach to reinforcement learning. The update of policy always results in the loss of information, or, in other words, makes the policy forget the learned experience. To avoid that, this method limits the KL-divergence between the sampled data from new policy and that from the preivous policy under a user-defined constant.

I saw this paper because both [DR006] and another paper use REPS to keep the policies between two consecutive iterations “close” to each other. However, the core design is pretty short and most pages of this paper are about how to solve the designed equation group and how to extend it to policy iteration setting. There are many mathematical deductions in this paper.

The motivation is very straightforward. By directly optimizing the expected return, we might disconnect between finding an optimal policy based on current observation and staying close to the observed data and the learned policy. It also results in a loss of data as an improved policy needs
to forget experience to avoid the mistakes of the past and to aim on the observed successes. Additionally, if we update the policy purely based on the return, we are more likely to get a policy that eliminates states in which only bad actions have been tried. This problem is called “optimization bias” and may appear due to under-sampling, model errors, or even policy update itself.

This work is to bound the information loss during updates. Standard notation of policy learning is used:

The observed data collected in history is $q(s,a)$ , and the policy update wants to choose a new policy $p^{\pi}(s,a)=\mu^{\pi}(s)\pi(a|s)$ . Directly maximizing Equ. (2) can lead to information loss. Therefore, they want to limit that loss by restricting the KL-divergence between two aforementioned distributions:

where $\varepsilon$ is user-defined boundary. For instance, if only bad actions have been tested on a state $s_-$ , then direct optimization might choose a policy that totally avoids $s_-$ . But this obviation might lead to sub-optimal solution. However, total forgetting is limited in Equ. (4), which renders the policy more tolerant to bad actions before jumping to conclusions. Just like direct descent versus simulated annealing.

With this limitation, our problem becomes:

Note that Equ. (6) is the major difference between REPS and other methods. If you are not a fan of mathematical deduction, please feel free to ignore the next image:

This deduction tells us how to extract a policy without violating the inequation in Equ. (5)-(8). Our final solution (REPS) looks like:

REPS introduces the notion of observed data rather than ignore them, but also gives a higher probability to actions with a higher reward. An extension of REPS under the on-policy setting is also designed in the paper, I will skip that in this post.

[DR010]: This work is a very elegant work that has strong motivation, introduces a comprehensible loss, and gives solid mathematical deduction. In short, this paper wants to make use of the observed data in the future learning. Admittedly, we cannot explore all possible track in most reinforcement learning tasks in one batch. Therefore, omitting previously observed data and focusing maximizing the reward based on current observation favors might have an innegligible bias towards fresh memory over previous experience.

Experience replay is more usually employed since it does not change the sampling method. Besides, most current reinforcement agents are implemented with the neural network, which is of high capacity and has redundant neurons to remember previous data.

Since it is a classical ML/RL paper with a lot of maths to digest and I am still confused on some details, I currently have no inspiration to extend this paper. But reading classical ML papers like this one always serves as a refreshment for me. Some papers about NN structure design are more like engineer project reports with everything auto-differentiated.