[DR004] Asynchronous Methods for Deep Reinforcement Learning

Today I am going to share one of the milestones in Reinforcement Learning. This paper is mostly known for A3C (Asynchronous Advantage Actor-critic), but actually, this paper proposes a framework with which 4 kinds of reinforcement learning methods can be accelerated with asynchronous parallel training.

The main idea is to have several copies of the environment, where each copy of model (policy network or value network) interacts with each copy of the environment. After a certain number of iterations, the collected gradients from each model are sent to the main server to update the network. Copying weights from the main model to sub-model is also done asynchronously.

It is reasonable that parallelism can accelerate the learning speed. Furthermore, there are other significant bonuses: (1) By enabling the sub-model to choose action differently (like $\varepsilon$ -greedy exploration), this framework can explore the environment faster. (2) Previous deep RL rely on experience replay to stabilize the training procedure, where memory and computation are required to record many steps before updating. And it is limited to off-policy learning algorithms which can update from experience generated by a previous policy. On the contrary, thanks to the diversity of the multiple sub-model or some reason, the proposed asynchronous framework can resolve this problem without storing experiment.

There are four kinds of reinforcement learning is implemented under the proposed framework: Asynchronous one-step Q-learning, Asynchronous one-step SARSA, Asynchronous n-step Q-learning and Asynchronous advantage actor-critic (A3C). The most famous model is A3C mainly because it outperforms other methods in most tasks.

The pseudocode of A3C is not hard to follow:

Since multiple sub-models are executed, the exploration speed should be linearly related to the number of sub-models (or threads in the paper).

[DR004]: From my perspective, this paper has more engineering details than theoretical innovation. If you are interested in reimplementing it, I strongly recommend you to read this blog. Aside from the acceleration, this paper shows parallelism can also help stabilize the training by introducing diversity. If we treat each trail of the experiment as a sample, then updating RL methods based on one trail is like training model with batch-size of 1, which is apparently irrational because the last few samples might distort previous training. Both experiment replay and asynchronous training can be interpreted as an increase of batch size.

Therefore, maybe asynchronous training can be applied to other fields which value diversity but fails to fast sample in batch. On the top of my head, fields like translation or video generation are a good topic to employ asynchronous framework.