[DR001] Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation

Daily Reading (DR) is a new column where I will read and summarize a paper every several days (not necessarily daily :P). I hope cultivating this habit can spur me on reading more papers.

Today I am going to share the paper Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation. This paper purposed a framework that combine deep Q learning (DQN) and Hierarchical Reinforcement Learning (HRL). The motivation was straightforward: when the reward feedback is sparse (temporally), common reinforcement methods cannot learn how to interact with the environment. To resolve this problem, HRL breaks the task into several intrinsic goals. Designed with hand-craft/unsupervised feedback, those goals are used to train the controller to make actions, while a meta-controller is trained how to pick goals.

We can treat the reward from the intrinsic goal as a common reward in classical RL. It is used to train the controller to learn the Q-function, with which we can pick an action. The only difference is that the Q-function is goal-conditional.

The meta-controller only updates when the goal is achieved or terminated. It uses a common RL model with temporally gaps.

The update illustration looks like (Note that the $Q_2(s_t, a_t, \theta_1, g_t)$ in the caption should be $Q_1(s_t, a_t, \theta_1, g_t)$ ):

The remained difficulty lies in how to design intrinsic goals in real tasks. Their experiments demonstrated few ways to do it:

1. Use states as goals.

If the task can be abstracted with a graph then nodes/states on the graph can be used as goals. This is rational in that an agent taking an action from A to B sometimes will end up in C. The intrinsic goal makes sure the movement is correctly executed. With this guarantee, the meta-controller can better focus on modeling the big picture.

This is a game they designed. Started at s2 and ended at s1. Taking action “right” can make the agent goes either right or left with 50/50 chance, while taking action “left” can move the agent to the left state. 1 point is reward if the agent has visited s6. Otherwise 1/100 point will be rewarded.

2. Object oriented

An ATARI 2600 game called ‘Montezuma’s Revenge’. The agent has to get the key before open the doors.

In this task, they first trained an unsupervised object detector to locate pluasible candidates. The internal critic is calculated by the relationship between objects and the agent.

Comments [DR001]: Dividing a task into several sub-level goal can make the model focus on different hierarchical levels separately. To employ that, it is critical to design the goals without extra labels. Generally, we can separate the task temporally (into states) or spatially (into objects).