[DR009] Learning to learn by gradient descent by gradient descent

Choosing the update step size or learning rate is always one of the most taxing job when training a neural network. A over-large update renders the training unable to converge, while a over-small update results in inefficient training. Previous hand-crafted update policy to control the gradient, like momentum, Rprop, Adagrad, RMSprop and ADAM, are tailored to specific classes of problems. In this paper, they instead propose to replace hand-designed update rules with a learned update rule.

Generally speaking, the goal of a task with objective function $f(\theta)$ is to find a optimal $\theta^* = \arg\min f(\theta)$ . If $f$ is a differentiable function and does not have a close-form for the optimal solution, we will use gradient descent to approximate the (local) optimal solution.

$\theta_{t+1} = \theta_{t} -\alpha_t \nabla f(\theta_t)$ ,

where $\alpha_t$ is a key component to control the update scale.

Instead of hand-crafting $\alpha_t$ , this paper proposes a learning-based method to control $\alpha_t$ by introducing an auxiliary function $g$ parameterized by $\phi$ .

The call $g$ the optimizer and $f$ the optimizee.

We know that the measurement of $\theta$ is $f$ . To evaluate $\phi$ , we define its loss function as:

, where $\theta^*$ is the optimal parameter trained by both the gradient given by optimizee $f$ and the step size given by optimizer $g$ parameterized by $\phi$ . Usually when calculating the expectation, the loss function $f$ keeps a consistent form but with different input/batches.

Since it takes multiple steps before we reach or approximate $\theta^*$ , we can regard the index of training iteration as a temporal axis. Suppose $T$ steps are taken during training and we denote the parameter after each update at iteration $t$ as $\theta_t$ . Then in the paper they let $\theta_T=\theta^*$ , assuming the last parameter is the best.

However, they states that if the loss is only given in the last iteration, then the gradient will be too small after BPTT to the first iteration. To resolve that, they add the loss $f(\theta_t)$ weighed by $w_t$ in each iteration $t$ .

The original loss in Equ. (2) equals to Equ. (3) if and only if the weights $w_t = \textbf{1}[t=T]$ . Actually, they use $w_t = 1$ for simplicity.

There is another detail about the update. Notice that the update to $\theta$ is related to $\phi$ . Therefore, $\theta_{t} (t>2)$ is intertwined with $\phi$ . To avoid the calculation of second derivatives of $f$ , they let $\frac{\partial \nabla_t}{\partial \phi} = 0$ , as shown in Figure 2.

Another design I find interesting is the coordinate-wise architecture. Since there might be thousands of parameters in $\theta$ , it is hardly possible to train an individual set of $\phi$ for each parameter. Therefore, in optimizer $g$ they use the same set of network parameter $\phi$ but different hidden states.

The first experiment is to compare with the existing methods which control the learning step or adjust gradient for an update. Although their proposed method seems to outperform the competitors.

Still, the experiment is problematic:

~~The loss and the accuracy is not necessarily related with the loss especially when the loss is extremly small.~~
The network is too simple (MLP with a 20-neuron hidden state).
The extra calculation of optimizer $g$ is unreported. The proposed method might cost much more time in each iteration compared with other methods.

They also test the transferability across different network architecture:

The method fails to generalize between different activation function, which is understandable. But generalization from a 1-layer network to a 2-layers network is feasible. This experiment is not convincing enough and sometimes the parameter matrix (one layer) is low-rank and can be decomposed into two matrices (two layers).

Extensions in the appendix are very diverse, like using Global-Average-Cells (GAC) to connect different hidden states or mixing their method with L-BFGS and Neural Turing Machine (NTM). The latter is complex enough to be developed into another piece of work.

[DR009]: First of all, the idea is quite novel to learning an optimizer via learning. And there are many perspectives that can be worked on if someone wants to improve this work:

Transferability: One of the advantages of previous hand-crafted optimizers is that they can be calculated with limited resources (memory and computation) on whatever the network structure or task is. However, the proposed optimizer seems to be closely related to the network structure (not rebut enough in the experiments) and the task (~~not mentioned in the paper~~ pluasible, since the No Free Lunch Theorems).
- Differences of structures should not be limited to the number of layers. If the method can transfer between feedforward-network, residual network, highway network, RNN and other possible “routes”, then we can claim confidently that this optimizer is all-around for different structures.
- Although learning the optimizer that is omnipotent on all task, as stated in the introduction, is unrealistic, there might be some shared pattern between different optimization process. For instance, the learning rate should become smaller as the loss converges.
Extra computation cost: If we cannot use the same optimizer across tasks, then we should reduce the extra cost of training the optimizer. For instance, when the loss converges, we can reduce the update frequency of the optimizers. Or we only update the optimizer when the loss stacks on a platform.
Training of the optimizer: the optimizer is trained to train the optimizee . And the optimizer itself is trained via classical methods. If we can find some methods to automate the training of . Here are some cracy thoughts:
- Train the optimizer using Unsupervised learning of Reinforcement learning. (hand-crafted target)
- Use two optimizer to train each other (using dual-learning).