[DR006] Learning Step Size Controllers for Robust Neural Network Training

This paper proposed a method to adaptively learn the learning step by inspecting some hand-crafted feature of the network, which can generalize to different tasks (not limited to classification) and architectures.

The problem of training a network can be described as:

where $\omega$ is the weights of network, $F$ is the target (loss) function and $X$ is the input. Under most circumstances, loss function $F$ is an average result of mulitple individual losses $f$ :

When updating $\omega$ , we usually set

So how to choose the open parameter (learning step) $\xi$ is the major contribution of this paper. They want to learn

Note that the word “feature” here does not refer to the feature of input generated by the network, but the feature of the learning procedure itself. And this feature can represent the current training status.

Before designing the feature, they set several ground rules that the feature should obey:

the computational complexity of generating informative features should not exceed the complexity of the training algorithm
the proposed features will also need to be conservative in their memory requirements

To estimate the improvement of a certain update $\Delta \omega$ , they use a first order Taylor expansions:

They state that the gradient of average (loss) function $\nabla F$ cannot improve the individual function $f_i$ by the same amount with $F$ . Thus, evaluating the agreement of the individual gradients is likely to yield informative features. And the first feature is the Predictive change in function value

Aside from the variance of gradients, Disagreement of function values also reflects some characteristics.

To stablize the features during training, the momentum policy is used:

In the following discussing, we might use $\phi$ to represent $\hat \phi$ . Apart from this, for each feature, an extra feature

is designed to show how violently the feature changes across iterations. This extra feature called Uncertainty Estimate also have a momentum policy. The momentum policy is employed to (1) stablize the fluctuation between different mini-batches and (2) enable the feature to direct the learning for multiple iterations.

With those features (Predictive change in function value, Disagreement of function values and their Uncertainty Estimate) designed, the second ingredient in

is the policy $\pi(\xi|\phi)$ , which is used to choose a learning step by observing the features. They simply let $\xi = g(\phi;\theta) = exp(\theta^T\phi)$ . Therefore, the optimal policy is like:

The remaining compent is the reward function $r(\cdot, \cdot)$ . However, only one related formula is given

but I fail to find the definition of $E$ . Maybe it is closely related to another work (Relative Entropy Policy Search (REPS) (Peters, Mülling, and Altun 2010)) they frequently refer to. ~~I might check that paper in the feature.~~ Compared with normal policy learning methods, REPS can render consecutive policy updates `close` to each other by constraining the KL-divergence of samples between the new policy and previous policy. Information about REPS is summarized in [DR010] but still it has no information about the reward $E$ in Equ. (14). Please leave a comment if you know the full definition of this reward.

[DR006]: Learning the learning step with minimal loss is very practical when training a neural network. We can find that some hand-crafted features about the activation of the network can reflect the status of training and hence are used to direct the training. From another perspective, those unsupervised features themselves or the regularizers based on them might be useful to boost the performance of the network or used as an intrinsic loss in transfer learning/one-shot learning/zero-shot learning.

Another ambitious idea is to analyze what statistic characteristics the activation of neurons have in a well-trained neural network. We can train a meta-network to generalize those pattern and use it as a regularizer to direct other networks.