[DR008] Self-Normalizing Neural Networks

This paper is quite quirky because its appendix takes 92 pages while the main content is only 9 pages long. Some pages in the appendix look like:

And:

Although this paper pays particular attention to mathematical proof, its intention is very interesting. It states that though RNN and CNN excel at sequence processing and visual tasks respectively, standard FNN (feed-forward neural network, or fully-connected neural network) fails to compete with traditional methods. According to this paper, RNN and CNN can stabilize learning via weight sharing and less is prone to perturbations brought by SGD, stochastic regularization (like dropout), and the estimation of the normalization parameters. On the contrary, FNNs trained with normalization techniques suffer from these perturbations and have high variance in the training error (see Figure 1).

Therefore, they are to propose a method that can induce self-normalizing properties like variance stabilization which in turn avoids exploding and vanishing gradients. More specifically, they design an activation function called SELUs (scaled exponential linear units) to achieve that.

First, let’s define some basic notations:

To define the word “stable” or “self-normalizing”, they suggest that

A short version is that: If the mean and variance of the input are in a certain range, then the mean and variance of the output should (1) also in that range and (2) converge to a fixed point after iteratively applying the activation function.

To achieve that, several qualities are required to choose a competent activation function:

The negative and Positive output should both exist to control the mean.
Saturation regions (derivatives approaching zero) exists in order to dampen the variance if it is too large in the previous layer.
A slope, where the derivative is larger than one, exists to increase the variance if it is too small in the previous layer.
It is a continuous curve. The latter ensures a fixed point, where variance damping is equalized by variance increasing.

The proposed SELUs satisfies all these:

So now we have a plausible activation function with hyper-parameter $\lambda$ and $\alpha$ . Engineers might choose them via cross-validation, while this work gives a detailed mathematical method to find out $\lambda$ and $\alpha$ . In the following description, I try my best to avoid writing formulas, which some of you might find intimidating.

Firstly, they show that if the weights are normalized ( $\omega=0, \tau =1$ ) and we want the fixed point to be ( $\mu=0, \nu =1$ ), then we can solve $\lambda$ and $\alpha$ explicitly ( $\alpha_{01}\approx 1.6733, ~\lambda_{01}\approx 1.0507$ ). The following image visualizes the SELU mapping given $\alpha_{01}\approx 1.6733, ~\lambda_{01}\approx 1.0507$ . The start of an arrow means the distribution of the input in the form of $(\mu, \nu)$ (mean, variance). And the end of that arrow means the distribution of mapped output. Here the “input” and “output” is the ones of activation layer, not the densely-connected layer.

However, the weights of the network cannot maintain normalized for all the time. Therefore, they relax the restriction and the property in theorem1: If the weights and the input is a restricted range, there still exists a fixed point for SELU and the fixed point lies in the same range.

Another violation that most likely happens is that the variance of the input might be too large or too small. Theorem 2 and Theorem 3 prove that SELU will not misbehave from either input with over-high variance or input with over-low variance. It just pushes the variance (and the mean) to a controlled range.

So basically their idea is very straightforward: they propose a self-normalizing activation function, whose output’s distribution (mean and variance) is proved to converge to a fixed point.

Some details like the assumption of the distribution of the activation, initialization method, and corresponding dropout technique are omitted in this post. If you are interested in them, please refer to the original paper. Its writing is quite fluent and comprehensible, at least for the first 9 pages.

[DR008]: Most engineers are intimidated by the lengthy proofs, which reminds me of the advent of WGAN and LSGAN. This work’s purpose is to stabilize the distribution of output in the neural network so that it can be stacked and properly trained.

However, it is unproven that the distribution $\mu=0, \nu =1$ works best. We always believe that neural network can learn an efficacious representation of input to solve a task. This reliability is shattered by SELU, where the feature space has to keep a certain pattern. If we can compromise between the stability of training and the autonomy of feature pattern, maybe the neural network can act better.