The motivation of this paper is to apply a sub-linear time hashing method to find out the most activated neurons, which are the only neurons selected to be activated and updated. This selective training method can reduce the computation (multiplication) with negligible hashing time. Since I am not familiar with hashing, I will skip the details of the hashing methods.
It is well-known that the neuron network, especially the one with deep layers, consumes a large quantity of computation. Recent research found that by selectively training (forward and backward) the most activated neurons the performance can remain the same or even be improved. However, those methods (e.g. Adaptive Dropout, Winner Take-All) are inefficient when selecting those speical neurons, and hence introduce extra computation loss.
In this paper, they proposed a method that capitalized a randomized sub-linear hashing algorithm to perform adaptive dropout. They denoted the AS (active set) to be the top k% nodes having significant activations (k=5 by default). For each layer, (1) they first built a hashing table based on the weights. (2) Then the activation from the previous layer was used as a query to get the AS of the next layer (without calculation). (3) Only the neurons in AS will be calculated.(4) Duiring back propagation, only neurons in AS will be updated.
The hashing part of this paper can be condensed into:
, where K and L was the hyper-parameter of the hashing algorithm. To sum up, the more activated a neuron was, the more possible it got chosen.
The overall framework can be described as :
where I believe there is a typo: should be since all the input neurons should be considered.
Although this paper was well motivated, it lacked convincing experiments to support its claims.
The author stated in the introduction that deep networks like ResNet or Inception requires massive computation on complex tasks. And the proposed methods are natually expectedly to reduce computation on complex tasks like CIFAR or ImageNet. However, these 4 tested datasets are somehow simple. It will be more convincing to (1) test on more complex datasets and (2) to show that this hashing can be applied to any or most types of network structure.
The first experiment showed that their proposed method (LSH) outperformed standard dropout and the performance decline compared with standard network is insignificant. My question is that the standard/vanilla dropout was proposed to solve the overfitting problem. And it is well-known that (1) training a network with dropout requires more computation time but there should be a obvious boosting of performance compared to the standard NN, which we cannot see in the experiments, and (2) best dropout ratios are bettdifferent in different layers, which are set to the same as comparison.
The second one showed that the proposed method also performed the same as other advanced dropout methods. Other methods required full computation while theirs used hashing to alleviate the cost of selecting neurons.
The following experiments showed that their method can be accelerated with parallel computation. In brief, they achieved x31 speed up using Asynchronous stochastic gradient descent (ASGD).
My comments [DR002]: I think the motivation is well expressed and the combination between hashing and dropout is just seamless. However, the paper, from my perspective, dispersed over several topcis (hashing, neural network, parallel computation). And the experiments are not convincing enough. Most importantly, it claimed that the computation was reduced to only 5% of the original one, but this complicated dropout disabled neuron network to be trained continuously. Reducing computation is a good thing, but it backfired when the training is unable to be trained on GPU.
Our team surmise that chosing the most activated one might not be the optimal policy. Maybe the chosing procedure can be learnt unsupervisedly or with reinforcement learning. However, it is hard to design a low-cost but effcacious neuron-chooser.