Papers about Binarized Neural Networks

Deep Neural Network recently dominates the field of computer vision. However, one of its critical disadvantages is that the trained network can contain numerous parameters and hence, it requires much more computation time to forward an image than traditional feature extractors. This is a barrier between deep neural network and devices with small memory and limited computation ability, such as mobile devices or embedded machines.

Binarized Neural Network is a possible method to alleviate this problem. By binarized the weights, activations and the even the inputs, the total volume of a network can be squeeze by a large factor (from float32 to 1-bit). Besides, we can further take the advantage of bit computation (e.g. xnor) to accelerate the computation.

Today I am going to share with you the following papers related to binarized neural network:

Binarized Neural Networks: Training Neural Networks with Weights and Activations Constrained to +1 or -1
Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations
XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks
~~Embedded Binarized Neural Networks~~

I believe there still some works on this topic that I fail to include. Please comment below if you have some suggestions.

1> Binarized Neural Networks: Training Neural Networks with Weights and Activations Constrained to +1 or -1

This paper is very engineering. Many details are missing and they just put almost everything in the pseudo-code. But it is earliest and feasible work on the binarized neural network that I can find. The experiment is pretty amazing because they can achieve almost the same performance on MNIST, SVHN, and CIFAR10 as network with real weights. And they implemented a binary computation kernel on GPU, which is about 23 times faster than the baseline kernel, 3.4 times faster than cuBLAS and 7 times faster with the XNOR kernel than with the baseline kernel, without suffering any loss in classication accuracy. It was later extended to a journal version, which is next paper I am going to introduce.

2> Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activation

Performing well on small datasets like MNIST, SVHN and CIFAR10 are still far away from recognizing an object in the real world. They wanted to conquer ImageNet as well. In this work, the limitation of “binarized” is not that strict. They changed it with work “quantized”, which means weights might have one or more than one bits. BNN is a very extreme case of QNN. In the abstract, they claimed that on ImageNet, the quantized version of AlexNet with 1-bit weights and 2-bit activations achieves 51% top-1 accuracy.

There are two ways to binarize a real variable, one is deterministic:

Another one is stochastic:

However, generating random number on a hardware is more time-consuming. Therefore the stochastic method is only used when binarizing the activation at train-time for some special cases.

To calculate the gradient, they stated that the update magnitude of SGD is very small at echo step. The need to accumulate small gradients and to average out the noisy requires the gradient to have high-resolution rather than just binary.

Even we allow the gradients to be real numbers, you will find out the gradient for those weights are always 0 because of the $sign(\cdot)$ function. To train discrete neurons, the “straight-through estimator” method has been introduced in Hinton’s lectures. They followed a similar approach and let the gradient to be

$\frac{\partial C}{\partial r} = \frac{\partial C}{\partial q}1_{|r|\leq 1},$

where $q=Sign(r)$ is the activation and $C$ is the loss function. If you want to have a better on this method, please refer to this discussion. To limit the growing of real-value weights, they clipped the weights into [-1, 1] during training. And they used the binarized weights to forward.

Another big issue is that the BatchNormalization. BN is proved to be very helpful during training. But BN requires much computation, especially with many expensive multiplications. They invented a shift-based BN to replace vanilla BN without loss in accuracy.

By applying the similar idea, a shift-based AdaMax was introduced to replace traditional ADAM

The final algorithm looks like:

During testing time, the procedure is:

Other details about how to quantize neural network in more than 1-bit were also introduced. I will leave these parts for you.

On the ImageNet challenge, their QNN or BNN are very competitive.

If the GoogleNet framework was applied, their network can perform much better:

Extensive experiments were done, including both CNN and RNN.As stated in the previous paper, they could achieve a 7 times acceleration with XNOR-GPU kernel.

My deepest impression about this paper is the engineering tricks and comparison it made. Building such neural network requires more engineering skills than theoretical deductions.

3> XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks

The related work research in this paper is quite comprehensive. Maybe I will extend this post and include more papers by recursively reading the references.

This work can be summarized in one picture:

Before diving into the network, we first look at some definition:

Two variations of binary CNN are proposed: Binary-Weight-Network. where elements in $\mathcal{W}$ are binary values; XNOR-Network, where elements in both $\mathcal{W}$ and $\mathcal{I}$ are binary values.

First, let’s take a look of Binary-Weight-Network. If we replace filter $\textbf{W}$ with a binary filter $\textbf{B} \in \{ +1, -1\}^{c\times w \times h}$ and a scaling factor $\alpha \in \mathbb{R}^+$ such that $\textbf{W} \approx \alpha \textbf{B}$ .Then the convolution operation is:

$I \ast \textbf{W} \approx \alpha (\textbf{I} \oplus \textbf{B})$

where $\oplus$ indicates a binary convolution. To find the best estimation for $\alpha^*$ and $\textbf{B}^*$ from a given $\textbf{W}$ . We have to solve the following optimization problem:

$\begin{align*} J(\textbf{B}, &\alpha) = \| \textbf{W} - \alpha \textbf{B} \|^2 \\ \alpha^*, \textbf{B}^* &= \arg\min_{\alpha,\textbf{B}} J(\textbf{B}, \alpha) \end{align*}$

By expanding this formula into

$J(\textbf{B}, \alpha) = \alpha^2\textbf{B}^T\textbf{B} - 2 \alpha\textbf{W}^T\textbf{B}+\textbf{W}^T\textbf{W}$

we can find that $\textbf{B}^T\textbf{B}$ and $\textbf{W}^T\textbf{W}$ are both constants. Therefore, only considering $\textbf{B}$ we have to maximize $\textbf{W}^T\textbf{B}$ . We can simply set $\textbf{B}^* = \text{sign}(\textbf{W})$ . Also, by setting the derivative of $J$ w.r.t. $\alpha$ to 0, we can solve $\alpha^*=\frac{1}{n}\|\textbf{W}\|_{L1}$ , where $n$ is the dimension of $\textbf{B}$ . The gradients back-propagate in the same way with last paper. The pseudo-code is here:

Next, let’s us move to XNOR-Networks, where both the input and weights are binarized. First, we consider the dot-production layer or densely-connected layer. We denote $\textbf{H}$ as the binarized variable of input $\textbf{I}$ and $\beta$ is the scaling factor. Similar to the last paper, we want $\textbf{X}^T\textbf{W}\approx\beta\textbf{H}^T\alpha\textbf{B}$ and we can find the optimal approximation by optimizing:

$\alpha^*, \textbf{B}^*,\beta^*,\textbf{H}^* &= \arg\min_{\alpha,\textbf{B},\beta,\textbf{H}} \|\textbf{X} \odot\textbf{H} - \beta\alpha\textbf{H}\odot\textbf{B}\|$

It can be written as: $\gamma^*, \textbf{C}^* &= \arg\min_{\gamma,\textbf{C}} \| \textbf{Y}-\gamma\textbf{C} \|$ , where $\textbf{Y} = \textbf{X} \odot \textbf{W}$ , $\textbf{C} = \textbf{H} \odot \textbf{B}$ and $\gamma = \beta\alpha$ . The solution is $\textbf{C}^* = \text{sign}(\textbf{X})\odot \text{sign}(\textbf{W})$ and $\gamma^* = \beta^*\alpha^*$ .

As for the binary convolution layer, the calculation method is a little bit tricky. The multiplication of each patch and filter is the same as in the densely-connected layer. However, when calculating the scaling coefficients, we have to calculate the L1 norm for each patch, which is very duplicated because each pixel will be involved in multiple patches. Therefore, we can first calculate the L1 norm for all pixels and use a kernel to average them. Detailed information can be found here.

We can use one figure to illustrate the procedure of binarization:

Another thing that is different from the ordinary network is the arrangement inside a single convolutional block:

The initiative of this order is quite vague in the paper:

Applying pooling on binary input results in significant loss of information. For example, max-pooling on binary input returns a tensor that most of its elements are equal to
+1. Therefore, we put the pooling layer after the convolution. To further decrease the information loss due to binarization, we normalize the input before binarization. This ensures the data to hold zero mean, therefore, thresholding at zero leads to less quantization error.

I still don’t know why putting pooling immediately after convolution can decrease the information loss. A slide of this paper shows a different version of arrangement:

, which is very confusing.

In the comparison

the proposed method is much better than other methods and very close to the full-precision network.

The method also works on large networks:

Besides, an acceleration of 58x speedup can be achieved in one convolution, which is pretty promising when we want to transfer deep NN to small devices working in real-time scenarios.

~~4> Embedded Binarized Neural Networks~~

~~A summary from the engineering perspective.~~

[]:

BNN”s original gradient is 0, gradients in hard attention model is 0. why not use RL to train a BNN?
Quantize from a real-value network directly rather than define-and-retrain a binary network. Hashing, Hessian, Maxtrix factorization and Compressed Coding have been employed to tackle the storage or computation limitation. Maybe we can learn to compress a network rather than compress with hand-craft criterion.
The BN layers and Pooling layers in the binary network should have a different design. The initiative of BN is to maintain the distribution of output space so that the already learned mapping in the next layer is still reasonable. BN is based on the assumption that the output should be in a normal distribution (mean equals 0 and variance equals 1). However, in the binary space, whether the normal distribution is optimal is still unproven.