Binarized Neural Networks: History Edit
Deep neural networks (DNNs) are becoming more powerful. However, as DNN models become larger they require more storage and computational power. Edge devices in IoT systems, small mobile devices, power constrained and resource constrained platforms all have constraints that restrict the use of cutting edge DNNs. Various solutions have been proposed to help solve this problem. Binarized Neural Networks (BNNs) are one solution that tries to reduce the memory and computational requirements of DNNs while still offering similar capabilities of full precision DNN models.
There are various types of networks that use binary values. In this paper we focus networks based on the BNN methodology first proposed by Courbariaux et al. in [1] where both weights and activations only use binary values, and these binary values are used during both inference and backpropgation training. From this original idea, various works have explored how to improve their accuracy and how to implement them in low power and resource constrained platforms. In this paper we give an overview of how BNNs work and review extensions.
In this paper we explain the basics of BNNs and review recent developments in this growing area. Most work in this area has focused on advantages that are gained during inference time. Unless otherwise stated, when the advantages of BNNs are mentioned in this work, we will assume these advantages apply mainly to inference. However, we will look at the advantages of BNNs during training as well. Since BNNs have received substantial attention from the digital design community, we also focus on various implementation of BNNs on FPGAs.
 

Terminology

Before diving into the details of BNNs and how they work, we want to clarify some of the terminology that will be used throughout this review. Some of the terms used in the literature interchangeably and can be ambiguous.
Weights: Learned values that are used in a dot product with activation values from previous layers. In BNNs, there are real valued weights which are learned and binary versions of those weights which are used in the dot product with binary activations.
Activations: The outputs from an activation function that are used in a dot product with the weights from the next layer. Sometimes the term “input” is used instead of activation. We use the term “input” to refer to input to the network itself and not just the inputs to an individual layer. In BNNs, the output of the activation function is a binary value and the activation function is the sign function.
Dot product: A multiply accumulate operation occurs in the “neurons” of a neural network. The term “multiply accumulate” is used at times in the literature, but we use the term dot product instead.
Parameters: All values that are learned by the network through backpropagation. This includes weights, biases, gains and other values.
Bias: An additive scalar value that is usually learned. Found in batch normalization layers and specific BNN techniques that will be discussed later.
Gain: A scaling factor that is usually learned, (but sometimes extracted from statistics (Section 5.2)). Similar to bias. A gain is applied after a dot product between weights and activations. The term scaling factor is used at times in the literature, but we use gain here to emphasis its correlation with bias.
Topology: The specific arrangement of layers in a network. The term “architecture” is used frequently in the DNN community. However, the digital design and FPGA community also use the term architecture to refer to the arrangement of hardware components. For this reason we use topology to refer to the layout of the DNN model.
Architecture: The connection and layout of digital hardware. Not to be confused with the topology of the DNN models themselves.
Fully Connected Layer: As a clarification, we use the term fully connected layer instead of dense layer like some of the literature reviewed in this paper.

Background

Various methods have been proposed to help make DNNs smaller and faster without sacrificing excess accuracy. Howard et al. proposed channel-wise separable convolutions as a way to reduce the total number of weights in a convolutional layer [2]. Other low rank and weight sharing methods have been explored [3,4]. These methods do not reduce the data width of the network, but instead use fewer parameters for convolutional layers while maintaining the same number of channels and kernel size.
SqueezeNet is an example of a network topology that designed specifically to reduce the number of parameters used [5]. SqueezeNet requires less parameters by using more 1×1 kernels for convolutional layers in place of some 3×3 kernels. They also reduce the number of channels in the convolutional layers to reduce the number of parameters even further.
Most DNN models are overparamertized and network pruning can help reduce size and computation [6,7,8]. Neurons that do not contribute much to the network can be identified and removed from the network. This leads to sparse matrices and potentially smaller networks with fewer calculations.

Network Quantization Techniques

Rather than reducing the total number of parameters and activations to be processed in a DNN, quantization reduces the bit width of the values used. Traditionally, 32-bit floating point values have been used in deep learning. Quantization techniques use data types that are smaller than 32-bits and tend to focus on fixed point calculations rather than floating point. Using smaller data types can offer reduction in total model size. In theory, arithmetic with smaller data types can be quicker to compute and fixed point operations can be more efficient than floating point. Gupta et al. show that reducing datatype precision in a DNN offers reduced model size with limited reduction in accuracy [9].
We note, however, that 32-bit floating point arithmetic operations have been highly optimized in GPUs and most CPUs. Performing fixed point operations on hardware with highly optimized floating point units may not achieve the kinds of execution speed advantages that oversimplified speedup calculations might suggest.
Courbariaux et al. compare accuracies of trained DNNs using various sizes of fixed and floating point values for weights and activations [10]. They even examine the effect of a hybrid dynamic fixed point data type and show how comparable accuracy can be obtained with sub 32-bit precision.
Using quantized values for gradients has also been explored in an effort to reduce training time. Zhou et al. experiment with several low bit widths for gradients [11]. They test various combinations of low bit-widths for activations, gradients and weights. They observe that using higher precision is more useful in gradients than in activations, and using higher precision in activations is more useful than in weights.
 

Early Binarization

The most extreme form of network quantization is binarization. Binarization is a 1-bit quantization where data can only have two possible values. Generally −1 and +1 have been used for these two values (or γ and +γ when scaling is considered, see Section 6.1).We point out that quantized networks that use the values −1 and 0 and +1 are not binary, but ternary, a confusion in some of the literature [12,13,14,15]. They exibit a high level of compression and simple arithmetic, but do not benefit from the single bit simplicity of BNNs since they require 2-bits of precision.
The idea of using binary weights predates the current boom in deep learning [16]. Early networks with binary values contained only single hidden layer [16,17]. These early works point out that backpropagation (BP) and stochastic gradient decent (SGD) cannot be directly applied to these networks since weights cannot be updated in small increments. As an alternative, early works with binary values used variations of Bayesian inference. More recently [18] applies a similar method, Expectation Backpropagation, to train deep networks with binary values.
Courbariaux et al. claim to be the first to train a DNN from start to finish using binary weights and BP with their BinaryConnect method [19]. They use real valued weights which are binarized before being using by the network. During backpropagation, the gradient is applied to the real valued weights using the Straight-Through Estimator (STE) which is explained in Section 4.1.
While binary values are used for the weights, Courbariaux et al. retain full precision activations in BinaryConnect. This eliminates the need for full precision multiplications, but still requires full precision accumulations. BinaryConnect is named in reference to DropConnect [20], but connections are binarized instead of being dropped.
These early works in binary neural networks are certainly binary in a general sense. However, this paper defines BNNs as networks that use binary values for both weights and activations allowing for bitwise operations instead of multiply-accumulate operations. Soudry et al. was one of the first research groups to focus on DNNs with binary weights and activations [18]. They use Bayesian learning to get around the problems of learning with binary values [18]. However, Courbariaux et al. are able to use binary weights and activations during training with backpropagation techniques and take advantage of bitwise operations [1,21]. Their BNN method is the basis for most binary networks that have come since (with some notable exceptions in [22,23]).

References

  1. Courbariaux, M.; Bengio, Y. BinaryNet: Training Deep Neural Networks with Weights and Activations Constrained to +1 or −1. arXiv 2016, arXiv:1602.02830. 
  2. Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. 
  3. Jaderberg, M.; Vedaldi, A.; Zisserman, A. Speeding up Convolutional Neural Networks with Low Rank Expansions. arXiv 2014, arXiv:1405.3866. 
  4. Chen, Y.; Wang, N.; Zhang, Z. DarkRank: Accelerating Deep Metric Learning via Cross Sample Similarities Transfer. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. 
  5. Iandola, F.N.; Moskewicz, M.W.; Ashraf, K.; Han, S.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50× fewer parameters and <1 MB model size. arXiv 2016, arXiv:1602.07360. 
  6. Hanson, S.J.; Pratt, L. Comparing Biases for Minimal Network Construction with Back-propagation. In Advances in Neural Information Processing Systems 1; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1989; pp. 177–185. 
  7. Cun, Y.L.; Denker, J.S.; Solla, S.A. Optimal Brain Damage. In Advances in Neural Information Processing Systems 2; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1990; pp. 598–605. 
  8. Han, S.; Mao, H.; Dally, W.J. Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding. arXiv 2015, arXiv:1510.00149.
  9. Gupta, S.; Agrawal, A.; Gopalakrishnan, K.; Narayanan, P. Deep Learning with Limited Numerical Precision. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015. 
  10. Courbariaux, M.; Bengio, Y.; David, J.P. Training deep neural networks with low precision multiplications. arXiv 2014, arXiv:1412.7024.
  11. Zhou, S.; Ni, Z.; Zhou, X.; Wen, H.; Wu, Y.; Zou, Y. DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients. arXiv 2016, arXiv:1606.06160. 
  12. Seo, J.; Yu, J.; Lee, J.; Choi, K. A new approach to binarizing neural networks. In Proceedings of the 2016 International SoC Design Conference (ISOCC), Jeju, Korea, 23–26 October 2016; pp. 77–78. 
  13. Yonekawa, H.; Sato, S.; Nakahara, H. A Ternary Weight Binary Input Convolutional Neural Network: Realization on the Embedded Processor. In Proceedings of the 2018 IEEE 48th International Symposium on Multiple-Valued Logic (ISMVL), Linz, Austria, 16–18 May 2018; pp. 174–179.
  14. Hwang, K.; Sung, W. Fixed-point feedforward deep neural network design using weights +1, 0, and −1. In Proceedings of the 2014 IEEE Workshop on Signal Processing Systems (SiPS), Belfast, UK, 20–22 October 2014; pp. 1–6.
  15. Prost-Boucle, A.; Bourge, A.; Pétrot, F. High-Efficiency Convolutional Ternary Neural Networks with Custom Adder Trees and Weight Compression. ACM Trans. Reconfigurable Technol. Syst. 201811, 1–24. 
  16. Saad, D.; Marom, E. Training feed forward nets with binary weights via a modified CHIR algorithm. Complex Syst. 19904, 573–586.
  17. Baldassi, C.; Braunstein, A.; Brunel, N.; Zecchina, R. Efficient supervised learning in networks with binary synapses. Proc. Natl. Acad. Sci. USA 2007104, 11079–11084. 
  18. Soudry, D.; Hubara, I.; Meir, R. Expectation Backpropagation: Parameter-free training of multilayer neural networks with real and discrete weights. Adv. Neural Inf. Process. Syst. 20142, 963–971. 
  19. Courbariaux, M.; Bengio, Y.; David, J.P. BinaryConnect: Training Deep Neural Networks with binary weights during propagations. In Proceedings of the Advances in Neural Information Processing Systems, Montréal, QC, Canada, 7–10 December 2015; pp. 3123–3131. 
  20. Wan, L.; Zeiler, M.; Zhang, S.; Le, Y.; Cun, R.F. DropConnect. In Proceedings of the International Conference on Machine Learning, Atlanta, GA, USA, 17–19 June 2013.
  21. Hubara, I.; Courbariaux, M.; Soudry, D.; El-Yaniv, R.; Bengio, Y. Binarized Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–8 December 2016; pp. 1–9.
  22. Ding, R.; Liu, Z.; Shi, R.; Marculescu, D.; Blanton, R.S. LightNN. In Proceedings of the on Great Lakes Symposium on VLSI 2017 (GLSVLSI), Banff, AB, Canada, 10–12 September 2017; ACM Press: New York, NY, USA, 2017; pp. 35–40. 
  23. Ding, R.; Liu, Z.; Blanton, R.D.S.; Marculescu, D. Lightening the Load with Highly Accurate Storage- and Energy-Efficient LightNNs. ACM Trans. Reconfigurable Technol. Syst. 201811, 1–24