Activation-Based Pruning of Neural Networks

Activation-Based Pruning of Neural Networks: Comparison

Please note this is a comparison between Version 2 by Catherine Yang and Version 1 by Tushar Ganguli.

We present aA novel technique is presented for pruning called activation-based pruning to effectively prune fully connected feedforward neural networks for multi-object classification. OurThe technique is based on the number of times each neuron is activated during model training. Further analysis demonstrated that activation-based pruning can be considered a dimensionality reduction technique, as it leads to a sparse low-rank matrix approximation for each hidden layer of the neural network. We also demonstrate that tThe rank-reduced neural network generated using activation-based pruning has better accuracy than a rank-reduced network using principal component analysis. We provide empirical results to show that, aAfter each successive pruning, the amount of reduction in the magnitude of singular values of each matrix representing the hidden layers of the network is equivalent to introducing the sum of singular values of the hidden layers as a regularization parameter to the objective function.

machine learning
network pruning
dimensionality reduction
computer vision

1. Introduction

Deep neural networks are used to solve real-world problems in various domains such as image classification, text classification, and speech recognition. These networks often require millions of parameters and billions of floating-point operations to make accurate predictions. Network pruning has emerged as an important technique for improving the efficiency of deep neural networks by removing redundant structures. Pruning reduces the number of parameters of a neural network, resulting in a reduction of the computational resource required to run the network. Some of the most-popular pruning methods are magnitude-based pruning , structured pruning , pruning based on the lottery ticket hypothesis , and dynamic pruning . Of these methods, magnitude-based pruning has been proven to be successful for producing compact models and has witnessed widespread acceptance. However, prior work on magnitude-based pruning contains certain deficiencies. Magnitude-based pruning does not inherently induce a low-rank structure in the hidden layers of neural networks. More-rigorous constraints are needed to drive rank reduction. Integrating it with structured pruning or low-rank regularizers is likely necessary to fully exploit the compression property.

2. Methodology

Activation-based pruning is based on the number of times each neuron is activated during the forward pass of the training phase of a neural network. In the forward pass, data are processed at each layer and passed on to the next layer until the processed data reach the output layer. In a fully connected network, each neuron is connected with incoming weights from all the neurons of the previous layer. The output of a neuron is calculated as σ(W^TX), where W is the vector of incoming weights, X is the vector of processed data from the previous layer, and σ is the non-linear activation function. Activation of a neuron means that its output is not zero i.e., σ(W^TX) ≠ 0. The number of times a neuron is activated is stored as a parameter called activation counter. Each neuron has a corresponding activation counter. During training, wthe chresearchers chose a random set of data as the input to the network before each cycle of pruning, which resulted in the neurons either being activated (σ(W^TX) ≠ 0) or not activated (σ(W^TX) = 0). Every time a neuron is activated, wthe researchers increment the value of the activation counter by 1. The values of the activation counter of all the neurons are used to perform pruning. After every pruning cycle, the network suffers some loss in training accuracy. Hence, wethe researchers trained the network to reach a predetermined training accuracy before wethe researchers initiated the next pruning cycle. The random set of data used to compute the activation counter can belong either to the training set or can be separated before the start of the training phase. The data can either be labeled or unlabeled, provided they belong to the same distribution used to train the neural network. WeThe researchers carried out an empirical comparison of two ways of pruning the network using activation-based pruning. The first was global activation-based pruning where at each stage of pruning, we the researchers consider neurons from all hidden layers of the network to decide which neurons to prune. The second was local activation-based pruning where at each stage of pruning, we the researchers prune one hidden layer.

3. Results and Discussion

While activation-based pruning prolonged the network training duration, it consistently led to better validation and test accuracy in almost all scenarios. Pruning resulted in a loss of accuracy, which was recovered by retraining the network. The pruned model can be used for prediction in place of the original model. A limitation of global activation-based pruning is its memory- and time-intensive nature, which may lead to poor scalability for larger networks. Local activation-based pruning delivered equivalent accuracy and pruning percentages to global activation-based pruning, while significantly enhancing processing time efficiency and reducing memory consumption.

OurThe empirical findings led us to hypothesize that activation-based pruning selectively targets weights that contribute minimally to the network's training process. This pruning strategy resulted in the formation of sparse, low-rank matrix approximations, effectively reducing the full-rank matrices of a trained network. When such pruned networks underwent retraining, they tended to preserve the low-rank characteristics of these matrices, especially when previously pruned (zeroed) weights were allowed to be reutilized. In contrast, ourthe observations with magnitude-based pruning indicated a different behavior: During the retraining phase, this method tended to engage nearly the entire spectrum of weights within the network. This distinction highlighted the unique impact of activation-based pruning on the network's weight optimization during retraining.

3.1 Comparative Analysis

Paper	Compression Rate	Sparsity %	Remaining Weights %	Required FLOPS %
Activation-based	94×	98.94	1.06	1.06
Han et al. [2]^[1]	12×	92	8	8
Han et al. [7]^[2]	40×	92	8	8
Guo et al. [6]^[3]	56×	98.2	1.8	1.8
Fankle and Carbin [5]^[4]	7×	86.5	13.5	13.5
Molchanov et al. [17]^[5]	7×	86.03	13.97	13.97

Han et al. [2] ^[1] presented an unstructured, iterative pruning method that uses magnitude-based pruning to achieve a 12-fold reduction in model size. Han et al. [7] ^[2] employed an iterative pruning technique, where pruning occurred after training and achieved a 40-fold reduction in model size. Guo et al. [6] ^[3] presented an iterative pruning method where pruning occurred before training and achieved a 56-fold reduction in model size. Frankle and Carbin [5] ^[4] used the lottery ticket hypothesis in the context of pruning small fully connected nets on MNIST. It is a structured pruning technique with importance scoring based on the weight magnitude. It achieved a seven-fold reduction in model size. Molchanov et al. [17] ^[5] presented an unstructured pruning, where pruning occurred during training. It achieved a 7-fold compression rate. Activation-based pruning is an unstructured, iterative-based pruning method. The results demonstrated that activation-based pruning achieved an impressive 94-fold reduction in the model size.

3.2 Principal Component Analysis of Hidden Layers

Activation-based pruning generates low-rank approximation of each matrix representing the hidden layers of the neural network. OuThe researchers' hypothesis posits that activation-based pruning offers a better rank-k approximation of network layers while preserving significant levels of training, validation, and test accuracy. To substantiate this hypothesis, wethe researchers ascertained the ranks of each pruned model and applied PCA to a fully trained model, resulting in the creation of rank-reduced models. Subsequently, we the researchers compared the test accuracy of the PCA-generated rank-reduced models with those derived from activation-based pruning. Models subjected to activation-based pruning demonstrated better test accuracy compared to those whose hidden layers were rank-reduced through PCA.

3.3 Analysis of Singular Value Changes in Each Layer

Every hidden layer of a feedforward neural network can be represented in the form of a matrix. The singular values of a matrix can be computed using singular-value decomposition (SVD). The magnitudes of the singular values capture information about the data. The larger the singular value, the higher the amount of information captured by the corresponding basis vector. WeThe researchers observed the change in singular values of the weight matrix of each hidden layer before and after pruning. WeThe researchers observed that the singular values of the weight matrix of each hidden layer reduced by a weighted amount. WeThe researchers also observed that, during the initial phase of pruning, the singular values with smaller magnitudes experienced a greater reduction in value than singular values with larger magnitudes. This corresponds to the notion that activation-based pruning results in removing the basis vectors that contain relatively little information about the data. As pruning progresses, the basis vectors that hold the least amount of information are removed successively. After a while, we the researchers are left with basis vectors that contain equally important amounts of information. At this stage, the rate of reduction in the magnitude of singular values shifts from a weighted reduction to a uniform reduction. This is equivalent to stating that a reduction in the magnitude of the singular values shifts from a weighted nuclear norm minimization (WNNM) to nuclear norm minimization (NNM). To the best of our knowledge, our worhe work is the first of its kind to analyze the effect of pruning on the singular values of hidden layers and provides empirical data to support the hypothesis that activation-based pruning results in a weighted nuclear norm minimization of the hidden layers. WeThe researchers hypothesize that, in a feedforward neural network using SGD as the optimization algorithm and ReLU as the activation function, activation-based pruning is equivalent to introducing the weighted nuclear norm as a regularization parameter to the original cost function.

4. Conclusion

WThe researchers demonstrated the effectiveness of activation-based pruning in successfully reducing the size of a feedforward network. This pruning technique can be applied with either labeled or unlabeled data, as long as the data are drawn from the same distribution used to train the original feedforward network. Consequently, wethe researchers hypothesized that activation-based pruning is adaptable to supervised, semi-supervised, or unsupervised learning algorithms. Furthermore, ourthe results showed that each layer of the pruned network serves as a sparse low-rank matrix representation of the fully trained original network. We The researchers provided empirical evidence supporting the hypothesis that activation-based pruning can be interpreted as introducing a regularization parameter of the weighted nuclear norm of the hidden layers. Additionally, considering the architectural and implementation characteristics of activation-based pruning, wethe researchers proposed that this technique has the potential to be applied to various types of neural networks. In ourthe study, the focus was specifically on image classification tasks utilizing smaller datasets, namely FashionMNIST and MNIST, and employing compact network architectures, such as the fully connected feedforward model. This initial scope was chosen to facilitate a controlled analysis of the methods involved.

References

Han, S.; Pool, J.; Tran, J.; Dally, W. Learning both Weights and Connections for Efficient Neural Network. In Advances in Neural Information Processing Systems; Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2015; Volume 28.
Han, S.; Mao, H.; Dally, W.J. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. arXiv 2016, arXiv:1510.00149.
Guo, Y.; Yao, A.; Chen, Y. Dynamic Network Surgery for Efficient DNNs. In Advances in Neural Information Processing Systems; Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2016; Volume 29.
Frankle, J.; Carbin, M. The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019.
Molchanov, D.; Ashukha, A.; Vetrov, D. Variational Dropout Sparsifies Deep Neural Networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, NSW, Australia, 6–11 August 2017; Precup, D., Teh, Y.W., Eds.; Proceedings of Machine Learning Research. Volume 70, pp. 2498–2507.