Activation-Based Pruning of Neural Networks

Activation-Based Pruning of Neural Networks: Comparison

Please note this is a comparison between Version 2 by Catherine Yang and Version 5 by Tushar Ganguli.

A novel technique is presented for pruning called activation-based pruning to effectively prune fully connected feedforward neural networks for multi-object classification. The technique is based on the number of times each neuron is activated during model training. Further analysis demonstrated that activation-based pruning can be considered a dimensionality reduction technique, as it leads to a sparse low-rank matrix approximation for each hidden layer of the neural network. The rank-reduced neural network generated using activation-based pruning has better accuracy than a rank-reduced network using principal component analysis. After each successive pruning, the amount of reduction in the magnitude of singular values of each matrix representing the hidden layers of the network is equivalent to introducing the sum of singular values of the hidden layers as a regularization parameter to the objective function.

machine learning
network pruning
dimensionality reduction
computer vision

1. Introduction

Deep neural networks are used to solve real-world problems in various domains such as image classification, text classification, and speech recognition. These networks often require millions of parameters and billions of floating-point operations to make accurate predictions. Network pruning has emerged as an important technique for improving the efficiency of deep neural networks by removing redundant structures. Pruning reduces the number of parameters of a neural network, resulting in a reduction of the computational resource required to run the network. Some of the most-popular pruning methods are magnitude-based pruning , structured pruning , pruning based on the lottery ticket hypothesis , and dynamic pruning. Of these methods, magnitude-based pruning has been proven to be successful for producing compact models and has witnessed widespread acceptance. However, prior work on magnitude-based pruning contains certain deficiencies. Magnitude-based pruning does not inherently induce a low-rank structure in the hidden layers of neural networks. More-rigorous constraints are needed to drive rank reduction. Integrating it with structured pruning or low-rank regularizers is likely necessary to fully exploit the compression property.

2. Methodology

Activation-based pruning is basep neural networks are used to solve real-world problems in various domains such as image classification, text classification, and speech recognition. Thesed on the number of times each neuron is activated during the forward pass of the training phase of a neural network. In the forward pass, data are processed at each layer and passed on to the next layer until the processed data reach the output layer. In a fully connected networks often require millions of parameters and billions of floating-point operations to make accurate p, each neuron is connected with incoming weights from all the neurons of the previous layer. The output of a neuron is calculated as σ(W^TX), whered W ictions. Network pruning has emerged as an important technique for improving the efficiency of deep neural networks by removing redundant structures. Pruning reducess the vector of incoming weights, X is the vector of processed data from the previous layer, and σ is the non-linear activation function. Activation of a neuron means that its output is not zero i.e., σ(W^TX) ≠ 0. tThe number of times a neuron is activated is stored as a parameters of a neural network, resulting in a reduction of the computational resource required to run called activation counter. Each neuron has a corresponding activation counter. During training, the researchers chose a random set of data as the input to the network. Some of the most-popular before each cycle of pruning methods are magnitude-based p, which resulted in the neurons either being activated (σ(W^TX) ≠ 0) oru ning, structured pot activated (σ(W^TX) = 0). Everuny ting , pruning based on the lottery ticket hypothesis, and dynamic pruning. Of these methods, magnitude-based pruning has been proven to be successful for producing compact models and has witnessed widespread acceptance. However, prior work on magnitude-basedme a neuron is activated, the researchers increment the value of the activation counter by 1. The values of the activation counter of all the neurons are used to perform pruning. After every pruning cycle, the network suffers some loss in training accuracy. Hence, the researchers trained the network to reach a predetermined training accuracy before the researchers initiated the next pruning contains certain deficiencies. Magnitude-based pruning does not inherently induce a low-rank structure in the hidden layers ofcycle. The random set of data used to compute the activation counter can belong either to the training set or can be separated before the start of the training phase. The data can either be labeled or unlabeled, provided they belong to the same distribution used to train the neural networks. More-rigorous constraints are needed to drive rank reduction. Integrating it with structur. The researchers carried out an empirical comparison of two ways of pruning the network using activation-based pruning. The first was global activation-based pruning or low-rank regularizers is likely necessary to fully exploit the compression propertywhere at each stage of pruning, the researchers consider neurons from all hidden layers of the network to decide which neurons to prune.

The se limitations are addressed throughcond was local activation-based pruning, which achieves results comparab where at each stage of pruning, the researchers prune one hidden layer.

3. Results and Discussion

While acto magnitudeivation-based pruning in terms ofprolonged the network training, duration, it consistently led to better validation, and testing accuracy, and functions as a dimensionality reduction technique. This method effectively reduces the hidden layers of neural networks to sparse low-rank matrix approximations. A in almost all scenarios. Pruning resulted in a loss of accuracy, which was recovered by retraining the network. The pruned model can be used for prediction in place of the original model. A limitation of global activation-based pruning is achieved using both labeled and unlabeled data where the data should be a representative of the same distribution as the training data. Aits memory- and time-intensive nature, which may lead to poor scalability for larger networks. Local activation-based pruning isdelivered equivalent to introducing a weighted nuclear norm as a regularization parameter during the minimization of the objective function in image classification tasks. Consequently, activation-based pruning eliminated the need for additionalaccuracy and pruning percentages to global activation-based pruning, while significantly enhancing processing time efficiency and reducing memory consumption.

The regularizers to induce a low-rank structure in the hidden layers of the feedforward network. Ampirical findings led us to hypothesize that activation-based pruning selectively targets weights that contributed minimally to the network's training process. This pruning strategy resultsed in the formation of sparse, low-rank matrix approximations, effectively reducing the full-rank matrices of a trained network. After theWhen such pruned networks undergowent retraining, they tended to preserve the low-rank characteristics of these matrices, especially when previously pruned (zeroed) weights awere allowed to be reutilized. In contrast, the observations with magnitude-based pruning indicated a different behavior: During the retraining phase, this method engagestended to engage nearly the entire spectrum of weights within the network. This distinction highlightsed the unique impact of activation-based pruning on the network's weight optimization during retraining.

2. Overview

2.1. Overview of Low-Rank Matrix Approximation

3.1 Comparative Analysis

Paper	Compression Rate	Sparsity %	Remaining Weights %	Required FLOPS %
Activation-based	94×	98.94	1.06	1.06
Han et al. ^[1]	12×	92	8	8
Han et al. ^[2]	40×	92	8	8
Guo et al. ^[3]	56×	98.2	1.8	1.8
Fankle and Carbin ^[4]	7×	86.5	13.5	13.5
Molchanov et al. ^[5]	7×	86.03	13.97	13.97

Han et al. ^[1] prioesented an us nstructured, iterative pruning methods ^[2][3][4] thavt use resulted in sparse matrices, low-rank matrix approximation, or a hybrid of boths magnitude-based pruning to achieve a 12-fold reduction in model size. Han et al. ^[2] co empressed deep neural networks by simultaneouslyloyed an iterative pruning both network weights and connections to reduce computational cost and memory usage by inducing a regularization term that encouragestechnique, where pruning occurred after training and achieved a 40-fold reduction in model size. Guo et al. s^[3] parsity during training. Weights with small magnitudes aesented an iterative pruning method where pruned based on a specified threshold. This technique resulted in sparse weight matrices. Swaminathan et al.ning occurred before training and achieved a 56-fold reduction in model size. Frankle and Carbin ^[34] propo used a novel method called sparse low rank (SLR) to compress the dense layers of deep neural networks by improving upon truncated singular-value decomposition (SVD). The key idea was to induce the lottery ticket hypothesis in the context of pruning small fully connected nets on MNIST. It is a structured sparsity into the decomposed matrices from SVD pruning technique with importance scoring based on the significance of the input/output neurons. Neuron significance is estimated by absolute weights, activations, or a change in cost when removed. Yangweight magnitude. It achieved a seven-fold reduction in model size. Molchanov et al. ^[45] proposed a method called SVD traiesented an unstructured pruning, which first decomposed each layer into the form of its full-rank SVD and, then, performs ere pruning occurred during training on the decomposed weights. Low rank is encouraged by applying sparsity-inducing regularizers on the singular values of each layer. Singular-value. It achieved a 7-fold compression rate. Activation-based pruning is an unstructured, iterative-based pruning method. The results demonstrated that activation-based pruning is applied at the end to explicitly reach a low-rankachieved an impressive 94-fold reduction in the model size.

3.2 Principal Component Analysis of Hidden Layers

Activation-based pruning achigeves sparsity innerates low-rank matrix aapproximation by assigning scores to each neuron to identify significant and insignificant neurons of each matrix representing the hidden layers of the neural network. The method avoids the computationally expensive process of decomposing matrices using SVD, making it advantageous for large matrices.

2.2. Overview of Structured/Unstructured Pruning

Presearchers' hypothesis posits that activation-based pruning offers a better rank-k approximation of network layers while preserunving can be classified into structured ^{[5][6][7][8][9][10][11][12]}significant levels of training, validation, and stemi-structuredst accuracy. ^[2][13]To prsuning. Structured pruning removes filters, channels, or layers to induce structured sparsity patterns. It is commonly used in convolutional neural networks (CNNs), where entire filters or channels (groups of neurons) can be pruned. In unstructured pruning, individual weights without structural constraints are considered for removal. It can be applied to any layer of a neural network, including fully connected layers and convolutional layers. Semi-structurebstantiate this hypothesis, the researchers ascertained the ranks of each pruned model and applied PCA to a fully trained model, resulting in the creation of rank-reduced models. Subsequently, the researchers compared the test accuracy of the PCA-generated rank-reduced models with those derived from activation-based pruning is a hybrid approach that combines aspects of both structured and unstructur. Models subjected to activation-based pruning. It involves removing entire structures, such as filters or channels, but within those structures, individual weights may be pruned. Activation-based pruning demonstrated better test accuracy compared to those whose hidden layers were rank-reduced through PCA.

3.3 Analysis of Singular Value Changes in Each Layer

Every hisdden unstructured, where individual neurons are assigned a score for pruninglayer of a feedforward neural network can be represented in the form of a matrix.

2.3. Overview of Importance-Based Pruning

Prun The sing culan also be categorized based on the importance assigned to the weights, filters, and neurons of the network. These techniques induce sparsity by removing connections or filters based on criteria such as the weight magnituder values of a matrix can be computed using singular-value decomposition (SVD). The magnitudes of the singular values capture information about the data. The larger the singular value, the higher the amount of information captured by the corresponding basis ^[6][7][8] or svensictivity scores ^[14][15]or. The sparse architectures are then retrained to regain accuracy. Zhu and Gupta ^[1] exploresearchers observed the change in singular values of thed model pruning as a means of model compression by implementing magnitude-basedweight matrix of each hidden layer before and after pruning, where the weights with. The researchers observed that the smallest absolute values are pruned. In activation-based pruning, a score is assigned to each neuron, which guides the decisioningular values of the weight matrix of each hidden layer reduced by a weighted amount. The researchers also observed that, during the initial phase of pruning.

2.4. Overview Of Iterative/One-Shot Pruning

Prun, the sing culan also be categorized as either iterative ^{[2][16][17][18]} or or values with smaller magnitudes experience-shotd ^[15].a Itgrerative pruning is a multi-step process of assigning a score, pruning the network, and retraining. Han et alater reduction in value than singular values with larger magnitudes. ^[2]This pcorroposed a three-step pipeline to prune redundant connections in neural networks without affecting accuracy. They first trained the dense network, then pruned low-weight connections below a threshold to obtain a sparse network, and finally, retrained the sparse network to learn the weight parameters. Guo et al.esponds to the notion that activation-based pruning results in removing the basis vectors that contain relatively little information about the data. As pruning progresses, the basis vectors that hold the least amount of information are ^[16] intremoduced a two-step process called pruning and splicing, where weight connections can be removed and added back, based on solving a constrained optimization problem for each layer. Han et al. ^[17] usved successively. After a while, the researchers are left with basis vectors that contain equally important amounts of information. At this staged a, three-pronged approach of pruning, quantization, and Huffman coding, to achieve substantial compression of the network. Yuan et al. ^[18] demone rate of reduction in the magnitude of singular values shifts from a weighted reduction to a uniform reduction. This is equivalent to strated how networks can be grown and pruned dynamically during the training phase using structured continuous sparsification. Growing and pruning of the network are often performed by introducing a regulariing that a reduction in the magnitude of the singular values shifts from a weighted nuclear norm minimization (WNNM) to nuclear norm minimization parameter in the cost function and relaxing the initial optimization problem. Liu et al.(NNM). The work is the first of its kind to analyze the effect of pruning on the singular values ^[15]of himplemented one-shot pruning to prune weights in a single step. Adden layers and provides empirical data to support the hypothesis that activation-based pruning is an iterative method where a score is assigned to each neuron, and subsequently, the neurons with the lowest score are pruned. Afterward, theresults in a weighted nuclear norm minimization of the hidden layers. The researchers hypothesize that, in a feedforward neural network is retrained, and this cycle is repeated multiple times.

2.5. Overview of When to Prune

Pruusing SGD as the optimizationing can also be classified based on when the pruning occurs, beforealgorithm and ReLU ^{[2][14][16][19]}, during ^{[7][13][18][20]}, or afters training ^[1][17]. The moactivation for pruning before training is to eliminate the cost of pretraining. Prunction, activation-based pruning during training iteratively prunes and retrains the network to induce sparsity by updatiis equivalent to introducing the weight magnitudes or filters and channels, and pruning after training generally takes a pretrained nuclear norm as a regularization parameter to the original cost function.

4. Conclusion

Thed network and, subsequently, prunes and retrains multiple times. Activation-based pruning occurs during training.

3. Conclusion

Aresearchers demonstrated the effectiveness of activation-based pruning in successfully reducesing the size of a feedforward networks. This pruning technique can be applied with either labeled or unlabeled data, as long as the data are drawn from the same distribution used to train the original feedforward network. A Consequently, the researchers hypothesized that activation-based pruning is adaptable to supervised, semi-supervised, or unsupervised learning algorithms. Furthermore, ethe results showed that each layer of the pruned network serves as a sparse low-rank matrix representation of the fully trained original network. EThe researchers provided empirical evidence supporting the hypothesis that activation-based pruning can be interpreted as introducing a regularization parameter of the weighted nuclear norm of the hidden layers. Additionally, considering the architectural and implementation characteristics of activation-based pruning, the researchers proposed that this technique has the potential to be applied to various types of neural networks. In the study, the focus was specifically on image classification tasks utilizing smaller datasets, namely FashionMNIST and MNIST, and employing compact network architectures, such as the fully connected feedforward model. This initial scope was chosen to facilitate a controlled analysis of the methods involved.

References

Zhu, M.; Gupta, S. To prune, or not to prune: Exploring the efficacy of pruning for model compression. arXiv 2017, arXiv:1710.01878.Han, S.; Pool, J.; Tran, J.; Dally, W. Learning both Weights and Connections for Efficient Neural Network. In Advances in Neural Information Processing Systems; Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2015; Volume 28.
Han, S.; Pool, J.; Tran, J.; Dally, W. Learning both Weights and Connections for Efficient Neural Network. In Advances in Neural Information Processing Systems; Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2015; Volume 28.Han, S.; Mao, H.; Dally, W.J. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. arXiv 2016, arXiv:1510.00149.
Swaminathan, S.; Garg, D.; Kannan, R.; Andres, F. Sparse low rank factorization for deep neural network compression. Neurocomputing 2020, 398, 185–196.Guo, Y.; Yao, A.; Chen, Y. Dynamic Network Surgery for Efficient DNNs. In Advances in Neural Information Processing Systems; Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2016; Volume 29.
Yang, H.; Tang, M.; Wen, W.; Yan, F.; Hu, D.; Li, A.; Li, H.; Chen, Y. Learning Low-Rank Deep Neural Networks via Singular Vector Orthogonality Regularization and Singular Value Sparsification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Virtually, 14–19 June 2020.Frankle, J.; Carbin, M. The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019.
Hu, H.; Peng, R.; Tai, Y.W.; Tang, C.K. Network Trimming: A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures. arXiv 2016, arXiv:1607.03250.Molchanov, D.; Ashukha, A.; Vetrov, D. Variational Dropout Sparsifies Deep Neural Networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, NSW, Australia, 6–11 August 2017; Precup, D., Teh, Y.W., Eds.; Proceedings of Machine Learning Research. Volume 70, pp. 2498–2507.
Li, H.; Kadav, A.; Durdanovic, I.; Samet, H.; Graf, H.P. Pruning Filters for Efficient ConvNets. arXiv 2017, arXiv:1608.08710.
Frankle, J.; Carbin, M. The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019.
He, Y.; Zhang, X.; Sun, J. Channel Pruning for Accelerating Very Deep Neural Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 1398–1406.
Gray, S.; Radford, A.; Kingma, D.P. GPU Kernels for Block-Sparse Weights. arXiv 2017, arXiv:1711.09224.
Kalchbrenner, N.; Elsen, E.; Simonyan, K.; Noury, S.; Casagrande, N.; Lockhart, E.; Stimberg, F.; van den Oord, A.; Dieleman, S.; Kavukcuoglu, K. Efficient Neural Audio Synthesis. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; Dy, J., Krause, A., Eds.; Proceedings of Machine Learning Research. Volume 80, pp. 2410–2419.
Frankle, J.; Dziugaite, G.K.; Roy, D.M.; Carbin, M. Stabilizing the Lottery Ticket Hypothesis. arXiv 2020, arXiv:1903.01611.
Yu, R.; Li, A.; Chen, C.F.; Lai, J.H.; Morariu, V.I.; Han, X.; Gao, M.; Lin, C.Y.; Davis, L.S. NISP: Pruning Networks Using Neuron Importance Score Propagation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 9194–9203.
Molchanov, D.; Ashukha, A.; Vetrov, D. Variational Dropout Sparsifies Deep Neural Networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, NSW, Australia, 6–11 August 2017; Precup, D., Teh, Y.W., Eds.; Proceedings of Machine Learning Research. Volume 70, pp. 2498–2507.
Lee, N.; Ajanthan, T.; Torr, P. SNIP: Single-Shot Network Pruning Based on Connection Sensitivity. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019.
Liu, Z.; Sun, M.; Zhou, T.; Huang, G.; Darrell, T. Rethinking the Value of Network Pruning. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019.
Guo, Y.; Yao, A.; Chen, Y. Dynamic Network Surgery for Efficient DNNs. In Advances in Neural Information Processing Systems; Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2016; Volume 29.
Han, S.; Mao, H.; Dally, W.J. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. arXiv 2016, arXiv:1510.00149.
Yuan, X.; Savarese, P.; Maire, M. Growing Efficient Deep Networks by Structured Continuous Sparsification. arXiv 2020, arXiv:2007.15353.
Wang, C.; Zhang, G.; Grosse, R. Picking Winning Tickets Before Training by Preserving Gradient Flow. arXiv 2020, arXiv:2002.07376.
Tanaka, H.; Kunin, D.; Yamins, D.L.; Ganguli, S. Pruning neural networks without any data by iteratively conserving synaptic flow. In Advances in Neural Information Processing Systems; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 6377–6389.