Activation-Based Pruning of Neural Networks

Activation-Based Pruning of Neural Networks: Comparison

Please note this is a comparison between Version 3 by Tushar Ganguli and Version 2 by Catherine Yang.

A novel technique is presented for pruning called activation-based pruning to effectively prune fully connected feedforward neural networks for multi-object classification. The technique is based on the number of times each neuron is activated during model training. Further analysis demonstrated that activation-based pruning can be considered a dimensionality reduction technique, as it leads to a sparse low-rank matrix approximation for each hidden layer of the neural network. The rank-reduced neural network generated using activation-based pruning has better accuracy than a rank-reduced network using principal component analysis. After each successive pruning, the amount of reduction in the magnitude of singular values of each matrix representing the hidden layers of the network is equivalent to introducing the sum of singular values of the hidden layers as a regularization parameter to the objective function.

machine learning
network pruning
dimensionality reduction
computer vision

1. Introduction

^[1]1. Introduction

Deep neural networks are used to solve real-world problems in various domains such as image classification, text classification, and speech recognition. These networks often require millions of parameters and billions of floating-point operations to make accurate predictions. Network pruning has emerged as an important technique for improving the efficiency of deep neural networks by removing redundant structures. Pruning reduces the number of parameters of a neural network, resulting in a reduction of the computational resource required to run the network. Some of the most-popular pruning methods are magnitude-based pruning , structured pruning , pruning based on the lottery ticket hypothesis , and dynamic pruning. Of these methods, magnitude-based pruning has been proven to be successful for producing compact models and has witnessed widespread acceptance. However, prior work on magnitude-based pruning contains certain deficiencies. Magnitude-based pruning does not inherently induce a low-rank structure in the hidden layers of neural networks. More-rigorous constraints are needed to drive rank reduction. Integrating it with structured pruning or low-rank regularizers is likely necessary to fully exploit the compression property.

2. Methodology

Activation-basDeed pruning is based on the number of times each neuron is activated during the forward pass of the training phase of a neural network. In the forward pass, data are processed at each layer and passed on to the next layer until the processed data reach the output layer. In a fully connected neural networks are used to solve real-world problems in various domains such as image classification, text classification, and speech recognition. These network, each neuron is connected with incoming weights from all the neurons of the previous layer. The output of a neuron is calculated as σ(W^TX), whes often require millions of parameters and billions of floating-point operations to make accurate pre W dictis the vector of incoming weights, X is the vector of processed data from the previous layer, and σ is the non-linear activation function. Activation of a neuron means that its output is not zero i.e., σ(W^TX) ≠ 0. Tons. Network pruning has emerged as an important technique for improving the efficiency of deep neural networks by removing redundant structures. Pruning reduces the number of parametimes ers of a neuron is activated is stored as a parameter called activation counter. Each neuron has a corresponding activation counter. During training, the researchers chose a random set of data as the input to tal network, resulting in a reduction of the computational resource required to run the network before each cycle of. Some of the most-popular pruning, which resulted in the neurons either being activated (σ(W^TX) ≠ 0) o methods are magnitude-based pruning, structured pr unot activated (σ(W^TX)ing , = 0). Evepry time a neuron is activated, the researchers increment the value of the activation counter by 1. The values of the activation counter of all the neurons are used to perform pruning. After every pruning cycle, the network suffers some loss in training accuracy. Hence, the researchers trained the network to reach a predetermined training accuracy before the researchers initiated the next uning based on the lottery ticket hypothesis, and dynamic pruning. Of these methods, magnitude-based pruning has been proven to be successful for producing compact models and has witnessed widespread acceptance. However, prior work on magnitude-based pruning cycle. The random set of data used to compute the activation counter can belong either to the training set or can be separated before the start of the training phase. The data can either be labeled or unlabeled, provided they belong to the same distribution used to train the neontains certain deficiencies. Magnitude-based pruning does not inherently induce a low-rank structure in the hidden layers of neural network. The researchers carried out an empirical comparison of two ways of pruning the network using activation-based pruning. The first was global activation-bass. More-rigorous constraints are needed to drive rank reduction. Integrating it with structured pruning where at each stage of pruning, the researchers consider neurons from all hidden layers of the network to decide which neurons to pruneor low-rank regularizers is likely necessary to fully exploit the compression property.

The second was local limitations are addressed through activation-based pruning where at each stage of pruning, the researchers prune one hidden layer.

3. Results and Discussion

Whi\cite{a17010048}, which achieves results comparable to mactivationgnitude-based pruning prolonged the networkin terms of training duration, it consistently led to better , validation, and testing accuracy in almost all scenarios. Pruning resulted in a loss of accuracy, which was recovered by retraining the network. The pruned model can be used for prediction in place of the original model. A limitation of global a, and functions as a dimensionality reduction technique. This method effectively reduces the hidden layers of neural networks to sparse low-rank matrix approximations. Activation-based pruning is its memory- and time-intensive nature, which may lead to poor scalability for larger networks. Local aachieved using both labeled and unlabeled data where the data should be a representative of the same distribution as the training data. Activation-based pruning deliveredis equivalent accuracy and pruning percentages to global activation-based pruning, while significantly enhancing processing time efficiency and reducing memory consumption.

Theto introducing a weighted nuclear norm as a regularization parameter during the minimization of the objective function in image classification tasks. Consequently, activation-based pruning eliminated the need for additional rempirical findings led us to hypothesize that agularizers to induce a low-rank structure in the hidden layers of the feedforward network. Activation-based pruning selectively targets weights that contributed minimally to the network's training process. This pruning strategy resulteds in the formation of sparse, low-rank matrix approximations, effectively reducing the full-rank matrices of a trained network. When suchAfter the pruned networks underwentgo retraining, they tended to preserve the low-rank characteristics of these matrices, especially when previously pruned (zeroed) weights weare allowed to be reutilized. In contrast, the observations with magnitude-based pruning indicated a different behavior: During the retraining phase, this method tended to engageengages nearly the entire spectrum of weights within the network. This distinction highlighteds the unique impact of activation-based pruning on the network's weight optimization during retraining.

3.1 Comparative Analysis

Paper	Compression Rate	Sparsity %	Remaining Weights %	Required FLOPS %
Activation-based	94×	98.94	1.06	1.06
Han et al. ^[1]	12×	92	8	8
Han et al. ^[2]	40×	92	8	8
Guo et al. ^[3]	56×	98.2	1.8	1.8
Fankle and Carbin ^[4]	7×	86.5	13.5	13.5
Molchanov et al. ^[5]	7×	86.03	13.97	13.97

2. Overview

2.1 Overview of Low-Rank Matrix Approximation

Van et al. ^[1] presented an iounstructured, iterative s pruning method that uses magnitude-based pruning to achieve a 12-fold reduction in model sizes \cite{NIPS2015_ae0eb3ee,SWAMINATHAN2020185,Yang_2020_CVPR_Workshops} have resulted in sparse matrices, low-rank matrix approximation, or a hybrid of both. Han et al.~\cite{NIPS2015_ae0eb3ee} ^[2] ecomployressed an iterativedeep neural networks by simultaneously pruning technique, where pruning occurred after training and achieved a 40-fold reduction in model size. Guo et al.both network weights and connections to reduce computational cost and memory usage by inducing a regularization term that encourages ^[3] sparesented an iterative pruning method whesity during training. Weights with small magnitudes are pruning occurred before training and achieved a 56-fold reduction in model size. Frankle and Carbin ^[4] ued based on a specified threshold. This technique resulted in sparse weight matrices. Swaminathan et al.~\cite{SWAMINATHAN2020185} proposed a novel method called sparsed the lottery ticket hypothesis in the context of pruning small fully connected nets on MNIST. It is alow rank (SLR) to compress the dense layers of deep neural networks by improving upon truncated singular-value decomposition (SVD). The key idea was to induce structured pruning technique with importance scoringsparsity into the decomposed matrices from SVD based on the weight magnitude. It achieved a seven-fold reduction in model size. Molchanovsignificance of the input/output neurons. Neuron significance is estimated by absolute weights, activations, or a change in cost when removed. Yang et al.~\cite{Yang_2020_CVPR_Workshops} ^[5] preoposented an unstructured prud a method called SVD training, where pruning occurred duringich first decomposed each layer into the form of its full-rank SVD and, then, performs training. It achieved a 7-fold compression rate. Activation-based pruning is an unstructured, iterative-based pruning method. The results demonstrated that activation-based on the decomposed weights. Low rank is encouraged by applying sparsity-inducing regularizers on the singular values of each layer. Singular-value pruning achieved an impressive 94-fold reduction in theis applied at the end to explicitly reach a low-rank model size.

3.2 Principal Component Analysis of Hidden Layers

. Activation-based pruning gachieneratesves sparsity in low-rank amatrix approximation ofby assigning scores to each matrix representing the hidden layers of the neural networkneuron to identify significant and insignificant neurons. The researchers' hypothesis posits that actmethod avoids the computationally expensive process of decomposing matrices using SVD, making it advantageous for large matrices.

2.2 Overview of Structured/Unstructured Pruning

Prunivng cation-based n be classified into structured \cite{li2017pruning offers a better rank-k approximation of ,8237417,Gray2017GPUKF,pmlr-v80-kalchbrenner18a,frankle2018the,frankle2020stabilizing,8579056,hu2016network layers while preserving significant levels of training, validation, and test accuracy. To substantiate this hypothesis, the researchers ascertained the ranks of each}, unstructured \cite{NIPS2015_ae0eb3ee,pmlr-v70-molchanov17a}, and semi-structured \cite{mariet2017diversity} pruning. Structured pruning removes filters, channels, or layers to induce structured sparsity patterns. It is commonly used in convolutional neural networks (CNNs), where entire filters or channels (groups of neurons) can be pruned model and applied PCA to a fully trained model, resulting in the creation of rank-reduced models. Subsequently, the researchers compared the test accuracy of the PCA-generated rank-reduced models with those derived from activation-based pruning. Models subjected to a. In unstructured pruning, individual weights without structural constraints are considered for removal. It can be applied to any layer of a neural network, including fully connected layers and convolutional layers. Semi-structured pruning is a hybrid approach that combines aspects of both structured and unstructured pruning. It involves removing entire structures, such as filters or channels, but within those structures, individual weights may be pruned. Activation-based pruning demonstrated better test accuracy compared to those whose hidden layers were rank-reduced through is unstructured, where individual neurons are assigned a score for pruning.

2.3 Overview of Importance-Based Pruning

PCA.

3.3 Analysis of Singular Value Changes in Each Layer

Every hunidden layer of a feedforward neural network can be represented in the form of a matrix. The singular values of a matrix can be computed using singular-value decomposition (SVD). The ng can also be categorized based on the importance assigned to the weights, filters, and neurons of the network. These techniques induce sparsity by removing connections or filters based on criteria such as the weight magnitudes of the singular values capture information about the data \cite{li2017pruning,8237417,frankle2018the} or sensitivity scores~\cite{lee2018snip,liu2018rethinking}. The larger the singular value, the higher the amount of information captured by the corresponding basis vector. The researchers observed the change in singular values of sparse architectures are then retrained to regain accuracy. Zhu and Gupta~\cite{zhu2017prune} explored model pruning as a means of model compression by implementing magnitude-based pruning, where the weight matrix of each hidden layer before and afters with the smallest absolute values are pruned. In activation-based pruning. The researchers observed that the singular values of the weight mat, a score is assigned to each neuron, which guides the decision of pruning.

2.4 Overview Of Iterative/One-Shot Pruning

Prunix of each hidden layer reduced by a weighted amount. The researchers also observed that, during the initial phase of ng can also be categorized as either iterative~\cite{NIPS2015_ae0eb3ee,NIPS2016_2823f479,han2016deep,yuan2020growing} or one-shot~\cite{liu2018rethinking}. Iterative pruning is a multi-step process of assigning a score, pruning, the singular values with smaller magnitudes experienced a greater reduction in value than singular value the network, and retraining. Han et al.~\cite{NIPS2015_ae0eb3ee} proposed a three-step pipeline to prune redundant connections in neural networks with larger magnitudes. This corresponds to the notion that activation-based pruning results in removing the basis vectors that contain relatively little information about the data. Asout affecting accuracy. They first trained the dense network, then pruned low-weight connections below a threshold to obtain a sparse network, and finally, retrained the sparse network to learn the weight parameters. Guo et al.~\cite{NIPS2016_2823f479} introduced a two-step process called pruning progresses, the basis vectors that hold the least amount of informand splicing, where weight connections can be removed and added back, based on solving a constrained optimization are removed successively. After a while, the researchers are left with basis vectors that contain equally important amounts of information. At this stage, the rate of reduction in the magnitude of singular values shifts from a weighted reduction to a uniform reduction. This is equivalent to stating that a reduproblem for each layer. Han et al. \cite{han2016deep} used a three-pronged approach of pruning, quantization, and Huffman coding, to achieve substantial compression of the network. Yuan et al.~\cite{yuan2020growing} demonstrated how networks can be grown and pruned dynamically during the training phase using structured continuous sparsification. Growing and pruning of the network are often performed by introducing a regularization parameter in the cost function in the magnitude of the singular values shifts from aand relaxing the initial optimization problem. Liu et al.~\cite{liu2018rethinking} implemented one-shot pruning to prune weighted nuclear norm minimization (WNNM) to nuclear norm minimization (NNM). The work is the first of its kind to analyze the effect of pruning on the singular values of hidden layers and provides empirical data to support the hypothesis that activation-based ps in a single step. Activation-based pruning is an iterative method where a score is assigned to each neuron, and subsequently, the neurons with the lowest score are pruned. Afterward, the network is retrained, and this cycle is repeated multiple times.

2.5 Overview of When to Prune

Pruning results ican a weighted nuclear norm minimization of the hidden layers. The researchers hypothesize that, in a feedforward neural network using SGD as the optimization algorithm and ReLU as the activation function, activation-based pruning is equivalentlso be classified based on when the pruning occurs, before \cite{NIPS2015_ae0eb3ee, NIPS2016_2823f479,lee2018snip,wang2020picking}, during \cite{pmlr-v70-molchanov17a,frankle2018the,yuan2020growing,NEURIPS2020_46a4378f}, or after training \cite{han2016deep,zhu2017prune}. The motivation for pruning before training is to eliminate the cost of pretraining. Pruning during training iteratively prunes and retrains the network to introduciduce sparsity by updating the weighted nuclear norm as a regularization parameter to the original cost function.

4. Conclusion

Th magnitudes or filters and channels, and pruning after training generally takes a pretrained network and, subsequently, researchers demonstrated the effectiveness of aprunes and retrains multiple times. Activation-based pruning occurs during training.

3. Conclusion

Activation-based pruning in successfully reducinges the size of a feedforward networks. This pruning technique can be applied with either labeled or unlabeled data, as long as the data are drawn from the same distribution used to train the original feedforward network. Consequently, the researchers hypothesized that actiActivation-based pruning is adaptable to supervised, semi-supervised, or unsupervised learning algorithms. Furthermore, the results showed that eaeach layer of the pruned network serves as a sparse low-rank matrix representation of the fully trained original network. The researchers provided eEmpirical evidence supporting the hypothesis that activation-based pruning can be interpreted as introducing a regularization parameter of the weighted nuclear norm of the hidden layers. Additionally, considering the architectural and implementation characteristics of activation-based pruning, the researchers proposed that this technique has the potential to be applied to various types of neural networks. In the study, the focus was specifically on image classification tasks utilizing smaller datasets, namely FashionMNIST and MNIST, and employing compact network architectures, such as the fully connected feedforward model. This initial scope was chosen to facilitate a controlled analysis of the methods involved.

References

Han, S.; Pool, J.; Tran, J.; Dally, W. Learning both Weights and Connections for Efficient Neural Network. In Advances in Neural Information Processing Systems; Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2015; Volume 28.Ganguli, Tushar and Chong, Edwin K. P. Activation-Based Pruning of Neural Networks. Algorithms. 2024, 17, 1.
Han, S.; Mao, H.; Dally, W.J. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. arXiv 2016, arXiv:1510.00149.
Guo, Y.; Yao, A.; Chen, Y. Dynamic Network Surgery for Efficient DNNs. In Advances in Neural Information Processing Systems; Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2016; Volume 29.
Frankle, J.; Carbin, M. The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019.
Molchanov, D.; Ashukha, A.; Vetrov, D. Variational Dropout Sparsifies Deep Neural Networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, NSW, Australia, 6–11 August 2017; Precup, D., Teh, Y.W., Eds.; Proceedings of Machine Learning Research. Volume 70, pp. 2498–2507.
Ganguli, Tushar and Chong, Edwin K. P. Activation-Based Pruning of Neural Networks. Algorithms. 2024, 17, 1.