Activation-Based Pruning of Neural Networks

Activation-Based Pruning of Neural Networks: Comparison

Please note this is a comparison between Version 1 by Tushar Ganguli and Version 5 by Tushar Ganguli.

AWe present a novel technique is presented for pruning called activation-based pruning to effectively prune fully connected feedforward neural networks for multi-object classification. TheOur technique is based on the number of times each neuron is activated during model training. Further analysis demonstrated that activation-based pruning can be considered a dimensionality reduction technique, as it leads to a sparse low-rank matrix approximation for each hidden layer of the neural network. T We also demonstrate that the rank-reduced neural network generated using activation-based pruning has better accuracy than a rank-reduced network using principal component analysis. AWe provide empirical results to show that, after each successive pruning, the amount of reduction in the magnitude of singular values of each matrix representing the hidden layers of the network is equivalent to introducing the sum of singular values of the hidden layers as a regularization parameter to the objective function.

machine learning
network pruning
dimensionality reduction
computer vision

1. Introduction

Deep neural networks are used to solve real-world problems in various domains such as image classification, text classification, and speech recognition. These networks often require millions of parameters and billions of floating-point operations to make accurate predictions. Network pruning has emerged as an important technique for improving the efficiency of deep neural networks by removing redundant structures. Pruning reduces the number of parameters of a neural network, resulting in a reduction of the computational resource required to run the network. Some of the most-popular pruning methods are magnitude-based pruning , structured pruning , pruning based on the lottery ticket hypothesis , and dynamic pruning . Of these methods, magnitude-based pruning has been proven to be successful for producing compact models and has witnessed widespread acceptance. However, prior work on magnitude-based pruning contains certain deficiencies. Magnitude-based pruning does not inherently induce a low-rank structure in the hidden layers of neural networks. More-rigorous constraints are needed to drive rank reduction. Integrating it with structured pruning or low-rank regularizers is likely necessary to fully exploit the compression property.

2. Methodology

Activation-based pruning is basep neural networks are used to solve real-world problems in various domains such as image classification, text classification, and speech recognition. Thesed on the number of times each neuron is activated during the forward pass of the training phase of a neural network. In the forward pass, data are processed at each layer and passed on to the next layer until the processed data reach the output layer. In a fully connected networks often require millions of parameters and billions of floating-point operations to make accurate p, each neuron is connected with incoming weights from all the neurons of the previous layer. The output of a neuron is calculated as σ(W^TX), whered W ictions. Network pruning has emerged as an important technique for improving the efficiency of deep neural networks by removing redundant structures. Pruning reducess the vector of incoming weights, X is the vector of processed data from the previous layer, and σ is the non-linear activation function. Activation of a neuron means that its output is not zero i.e., σ(W^TX) ≠ 0. tThe number of parameters oftimes a neural network, resulting in a reduction of the computational resource required to runon is activated is stored as a parameter called activation counter. Each neuron has a corresponding activation counter. During training, we chose a random set of data as the input to the network. Some of the most-popular before each cycle of pruning methods are magnitude-based p, which resulted in the neurons either being activated (σ(W^TX) ≠ 0) oru ning, structured pot activated (σ(W^TX) = 0). Everuny ting , pruning based on the lottery ticket hypothesis, and dynamic pruning. Of these methods, magnitude-based pruning has been proven to be successful for producing compact models and has witnessed widespread acceptance. However, prior work on magnitude-based pruning contains certain deficiencies. Magnitude-based pruning does not inherently induce a low-rank structure me a neuron is activated, we increment the value of the activation counter by 1. The values of the activation counter of all the neurons are used to perform pruning. After every pruning cycle, the network suffers some loss in training accuracy. Hence, we trained the network to reach a predetermined training accuracy before we initiated the next pruning cycle. The random set of data used to compute the activation counter can belong either to the training set or can be separated before the start of the training phase. The data can either be labeled or unlabeled, provided they belong to the same distribution used to train the hidden layers of neural networks. More-rigorous constraints are needed to drive rank reduction. Integrating it with structur. We carried out an empirical comparison of two ways of pruning the network using activation-based pruning. The first was global activation-based pruning or low-rank regularizers is likely necessary to fully exploit the compression propertywhere at each stage of pruning, we consider neurons from all hidden layers of the network to decide which neurons to prune.

The se limitations are addressed throughcond was local activation-based pruning, which achieves results comparab where at each stage of pruning, we prune one hidden layer.

3. Results and Discussion

While acto magnitudeivation-based pruning in terms ofprolonged the network training, duration, it consistently led to better validation, and testing accuracy, and functions as a dimensionality reduction technique. This method effectively reduces the hidden layers of neural networks to sparse low-rank matrix approximations. A in almost all scenarios. Pruning resulted in a loss of accuracy, which was recovered by retraining the network. The pruned model can be used for prediction in place of the original model. A limitation of global activation-based pruning is achieved using both labeled and unlabeled data where the data should be a representative of the same distribution as the training data. Aits memory- and time-intensive nature, which may lead to poor scalability for larger networks. Local activation-based pruning isdelivered equivalent to introducing a weighted nuclear norm as a regularization parameter during the minimization of the objective function in image classification tasks. Consequently, activation-based pruning eliminated the need for additionalaccuracy and pruning percentages to global activation-based pruning, while significantly enhancing processing time efficiency and reducing memory consumption.

Our regularizers to induce a low-rank structure in the hidden layers of the feedforward network. Ampirical findings led us to hypothesize that activation-based pruning selectively targets weights that contributed minimally to the network's training process. This pruning strategy resultsed in the formation of sparse, low-rank matrix approximations, effectively reducing the full-rank matrices of a trained network. After theWhen such pruned networks undergowent retraining, they tended to preserve the low-rank characteristics of these matrices, especially when previously pruned (zeroed) weights awere allowed to be reutilized. In contrast, our observations with magnitude-based pruning indicated a different behavior: During the retraining phase, this method engagestended to engage nearly the entire spectrum of weights within the network. This distinction highlightsed the unique impact of activation-based pruning on the network's weight optimization during retraining.

2. Overview

2.1. Overview of Low-Rank Matrix Approximation

3.1 Comparative Analysis

Paper	Compression Rate	Sparsity %	Remaining Weights %	Required FLOPS %
Activation-based	94×	98.94	1.06	1.06
Han et al. [2]	12×	92	8	8
Han et al. [7]	40×	92	8	8
Guo et al. [6]	56×	98.2	1.8	1.8
Fankle and Carbin [5]	7×	86.5	13.5	13.5
Molchanov et al. [17]	7×	86.03	13.97	13.97

Han et al. [2] prioesented an us nstructured, iterative pruning methods ^[2][3][4] thavt use resulted in sparse matrices, low-rank matrix approximation, or a hybrid of boths magnitude-based pruning to achieve a 12-fold reduction in model size. Han et al. ^[2][7] coempressed deep neural networks by simultaneouslyloyed an iterative pruning both network weights and connections to reduce computational cost and memory usage by inducing a regularization term that encouragestechnique, where pruning occurred after training and achieved a 40-fold reduction in model size. Guo et al. [6] sparsity during training. Weights with small magnitudes aresented an iterative pruning method where pruned based on a specified threshold. This technique resulted in sparse weight matrices. Swaminathan et al.ing occurred before training and achieved a 56-fold reduction in model size. Frankle and Carbin ^[3][5] propoused a novel method called sparse low rank (SLR) to compress the dense layers of deep neural networks by improving upon truncated singular-value decomposition (SVD). The key idea was to induce the lottery ticket hypothesis in the context of pruning small fully connected nets on MNIST. It is a structured sparsity into the decomposed matrices from SVD pruning technique with importance scoring based on the significance of the input/output neurons. Neuron significance is estimated by absolute weights, activations, or a change in cost when removed. Yangweight magnitude. It achieved a seven-fold reduction in model size. Molchanov et al. ^[4][17] propoesed a method called SVD trainted an unstructured pruning, which first decomposed each layer into the form of its full-rank SVD and, then, performs trere pruning occurred during training on the decomposed weights. Low rank is encouraged by applying sparsity-inducing regularizers on the singular values of each layer. Singular-value. It achieved a 7-fold compression rate. Activation-based pruning is an unstructured, iterative-based pruning method. The results demonstrated that activation-based pruning is applied at the end to explicitly reach a low-rankachieved an impressive 94-fold reduction in the model size.

3.2 Principal Component Analysis of Hidden Layers

Activation-based pruning achigeves sparsity innerates low-rank matrix aapproximation by assigning scores to each neuron to identify significant and insignificant neurons. The method avoids the computationally expensive process of decomposing matrices using SVD, making it advantageous for large matrices.

2.2. Overview of Structured/Unstructured Pruning

P of each matrix representing the hidden layers of the neural network. Our hypothesis posits that activation-based pruning offers a better rank-k approximation of network layers while preserunving can be classified into structured ^{[5][6][7][8][9][10][11][12]}, significant levels of trand semi-structured ^[2][13] pruning., Structured pruning removes filters, channels, or layers to induce structured sparsity patterns. It is commonly used in convolutional neural networks (CNNs), where entire filters or channels (groups of neurons) can be validation, and test accuracy. To substantiate this hypothesis, we ascertained the ranks of each pruned. In unstructured pruning, individual weights without structural constraints are considered for removal. It can be applied to any layer of a neural network, including fully connected layers and convolutional layers. Semi-structured pruning is a hybrid approach that combines aspects of both structured and unstructured pruning. It involves removing entire structures, such as filters or channels, but within those structures, individual weights may be pruned. A model and applied PCA to a fully trained model, resulting in the creation of rank-reduced models. Subsequently, we compared the test accuracy of the PCA-generated rank-reduced models with those derived from activation-based pruning. Models subjected to activation-based pruning is unstructured, where individual neurons are assigned a score for pruningdemonstrated better test accuracy compared to those whose hidden layers were rank-reduced through PCA.

2.3. Overview of Importance-Based Pruning

3.3 Analysis of Singular Value Changes in Each Layer

Everuny hing can also be categorized based on the importance assigned to the weights, filters, and neurons of the network. These techniques induce sparsity by removing connections or filters based on criteria such as the weight dden layer of a feedforward neural network can be represented in the form of a matrix. The singular values of a matrix can be computed using singular-value decomposition (SVD). The magnitude ^[6][7][8]s orf sensitivity scores ^[14][15]. Tthe spingularse architectures are then retrained to regain accuracy. Zhu and Gupta values capture information about the data. ^[1]The explored model pruning as a means of model compression by implementing magnitude-based pruning, wherearger the singular value, the higher the amount of information captured by the corresponding basis vector. We observed the change in singular values of the weights with the smallest absolute values are pruned. In activation-based matrix of each hidden layer before and after pruning, a score is assigned to each neuron, which guides the decision of pruning.

2.4. Overview Of Iterative/One-Shot Pruning

Pruning. We observed that the singular values of the weight matrix of each hidden clan also be categorized as either iterativyer reduced by a weighted amount. We ^{[2][16][17][18]}also obser one-shotved ^[15]. Iterhative pruning is a multi-step process of assigning a score,, during the initial phase of pruning the network, and retraining. Han et al. ^[2] p, the singular values with smalleroposed a three-step pipeline to prune redundant connections in neural networkmagnitudes experienced a greater reduction in value than singular values without affecting accuracy. They first trained the dense network, then pruned low-weight connections below a threshold to obtain a sparse network, and finally, retrained the sparse network to learn the weight parameters. Guo et al. larger magnitudes. This corresponds to the notion that activation-based pruning results in removing the basis vectors that contain relatively little ^[16] informatroduced a two-step process calledion about the data. As pruning and splicing, where weight connections can be removed and added back, based on solving a constrained optimizprogresses, the basis vectors that hold the least amount of information problem for each layer. Han et al.are removed successively. After a while, we are left ^[17]with ubasised a three-pronged approach of pruning, quantization, and Huffman coding, to achieve substantial compression of the network. Yuan et al. vectors that contain equally important amounts of information. At this stage, the rate of reduction in the magnitude of singular values ^[18] demonshiftrated how networks can be grown and pruned dynamically during the training phase using structured continuous sparsification. Growing and pruning s from a weighted reduction to a uniform reduction. This is equivalent to stating that a reduction in the magnitude of the network are often performed by introducing a regularsingular values shifts from a weighted nuclear norm minimization parameter in the cost function and relaxing the initial opti(WNNM) to nuclear norm minimization problem. Liu et al.(NNM). To the best of ^[15]our impknowlemented one-shot pruning to prune weights in a single step. Activation-baseddge, our work is the first of its kind to analyze the effect of pruning is an iterative method where a score is assigned to each neuron, and subsequently, the neurons with the lowest score are pruned. Afterward, the network is retrained, and this cycle is repeated multiple times.

2.5. Overview of When to Prune

Prunon the singular values of hidden layers and provides empirical data to support the hypothesis that activation-based pruning results in a weighted nuclear norm miniming czan also be classified based on when the pruning occurs, beforetion of the hidden layers. We hypothesize that, in a feedforward neural network ^{[2][14][16][19]}, dursing ^{[7][13][18][20]}, orSGD afters training ^[1][17]. The motivptimization for pruning before training is to eliminate the cost of pretraining. Pruning during training iteratively prunes and retrains the networkalgorithm and ReLU as the activation function, activation-based pruning is equivalent to induce sparsity by updatitroducing the weight magnitudes or filters and channels, and pruning after training generally takes a pretrained network and, subsequently, pruned nuclear norm as a regularization parameter to the original cost function.

4. Conclusion

Wes and retrains multiple times. Activation-based pruning occurs during training.

3. Conclusion

Ademonstrated the effectiveness of activation-based pruning in successfully reducesing the size of a feedforward networks. This pruning technique can be applied with either labeled or unlabeled data, as long as the data are drawn from the same distribution used to train the original feedforward network. AConsequently, we hypothesized that activation-based pruning is adaptable to supervised, semi-supervised, or unsupervised learning algorithms. Furthermore, eaour results showed that each layer of the pruned network serves as a sparse low-rank matrix representation of the fully trained original network. E We provided empirical evidence supporting the hypothesis that activation-based pruning can be interpreted as introducing a regularization parameter of the weighted nuclear norm of the hidden layers. Additionally, considering the architectural and implementation characteristics of activation-based pruning, we proposed that this technique has the potential to be applied to various types of neural networks. In our study, the focus was specifically on image classification tasks utilizing smaller datasets, namely FashionMNIST and MNIST, and employing compact network architectures, such as the fully connected feedforward model. This initial scope was chosen to facilitate a controlled analysis of the methods involved.