Neural network computational methods have evolved over the past halfcentury. In 1943, McCulloch and Pitts designed the first model, recognized as the linear threshold gate. Hebbian developed the Hebbian learning rule approach for training the neural network. However, would the Hebbian rule remain productive when all the input patterns became orthogonal? The existence of orthogonality in input vectors is a crucial component for this rule to execute effectively. To meet this requirement, a much more productive learning rule, known as the Delta rule, was established. Whereas the delta rule poses issues with the learning principles outlined above, backpropagation has developed as a more complicated learning approach. Backpropagation could learn an infinite layered structure and estimate any commutative function. A feedforward neural network is most often trained using backpropagation (FFNN).
1. Novel Pooling Methods
1.1 Compact Bilinear Pooling
Bilinear methods have been shown to perform well on several visual tasks, including semantic segmentation, finegrained classification, and facial detection. Endtoend backpropagation is being used to train the compact bilinear pooling technique that allows for a lowdimensional and highly discriminatory image representation. This approach of pooling is also employed in ^{[1]}^{[2]}.
For the last convolutional feature, this strategy is suggested to achieve global heterogeneity and rich representations, which attained cuttingedge performance in several multidimensional datasets. However, since computing pairing interaction between channels produces great complexity, dimension reduction methods have been presented. Lowrank bilinear pooling (Figure 1) shows a schematic representation of compacted bilinear pooling. Endtoend backpropagation has been used to train this pooling technique, which allows for a lowdimensional yet highly discriminatory image representation.
Figure 1. Image identification using the compact bilinear pooling method.
1.2. Spectral Pooling
Ripple et al. ^{[3]} proposed a novel pooling approach that included the concept of dimension reduction by shrinking the frequency domain representation of the data. Let h*w be the appropriate output feature map parameters and let x Rm*m be the given input map. The given input map is first treated with a discrete Fourier transform (DFT) after during which a frequencies representation submatrix of h*w size is eliminated from the center. Finally, inverse DFT is used to convert the h*w submatrix back into image pixels. By implementing a thresholdbased filtering methodology, spectral pooling retains more information over max pooling for the very same output dimension. It fixes the problem of the output map’s dimensions being reduced significantly.
1.3. Per Pixel Pyramid Pooling
To obtain the requisite receptive field size, a wider pooling window could have been used as contrasted to a stride and a narrow pooling window. While using a large single pooling window, finer details may be lost. As a consequence, successive pooling with various window dimensions is conducted, and the results are concatenated to construct additional feature maps. The material from broad to fine scales is presented in the feature maps that emerge. The multiscale pooling process can be carried out by each pixel without strides. The preceding is the formal definition of perpixel pyramid pooling
^{[4]}.
P (F, Si) is a pooling process with a size of Si and a stride of 1, and s is a vector with an element count of M. To be clear, one channel of the extracted features is shown in Figure 2 to demonstrate the pooling process; the other channels obtained similar findings.
Figure 2. Representation of the 4P module with the pooling size vector s = [5, 3, 1].
1.4. RankBased Average Pooling
The proposed pooling evaluates the average performance for practically zero negativity activation functions, which could also cause the loss of racist and discriminatory data by downplaying higher activation levels. Likewise, in max pooling, nonmaximum activations are eliminated, leading to data loss. A rankbased average pooling layer can overcome the challenges of information loss imposed on both max pooling and average pooling layers (RAP) ^{[5]}. The outcome of the RAP can be stated as Equation (8):
The ranks boundary, which defines the categories of activations used during averaging, is represented by t. In feature maps, R stands for the pooling regions j, and t stands for the index of each activation inside of it. S_{j} and a_{i}, within this order, reflect the rank of activation I and the value of activation I. When t = 1, max pooling is established. According to Shi et al. ^{[6]}, limiting t to a median value achieves good performance and a good balance between max pooling and average pooling. Therefore, RAP has better discriminative power than traditional pooling methods and is a perfect combination of maximum and average pooling. Figure 3 depicts a simulation of rankbased pooling in operation.
Figure 3. Rankbased average pooling: rankings are presented in ascending order, and activations for a pooling area are listed in descending order. The pooling output is calculated by averaging the four largest activations, since t = 4.
1.5. MaxOut Fractional Pooling
The concept of fractional pooling applies to the modification of the max pooling score. Herein, the multiplication factor (α) can only take noninteger values such as 1 and 2. The location of the pooling area and its random composition are, in fact, factors that contribute to the uncertainty provided by the largest max pooling. The region of pooling can be designed randomly or pseudorandomly, with overlaps or irregularities, employing dropout and trained data augmentation. According to Graham B. et al. ^{[7]}, the design of fractional max pooling with an overlapping region of pooling demonstrates greater performance than a discontinuous one. Furthermore, they observed that the results of the pooling region’s pseudorandom number selection with data augmentation were superior to those of random selection.
1.6. S3Pooling
Zhai et al. in 2017 presented the S3Pool method, a novel approach to pooling ^{[8]}. The pooling process is performed under this scheme in two stages. On each one of the preliminary phase feature maps (retrieved from the convolutional layer), the execution of max pooling is performed by stride 1. The outcome of step 1 is down sampled using a probabilistic process, in comparison to step 2, which first partitions the feature map of size X × Y into a preset set of horizontal (h) and vertical (v) panels. V is y/g and H is x/g. The following figure illustrates a schematic of S3Pooling. The working of S3 pooling is referred in Figure 4.
Figure 4. Working of S3 pooling mechanism. The dimension of the feature map in this example is 4 × 4, with both x and y = 4 represented in (a). The max pooling operation in step 1 uses stride 1, and there is no padding at the border. The grid size and stride should both be 2 in step 2. There will be two horizontal (h) and vertical (v) strips. In step 2, a stochastic downsampling is used to represent the rows and columns that were randomly chosen to build the feature map. Flexibility to change the grid size in step 2 in order to control the distortion or stochasticity is represented in (b,c).
Xu et al. ^{[9]} executed tests for the CIFAR10, CIFAR100, and SIT datasets using both network in the network (NIN) and residual network architectures to test the effectiveness of S3Pool in comparison to other pooling techniques (ResNet). According to the experimental observations, S3Pool showed better performance than NIN and ResNet with dropout and stochastic pooling, even when flipping and cropping were used as data augmentation techniques during the testing phase.
1.7. Methods to Preserve Critical Information When Pooling
Improper pooling techniques can lead to information loss, especially in the early stages of the network. This loss of information can limit learning and reduce model quality ^{[10]}^{[11]}. Detailpreserving clustering (DPP) ^{[12]} and local importancebased clustering (LIP) ^{[13]} minimize potential information loss by preserving key features during pooling operations. These approaches can also be known as soft approaches. Large networks require a lot of memory and cannot be started on devices with limited resources. One way to solve this problem is to quickly down sample to reduce the number of layers in the network. Poor performance may be the result of information loss due to the large and rapid reduction of the feature maps. RNNPool ^{[14]}^{[15]} attempts to solve this problem using a recursive down sampling network. The first recurrent network highlights feature maps and the second recurrent network summarizes its results as pooling output.
2. Advantages and Disadvantages of Pooling Approaches
The upsides and downsides of pooling operations in the numerous CNNbased architectures is discussed in Table 1, which would help researchers to understand and make their choices by keeping in mind the required pros and cons. Max pooling has indeed been applied by several researchers owing to its simplicity of use and effectiveness. Detail analysis was performed for further clarification of the topic.
Table 1. Advantages and disadvantages of different pooling approach in CNN.
Performance Evaluation of Popular Pooling Methods
The performance among the most latest pooling methods has been investigated systematically for the purpose of image classification in this section. It would be emphasized that the it is to fairly assess the influence of the pooling strategies in the CNNs, not to establish the optimum classification architecture. Table 2 evaluates the effectiveness of different pooling approaches on standard datasets including MNIST, CIFAR10, and CIFAR100. The architecture and the forms of activation functions that have been used to implement these techniques are presented in the following table. In Table 2, it is shown that for the MNIST dataset, average pooling performed the worst, with an error rate of 0.83%. Furthermore, in comparison to other pooling methods, gated pooling was a significant improvement where the average and maximum pools were responsively combined. With a difference of 0.01%, mixed, tree max average pooling, and fractional max pooling were followed in order by the performance of gated pooling. These pooling strategies’ outstanding regularization and generalization capabilities were validated by their effective implementation. In conclusion, the NIN and max out networks’ respectively showed a strong performance and error frequencies of 0.45% and 0.47%. Unfortunately, their performance was still inadequate to what was achieved while pooling methods. It was found that for MNIST datasets, using the same network with ReLU activation, rankbased pooling (RSP) gave a higher error rate than the error rate provided by random pooling in the range of 0.42% to 0.59%.
Table 2. Comparing performance of various pooling methods on different standard datasets.
Pooling Methods 
Architecture 
Activation Function 
Error Rate of Different Datasets 
Accuracy 
Reference 
MNIST 
CIFAR10 
CIFAR100 
Gated Method 
6 Convolutional Layers 
RELU 
0.29 
7.90 
33.22 
88% (Rotation Angle) 
^{[32]} 
Mixed Pooling 
6 Convolutional Layers 
RELU 
0.30 
8.01 
33.35 
90% (Translation Angle) 
Max Pooling 
6 Convolutional Layers 
RELU 
0.32 
7.68 
32.41 
93.75% (Scale Multiplier) 
Max + Tree Pooling 
6 Convolutional Layers 
RELU 
0.39 
9.28 
34.75 
Mixed Pooling 
6 Convolutional Layers (Without data Augmentation) 
RELU 
10.41 
12.61 
37.20 
91.5% 
^{[33]} 
Stochastic Pooling 
3 Convolutional Layers 
RELU 
0.47 
15.26 
42.58 
 
^{[31]} 
Average Pooling 
6 Convolutional Layers 
RELU 
0.83 
19.38 
47.18 
 
RankBased Average Pooling (RAP) 
3 Convolutional Layers 
RELU 
0.56 
18.28 
46.24 
 
^{[6]} 
RankBased Weighted Pooling (RWP) 
3 Convolutional Layers 
RELU 
0.56 
19.28 
48.54 
 
RankBased Stochastic Pooling (RSP) 
3 Convolutional Layers 
RELU 
0.59 
17.85 
45.48 
 
RankBased Average Pooling (RAP) 
3 Convolutional Layers 
RELU (Parametric) 
0.56 
18.58 
45.86 
 
RankBased Weighted Pooling (RWP) 
3 Convolutional Layers 
RELU (Parametric) 
0.53 
18.96 
47.09 
 
RankBased Stochastic pooling (RSP) 
3 Convolutional Layers 
RELU (Parametric) 
0.42 
14.26 
44.97 
 
RankBased Average Pooling (RAP) 
3 Convolutional Layers 
Leaky RELU 
0.58 
17.97 
45.64 

RankBased Weighted Pooling (RWP) 
3 Convolutional Layers 
Leaky RELU 
0.56 
19.86 
48.26 
 
RankBased Stochastic Pooling (RSP) 
3 Convolutional Layers 
Leaky RELU 
0.47 
13.48 
43.39 
 
RankBased Average Pooling (RAP) 
Network in Network (NIN) 
Leaky RELU 
 
9.48 
32.18 
 
^{[6]} 
RankBased Weighted Pooling (RWP) 
Network in Network (NIN) 
Leaky RELU 
 
9.34 
32.47 
 
RankBased Stochastic Pooling (RSP) 
Network in Network (NIN) 
Leaky RELU 
 
9.84 
32.16 
 
RankBased Average Pooling (RAP) 
Network in Network (NIN) 
RELU 
 
9.84 
34.85 
 
RankBased Weighted Pooling (RWP) 
Network in Network (NIN) 
RELU 
 
10.62 
35.62 
 
RankBased Stochastic Pooling (RSP) 
Network in Network (NIN) 
RELU 
 
9.48 
36.18 
 
RankBased Average Pooling (RAP) 
Network in Network (NIN) 
RELU (Parametric) 
 
8.75 
34.86 
 
RankBased Weighted Pooling (RWP) 
Network in Network (NIN) 
RELU (Parametric) 
 
8.94 
37.48 
 
RankBased Stochastic Pooling (RSP) 
Network in Network (NIN) 
RELU (Parametric) 
 
8.62 
34.36 
 
RankBased Average Pooling (RAP) (Includes Data Augmentation) 
Network in Network (NIN) 
RELU 
 
8.67 
30.48 
 
RankBased Weighted Pooling (RWP) (Includes Data Augmentation) 
Network in Network (NIN) 
Leaky RELU 
 
8.58 
30.41 
 
RankBased Stochastic Pooling (RSP) (Includes Data Augmentation) 
Network in Network (NIN) 
RELU (Parametric) 
 
7.74 
33.67 
 
 
Network in Network 
RELU 
0.49 
10.74 
35.86 
 
 
Supervised Network 
RELU 
 
9.55 
34.24 
 
 
Max out Network 
RELU 
0.47 
11.48 
 
 
Mixed Pooling 
Network in Network (NIN) 
RELU 
16.01 
8.80 
35.68 
92.5% 
^{[17]} 
VGG (GOFs Learned Filter) 
RELU 
10.08 
6.23 
28.64 
Fused Random Pooling 
10 Convolutional Layers 
RELU 
 
4.15 
17.96 
87.3% 
^{[1]} 
Fractional Max Pooling 
11 Convolutional Layers 
Leaky RELU 
0.50 
 
26.49 

^{[2]} 
Fractional Max Pooling 
Convolutional Layer Network (Sparse) 
Leaky RELU 
0.23 
3.48 
26.89 

S3pooling 
Network in Network (NIN) (Addition to Dropout) 
RELU 
 
7.70 
30.98 
92.3% 
^{[8]} 
S3pooling 
Network in Network (NIN) (Addition to Dropout) 
RELU 
 
9.84 
32.48 
S3pooling 
ResNet 
RELU 
 
7.08 
29.38 
84.5% 
^{[30]} 
S3pooling (Flip + Crop) 
ResNet 
RELU 
 
7.74 
30.86 
S3pooling (Flip + Crop) 
CNN With Data Augmentation 
RELU 
 
7.35 
 
S3pooling (Flip + Crop) 
CNN in Absence of Data Augmenting 
RELU 
 
9.80 
32.71 
Wavelet Pooling 
Network in Network 
RELU 
 
10.41 
35.70 
81.04% (CIFAR100) 
^{[34]} 
ALLCNN 
 
9.09 
 

ResNet 
 
13.76 
27.30 
96.87% (CIFAR10) 
Dense Net 
 
7.00 
27.95 

AlphaMaxDenseNet 
 
6.56 
27.45 

Temporal Pooling 
Global Pooling Layer 
Softmax 
 
 
 
91.5% 
^{[35]} 
Spectral Pooling 
AttentionBased CNN 2 Convolutional Layers 
RELU 
0.605 
8.87 
 
They mentioned improved accuracy but did not mentioned percentage. 
^{[36]} 
Mixed Pooling 
3 Convolutional Layers (Without Data Augmentation) 
MBA (Multi Bias Nonlinear Activation) 
 
6.75 
26.14 

^{[37]} 
Mixed Pooling 
3 Convolutional Layers (With Data Augmentation) 
 
5.37 
24.2 

Wavelet Pooling 
3 Convolutional Layers 
RELU 
 
 
 
99% (MNIST)74.42 (CIFAR10)80.28 (CIFAR100) 
^{[38]} 