1. Introduction
In recent years, deep learning
[1] has achieved remarkable success in various computer vision (CV) tasks, such as image classification
[2], object detection
[3,4][3][4], and semantic segmentation
[5]. However, deep learning (DL) models often suffer from heavy computational burdens due to large numbers of parameters and high-dimensional input data, limiting their practical applications
[6]. In particular, the proliferation of smart devices and IoT (Internet of Things) sensors has given rise to a pressing need for edge computing
[7] as edge computing enables computation near data sources or things. To deploy DL models on resource-limited edge devices, reducing model complexity has become a priority
[8,9][8][9]. Various techniques have been proposed for reducing the complexity of DL models, among which downsampling plays a crucial role
[10]. However, most existing downsampling methods tend to lose some detailed information
[11]. Thus, it remains a challenging problem to design a lightweight and efficient downsampling component which can retain more semantic and detailed information with lower algorithmic complexity.
In many computer vision tasks, neural network downsampling is a crucial technique that is used to reduce the spatial resolution of the feature map
[11,12][11][12]. It can effectively reduce the computational and memory requirements of the network and expand the receptive field while retaining important information for subsequent processing
[13]. It reduces the spatial resolution by proportionally scaling down the width and height of feature maps, which can be achieved by selecting a subset of features or by aggregating the features in local regions. Downsampling can help to regularize the network and prevent overfitting by reducing the number of parameters and introducing some degree of spatial invariance
[14]. It can improve the efficiency of the neural network in processing large-scale complex data, such as remote sensing images and videos, and enable the DL models to operate on resource-limited devices
[15]. Pooling or subsampling of feature maps, such as Max Pooling or Strided Convolution, is a common downsampling operation in the neural network
[16]. However, most of the methods condense regional features to a single output, which suffers from several challenges, such as information loss and spatial bias
[11]. For instance, Max Pooling only retains the most distinguishable features
[17], and subsampling picks a portion of features randomly or according to rules
[18[18][19],
19], while the slicing adopted in this work utilizes the full information in the input feature map, as shown in
Figure 1. Therefore, the research on neural network downsampling is still an active area where there is space for optimization, and more efficient methods need be developed for better retaining feature information.
Figure 1.
Feature information retained by different downsampling methods.
Upsampling also plays an important role in neural networks. It is often used for image super-resolution
[20], segmentation
[21], and generation
[22] tasks via the reconstruction of high-resolution feature maps during the decoding stage in the neural network
[23]. The main upsampling methods include interpolation-based upsampling such as the Nearest Neighbor, Bilinear, and Bicubic Interpolation methods
[24] and the Transposed Convolution
[25] and Sub-Pixel Convolutional
[26] methods. The simplest and fastest algorithm is Nearest Neighbor sampling in which each pixel is copied in four copies to a 2 × 2 neighborhood; however, jagged edges are often introduced
[27]. The Bilinear and Bicubic Interpolation methods calculate new pixel values via the weighted averaging of the nearest pixels in the original image, providing smoother results than Nearest Neighbor Upsampling yet still introducing some blurring
[28]. Transposed Convolution, also known as deconvolution or fractionally strided convolution, is the reverse operation of convolution. It produces high-quality results with an expensive computational burden
[29]. Sub-Pixel Convolutional Upsampling rearranges the feature maps via a periodic shuffling operator to increase the spatial resolution
[26]. It is fast and computationally efficient; however, it may lead to some artifacts. Many existing downsampling techniques are often combined with the above upsampling methods as it is difficult to implement an inverse transform for generating low-dimensional spatial features
[11].
2. Downsampling in Neural Networks
To reduce model complexity, researchers have developed various downsampling methods that are tailored to specific tasks and architectures, including pooling-based
[17[17][19],
19], subsampling-based
[16,30][16][30], patch-based
[31[31][32],
32], and learnable pooling
[19,33][19][33] methods. In the early stages of neural network development, Maximum Pooling or Average Pooling were commonly adopted to achieve downsampling by taking the maximum or average value within a local window. These methods are fast and memory-efficient, yet t room for improvement remains in terms of information retention
[11]. Some methods that combine Max and Average Pooling, such as Mixed Pooling
[34], exhibit better performance compared to a single method. Unlike Maximum Pooling, Average Pooling, and their variants, SoftPool exponentially weights the activations using Softmax (normalized exponential function) kernels to retain feature information
[12]. Wu et al.
[35] proposed pyramid pooling for the transformer architecture; pyramid pooling which applies different scales of average pooling layers to generate pyramid feature maps, thus capturing powerful contextual features. There are also pooling methods that are designed to enhance the generalization of a model. For example, Fractional Pooling
[36], S3Pool
[13], and Stochastic Pooling
[30] can prevent overfitting by taking random samples in the pooling region. However, most pooling-based methods are hand-crafted nonlinear mappings which usually employ fixed, unlearnable prior knowledge
[37].
Nonlinear mapping can also be generated by overlaying complex convolutional layers and activation functions in a deep neural network (DNN)
[16]. When the network is shallow, pooling has some advantages. When the network goes deeper, multi-layer stacked convolution can learn better nonlinearity than pooling. It can also achieve better results
[38]. Therefore, Strided Convolution, which reduces spatial dimensionality by adjusting the stride to skip some pixels in the feature map, is generally used for downsampling in convolutional neural networks at present
[16]. Pooling and Stride Convolution have the advantage of extracting stronger semantic features, although at the cost of losing some detailed information
[39]. In contrast, the features extracted via passthrough downsampling
[40] have less semantic information but retain more detailed information. In transformer-based networks, patch-based downsampling is generally adopted
[31,32][31][32]. Patch merging is a method of reducing the number of tokens in transformer architectures which concatenates the features of each group of 2 × 2 neighboring patches and extracts the features with a linear layer
[32]. Patch-based methods perform poorly at capturing fine spatial structures and details, like edges and texture
[41]. Li et al.
[42] stacked the results of the Discrete Wavelet Transform in the channel dimension instead of directly stacking patches to prevent spatial domain distortion. Moreover, Lu et al.
[43] proposed a Robust Feature Downsampling Module by combining various techniques such as slice, Max Pooling, and group convolution, achieving satisfactory results in remote sensing visual tasks.
In recent years, learnable weights were gradually introduced into some advanced downsampling methods. Saeedan et al.
[19] proposed Detail Preserving Pooling methods that use learnable weights to emphasize spatial changes and preserve edges and texture details. Gao et al.
[33] proposed Local Importance-Based Pooling to retain important features based on weights learned by a local attention mechanism. Ma and Gu et al.
[44] proposed spatial attention pooling to learn feature weights and refine local features. Hesse et al.
[45] introduced a Content-Adaptive Downsampling method that downsamples only the non-critical regions learned by a network, effectively preserving detailed information in the regions of interest.
Recently, other studies proposed bi-directional pooling operations that can support both downsampling and upsampling operations, such as Liftpool
[46] and AdaPool
[11]. Liftpool decomposes the input into multiple sub-bands carrying different frequency information during downsampling and enables inverse recovery during upsampling
[46]. AdaPool uses two groups of pooling kernels to better retain the details of the original feature, and its learned weights can be used as prior knowledge for upsampling
[11]. These improved downsampling methods have demonstrated good performance gains, yet most of them still inevitably lose some feature information in the downsampling process.
3. Depthwise Separable Convolution
Depthwise separable convolution (DSConv)
[47] has gained significant attention in recent years due to its effectiveness at reducing the computational cost of convolutional layers in neural networks
[48,49][48][49]. An early work was proposed by Chollet in the Xception model
[44]. It replaced traditional Inception modules with DSConv and showed advanced performance on the ImageNet dataset with fewer parameters. Another remarkable work on DSConv is MobileNet, which builds faster and more efficient lightweight DNNs for mobile and embedded vision applications using DSConv
[50]. In addition, several studies integrated and improved DSConv. Drossos et al.
[51] combined DSConv and dilated convolutions for sound event detection. ShuffleNet uses channel shuffle to reduce computational costs in DSConv while maintaining or improving accuracy
[52]. Recently, a depthwise separable convolution attention module was proposed to focus on important information and capture the relationships of channels and spatial positions
[53]. Compared to standard convolutional layers, DSConv can significantly reduce the computational cost and memory requirements of a network while maintaining competitive model performance
[54,55][54][55]. This makes it a popular choice in modern neural network architectures, especially for mobile and embedded devices with limited computational resources
[56]. Overall, the above studies demonstrate the effectiveness of DSConv in terms of saving computational resources, as well as the potential ability to further improve model performance by combining DSConv with other techniques.