Remote sensing image object detection tasks play a pivotal role in the realm of airborne and satellite remote sensing imagery, representing invaluable applications. Remote sensing technology has witnessed remarkable progress, enabling the capture of copious details that inherently reflect the contours, hues, textures, and other distinctive attributes of terrestrial targets. It has emerged as an indispensable avenue for acquiring comprehensive knowledge about the Earth’s surface. The primary objective of remote sensing image object detection is to precisely identify and locate objects of interest within the vast expanse of remote sensing images. This task finds extensive implementation across significant domains, including military reconnaissance, urban planning, environmental monitoring, soil science, and maritime vessel surveillance. With the incessant advancement of observational techniques, the availability of high-quality remote sensing image datasets, encompassing richer and more intricate information, has unlocked immense developmental potential for the ongoing pursuit of remote sensing image object detection.
1. Introduction
In the past decade, deep learning has undergone rapid advancements and progressively found applications in diverse fields, including speech recognition, natural language processing, and computer vision. Computer vision technology has been widely implemented in intelligent security, autonomous driving, remote sensing monitoring, healthcare and pharmaceuticals, agriculture, intelligent transportation, and information security [
8,
9,
10,
11,
12,
13,
14]. Within computer vision, tasks can be classified into image classification [
15], object detection [
16], and image segmentation [
17]. Notably, object detection, a pivotal branch of computer vision, has made remarkable strides during this period, largely attributed to the availability of extensive object detection datasets. Datasets such as MS COCO [
18], PASCAL VOC [
19], and Visdrone [
20,
21] have played a crucial role in facilitating breakthroughs in object detection tasks.
Nevertheless, in the realm of optical remote sensing imagery, current object detection algorithms still encounter numerous formidable challenges. These difficulties arise due to disparities between the acquisition methods used for optical remote sensing imagery and those employed for natural images. Remote sensing imagery relies on sensors such as optical, microwave, or laser devices to capture Earth’s surface information by detecting and recording radiation or reflection across different spectral ranges. Conversely, natural images are captured using electronic devices (e.g., cameras) or sensors to record visible light, infrared radiation, and other forms of radiation present in the natural environment, thereby acquiring everyday image data. Unlike natural images captured horizontally by ground cameras, satellite images taken from an aerial perspective provide extensive imaging coverage and comprehensive information. In complex landscapes and urban environments, advanced structures and uneven distribution of background information can pose additional challenges [
22]. Furthermore, due to the imaging method of remote sensing images, they encompass a wealth of information regarding various target objects. Consequently, these images frequently exhibit numerous instances of overlapping and varying-scaled targets, such as ships and ports, which are often arranged in a non-directional manner unnecessarily [
23]. This necessitates that models designed for detecting remote sensing targets possess a highly perceptive ability in terms of accurate positioning [
24] while also being sensitive to capturing informative details during the detection process. Additionally, the prevalence of small target instances in remote sensing images, some of which may consist of only a few pixels, poses significant challenges in feature extraction for the model [
25], thereby resulting in performance degradation. Moreover, certain target instances in remote sensing images, such as flyovers and bridges, share strikingly similar features, intensifying the difficulties encountered in feature extraction for the model [
26], consequently leading to phenomena such as false detections or missed detections. The presence of target instances in remote sensing images with extreme aspect ratios [
27], such as highways and sea-crossing bridges, further exacerbates the challenges faced by the detector. Lastly, the complex background information within remote sensing images often leads to the occlusion of target regions by irrelevant backgrounds, rendering it difficult for the detector to extract target-specific features [
28]. Moreover, the imaging method of remote sensing images is subject to environmental conditions on Earth’s surface [
29], including atmospheric interference, cloud cover, and vegetation obstruction, which may result in target occlusion and overlap, impeding the detector’s ability to accurately delineate object contours [
30] and consequently compromising the precise localization of target information. As a consequence, remote sensing images necessitate calibration and preprocessing measures [
31]. Furthermore, in the current stage, numerous advanced detectors have achieved exceptional performance in remote sensing object detection through the design of neural network models’ depth and width. However, this achievement comes at the cost of a substantial increase in model parameters. For instance, in remote sensing devices such as unmanned aerial vehicles and remote sensing satellites, it is impractical to equip them with mobile devices possessing equivalent computational power. As a result, the lightweight design of remote sensing object detection lags its progress in natural image domains. Hence, effectively addressing the balance between model detection performance and lightweight design becomes an immensely valuable research question.
Deep learning-based object detection algorithms can be broadly classified into two categories. The first category consists of two-stage object detection algorithms that rely on candidate regions. These algorithms generate potential regions [
32,
33] and then perform classification and position regression [
34,
35], achieving high-precision object detection. Representative algorithms in this category include R-CNN [
36], Faster R-CNN [
37], Mask R-CNN [
38], and Sparse R-CNN [
39]. While these algorithms achieve high accuracy, their slower speed prevents real-time detection on all devices. The second category comprises single-stage object detection networks based on regression. These algorithms directly predict the position and class of objects from input images using a single network, avoiding the complex process of generating candidate regions and achieving faster detection speeds. The main representative networks in this category include SSD [
40] and the YOLO [
41,
42,
43,
44,
45,
46] series. Among them, the YOLO series of single-stage detection algorithms is widely used. Currently, YOLOv5 strikes a balanced performance in the YOLO series.
The YOLO object detection model, proposed by Redmon et al. [
47], achieves high-precision object detection performance while ensuring real-time inference. However, the individual training of each module in the YOLO model compromises the model’s inference speed, thus the concept of joint training was introduced in YOLOv2 [
48] to enhance the model’s inference speed. The Darknet-53 backbone network architecture, first introduced in YOLOv3 [
49], combines the strengths of Resnet to ensure highly expressive feature representation while avoiding gradient issues caused by excessive network depth. Additionally, multi-scale prediction techniques were employed to better adapt to objects of various sizes and shapes. In YOLOv4 [
50], the CSPDarknet53 feature extraction backbone network integrated a cross-stage partial network architecture (CSP), effectively addressing information redundancy within the backbone network and significantly reducing the model’s parameter count, thereby improving the overall inference speed. Moreover, the introduced Spatial Pooling Pyramid module in YOLOv4 helps expand the receptive field of the feature maps, further enhancing detection accuracy. As for YOLOv5, it strikes a balance in detection performance within the YOLO series. By employing CSPDarknet as the backbone network for feature extraction and adopting the FPN (Feature Pyramid Network) [
51] approach for semantic transmission in the neck region, YOLOv5 incorporates multiple feature layers with different resolutions at the top of the backbone network. Convolutional and upsampling operations are utilized to fuse the feature maps and align scales. Furthermore, the PANet (Path Aggregation Network) [
52] facilitates top-down localization. The YOLOv5 model has achieved favorable outcomes in natural image object detection tasks, but its effectiveness diminishes when applied to remote sensing satellite image detection due to challenges in meeting both real-time requirements and accuracy.
2. Traditional Object Detection in Remote Sensing Images
In the initial stages, object detection algorithms heavily relied on manual feature design given the absence of effective image representations. Due to the limitations of image encoding, these methods necessitated intricate feature representation schemes alongside various optimization techniques to accommodate the constraints of available computational resources. The underlying process of early approaches entailed pre-processing the target images, selecting relevant areas of interest [
55], extracting distinctive attributes [
56], and applying classifiers for categorization [
57]. Primarily, superfluous details that lacked relevance to the object detection task were effectively filtered out through advanced image pre-processing techniques, thereby streamlining the data by retaining only the most essential visual elements. To localize potential regions where objects may be present, the sliding window technique was employed. By applying the Histogram of Oriented Gradients (HOG) algorithm [
58], a diverse set of features including color, texture [
59], shape [
60], and spatial relationships [
61] were extracted from these regions. Finally, the extracted features were transformed into vector representations and classified using an appropriate classifier. However, due to the large number of candidate regions involved in feature extraction, the computational complexity increased significantly, resulting in redundant calculations. Moreover, manually engineered features demonstrated limited resilience and proved inadequate in complex and dynamic environments. Consequently, when it comes to object detection in remote sensing imagery, traditional machine learning-based methods have gradually been superseded by more efficient deep learning approaches, which have now become the primary choice.
3. Object Detection Based on Deep Learning Method in Remote Sensing Images
The field of deep learning has propelled neural networks to become integral components in modern target detection methods. Leveraging the powerful feature extraction capabilities of neural networks, deep learning-based algorithms have found widespread applications in remote sensing imagery. However, traditional target detection algorithms face challenges in achieving optimal performance due to complex backgrounds, varying target sizes, object overlap and occlusion, as well as the prevalence of small-scale targets in remote sensing images [
62]. To address these complexities, researchers have introduced innovative techniques. Mashformer [
63] presents a hybrid detector that integrates multi-scale perception convolutional neural networks (CNN) and Transformers. This integration captures relationships between remote features, thereby enhancing expressiveness in complex background scenarios and improving target detection across different scales. Considering the diverse orientations of objects in remote sensing images, Li et al. [
64] propose an adaptive point learning method. By utilizing adaptive points as fine-grained representations, this method effectively captures geometric key features of objects aligned in any direction, even amidst clutter and non-axis-aligned circumstances. Addressing the issue of object boundary detection discontinuity, Yang et al. [
65] introduce a novel regression loss function called Gaussian Wasserstein distance (GWD). This function aligns the specified loss with detection accuracy, enabling efficient model learning through backpropagation. For the problem of detecting small targets, Zhao et al. [
66] suggest incorporating dedicated detection heads specifically designed for such targets. They also propose a cross-layer asymmetric Transformer module that leverages minute pathways to enrich the features of small objects, thereby improving the effectiveness of small target detection while reducing model complexity. To combat specific image degradation characteristics induced by remote sensing imaging techniques, Niu et al. [
67] propose an effective feature enhancement (EFE) block. This block integrates a non-local means filtering method to address issues such as weak target energy and low image signal-to-noise ratio, enhancing the quality of features. Yan et al. [
68] devised a novel detection network called LssDet, which not only ensures accurate target detection but also reduces the complexity of the model. This method enhances the feature extraction capabilities specifically for small targets. Furthermore, CenterNet [
69] and CornerNet [
70] improve target detection speed through methodologies that focus on detecting center points and corner points, respectively.
On the whole, these advancements contribute to the ongoing improvement and effectiveness of target detection in remote sensing imagery. However, despite the significant enhancement in detection accuracy achieved by existing methods, they come at the cost of substantial computations and parameterization. This poses a challenge as the current approaches struggle to strike a harmonious balance between lightweight design and detection performance. Consequently, when applied to real-time or mobile devices, the efficacy of these methods for target detection in remote sensing images diminishes. Therefore, it becomes crucial to address the pressing issue of effectively reconciling the performance of remote sensing image detection with the imperative for model lightweight.
4. The Attention Mechanism
Attention mechanism is a widely employed technique in the field of deep learning which plays a similar role to human attention. It focuses on the most important and relevant parts of information during the processing stage. By mimicking human visual or attention processes, this mechanism helps models emphasize crucial information, enabling neural networks to adapt perceptively to visual tasks and dynamically adjust their focus on inputs. Currently, attention mechanisms find extensive applications in various tasks, including image classification [
71], image semantic segmentation [
72], object detection [
73], natural language processing [
74], medical image processing [
75], and image generation [
76]. The Recurrent Attention Model (RAM) [
77] was the first to apply attention mechanisms to deep neural networks. Attention mechanisms can be categorized into different types: channel attention, spatial attention, hybrid attention, temporal attention, and branch attention. Channel attention aims to automatically learn attention mechanisms for each channel and adjust the weights of channels accordingly. SENet [
78] was the pioneering work that introduced channel attention, collecting global information through squeeze-and-excitation to capture channel-wise information and enhance feature representation and discrimination. Spatial attention involves automatically learning the importance of each spatial position within an image and adjusting the weights of positions accordingly. The Spatial Transformer Network (STN) [
79] is a representative method that transforms various deformable data in space and automatically captures features from important regions. GENet [
80] implicitly utilizes sub-networks to predict soft masks for selecting significant regions. Hybrid attention combines channel attention and spatial attention. Notable algorithms include DANet [
81], which introduces both channel and spatial attention to capture global and contextual information by adaptively learning channel and spatial weights. Woo et al. [
82] propose a lightweight attention mechanism called the Convolutional Block Attention Module (CBAM), decoupling spatial attention and channel attention to improve computational efficiency. The tremendous success of the Transformer model [
83] in the field of natural language processing (NLP) has brought attention to self-attention mechanisms, which have been introduced into computer vision. Vision Transformers [
84] and Swin-Transformers [
85], based on attention mechanisms, achieve excellent detection accuracy and speed without using convolutional operations, showcasing the enormous potential of pure attention-based models in computer vision. However, due to the sliding window approach employed by Transformers for image processing, the computational complexity remains high, resulting in unsatisfactory performance when detecting small targets [
86].
This entry is adapted from the peer-reviewed paper 10.3390/rs15204974