Object Detection of Remote Sensing Image

Version	Summary	Created by	Modification	Content Size	Created at	Operation
1		Jiarui Zhang	--	2285	2023-10-19 18:10:30	\|
2	Format correct	Wendy Huang	Meta information modification	2285	2023-10-23 11:50:10	\|

This entry is adapted from the peer-reviewed paper 10.3390/rs15204974

Remote sensing image object detection tasks play a pivotal role in the realm of airborne and satellite remote sensing imagery, representing invaluable applications. Remote sensing technology has witnessed remarkable progress, enabling the capture of copious details that inherently reflect the contours, hues, textures, and other distinctive attributes of terrestrial targets. It has emerged as an indispensable avenue for acquiring comprehensive knowledge about the Earth’s surface. The primary objective of remote sensing image object detection is to precisely identify and locate objects of interest within the vast expanse of remote sensing images. This task finds extensive implementation across significant domains, including military reconnaissance, urban planning, environmental monitoring, soil science, and maritime vessel surveillance. With the incessant advancement of observational techniques, the availability of high-quality remote sensing image datasets, encompassing richer and more intricate information, has unlocked immense developmental potential for the ongoing pursuit of remote sensing image object detection.

object detection remote sensing image deep learning computer vision neural networks attention mechanism

1. Introduction

In the past decade, deep learning has undergone rapid advancements and progressively found applications in diverse fields, including speech recognition, natural language processing, and computer vision. Computer vision technology has been widely implemented in intelligent security, autonomous driving, remote sensing monitoring, healthcare and pharmaceuticals, agriculture, intelligent transportation, and information security ^[1]^[2]^[3]^[4]^[5]^[6]^[7]. Within computer vision, tasks can be classified into image classification ^[8], object detection ^[9], and image segmentation ^[10]. Notably, object detection, a pivotal branch of computer vision, has made remarkable strides during this period, largely attributed to the availability of extensive object detection datasets. Datasets such as MS COCO ^[11], PASCAL VOC ^[12], and Visdrone ^[13]^[14] have played a crucial role in facilitating breakthroughs in object detection tasks.

Nevertheless, in the realm of optical remote sensing imagery, current object detection algorithms still encounter numerous formidable challenges. These difficulties arise due to disparities between the acquisition methods used for optical remote sensing imagery and those employed for natural images. Remote sensing imagery relies on sensors such as optical, microwave, or laser devices to capture Earth’s surface information by detecting and recording radiation or reflection across different spectral ranges. Conversely, natural images are captured using electronic devices (e.g., cameras) or sensors to record visible light, infrared radiation, and other forms of radiation present in the natural environment, thereby acquiring everyday image data. Unlike natural images captured horizontally by ground cameras, satellite images taken from an aerial perspective provide extensive imaging coverage and comprehensive information. In complex landscapes and urban environments, advanced structures and uneven distribution of background information can pose additional challenges ^[15]. Furthermore, due to the imaging method of remote sensing images, they encompass a wealth of information regarding various target objects. Consequently, these images frequently exhibit numerous instances of overlapping and varying-scaled targets, such as ships and ports, which are often arranged in a non-directional manner unnecessarily ^[16]. This necessitates that models designed for detecting remote sensing targets possess a highly perceptive ability in terms of accurate positioning ^[17] while also being sensitive to capturing informative details during the detection process. Additionally, the prevalence of small target instances in remote sensing images, some of which may consist of only a few pixels, poses significant challenges in feature extraction for the model ^[18], thereby resulting in performance degradation. Moreover, certain target instances in remote sensing images, such as flyovers and bridges, share strikingly similar features, intensifying the difficulties encountered in feature extraction for the model ^[19], consequently leading to phenomena such as false detections or missed detections. The presence of target instances in remote sensing images with extreme aspect ratios ^[20], such as highways and sea-crossing bridges, further exacerbates the challenges faced by the detector. Lastly, the complex background information within remote sensing images often leads to the occlusion of target regions by irrelevant backgrounds, rendering it difficult for the detector to extract target-specific features ^[21]. Moreover, the imaging method of remote sensing images is subject to environmental conditions on Earth’s surface ^[22], including atmospheric interference, cloud cover, and vegetation obstruction, which may result in target occlusion and overlap, impeding the detector’s ability to accurately delineate object contours ^[23] and consequently compromising the precise localization of target information. As a consequence, remote sensing images necessitate calibration and preprocessing measures ^[24]. Furthermore, in the current stage, numerous advanced detectors have achieved exceptional performance in remote sensing object detection through the design of neural network models’ depth and width. However, this achievement comes at the cost of a substantial increase in model parameters. For instance, in remote sensing devices such as unmanned aerial vehicles and remote sensing satellites, it is impractical to equip them with mobile devices possessing equivalent computational power. As a result, the lightweight design of remote sensing object detection lags its progress in natural image domains. Hence, effectively addressing the balance between model detection performance and lightweight design becomes an immensely valuable research question.

Deep learning-based object detection algorithms can be broadly classified into two categories. The first category consists of two-stage object detection algorithms that rely on candidate regions. These algorithms generate potential regions ^[25]^[26] and then perform classification and position regression ^[27]^[28], achieving high-precision object detection. Representative algorithms in this category include R-CNN ^[29], Faster R-CNN ^[30], Mask R-CNN ^[31], and Sparse R-CNN ^[32]. While these algorithms achieve high accuracy, their slower speed prevents real-time detection on all devices. The second category comprises single-stage object detection networks based on regression. These algorithms directly predict the position and class of objects from input images using a single network, avoiding the complex process of generating candidate regions and achieving faster detection speeds. The main representative networks in this category include SSD ^[33] and the YOLO ^[34]^[35]^[36]^[37]^[38]^[39] series. Among them, the YOLO series of single-stage detection algorithms is widely used. Currently, YOLOv5 strikes a balanced performance in the YOLO series.

The YOLO object detection model, proposed by Redmon et al. ^[40], achieves high-precision object detection performance while ensuring real-time inference. However, the individual training of each module in the YOLO model compromises the model’s inference speed, thus the concept of joint training was introduced in YOLOv2 ^[41] to enhance the model’s inference speed. The Darknet-53 backbone network architecture, first introduced in YOLOv3 ^[42], combines the strengths of Resnet to ensure highly expressive feature representation while avoiding gradient issues caused by excessive network depth. Additionally, multi-scale prediction techniques were employed to better adapt to objects of various sizes and shapes. In YOLOv4 ^[43], the CSPDarknet53 feature extraction backbone network integrated a cross-stage partial network architecture (CSP), effectively addressing information redundancy within the backbone network and significantly reducing the model’s parameter count, thereby improving the overall inference speed. Moreover, the introduced Spatial Pooling Pyramid module in YOLOv4 helps expand the receptive field of the feature maps, further enhancing detection accuracy. As for YOLOv5, it strikes a balance in detection performance within the YOLO series. By employing CSPDarknet as the backbone network for feature extraction and adopting the FPN (Feature Pyramid Network) ^[44] approach for semantic transmission in the neck region, YOLOv5 incorporates multiple feature layers with different resolutions at the top of the backbone network. Convolutional and upsampling operations are utilized to fuse the feature maps and align scales. Furthermore, the PANet (Path Aggregation Network) ^[45] facilitates top-down localization. The YOLOv5 model has achieved favorable outcomes in natural image object detection tasks, but its effectiveness diminishes when applied to remote sensing satellite image detection due to challenges in meeting both real-time requirements and accuracy.

2. Traditional Object Detection in Remote Sensing Images

In the initial stages, object detection algorithms heavily relied on manual feature design given the absence of effective image representations. Due to the limitations of image encoding, these methods necessitated intricate feature representation schemes alongside various optimization techniques to accommodate the constraints of available computational resources. The underlying process of early approaches entailed pre-processing the target images, selecting relevant areas of interest ^[46], extracting distinctive attributes ^[47], and applying classifiers for categorization ^[48]. Primarily, superfluous details that lacked relevance to the object detection task were effectively filtered out through advanced image pre-processing techniques, thereby streamlining the data by retaining only the most essential visual elements. To localize potential regions where objects may be present, the sliding window technique was employed. By applying the Histogram of Oriented Gradients (HOG) algorithm ^[49], a diverse set of features including color, texture ^[50], shape ^[51], and spatial relationships ^[52] were extracted from these regions. Finally, the extracted features were transformed into vector representations and classified using an appropriate classifier. However, due to the large number of candidate regions involved in feature extraction, the computational complexity increased significantly, resulting in redundant calculations. Moreover, manually engineered features demonstrated limited resilience and proved inadequate in complex and dynamic environments. Consequently, when it comes to object detection in remote sensing imagery, traditional machine learning-based methods have gradually been superseded by more efficient deep learning approaches, which have now become the primary choice.

3. Object Detection Based on Deep Learning Method in Remote Sensing Images

The field of deep learning has propelled neural networks to become integral components in modern target detection methods. Leveraging the powerful feature extraction capabilities of neural networks, deep learning-based algorithms have found widespread applications in remote sensing imagery. However, traditional target detection algorithms face challenges in achieving optimal performance due to complex backgrounds, varying target sizes, object overlap and occlusion, as well as the prevalence of small-scale targets in remote sensing images ^[53]. To address these complexities, researchers have introduced innovative techniques. Mashformer ^[54] presents a hybrid detector that integrates multi-scale perception convolutional neural networks (CNN) and Transformers. This integration captures relationships between remote features, thereby enhancing expressiveness in complex background scenarios and improving target detection across different scales. Considering the diverse orientations of objects in remote sensing images, Li et al. ^[55] propose an adaptive point learning method. By utilizing adaptive points as fine-grained representations, this method effectively captures geometric key features of objects aligned in any direction, even amidst clutter and non-axis-aligned circumstances. Addressing the issue of object boundary detection discontinuity, Yang et al. ^[56] introduce a novel regression loss function called Gaussian Wasserstein distance (GWD). This function aligns the specified loss with detection accuracy, enabling efficient model learning through backpropagation. For the problem of detecting small targets, Zhao et al. ^[57] suggest incorporating dedicated detection heads specifically designed for such targets. They also propose a cross-layer asymmetric Transformer module that leverages minute pathways to enrich the features of small objects, thereby improving the effectiveness of small target detection while reducing model complexity. To combat specific image degradation characteristics induced by remote sensing imaging techniques, Niu et al. ^[58] propose an effective feature enhancement (EFE) block. This block integrates a non-local means filtering method to address issues such as weak target energy and low image signal-to-noise ratio, enhancing the quality of features. Yan et al. ^[59] devised a novel detection network called LssDet, which not only ensures accurate target detection but also reduces the complexity of the model. This method enhances the feature extraction capabilities specifically for small targets. Furthermore, CenterNet ^[60] and CornerNet ^[61] improve target detection speed through methodologies that focus on detecting center points and corner points, respectively.

On the whole, these advancements contribute to the ongoing improvement and effectiveness of target detection in remote sensing imagery. However, despite the significant enhancement in detection accuracy achieved by existing methods, they come at the cost of substantial computations and parameterization. This poses a challenge as the current approaches struggle to strike a harmonious balance between lightweight design and detection performance. Consequently, when applied to real-time or mobile devices, the efficacy of these methods for target detection in remote sensing images diminishes. Therefore, it becomes crucial to address the pressing issue of effectively reconciling the performance of remote sensing image detection with the imperative for model lightweight.

4. The Attention Mechanism

Attention mechanism is a widely employed technique in the field of deep learning which plays a similar role to human attention. It focuses on the most important and relevant parts of information during the processing stage. By mimicking human visual or attention processes, this mechanism helps models emphasize crucial information, enabling neural networks to adapt perceptively to visual tasks and dynamically adjust their focus on inputs. Currently, attention mechanisms find extensive applications in various tasks, including image classification ^[62], image semantic segmentation ^[63], object detection ^[64], natural language processing ^[65], medical image processing ^[66], and image generation ^[67]. The Recurrent Attention Model (RAM) ^[68] was the first to apply attention mechanisms to deep neural networks. Attention mechanisms can be categorized into different types: channel attention, spatial attention, hybrid attention, temporal attention, and branch attention. Channel attention aims to automatically learn attention mechanisms for each channel and adjust the weights of channels accordingly. SENet ^[69] was the pioneering work that introduced channel attention, collecting global information through squeeze-and-excitation to capture channel-wise information and enhance feature representation and discrimination. Spatial attention involves automatically learning the importance of each spatial position within an image and adjusting the weights of positions accordingly. The Spatial Transformer Network (STN) ^[70] is a representative method that transforms various deformable data in space and automatically captures features from important regions. GENet ^[71] implicitly utilizes sub-networks to predict soft masks for selecting significant regions. Hybrid attention combines channel attention and spatial attention. Notable algorithms include DANet ^[72], which introduces both channel and spatial attention to capture global and contextual information by adaptively learning channel and spatial weights. Woo et al. ^[73] propose a lightweight attention mechanism called the Convolutional Block Attention Module (CBAM), decoupling spatial attention and channel attention to improve computational efficiency. The tremendous success of the Transformer model ^[74] in the field of natural language processing (NLP) has brought attention to self-attention mechanisms, which have been introduced into computer vision. Vision Transformers ^[75] and Swin-Transformers ^[76], based on attention mechanisms, achieve excellent detection accuracy and speed without using convolutional operations, showcasing the enormous potential of pure attention-based models in computer vision. However, due to the sliding window approach employed by Transformers for image processing, the computational complexity remains high, resulting in unsatisfactory performance when detecting small targets ^[77].

References

Alem, A.; Kumar, S. Deep Learning Models Performance Evaluations for Remote Sensed Image Classification. IEEE Access 2022, 10, 111784–111793.
Lu, F.; Han, M. Hyperspectral remote sensing image classification based on deep extreme learning machine. J. Dalian Univ. Technol. 2018, 58, 166–173.
Guo, C.; Li, K.; Li, H.; Tong, X.; Wang, X. Deep Convolution Neural Network Method for Remote Sensing Image Quality Level Classification. Geomat. Inf. Sci. Wuhan Univ. 2022, 47, 1279–1286.
Gu, Y.; Wang, Y.; Li, Y. A Survey on Deep Learning-Driven Remote Sensing Image Scene Understanding: Scene Classification, Scene Retrieval and Scene-Guided Object Detection. Appl. Sci. 2019, 9, 2110.
Liu, D.; Han, L.; Han, X. High Spatial Resolution Remote Sensing Image Classification Based on Deep Learning. Acta Opt. Sin. 2016, 36, 0428001.
Sun, X.; Wang, B.; Wang, Z.; Li, H.; Li, H.; Fu, K. Research Progress on Few-Shot Learning for Remote Sensing Image Interpretation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 2387–2402.
Chen, Z.; Wang, Y.; Han, W.; Feng, R.; Chen, J. An Improved Pretraining Strategy-Based Scene Classification With Deep Learning. IEEE Geosci. Remote Sens. Lett. 2020, 17, 844–848.
Aggarwal, A.; Kumar, V.; Gupta, R. Object Detection Based Approaches in Image Classification: A Brief Overview. In Proceedings of the 2023 IEEE Guwahati Subsection Conference (GCON), Guwahati, India, 23–25 June 2023; pp. 1–6.
Liu, B.; Huang, J. Global-Local Attention Mechanism Based Small Object Detection. In Proceedings of the 2023 IEEE 12th Data Driven Control and Learning Systems Conference (DDCLS), Xiangtan, China, 12–14 May 2023; pp. 1439–1443.
Shen, T.; Xu, H. Medical Image Segmentation Based on Transformer and HarDNet Structures. IEEE Access 2023, 11, 16621–16630.
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context; Springer: Cham, Switzerland, 2014; pp. 740–755.
Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis. 2010, 88, 303–338.
Zhu, P.; Wen, L.; Du, D.; Bian, X.; Fan, H.; Hu, Q.; Ling, H. Detection and Tracking Meet Drones Challenge. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 7380–7399.
Zhu, P.; Wen, L.; Du, D.; Bian, X.; Ling, H.; Hu, Q.; Wu, H.; Nie, Q.; Cheng, H.; Liu, C.; et al. VisDrone-VDT2018: The Vision Meets Drone Video Detection and Tracking Challenge Results. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 496–518.
Gong, H.; Mu, T.; Li, Q.; Dai, H.; Li, C.; He, Z.; Wang, W.; Han, F.; Tuniyazi, A.; Li, H.; et al. Swin-Transformer-Enabled YOLOv5 with Attention Mechanism for Small Object Detection on Satellite Images. Remote Sens. 2022, 14, 2861.
Sun, X.; Wang, P.; Yan, Z.; Xu, F.; Wang, R.; Diao, W.; Chen, J.; Li, J.; Feng, Y.; Xu, T.; et al. FAIR1M: A benchmark dataset for fine-grained object recognition in high-resolution remote sensing imagery. Isprs J. Photogramm. Remote Sens. 2022, 184, 116–130.
You, Y.; Ran, B.; Meng, G.; Li, Z.; Liu, F.; Li, Z. OPD-Net: Prow Detection Based on Feature Enhancement and Improved Regression Model in Optical Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2021, 59, 6121–6137.
Ma, W.; Li, N.; Zhu, H.; Jiao, L.; Tang, X.; Guo, Y.; Hou, B. Feature Split-Merge-Enhancement Network for Remote Sensing Object Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5616217.
Xiao, J.; Guo, H.; Zhou, J.; Zhao, T.; Yu, Q.; Chen, Y.; Wang, Z. Tiny object detection with context enhancement and feature purification. Expert Syst. Appl. 2023, 211, 118665.
Cheng, G.; Si, Y.; Hong, H.; Yao, X.; Guo, L. Cross-Scale Feature Fusion for Object Detection in Optical Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2021, 18, 431–435.
Dou, Z.; Gao, K.; Zhang, X.; Wang, H.; Wang, J. Improving Performance and Adaptivity of Anchor-Based Detector Using Differentiable Anchoring With Efficient Target Generation. IEEE Trans. Image Process. 2021, 30, 712–724.
Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q.; Soc, I.C. TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-captured Scenarios. In Proceedings of the 18th IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 2778–2788.
Liu, X.; Li, Z.; Fu, X.; Yin, Z.; Liu, M.; Yin, L.; Zheng, W. Monitoring House Vacancy Dynamics in The Pearl River Delta Region: A Method Based on NPP-VIIRS Night-Time Light Remote Sensing Images. Land 2023, 12, 831.
Ju, M.; Niu, B.; Jin, S.; Liu, Z. SuperDet: An Efficient Single-Shot Network for Vehicle Detection in Remote Sensing Images. Electronics 2023, 12, 1312.
Yan, B.; Wang, D.; Lu, H.; Yang, X. Cooling-Shrinking Attack: Blinding the Tracker with Imperceptible Noises. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 987–996.
Ji, L.; Yu-Xiao, N. Method of Insulator Detection Based on Improved Faster R-CNN. In Proceedings of the 2023 6th International Conference on Electronics Technology (ICET), Chengdu, China, 12–15 May 2023; pp. 1127–1133.
Zhaowei, C.; Vasconcelos, N. Cascade R-CNN: Delving Into High Quality Object Detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162.
Tsung-Yi, L.; Goyal, P.; Girshick, R.; Kaiming, H.; Dollar, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2999–3007.
Cai, C.; Chen, L.; Zhang, X.; Gao, Z. End-to-End Optimized ROI Image Compression. IEEE Trans. Image Process. 2020, 29, 3442–3457.
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Proceedings of the 29th Annual Conference on Neural Information Processing Systems (NIPS), Montreal, BC, Canada, 7–12 December 2015.
He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 386–397.
Sun, P.; Zhang, R.; Jiang, Y.; Kong, T.; Xu, C.; Zhan, W.; Tomizuka, M.; Yuan, Z.; Luo, P. Sparse R-CNN: An End-to-End Framework for Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2023. Early Access.
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the 14th European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 8–16 October 2016; pp. 21–37.
Xu, X.; Jiang, Y.; Chen, W.; Huang, Y.; Zhang, Y.; Sun, X. DAMO-YOLO: A Report on Real-Time Object Detection Design. arXiv 2022, arXiv:2211.15444.
Adarsh, P.; Rathi, P.; Kumar, M. YOLO v3-Tiny: Object Detection and Recognition using one stage improved model. In Proceedings of the 6th International Conference on Advanced Computing and Communication Systems (ICACCS), Coimbatore, India, 6–7 March 2020; pp. 687–694.
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications. arXiv 2022, arXiv:2209.02976.
Wang, C.-Y.; Bochkovskiy, A.; Mark Liao, H.-Y. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:2207.02696.
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430.
Chen, Q.; Wang, Y.; Yang, T.; Zhang, X.; Cheng, J.; Sun, J. You Only Look One-level Feature. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 19–25 June 2021; p. 13034.
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 27–30 June 2016; pp. 779–788.
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525.
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767.
Bochkovskiy, A.; Wang, C.-Y.; Mark Liao, H.-Y. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934.
Lin, T.-Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 30th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944.
Li, H.; Xiong, P.; An, J.; Wang, L. Pyramid Attention Network for Semantic Segmentation. arXiv 2018, arXiv:1805.10180.
Felzenszwalb, P.F.; Girshick, R.B.; McAllester, D.; Ramanan, D. Object Detection with Discriminatively Trained Part-Based Models. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 32, 1627–1645.
Mikolajczyk, K.; Schmid, C. A performance evaluation of local descriptors. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 1615–1630.
Viola, P.; Jones, M. Rapid object detection using a boosted cascade of simple features. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Kauai, HI, USA, 8–14 December 2001; pp. 511–518.
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, 20–25 June 2005; pp. 886–893.
Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110.
Yu, N.; Ren, H.; Deng, T.; Fan, X. Stepwise Locating Bidirectional Pyramid Network for Object Detection in Remote Sensing Imagery. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5.
Uijlings, J.R.R.; van de Sande, K.E.A.; Gevers, T.; Smeulders, A.W.M. Selective Search for Object Recognition. Int. J. Comput. Vis. 2013, 104, 154–171.
Li, Z.; Wang, Y.; Zhang, N.; Zhang, Y.; Zhao, Z.; Xu, D.; Ben, G.; Gao, Y. Deep Learning-Based Object Detection Techniques for Remote Sensing Images: A Survey. Remote Sens. 2022, 14, 2385.
Wang, K.; Bai, F.; Li, J.; Liu, Y.; Li, Y. MashFormer: A Novel Multiscale Aware Hybrid Detector for Remote Sensing Object Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 2753–2763.
Ding, J.; Xue, N.; Long, Y.; Xia, G.-S.; Lu, Q.; Soc, I.C. Learning RoI Transformer for Oriented Object Detection in Aerial Images. In Proceedings of the 32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 2844–2853.
Yang, X.; Yan, J.; Ming, Q.; Wang, W.; Zhang, X.; Tian, Q. Rethinking Rotated Object Detection with Gaussian Wasserstein Distance Loss. In Proceedings of the International Conference on Machine Learning (ICML), Virtual, 18–24 July 2021.
Zhao, Q.; Liu, B.; Lyu, S.; Wang, C.; Zhang, H. TPH-YOLOv5++: Boosting Object Detection on Drone-Captured Scenarios with Cross-Layer Asymmetric Transformer. Remote Sens. 2023, 15, 1687.
Niu, R.; Zhi, X.; Jiang, S.; Gong, J.; Zhang, W.; Yu, L. Aircraft Target Detection in Low Signal-to-Noise Ratio Visible Remote Sensing Images. Remote Sens. 2023, 15, 1971.
Yan, G.; Chen, Z.; Wang, Y.; Cai, Y.; Shuai, S. LssDet: A Lightweight Deep Learning Detector for SAR Ship Detection in High-Resolution SAR Images. Remote Sens. 2022, 14, 5148.
Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. CenterNet: Keypoint Triplets for Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6568–6577.
Law, H.; Deng, J. CornerNet: Detecting Objects as Paired Keypoints. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 765–781.
Li, X.; Wang, W.; Hu, X.; Yang, J.; Soc, I.C. Selective Kernel Networks. In Proceedings of the 32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 510–519.
Luo, Z.; Zhou, C.; Zhang, G.; Lu, S. DETR4D: Direct Multi-View 3D Object Detection with Sparse Attention. arXiv 2022, arXiv:abs/2212.07849.
Feng, X.; Han, J.; Yao, X.; Cheng, G. TCANet: Triple Context-Aware Network for Weakly Supervised Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2021, 59, 6946–6955.
Oh, B.-D.; Schuler, W. Entropy- and Distance-Based Predictors From GPT-2 Attention Patterns Predict Reading Times Over and Above GPT-2 Surprisal. arXiv 2022, arXiv:abs/2212.11185.
Illium, S.; Mueller, R.; Sedlmeier, A.; Popien, C.-L.; Int Speech Commun, A. Visual Transformers for Primates Classification and Covid Detection. In Proceedings of the Interspeech Conference, Brno, Czech Republic, 30 August–3 September 2021; pp. 451–455.
Xu, K.; Ba, J.L.; Kiros, R.; Cho, K.; Courville, A.; Salakhutdinov, R.; Zemel, R.S.; Bengio, Y. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 7–9 July 2015; pp. 2048–2057.
Mnih, V.; Heess, N.; Graves, A.; Kavukcuoglu, K. Recurrent Models of Visual Attention. In Proceedings of the 28th Conference on Neural Information Processing Systems (NIPS), Montreal, BC, Canada, 8–13 December 2014.
Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-Excitation Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2011–2023.
Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K. Spatial Transformer Networks. In Proceedings of the 29th Annual Conference on Neural Information Processing Systems (NIPS), Montreal, BC, Canada, 7–12 December 2015.
Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Vedaldi, A. Gather-Excite: Exploiting Feature Context in Convolutional Neural Networks. In Proceedings of the 32nd Conference on Neural Information Processing Systems (NIPS), Montreal, BC, Canada, 2–8 December 2018.
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H.; Soc, I.C. Dual Attention Network for Scene Segmentation. In Proceedings of the 32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 3141–3149.
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19.
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017.
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929.
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 18th IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 9992–10002.
Yao, Z.; Ai, J.; Li, B.; Zhang, C. Efficient DETR: Improving End-to-End Object Detector with Dense Prior. arXiv 2021, arXiv:abs/2104.01318.

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.

Upload a video for this entry

Information

Subjects: Computer Science, Artificial Intelligence

Contributors MDPI registered users' name will be linked to their SciProfiles pages. To register with us, please refer to https://encyclopedia.pub/register :

Jiarui Zhang

Zhihua Chen

Guoxu Yan

Yi Wang

Bo Hu

View Times: 460

Update Date: 23 Oct 2023

Table of Contents

Video Upload Options

Confirm