Encoder–Decoder Architecture and Kernel-Sharing Mechanism

This entry is adapted from the peer-reviewed paper 10.3390/drones8020046

As the application of unmanned aerial vehicles (UAVs) becomes more and more widespread, accidents such as accidental injuries to personnel, property damage, and loss and destruction of UAVs due to accidental UAV crashes also occur in daily use scenarios. To reduce the occurrence of such accidents, UAVs need to have the ability to autonomously choose a safe area to land in an accidental situation, and the key lies in realizing on-board real-time semantic segmentation processing.

UAVs semantic segmentation emergency landing scale change real-time processing

1. Introduction

With the wide application of unmanned aerial vehicles (UAVs) in many fields such as remote sensing and geological exploration, search and rescue, agriculture and forestry, and film and media production ^[1], the applicable scenarios of UAVs have been expanding, and the frequency of their use has been increasing. It is necessary to ensure the accuracy of emergency autonomous landing when UAVs respond to sudden emergencies such as low battery power and signal loss during flight.

In recent decades, with the rapid development of visual navigation technology ^[2]^[3]^[4], researchers have started to focus on vision-based autonomous landing techniques. Generally, UAV landing based on visual navigation requires the optical cameras that capture high-resolution images of their surroundings, allowing the visual navigation system to accurately identify ground features, obstacles, and other important navigational reference points. Alam et al. ^[5] proposed a detection method for landmark obstacles in small-scale safety detection by synthesizing different image processing and safe landing area detection (SLAD) algorithms. However, the method is limited to the presence of a small number of obstacles and is not reliable for visual guidance in complex terrains and scenes. Respall et al. ^[6] proposed an autonomous visual detection, tracking, and landing system that is capable of detecting and tracking unmanned ground vehicles (UGVs) online without artificial markers. However, UAVs must perform their tasks in the vicinity of moving UGVs, but in practice, it is difficult for UAVs to communicate and interact with ground stations in situations such as congested areas or disaster zones. Symeonidis et al. ^[7] proposed a UAV navigation method for safe landings, which relies on a lightweight computer vision module that is capable of being executed on the limited computational resources of the UAV. However, this network model needs to consider issues such as overfitting and generalization performance. It is not applicable to complex scene tasks and in cases with large landing height drops, where the larger the UAV landing height drop, the larger the scale change of the target, which leads to misjudgment during landing.

The traditional UAV landing method mainly relies on the UAV’s own sensors and controllers to complete the perception of fixed landing markers to execute emergency landing, but the identification and spatial localization of the fixed ground landing markers at a higher flight altitude have a strict requirement for the accuracy of small-target identification.

2. Encoder–Decoder Architecture

Semantic segmentation is one of the three basic tasks of computer vision, which assigns labels to each pixel in an image ^[8]^[9]. In recent years, with the development of deep learning technology, many semantic segmentation algorithms have been applied in various fields, including smart cities, industrial production, remote sensing image processing, medical image processing, etc. ^[10]^[11]^[12]^[13]^[14]. Existing semantic segmentation methods usually rely on a convolutional encoder–decoder architecture, where the encoder generates low-resolution image features and the decoder up-samples the features and performs segmentation mapping using pixel-by-pixel class scores. U-Net ^[15] uses contraction and expansion paths based on the encoder–decoder framework, which fuses low-level and high-level semantic information by introducing hopping connections. DeepLabV3+ ^[16] uses extended convolution and larger convolution kernels to increase the strength of the receptive field. The Pyramid Scene Parsing Network (PSPNet) ^[17] designs a pyramid pool to capture local and global context information on an extended backbone. SegNet ^[18] utilizes an encoder–decoder architecture to recover high-resolution feature maps. MKANet ^[19] is also a typical encoder–decoder structure that uses a parallel shallow architecture to improve inference speed while supporting large-scale image block inputs. In addition, with the great success of transformers in the field of Natural Language Processing (NLP), researchers have begun to explore its application in computer vision tasks. Vision Transformer (ViT) ^[20] employs a fully transformer-based image classification design. SegFormer ^[21] combines a transformer with a lightweight multilayer perceptron (MLP) decoder to efficiently fuse locally focused and globally focused information, presenting a powerful feature representation capability.

However, when embedding semantic segmentation algorithms into terminal devices such as UAV chips, it is necessary to ensure that the model size and computational cost are small to meet the demand for fast interaction. Due to the high resolution of UAV images and the large variation in feature scales, it is difficult for the above methods to meet the requirement of fast and accurate landing scene segmentation. Therefore, for emergency UAV landing, it is still necessary to design a professional semantic segmentation model.

3. Kernel-Sharing Mechanism

Atrous convolution ^[22] has been widely used to increase the receptive field without increasing the convolution kernel size or computational effort. Since Atrous convolution does not require changing the model construction or adding other parameters, it can be perfectly embedded into any network model. On this basis, Huang et al. ^[23] proposed the shared kernel null convolutional network (KSAC) to improve the network’s ability to generalize and represent information at different scales and reduce the computational cost. Che et al. ^[24] proposed a hybrid convolutional attention (MCA) module, which enables the segmentation model to obtain richer feature information at the shallow level and improves the range of sensory field capture and the ability to distinguish different sizes and shapes of objects. Li et al. ^[25] applied the KSAC and multi-receptive field convolution on input feature maps to explore multi-scale contextual information, and which captured clear target boundaries by gradually recovering spatial information.

The above studies show that the convolution sharing mechanism provides an effective way to control the receptive fields and find the best trade-off between local and global information extraction. The convolutional sharing mechanism expands the deep network features according to different expansion rates, captures the contextual information of the multi-scale features using the spatial pyramid pooling module, and introduces a global average pooling module to complement it, which can refine the contextual information. Second, the input feature graph is expanded and grouped according to the expansion coefficient, and different scales of features are extracted by using different expansion kernels, which can obtain the information of different layer interactions and enhance the global semantic information expression. By aggregating shallow rough information and deep fine information, it also helps to effectively mitigate the loss of feature information due to differences in image resolution caused by various data sources. As a result, problems such as the loss of edge details of multi-scale feature targets as well as the absence and misclassification of small targets can be solved.

References

Hall, O.; Wahab, I. The use of drones in the spatial social sciences. Drones 2021, 5, 112.
Meng, Y.; Wang, W.; Han, H.; Ban, J. A visual/inertial integrated landing guidance method for UAV landing on the ship. Aerosp. Sci. Technol. 2019, 85, 474–480.
Falanga, D.; Zanchettin, A.; Simovic, A.; Delmerico, J.; Scaramuzza, D. Vision-based Autonomous Quadrotor Landing on a Moving Platform. In Proceedings of the 2017 IEEE International Symposium on Safety, Security and Rescue Robotics (SSRR), Shanghai, China, 11–13 October 2017.
Zhang, H.T.; Hu, B.B.; Xu, Z.; Cai, Z.; Liu, B.; Wang, X.; Geng, T.; Zhong, S.; Zhao, J. Visual Navigation and Landing Control of an Unmanned Aerial Vehicle on a Moving Autonomous Surface Vehicle via Adaptive Learning. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 5345–5355.
Alam, M.S.; Oluoch, J. A survey of safe landing zone detection techniques for autonomous unmanned aerial vehicles (UAVs). Expert Syst. Appl. 2021, 179, 115091.
Respall, V.M.; Sellami, S.; Afanasyev, I. Implementation of Autonomous Visual Detection, Tracking and Landing for AR. Drone 2.0 Quadcopter. In Proceedings of the 2019 12th International Conference on Developments in eSystems Engineering (DeSE), Kazan, Russia, 7–10 October 2019; pp. 477–482.
Symeonidis, C.; Kakaletsis, E.; Mademlis, I.; Nikolaidis, N.; Tefas, A.; Pitas, I. Vision-based UAV safe landing exploiting lightweight deep neural networks. In Proceedings of the 2021 4th International Conference on Image and Graphics Processing, Sanya, China, 1–3 January 2021; pp. 13–19.
Zhang, T.; Lin, G.; Cai, J.; Shen, T.; Shen, C.; Kot, A.C. Decoupled spatial neural attention for weakly supervised semantic segmentation. IEEE Trans. Multimed. 2019, 21, 2930–2941.
Gao, G.; Xu, G.; Yu, Y.; Xie, J.; Yang, J.; Yue, D. Mscfnet: A lightweight network with multi-scale context fusion for real-time semantic segmentation. IEEE Trans. Intell. Transp. Syst. 2021, 23, 25489–25499.
Xing, Y.; Zhong, L.; Zhong, X. An encoder-decoder network based FCN architecture for semantic segmentation. Wirel. Commun. Mob. Comput. 2020, 2020, 8861886.
Saiz, F.A.; Alfaro, G.; Barandiaran, I.; Graña, M. Generative adversarial networks to improve the robustness of visual defect segmentation by semantic networks in manufacturing components. Appl. Sci. 2021, 11, 6368.
Xiang, S.; Xie, Q.; Wang, M. Semantic Segmentation for Remote Sensing Images Based on Adaptive Feature Selection Network. IEEE Geosci. Remote Sens. Lett. 2022, 19, 8006705.
Wang, M.; Dong, Z.; Cheng, Y.; Li, D. Optimal Segmentation of High-Resolution Remote Sensing Image by Combining Superpixels with the Minimum Spanning Tree. IEEE Trans. Geosci. Remote Sens. 2018, 56, 228–238.
Khan, M.Z.; Gajendran, M.K.; Lee, Y.; Khan, M.A. Deep neural architectures for medical image semantic segmentation. IEEE Access 2021, 9, 83002–83024.
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015, Part III, Proceedings of the 18th International Conference, Munich, Germany, 5–9 October 2015; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241.
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818.
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890.
Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495.
Zhang, Z.; Lu, W.; Cao, J.; Xie, G. MKANet: An efficiFent network with Sobel boundary loss for land-cover classification of satellite remote sensing imagery. Remote Sens. 2022, 14, 4514.
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929.
Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.; et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 6881–6890.
Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv 2015, arXiv:1511.07122.
Huang, Y.; Wang, Q.; Jia, W.; Lu, Y.; Li, Y.; He, X. See more than once: Kernel-sharing atrous convolution for semantic segmentation. Neurocomputing 2021, 443, 26–34.
Che, Z.; Shen, L.; Huo, L.; Hu, C.; Wang, Y.; Lu, Y.; Bi, F. MAFF-HRNet: Multi-Attention Feature Fusion HRNet for Building Segmentation in Remote Sensing Images. Remote Sens. 2023, 15, 1382.
Li, M.; Rui, J.; Yang, S.; Liu, Z.; Ren, L.; Ma, L.; Li, Q.; Su, X.; Zuo, X. Method of Building Detection in Optical Remote Sensing Images Based on SegFormer. Sensors 2023, 23, 1258.

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.

Upload a video for this entry

Information

Subjects: Computer Science, Artificial Intelligence

Contributors MDPI registered users' name will be linked to their SciProfiles pages. To register with us, please refer to https://encyclopedia.pub/register :

View Times: 522

Update Date: 18 Feb 2024

Version	Summary	Created by	Modification	Content Size	Created at	Operation
1		YiFan Zhang	--	1119	2024-02-16 17:07:44	\|
2	layout	Jessie Wu	+ 6 word(s)	1125	2024-02-18 03:20:28	\|

1. Introduction

2. Encoder–Decoder Architecture

3. Kernel-Sharing Mechanism

References

Video Upload Options

Confirm