MF-DCMANet for PolSAR Target Recognition: Comparison
Please note this is a comparison between Version 1 by chaoqi zhang and Version 2 by Camila Xu.

Multi-polarization SAR data offers an advantage over single-polarization SAR data in that it not only provides amplitude (intensity) information but also records backward scattering information of the target under different polarization states, which can be represented through the polarimetric scattering matrix.

  • PolSAR target
  • deep learning
  • feature fusion
  • transformer

1. Introduction

It is well established that PolSAR target recognition has become increasingly significant in battlefield surveillance, air and missile defense, and strategic early warning, providing an important guarantee for battlefield situation awareness and intelligence generation [1]. Multi-polarization SAR data offers an advantage over single-polarization SAR data in that it not only provides amplitude (intensity) information but also records backward scattering information of the target under different polarization states, which can be represented through the polarimetric scattering matrix [2]. The polarimetric scattering matrix unifies the energy, phase, and polarization characteristics of target scattering, which are highly dependent on the target’s shape, size, structure, and other factors [3]. It provides a relatively complete description of the electromagnetic scattering properties of the target. Therefore, it is essential to make reasonable use of fundamental or further processed polarization information to enhance the target recognition capability. However, in most studies, polarization information is often applied to terrain classification tasks that assign class labels to individual pixels in the image possessing semantic information. Zhou et al. [4] extracted a six-dimensional real-valued feature vector from the polarization covariance matrix and then fed the six-channel real images into a deep network to learn hierarchical polarimetric spatial features, achieving satisfactory results in classifying 15 terrains in the Flevoland data. Zhang et al. [5] employed polarization decomposition to crops in PolSAR scenes and then fed the resulting polarization tensors into a tensor decomposition network for dimension reduction, which achieved better classification accuracy. However, these pixel-scale terrain classification methods cannot be directly applied to image-scale target recognition tasks. Therefore, for the PolSAR target recognition tasks, methods that exploit polarization information at the image scale should be developed.
Despite the promising results of terrain classification based on polarization features, the efficacy of utilizing a single feature to identify targets in a complex and dynamic battlefield environment is limited [6][7][6,7]. A single feature only portrays the target characteristics from one aspect, which makes it difficult to describe all the information embedded in the polarization target. The application of multi-feature fusion recognition methods allows for the comprehensive exploitation and utilization of diverse information contained in multi-polarization SAR data, effectively solving the problem of insufficient robustness of a single feature in complex scenarios [8][9][10][8,9,10]. Based on human perception and experience accumulation, researchers have designed many distinctive features from the intensity map of PolSAR targets, which generally have specific physical meanings. At present, various features have been developed for target recognition tasks, such as monogenic signals [11], computer vision features [12], and electromagnetic scattering features [13]. The potential feature extraction process of the monogenic signal has the characteristics of rotation-invariance and scale-invariance and has been widely investigated and explored in the domain of PolSAR target recognition. Dong et al. [14][15][16][14,15,16] and Li et al. [10] introduced monogenic signal analysis into the task of SAR target recognition, systematically analyzing the advantages of monogenic signal in describing SAR target characteristics. They also designed multiple feasible classification strategies to improve the target recognition performance. These handcrafted features have a strong discriminative ability and are not restricted by the amount of data, so they are more suitable for the PolSAR target recognition field with few labeled samples; however, they have difficulty in excavating deeper features of the image and lack universality. Moreover, the distinctive imaging mechanism of PolSAR, coupled with the diversity of target categories and the challenge of adapting to different datasets, makes it difficult to fully maximize the discriminative properties of SAR data. Therefore, artificial feature design remains a challenging task.
Lately, deep learning has greatly promoted the development of the computer vision field [17][18][17,18]. By leveraging neural networks to automatically discover more abstract features from input data, deep learning reduces the incompleteness caused by handcrafted features, leading to more competitive performance compared to traditional methods. Chen et al. designed a network [19] specifically for SAR images called A-ConvNet. The average accuracy rate of 10 types of target classification on the MSTAR dataset can reach 99%. The CV-CNN proposed in [20] uses complex parameters and variables to extract features from PolSAR data and perform feature classification, effectively utilizing phase information. The convolution operation in CNNs facilitates the learning and extraction of visual features, but at the same time, CNN also introduces inductive bias during the process of feature learning, which limits the receptive fields of the features. This results in CNN being adept at extracting effective local information but struggling to capture and store long-range dependent information. The recently developed Vision Transformer (ViT) [21][22][21,22] effectively addresses this problem. ViT models the global dependencies between input and output by utilizing the self-attention mechanism, resulting in more interpretable models. As a result, ViT has found applications in the field of PolSAR recognition.

2. CNN-Based Multi-Feature Target Recognition

The methods of multi-feature target recognition based on CNN can mainly be divided into two categories: one is the combination of deep features and handcrafted features, while the other is the combination of deep features learned from different layers of the network for classification. In the work of combining deep features and handcrafted features, Xing et al. [8] fused the scattering center features and CNN features through discriminant correlation analysis and achieved satisfactory results under the extended operating conditions of the MSTAR dataset. Zhang et al. concatenated Hog features with multi-scale deep features for preferable SAR ship classification [23]. Zhou et al. [24] automatically extracted semantic features from the attributed scattering centers and SAR images through the network and then simply concatenated the features for target recognition. Note that in the above fusion methods, different features are extracted independently, and the classification information contained in the features is only converged in the fusion stage. Zhang et al. [25][26][25,26] utilized polarimetric features as expert knowledge for the SAR ship classification task, performed effective feature fusion through deep neural networks and achieved advanced classification performance on the OpenSARShip dataset. Furthermore, Zhang et al. [27] analyzed the impact of integrating handcrafted features at different layers of the deep neural networks on recognition rates and introduced various effective feature concatenation techniques. To effectively use the features learned by different layers of the network, Guo et al. [28] used convolution kernels of different scales to extract features of different levels in SAR images. Ai et al. [29] used different sizes of convolutional kernels to extract from images and then combined them through weighted fusion. The weights were learned via the neural network and achieved good recognition results on the MSTAR dataset. Zeng et al. [30] introduced a multi-stream structure combined with an attention mechanism to obtain rich features of targets and achieved better recognition performance on the MSTAR dataset. Zhai et al. [31] introduced an attention module into the CNN architecture to connect the features extracted from different layers and introduced transfer learning to reduce the demand for the number of training samples. The methods for multi-feature fusion described above primarily use concatenation to combine features, which may not be effective in merging features with different attributes and can lead to weak fusion generalization.

3. Transformer in Target Recognition

CNN has a relatively large advantage in extracting the underlying features and visual structure. However, the receptive field of CNN is usually small, which is not conducive to capturing global features [21]. In contrast, the multi-head attention mechanism of the transformer is more natural and effective in handling the dependencies between long-range features. Dosovitskiy et al. [22] successfully applied the transformer to the visual field (ViT). ViT treats the input image as a series of patches, where channels are connected across all the pixels in the patch and then linearly projected to the desired input dimension, flattening each patch into a single vector. Zhao et al. [32] applied the transformer to the few-shot recognition problem in the field of SAR recognition, constructed a support set and query set from original MSTAR data, and then calculated the attention weight between them. The attention weight is obtained by computing cosine similarity in Euclidean space. Wang et al. [33] developed a method combining CNN and transformer, which makes full use of the local perception capability of CNN and the global modeling capability of the transformer. Li et al. [34] constructed a multi-aspect SAR sequence dataset from the MSTAR data. The convolutional autoencoder is used as the basic feature extractor, and the dependence between sequences is mined through the transformer. The method has good noise robustness and achieves higher recognition accuracy.
ScholarVision Creations