PVTv2 for Deep Hash Remote Sensing Image Retrieval

PVTv2 for Deep Hash Remote Sensing Image Retrieval: Comparison

Please note this is a comparison between Version 2 by Rita Xu and Version 1 by kunlin wu.

对于高分辨率遥感影像检索任务，单尺度特征无法充分表达影像信息的复杂性。由于遥感图像量大，检索需要大量的内存和时间。

For high-resolution remote sensing image retrieval tasks, single-scale features cannot fully express the complexity of the image information. Due to the large volume of remote sensing images, retrieval requires extensive memory and time.

remote sensing image retrieval
PVTv2
multi-scale feature fusion

1. Introduction

With the rapid advancement of Earth observation technology, the number of remote sensing satellites has increased significantly, resulting in a rapid growth in the volume of remote sensing images ^[1]. Effectively locating and retrieving the desired remote sensing images from massive databases, as well as efficiently managing and utilizing the remote sensing image data, pose formidable challenges ^[2]. Remote sensing image retrieval (RSIR) aims to retrieve the required remote sensing images accurately and efficiently from extensive databases and can be categorized into text-based RSIR and content-based RSIR ^[3]. Text-based RSIR retrieves tagged images from the remote sensing database based on query keywords or labels, but it requires extensive manual annotation of each image in the dataset during the initial phase. Content-based RSIR, on the other hand, performs image retrieval by searching for images in the database that closely resemble the query image. This approach closely aligns with human visual perception and is currently the dominant retrieval method. Due to the complex scene and rich background information of remote sensing images, it is difficult to extract effective retrieval features and accurately measure the similarity of features, which is a problem that needs to be solved.

CBRSIR comprises three main components: feature extraction, reduction of feature dimensionality, and similarity calculation. Initial features for CBRSIR were limited to basic patterns in the images, such as lines, shapes, and textures. These features, known as low-level features, were manually designed. Low-level features such as SIFT ^[4], LBP ^[5], and HOG ^[6] provide typical examples. Low-level features describe local image representation and are aggregated to form mid-level features using descriptor aggregation techniques such as BoW ^[7], VLAD ^[8], FK ^[9], and EMK ^[10]. With the development of deep learning technology and the introduction of image retrieval, convolutional neural networks (CNNs) ^[11] are typically used as feature extractors to obtain abstract features of remote sensing images ^[12], referred to as high-level features. However, the high dimensionality of these deep features has led to challenges such as high computational costs and storage requirements. Therefore, dimensionality reduction techniques are necessary to improve retrieval speed and minimize memory usage. Various studies have shown that encoding or pooling methods can be used to achieve dimensionality reduction on the characteristics. One such technique is hashing, which produces binary hash codes through coding, significantly reducing retrieval time and memory use.

The primary challenge in CBRSIR is the vast area covered by remote sensing images, which depict multiple object categories and complex background information. Retrieval accuracy is affected by the high similarity between images of different categories, significant differences between images of the same category, and diversity in the orientation of image targets. Remote sensing images can be represented from various perspectives using features at different scales. The multi-scale feature fusion methods have been applied in multiple domains [13,14^[13][14][15],15], such as hyperspectral image classification ^[14] and pedestrian detection ^[15], and have demonstrated significant effectiveness. Inspired by this, some studies in CBRSIR use a feature fusion technique to overcome the limitations of single-feature expression capability by extracting multiple features from the same or different models. Nonetheless, in these methods, the feature fusion and feature extraction processes are usually separated, making it difficult to uniformly learn features at varying scales and perform an end-to-end multi-feature fusion.

Recently, Transformer models have gained significant attention in the field of computer vision. Dosovitskiy et al. ^[16] proposed the Vision Transformer (ViT) model, which employs a pure Transformer-based approach and is suitable for image classification tasks. After being trained on large datasets, ViT outperformed traditional convolutional neural network (CNN) models and demonstrated stronger generalization capability. However, ViT only generates feature maps of a single resolution, which results in high computational complexity, as global self-attention needs to be computed. To address these issues, Liu et al. ^[17] proposed the Swin Transformer, which adopts a hierarchical structure similar to CNNs and can process multi-scale images. Moreover, it employs a sliding window operation to calculate local window attention, reducing the computational complexity from quadratic to linear, as in ViT. Wang et al. ^[18] proposed Pyramid Vision Transformer (PVT), which is the first Transformer-based architecture using a feature pyramid. PVT features a progressive shrinking pyramid structure and a spatial reduction attention mechanism (SRA). Compared to ViT, PVT significantly reduces computational complexity. PVTv2 ^[19] further improves the original PVT by introducing overlapping patch embeddings and a linear spatial reduction attention mechanism, making the feature pyramid Transformer architecture a viable backbone network for visual tasks. Other than image classification, Transformer models have demonstrated stronger feature extraction capabilities than CNNs in fields such as object detection, semantic segmentation, and image processing.

2. CBRSIR Based on CNN Features

Deep features extracted from CNNs have been increasingly utilized in CBRSIR. For instance, Li et al. ^[20] designed four unsupervised convolutional neural networks that generate four types of deep features at different layers. By combining these deep features with traditional handcrafted features, they provided more effective features for CBRSIR. Raffaele et al. ^[21] extracted deep local convolutional features from fine-tuned CNN models and aggregated the local convolutional features into global descriptors using the vector of locally aggregated descriptors (VLAD). They utilized multiplication and addition attention mechanisms to overcome irrelevant background interference. Hou et al. ^[22] fine-tuned the MobileNet model to extract deep convolutional features and obtained low-dimensional feature representation by changing the dimension of the final fully connected layer. They compared the retrieval accuracy with the principal component analysis (PCA) method of dimensionality reduction. In cross-dataset remote sensing image retrieval, Wang et al. ^[23] proposed a learnable joint spatial and spectral transformation (JSST) model to correct spatial and spectral distortions in images. This model embedded the spatially and spectrally modified inputs at the front end of the ResNet34 network, thereby improving generalization and adaptability. Wu et al. ^[24] proposed two rotation-aware networks, namely the feature-map-transformation-based rotation-aware network (FMT-RAN) and spatial-transformer-based rotation-aware network (ST-RAN), to address the issue of images appearing at arbitrary rotation angles. However, the aforementioned methods extract deep features from convolutional neural networks (CNNs) for retrieval without utilizing the features of Transformer models. In contrast to CNN models, Transformer models can perform global context modeling and better comprehend the semantic relationships of the entire input sequence. Therefore, they can capture global contextual information and extract richer features.

3. CBRSIR Based on Deep Hashing Features

Hashing has been widely used in large-scale remote sensing image retrieval due to its prominent advantages in storage and retrieval speed. Li et al. ^[25] proposed the deep hashing neural network (DHNN), which utilizes deep feature learning neural networks to learn high-dimensional embedding features and hash learning neural networks to learn low-dimensional hashing features. This model can be optimized end-to-end. To address the overfitting issue caused by a limited number of labeled images in remote sensing datasets, Roy et al. ^[26] proposed a deep hashing network based on metric learning. Liu et al. ^[27] introduced a deep supervised hashing model using a loss function composed of classification, similarity, and bit entropy terms based on the framework of generative adversarial networks (GANs) to learn compact and effective hash codes. Cheng et al. ^[28] proposed the semantic consistency deep hashing model, which applies deep hashing to multi-label remote sensing image retrieval. It introduces a paired label similarity loss that fully utilizes multi-label information, demonstrating the effectiveness of hashing methods in multi-label remote sensing image retrieval. Tan et al. ^[29] proposed deep contrastive self-supervised hashing for remote sensing image retrieval, which utilizes unlabeled images for training. This method assumes that hash codes generated from different views of the same image should be similar, while those generated from different images should be dissimilar. They designed a loss function to preserve the similarity of hash codes. Jing et al. ^[30] presented a deep unsupervised weighted hashing model that utilizes a pretrained Swin Transformer to extract feature representations. This model uses an adaptive weight-based loss function that assigns weights adaptively to positive and negative samples and combines it with quantization loss, resulting in improved model performance. Although these deep hashing methods have achieved good retrieval results, they extract single-layer features without employing methods for fusing multiple features. Single-feature extraction is insufficient to fully express the rich detailed information and semantic information of remote sensing images. The adoption of multi-feature fusion in hashing methods has the potential to improve the accuracy of remote sensing image retrieval.

4. Methods Based on Multi-Feature Fusion

目前，一些研究集中在使用单特征表示来充分表达有关图像的视觉和语义信息的局限性上。这些研究采用特征融合技术来增强特征判别能力。例如，Yang等人[31]通过融合卷积和全连接层特征，将卷积神经网络（CNN）中高级特征的优点结合起来，通过同时利用全局和局部图像信息来提高检索性能。Li等人[32]从ResNet50和VGG16网络中提取高级特征并将它们连接起来，提高了特征表示能力，并通过利用来自两个网络的学习参数提高了遥感图像的分类性能。Yin 等人 [33] 引入了一种平均-最大池化加权融合技术来合并高级特征，通过提高特征表示能力有效地提高了检索性能。Li等人[20]制定了四个具有不同特征层的CNN模型，并开发了协作亲和度量融合（CAMF）来合并来自不同层的特征并提高检索性能。Alhichri等人[34]利用了三个预先训练的SqueezeNet模型，这些模型可以获取各种规模的输入图像，并以级联方式融合三个CNN模型的输出。Minakshi等人[35]发明了一种融合的CNN架构，该架构合并了从VGG16，VGG19和ResNet中提取的特征，以获得高效和准确的特征。此外，他们提出了一种基于联合MI_RFO的最优特征选择模型，以选择最佳特征以提高检索精度。然而，许多当前的融合方法依赖于简单的求和或串联操作，大多数研究都集中在合并CNN特征上。此外，现有方法将多尺度特征提取过程与特征融合过程分离，阻碍了基于遥感影像检索需求的多尺度特征的自动调整和合并，限制了检索特征的表示能力。

5. 基于深度度量学习的方法

目前，深度学习（DML）方法被广泛用于增强网络的检索功能。对比损失[36]是一种先前的度量学习方法，它测量两个样本之间的距离，缩小了同一类别的配对样本之间的差距，并增加了不同类别样本之间的差距。三重损失 [37] 选择一个样本作为锚点，另外两个样本分类为阳性和阴性。它要求同一类样品之间的距离变得更加紧凑，并且不同类别的样品之间的距离增加。N对损失[38]使用余弦相似性测量样本之间的相关性，并将每个锚点与一个正样本和多个负样本相匹配。Proxy-NCA [39] 是一种基于代理的初始损失，它将每个样本与为每个类别分配的代理连接起来。它旨在使样本更接近同一类别的代理，并鼓励与不同类别的代理保持距离。SoftTriple Loss [40] 通过将多个代理连接到每个类别来增强 softmax 损失，有效地捕获样本的隐藏分布并保持更广泛的类内传播。Liu等[41]用softmax函数代替铰链函数得到全局优化，成功克服了三重损失的局部优化问题。Xue等人[42]提出了一种哈希检索方法，该方法使用基于代理的度量学习与哈希编码学习相结合，以提高检索速度，同时保持准确性并最小化存储空间。但是，这些方法也有明显的局限性。例如，基于图像对的度量损失需要使用更多的训练样本形成越来越多的样本对。这会导致额外的计算和更长的网络收敛时间。基于代理的损耗可以成功解决与网络收敛速率和时间复杂性相关的问题。但是，它们不能充分利用样本信息，并且为每个类别分配的代理具有固定的数量，并且无法自适应分配，导致缺乏泛化能力。

6. 变压器网络（PVTv2）

在本文中，所提出的方法改进了PVTv2的b2版本[19]，该版本在ImageNet-1K上进行了预训练。PVTv2是金字塔结构变压器网络PVT的增强版本。它采用重叠补丁嵌入来挤压图像，确保本地图像的连续性。第二种修改是将PVT中的固定位置编码替换为使用零填充的位置编码机制，以使网络能够更有效地处理各种尺寸的图像。此外，线性空间约简注意力机制取代了原有的空间约简注意力机制，优化了计算成本，将计算复杂度限制在线性范围内。PVTv2 网络模型是一个多层结构，具有四个不同的阶段，每个阶段由一个补丁嵌入层和一个变压器编码器组成。这些阶段实现四种不同比例的特征图。随着网络深度的增加，特征图的分辨率逐渐降低，特征的通道维数逐渐增大。变压器编码器主要包括层范数、MLP和线性空间缩减注意力。四个级的输出特征具有不同的刻度和通道号。具体来说，第1级的输出特征为64×56×56，第2级的输出特征为128×28×28，第3级的输出特征为320×14×14，第4级的输出特征为512×7×7。

References

Tang, X.; Yang, Y.; Ma, J.; Cheung, Y.M.; Liu, C.; Liu, F.; Zhang, X.; Jiao, L. Meta-Hashing for Remote Sensing Image Retrieval. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5615419.
Yuan, Z.; Zhang, W.; Fu, K.; Li, X.; Deng, C.; Wang, H.; Sun, X. Exploring a fine-grained multiscale method for cross-modal remote sensing image retrieval. arXiv 2022, arXiv:2204.09868.
Ye, F.; Luo, W.; Dong, M.; He, H.; Min, W. SAR Image retrieval based on unsupervised domain adaptation and clustering. IEEE Geosci. Remote Sens. Lett. 2019, 16, 1482–1486.
Sumbul, G.; Ravanbakhsh, M.; Demir, B. Informative and Representative Triplet Selection for Multilabel Remote Sensing Image Retrieval. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5405811.
Zhuo, Z.; Zhou, Z. Remote Sensing Image Retrieval with Gabor-CA-ResNet and Split-Based Deep Feature Transform Network. Remote Sens. 2021, 13, 869.
Mehmood, M.; Shahzad, A.; Zafar, B.; Shabbir, A.; Ali, N. Remote sensing image classification: A comprehensive review and application. Math. Probl. Eng. 2022, 2022, 5880959.
Ma, J.; Shi, D.; Tang, X.; Zhang, X.; Jiao, L. Dual Modality Collaborative Learning for Cross-Source Remote Sensing Retrieval. Remote Sens. 2022, 14, 1319.
Shabbir, A.; Ali, N.; Ahmed, J.; Zafar, B.; Rasheed, A.; Sajid, M.; Ahmed, A.; Dar, S.H. Satellite and scene image classification based on transfer learning and fine tuning of ResNet50. Math. Probl. Eng. 2021, 2021, 5843816.
Wang, Y.; Ji, S.; Lu, M.; Zhang, Y. Attention boosted bilinear pooling for remote sensing image retrieval. Int. J. Remote Sens. 2020, 41, 2704–2724.
Bo, L.; Sminchisescu, C. Efficient match kernel between sets of features for visual recognition. Adv. Neural Inf. Process. Syst. 2009, 22, 135–143.
Ye, F.; Su, Y.; Xiao, H.; Zhao, X.; Min, W. Remote Sensing Image Registration Using Convolutional Neural Network Features. IEEE Geosci. Remote Sens. Lett. 2018, 15, 232–236.
Ye, F.; Luo, W.; Dong, M.; Li, D.; Min, W. Content-based Remote Sensing Image Retrieval Based on Fuzzy Rules and a Fuzzy Distance. IEEE Geosci. Remote Sens. Lett. 2022, 19, 8002505.
Kumar, A.; Yadav, D.P.; Kumar, D.; Pant, M.; Pant, G. Multi-scale feature fusion-based lightweight dual stream transformer for detection of paddy leaf disease. Environ. Monit. Assess. 2023, 195, 1020.
Ghaderizadeh, S.; Abbasi-Moghadam, D.; Sharifi, A.; Tariq, A.; Qin, S. Multiscale Dual-Branch Residual Spectral-Spatial Network With Attention for Hyperspectral Image Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 5455–5467.
Chen, H.; GUO, X. Multi-scale feature fusion pedestrian detection algorithm based on Transformer. In Proceedings of the 2023 4th International Conference on Computer Vision, Image and Deep Learning (CVIDL), Zhuhai, China, 12–14 May 2023; pp. 536–540.
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929.
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arXiv 2021, arXiv:2103.14030.
Wang, W.; Xie, E.; Li, X.; Fan, D.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. arXiv 2021, arXiv:2102.12122.
Wang, W.; Xie, E.; Li, X.; Fan, D.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. PVT v2: Improved baselines with Pyramid Vision Transformer. Comput. Vis. Media 2022, 8, 415–424.
Li, Y.; Zhang, Y.; Tao, C.; Zhu, H. Content-Based High-Resolution Remote Sensing Image Retrieval via Unsupervised Feature Learning and Collaborative Affinity Metric Fusion. Remote Sens. 2016, 8, 709.
Imbriaco, R.; Sebastian, C.; Bondarev, E. Aggregated Deep Local Features for Remote Sensing Image Retrieval. Remote Sens. 2019, 11, 493.
Hou, D.; Miao, Z.; Xing, H.; Wu, H. Exploiting low dimensional features from the MobileNets for remote sensing image retrieval. Earth Sci. Inform. 2020, 13, 1437–1443.
Wang, Y.; Ji, S.; Zhang, Y. A learnable joint spatial and spectral transformation for high resolution remote sensing image retrieval. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 8100–8112.
Wu, Z.; Zou, C.; Wang, Y.; Tan, M.; Weise, T. Rotation-Aware Representation Learning for Remote Sensing Image Retrieval. Inf. Sci. 2021, 572, 404–423.
Li, Y.; Zhang, Y.; Xin, H.; Hu, Z.; Ma, J. Large-Scale Remote Sensing Image Retrieval by Deep Hashing Neural Networks. IEEE Trans. Geosci. Remote Sens. 2017, 56, 950–965.
Roy, S.; Sangineto, E.; Demir, B.; Sebe, N. Metric-Learning based Deep Hashing Network for Content Based Retrieval of Remote Sensing Images; Cornell University: Ithaca, NY, USA, 2019.
Liu, C.; Ma, J.; Tang, X.; Zhang, X.; Jiao, L. Adversarial hash-code learning for remote sensing image retrieval. In Proceedings of the IGARSS 2019—2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; pp. 4324–4327.
Cheng, Q.; Huang, H.; Ye, L.; Fu, P.; Gan, D.; Zhou, Y. A Semantic-Preserving Deep Hashing Model for Multi-Label Remote Sensing Image Retrieval. Remote Sens. 2021, 13, 4965.
Tan, X.; Zou, Y.; Guo, Z.; Zhou, K.; Yuan, Q. Deep Contrastive Self-Supervised Hashing for Remote Sensing Image Retrieval. Remote Sens. 2022, 14, 3643.
Jing, W.; Xu, Z.; Li, L.; Wang, J.; He, Y.; Chen, G. Deep Unsupervised Weighted Hashing for Remote Sensing Image Retrieval. J. Database Manag. (JDM) 2022, 33, 1–19.