PVTv2 for Deep Hash Remote Sensing Image Retrieval: Comparison
Please note this is a comparison between Version 3 by Rita Xu and Version 2 by Rita Xu.

For high-resolution remote sensing image retrieval tasks, single-scale features cannot fully express the complexity of the image information. Due to the large volume of remote sensing images, retrieval requires extensive memory and time. Researchers propose an end-to-end deep hash remote sensing image retrieval model (PVTA_MSF) by fusing multi-scale features based on the Pyramid Vision Transformer network (PVTv2).

  • remote sensing image retrieval
  • PVTv2
  • multi-scale feature fusion

1. Introduction

With the rapid advancement of Earth observation technology, the number of remote sensing satellites has increased significantly, resulting in a rapid growth in the volume of remote sensing images [1]. Effectively locating and retrieving the desired remote sensing images from massive databases, as well as efficiently managing and utilizing the remote sensing image data, pose formidable challenges [2]. Remote sensing image retrieval (RSIR) aims to retrieve the required remote sensing images accurately and efficiently from extensive databases and can be categorized into text-based RSIR and content-based RSIR [3]. Text-based RSIR retrieves tagged images from the remote sensing database based on query keywords or labels, but it requires extensive manual annotation of each image in the dataset during the initial phase. Content-based RSIR, on the other hand, performs image retrieval by searching for images in the database that closely resemble the query image. This approach closely aligns with human visual perception and is currently the dominant retrieval method. Due to the complex scene and rich background information of remote sensing images, it is difficult to extract effective retrieval features and accurately measure the similarity of features, which is a problem that needs to be solved.
CBRSIR comprises three main components: feature extraction, reduction of feature dimensionality, and similarity calculation. Initial features for CBRSIR were limited to basic patterns in the images, such as lines, shapes, and textures. These features, known as low-level features, were manually designed. Low-level features such as SIFT [4], LBP [5], and HOG [6] provide typical examples. Low-level features describe local image representation and are aggregated to form mid-level features using descriptor aggregation techniques such as BoW [7], VLAD [8], FK [9], and EMK [10]. With the development of deep learning technology and the introduction of image retrieval, convolutional neural networks (CNNs) [11] are typically used as feature extractors to obtain abstract features of remote sensing images [12], referred to as high-level features. However, the high dimensionality of these deep features has led to challenges such as high computational costs and storage requirements. Therefore, dimensionality reduction techniques are necessary to improve retrieval speed and minimize memory usage. Various studies have shown that encoding or pooling methods can be used to achieve dimensionality reduction on the characteristics. One such technique is hashing, which produces binary hash codes through coding, significantly reducing retrieval time and memory use.
The primary challenge in CBRSIR is the vast area covered by remote sensing images, which depict multiple object categories and complex background information. Retrieval accuracy is affected by the high similarity between images of different categories, significant differences between images of the same category, and diversity in the orientation of image targets. Remote sensing images can be represented from various perspectives using features at different scales. The multi-scale feature fusion methods have been applied in multiple domains [13][14][15], such as hyperspectral image classification [14] and pedestrian detection [15], and have demonstrated significant effectiveness. Inspired by this, some studies in CBRSIR use a feature fusion technique to overcome the limitations of single-feature expression capability by extracting multiple features from the same or different models. Nonetheless, in these methods, the feature fusion and feature extraction processes are usually separated, making it difficult to uniformly learn features at varying scales and perform an end-to-end multi-feature fusion.
Recently, Transformer models have gained significant attention in the field of computer vision. Dosovitskiy et al. [16] proposed the Vision Transformer (ViT) model, which employs a pure Transformer-based approach and is suitable for image classification tasks. After being trained on large datasets, ViT outperformed traditional convolutional neural network (CNN) models and demonstrated stronger generalization capability. However, ViT only generates feature maps of a single resolution, which results in high computational complexity, as global self-attention needs to be computed. To address these issues, Liu et al. [17] proposed the Swin Transformer, which adopts a hierarchical structure similar to CNNs and can process multi-scale images. Moreover, it employs a sliding window operation to calculate local window attention, reducing the computational complexity from quadratic to linear, as in ViT. Wang et al. [18] proposed Pyramid Vision Transformer (PVT), which is the first Transformer-based architecture using a feature pyramid. PVT features a progressive shrinking pyramid structure and a spatial reduction attention mechanism (SRA). Compared to ViT, PVT significantly reduces computational complexity. PVTv2 [19] further improves the original PVT by introducing overlapping patch embeddings and a linear spatial reduction attention mechanism, making the feature pyramid Transformer architecture a viable backbone network for visual tasks. Other than image classification, Transformer models have demonstrated stronger feature extraction capabilities than CNNs in fields such as object detection, semantic segmentation, and image processing.

2. CBRSIR Based on CNN Features

Deep features extracted from CNNs have been increasingly utilized in CBRSIR. For instance, Li et al. [20] designed four unsupervised convolutional neural networks that generate four types of deep features at different layers. By combining these deep features with traditional handcrafted features, they provided more effective features for CBRSIR. Raffaele et al. [21] extracted deep local convolutional features from fine-tuned CNN models and aggregated the local convolutional features into global descriptors using the vector of locally aggregated descriptors (VLAD). They utilized multiplication and addition attention mechanisms to overcome irrelevant background interference. Hou et al. [22] fine-tuned the MobileNet model to extract deep convolutional features and obtained low-dimensional feature representation by changing the dimension of the final fully connected layer. They compared the retrieval accuracy with the principal component analysis (PCA) method of dimensionality reduction. In cross-dataset remote sensing image retrieval, Wang et al. [23] proposed a learnable joint spatial and spectral transformation (JSST) model to correct spatial and spectral distortions in images. This model embedded the spatially and spectrally modified inputs at the front end of the ResNet34 network, thereby improving generalization and adaptability. Wu et al. [24] proposed two rotation-aware networks, namely the feature-map-transformation-based rotation-aware network (FMT-RAN) and spatial-transformer-based rotation-aware network (ST-RAN), to address the issue of images appearing at arbitrary rotation angles. However, the aforementioned methods extract deep features from convolutional neural networks (CNNs) for retrieval without utilizing the features of Transformer models. In contrast to CNN models, Transformer models can perform global context modeling and better comprehend the semantic relationships of the entire input sequence. Therefore, they can capture global contextual information and extract richer features.

3. CBRSIR Based on Deep Hashing Features

Hashing has been widely used in large-scale remote sensing image retrieval due to its prominent advantages in storage and retrieval speed. Li et al. [25] proposed the deep hashing neural network (DHNN), which utilizes deep feature learning neural networks to learn high-dimensional embedding features and hash learning neural networks to learn low-dimensional hashing features. This model can be optimized end-to-end. To address the overfitting issue caused by a limited number of labeled images in remote sensing datasets, Roy et al. [26] proposed a deep hashing network based on metric learning. Liu et al. [27] introduced a deep supervised hashing model using a loss function composed of classification, similarity, and bit entropy terms based on the framework of generative adversarial networks (GANs) to learn compact and effective hash codes. Cheng et al. [28] proposed the semantic consistency deep hashing model, which applies deep hashing to multi-label remote sensing image retrieval. It introduces a paired label similarity loss that fully utilizes multi-label information, demonstrating the effectiveness of hashing methods in multi-label remote sensing image retrieval. Tan et al. [29] proposed deep contrastive self-supervised hashing for remote sensing image retrieval, which utilizes unlabeled images for training. This method assumes that hash codes generated from different views of the same image should be similar, while those generated from different images should be dissimilar. They designed a loss function to preserve the similarity of hash codes. Jing et al. [30] presented a deep unsupervised weighted hashing model that utilizes a pretrained Swin Transformer to extract feature representations. This model uses an adaptive weight-based loss function that assigns weights adaptively to positive and negative samples and combines it with quantization loss, resulting in improved model performance. Although these deep hashing methods have achieved good retrieval results, they extract single-layer features without employing methods for fusing multiple features. Single-feature extraction is insufficient to fully express the rich detailed information and semantic information of remote sensing images. The adoption of multi-feature fusion in hashing methods has the potential to improve the accuracy of remote sensing image retrieval.

References

  1. Tang, X.; Yang, Y.; Ma, J.; Cheung, Y.M.; Liu, C.; Liu, F.; Zhang, X.; Jiao, L. Meta-Hashing for Remote Sensing Image Retrieval. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5615419.
  2. Yuan, Z.; Zhang, W.; Fu, K.; Li, X.; Deng, C.; Wang, H.; Sun, X. Exploring a fine-grained multiscale method for cross-modal remote sensing image retrieval. arXiv 2022, arXiv:2204.09868.
  3. Ye, F.; Luo, W.; Dong, M.; He, H.; Min, W. SAR Image retrieval based on unsupervised domain adaptation and clustering. IEEE Geosci. Remote Sens. Lett. 2019, 16, 1482–1486.
  4. Sumbul, G.; Ravanbakhsh, M.; Demir, B. Informative and Representative Triplet Selection for Multilabel Remote Sensing Image Retrieval. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5405811.
  5. Zhuo, Z.; Zhou, Z. Remote Sensing Image Retrieval with Gabor-CA-ResNet and Split-Based Deep Feature Transform Network. Remote Sens. 2021, 13, 869.
  6. Mehmood, M.; Shahzad, A.; Zafar, B.; Shabbir, A.; Ali, N. Remote sensing image classification: A comprehensive review and application. Math. Probl. Eng. 2022, 2022, 5880959.
  7. Ma, J.; Shi, D.; Tang, X.; Zhang, X.; Jiao, L. Dual Modality Collaborative Learning for Cross-Source Remote Sensing Retrieval. Remote Sens. 2022, 14, 1319.
  8. Shabbir, A.; Ali, N.; Ahmed, J.; Zafar, B.; Rasheed, A.; Sajid, M.; Ahmed, A.; Dar, S.H. Satellite and scene image classification based on transfer learning and fine tuning of ResNet50. Math. Probl. Eng. 2021, 2021, 5843816.
  9. Wang, Y.; Ji, S.; Lu, M.; Zhang, Y. Attention boosted bilinear pooling for remote sensing image retrieval. Int. J. Remote Sens. 2020, 41, 2704–2724.
  10. Bo, L.; Sminchisescu, C. Efficient match kernel between sets of features for visual recognition. Adv. Neural Inf. Process. Syst. 2009, 22, 135–143.
  11. Ye, F.; Su, Y.; Xiao, H.; Zhao, X.; Min, W. Remote Sensing Image Registration Using Convolutional Neural Network Features. IEEE Geosci. Remote Sens. Lett. 2018, 15, 232–236.
  12. Ye, F.; Luo, W.; Dong, M.; Li, D.; Min, W. Content-based Remote Sensing Image Retrieval Based on Fuzzy Rules and a Fuzzy Distance. IEEE Geosci. Remote Sens. Lett. 2022, 19, 8002505.
  13. Kumar, A.; Yadav, D.P.; Kumar, D.; Pant, M.; Pant, G. Multi-scale feature fusion-based lightweight dual stream transformer for detection of paddy leaf disease. Environ. Monit. Assess. 2023, 195, 1020.
  14. Ghaderizadeh, S.; Abbasi-Moghadam, D.; Sharifi, A.; Tariq, A.; Qin, S. Multiscale Dual-Branch Residual Spectral-Spatial Network With Attention for Hyperspectral Image Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 5455–5467.
  15. Chen, H.; GUO, X. Multi-scale feature fusion pedestrian detection algorithm based on Transformer. In Proceedings of the 2023 4th International Conference on Computer Vision, Image and Deep Learning (CVIDL), Zhuhai, China, 12–14 May 2023; pp. 536–540.
  16. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929.
  17. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arXiv 2021, arXiv:2103.14030.
  18. Wang, W.; Xie, E.; Li, X.; Fan, D.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. arXiv 2021, arXiv:2102.12122.
  19. Wang, W.; Xie, E.; Li, X.; Fan, D.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. PVT v2: Improved baselines with Pyramid Vision Transformer. Comput. Vis. Media 2022, 8, 415–424.
  20. Li, Y.; Zhang, Y.; Tao, C.; Zhu, H. Content-Based High-Resolution Remote Sensing Image Retrieval via Unsupervised Feature Learning and Collaborative Affinity Metric Fusion. Remote Sens. 2016, 8, 709.
  21. Imbriaco, R.; Sebastian, C.; Bondarev, E. Aggregated Deep Local Features for Remote Sensing Image Retrieval. Remote Sens. 2019, 11, 493.
  22. Hou, D.; Miao, Z.; Xing, H.; Wu, H. Exploiting low dimensional features from the MobileNets for remote sensing image retrieval. Earth Sci. Inform. 2020, 13, 1437–1443.
  23. Wang, Y.; Ji, S.; Zhang, Y. A learnable joint spatial and spectral transformation for high resolution remote sensing image retrieval. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 8100–8112.
  24. Wu, Z.; Zou, C.; Wang, Y.; Tan, M.; Weise, T. Rotation-Aware Representation Learning for Remote Sensing Image Retrieval. Inf. Sci. 2021, 572, 404–423.
  25. Li, Y.; Zhang, Y.; Xin, H.; Hu, Z.; Ma, J. Large-Scale Remote Sensing Image Retrieval by Deep Hashing Neural Networks. IEEE Trans. Geosci. Remote Sens. 2017, 56, 950–965.
  26. Roy, S.; Sangineto, E.; Demir, B.; Sebe, N. Metric-Learning based Deep Hashing Network for Content Based Retrieval of Remote Sensing Images; Cornell University: Ithaca, NY, USA, 2019.
  27. Liu, C.; Ma, J.; Tang, X.; Zhang, X.; Jiao, L. Adversarial hash-code learning for remote sensing image retrieval. In Proceedings of the IGARSS 2019—2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; pp. 4324–4327.
  28. Cheng, Q.; Huang, H.; Ye, L.; Fu, P.; Gan, D.; Zhou, Y. A Semantic-Preserving Deep Hashing Model for Multi-Label Remote Sensing Image Retrieval. Remote Sens. 2021, 13, 4965.
  29. Tan, X.; Zou, Y.; Guo, Z.; Zhou, K.; Yuan, Q. Deep Contrastive Self-Supervised Hashing for Remote Sensing Image Retrieval. Remote Sens. 2022, 14, 3643.
  30. Jing, W.; Xu, Z.; Li, L.; Wang, J.; He, Y.; Chen, G. Deep Unsupervised Weighted Hashing for Remote Sensing Image Retrieval. J. Database Manag. (JDM) 2022, 33, 1–19.
More
ScholarVision Creations