For high-resolution remote sensing image retrieval tasks, single-scale features cannot fully express the complexity of the image information. Due to the large volume of remote sensing images, retrieval requires extensive memory and time. Researchers propose an end-to-end deep hash remote sensing image retrieval model (PVTA_MSF) by fusing multi-scale features based on the Pyramid Vision Transformer network (PVTv2).
1. Introduction
With the rapid advancement of Earth observation technology, the number of remote sensing satellites has increased significantly, resulting in a rapid growth in the volume of remote sensing images
[1]. Effectively locating and retrieving the desired remote sensing images from massive databases, as well as efficiently managing and utilizing the remote sensing image data, pose formidable challenges
[2]. Remote sensing image retrieval (RSIR) aims to retrieve the required remote sensing images accurately and efficiently from extensive databases and can be categorized into text-based RSIR and content-based RSIR
[3]. Text-based RSIR retrieves tagged images from the remote sensing database based on query keywords or labels, but it requires extensive manual annotation of each image in the dataset during the initial phase. Content-based RSIR, on the other hand, performs image retrieval by searching for images in the database that closely resemble the query image. This approach closely aligns with human visual perception and is currently the dominant retrieval method. Due to the complex scene and rich background information of remote sensing images, it is difficult to extract effective retrieval features and accurately measure the similarity of features, which is a problem that needs to be solved.
CBRSIR comprises three main components: feature extraction, reduction of feature dimensionality, and similarity calculation. Initial features for CBRSIR were limited to basic patterns in the images, such as lines, shapes, and textures. These features, known as low-level features, were manually designed. Low-level features such as SIFT
[4], LBP
[5], and HOG
[6] provide typical examples. Low-level features describe local image representation and are aggregated to form mid-level features using descriptor aggregation techniques such as BoW
[7], VLAD
[8], FK
[9], and EMK
[10]. With the development of deep learning technology and the introduction of image retrieval, convolutional neural networks (CNNs)
[11] are typically used as feature extractors to obtain abstract features of remote sensing images
[12], referred to as high-level features. However, the high dimensionality of these deep features has led to challenges such as high computational costs and storage requirements. Therefore, dimensionality reduction techniques are necessary to improve retrieval speed and minimize memory usage. Various studies have shown that encoding or pooling methods can be used to achieve dimensionality reduction on the characteristics. One such technique is hashing, which produces binary hash codes through coding, significantly reducing retrieval time and memory use.
The primary challenge in CBRSIR is the vast area covered by remote sensing images, which depict multiple object categories and complex background information. Retrieval accuracy is affected by the high similarity between images of different categories, significant differences between images of the same category, and diversity in the orientation of image targets. Remote sensing images can be represented from various perspectives using features at different scales. The multi-scale feature fusion methods have been applied in multiple domains
[13][14][15], such as hyperspectral image classification
[14] and pedestrian detection
[15], and have demonstrated significant effectiveness. Inspired by this, some studies in CBRSIR use a feature fusion technique to overcome the limitations of single-feature expression capability by extracting multiple features from the same or different models. Nonetheless, in these methods, the feature fusion and feature extraction processes are usually separated, making it difficult to uniformly learn features at varying scales and perform an end-to-end multi-feature fusion.
Recently, Transformer models have gained significant attention in the field of computer vision. Dosovitskiy et al.
[16] proposed the Vision Transformer (ViT) model, which employs a pure Transformer-based approach and is suitable for image classification tasks. After being trained on large datasets, ViT outperformed traditional convolutional neural network (CNN) models and demonstrated stronger generalization capability. However, ViT only generates feature maps of a single resolution, which results in high computational complexity, as global self-attention needs to be computed. To address these issues, Liu et al.
[17] proposed the Swin Transformer, which adopts a hierarchical structure similar to CNNs and can process multi-scale images. Moreover, it employs a sliding window operation to calculate local window attention, reducing the computational complexity from quadratic to linear, as in ViT. Wang et al.
[18] proposed Pyramid Vision Transformer (PVT), which is the first Transformer-based architecture using a feature pyramid. PVT features a progressive shrinking pyramid structure and a spatial reduction attention mechanism (SRA). Compared to ViT, PVT significantly reduces computational complexity. PVTv2
[19] further improves the original PVT by introducing overlapping patch embeddings and a linear spatial reduction attention mechanism, making the feature pyramid Transformer architecture a viable backbone network for visual tasks. Other than image classification, Transformer models have demonstrated stronger feature extraction capabilities than CNNs in fields such as object detection, semantic segmentation, and image processing.
2. CBRSIR Based on CNN Features
Deep features extracted from CNNs have been increasingly utilized in CBRSIR. For instance, Li et al.
[20] designed four unsupervised convolutional neural networks that generate four types of deep features at different layers. By combining these deep features with traditional handcrafted features, they provided more effective features for CBRSIR. Raffaele et al.
[21] extracted deep local convolutional features from fine-tuned CNN models and aggregated the local convolutional features into global descriptors using the vector of locally aggregated descriptors (VLAD). They utilized multiplication and addition attention mechanisms to overcome irrelevant background interference. Hou et al.
[22] fine-tuned the MobileNet model to extract deep convolutional features and obtained low-dimensional feature representation by changing the dimension of the final fully connected layer. They compared the retrieval accuracy with the principal component analysis (PCA) method of dimensionality reduction. In cross-dataset remote sensing image retrieval, Wang et al.
[23] proposed a learnable joint spatial and spectral transformation (JSST) model to correct spatial and spectral distortions in images. This model embedded the spatially and spectrally modified inputs at the front end of the ResNet34 network, thereby improving generalization and adaptability. Wu et al.
[24] proposed two rotation-aware networks, namely the feature-map-transformation-based rotation-aware network (FMT-RAN) and spatial-transformer-based rotation-aware network (ST-RAN), to address the issue of images appearing at arbitrary rotation angles.
However, the aforementioned methods extract deep features from convolutional neural networks (CNNs) for retrieval without utilizing the features of Transformer models. In contrast to CNN models, Transformer models can perform global context modeling and better comprehend the semantic relationships of the entire input sequence. Therefore, they can capture global contextual information and extract richer features.
3. CBRSIR Based on Deep Hashing Features
Hashing has been widely used in large-scale remote sensing image retrieval due to its prominent advantages in storage and retrieval speed. Li et al.
[25] proposed the deep hashing neural network (DHNN), which utilizes deep feature learning neural networks to learn high-dimensional embedding features and hash learning neural networks to learn low-dimensional hashing features. This model can be optimized end-to-end. To address the overfitting issue caused by a limited number of labeled images in remote sensing datasets, Roy et al.
[26] proposed a deep hashing network based on metric learning. Liu et al.
[27] introduced a deep supervised hashing model using a loss function composed of classification, similarity, and bit entropy terms based on the framework of generative adversarial networks (GANs) to learn compact and effective hash codes. Cheng et al.
[28] proposed the semantic consistency deep hashing model, which applies deep hashing to multi-label remote sensing image retrieval. It introduces a paired label similarity loss that fully utilizes multi-label information, demonstrating the effectiveness of hashing methods in multi-label remote sensing image retrieval. Tan et al.
[29] proposed deep contrastive self-supervised hashing for remote sensing image retrieval, which utilizes unlabeled images for training. This method assumes that hash codes generated from different views of the same image should be similar, while those generated from different images should be dissimilar. They designed a loss function to preserve the similarity of hash codes. Jing et al.
[30] presented a deep unsupervised weighted hashing model that utilizes a pretrained Swin Transformer to extract feature representations. This model uses an adaptive weight-based loss function that assigns weights adaptively to positive and negative samples and combines it with quantization loss, resulting in improved model performance. Although these deep hashing methods have achieved good retrieval results, they extract single-layer features without employing methods for fusing multiple features. Single-feature extraction is insufficient to fully express the rich detailed information and semantic information of remote sensing images. The adoption of multi-feature fusion in hashing methods has the potential to improve the accuracy of remote sensing image retrieval.
This entry is adapted from the peer-reviewed paper 10.3390/rs15194729