Edge-Guided Multimodal Transformers Change Detection: History
Please note this is an old version of this entry, which may differ significantly from the current revision.

Change detection from heterogeneous satellite and aerial images plays a progressively important role in many fields, including disaster assessment, urban construction, and land use monitoring. Researchers have mainly devoted their attention to change detection using homologous image pairs and achieved many remarkable results. It is sometimes necessary to use heterogeneous images for change detection in practical scenarios due to missing images, emergency situations, and cloud and fog occlusion.

  • change detection
  • remote sensing
  • transformer

1. Introduction

Remote sensing change detection refers to detecting the changes between a pair of images in the same geographical area on the Earth that were obtained at different times [1]. Accurate monitoring of the Earth’s surface changes is important for understanding the relationship between humans and the natural environment. With the advancement of aerospace remote sensing (RS) technology, massive multi-temporal remote sensing images provide enough data support for change detection (CD) research and promote the vigorous development of change detection application fields. Change detection is a promising research topic in the field of remote sensing. As an advanced method for monitoring land cover conditions, CD has played a huge role in important fields such as land monitoring [2], urban management [3], geological disasters [4], and emergency support [5].
With the diversification of remote sensing methods, the refinement and integrated monitoring of satellite and aviation data have become new a development trend. Aerial remote sensing has the characteristics of strong mobility, a high resolution at the sub-meter level, and rapid data acquisition, but it is constrained by the lack of pre-temporal historical data and a narrow coverage range in CD tasks. Therefore, it is necessary to complement this with satellite images to form a change monitoring system. According to whether a pair of CD images are obtained using the same RS platform or sensor, change detection algorithms can be divided into homologous change detection and heterogeneous change detection [6]. Traditional satellite image CD algorithms require obtaining multitemporal images in the same area from identical sensors with more strict conditions. Due to the limitations of weather like fog, the orbital repetition period, and the payload width, it cannot fully meet the complex and diverse application needs in the real world today. Thus, it is necessary to use satellite and aerial images for heterogeneous change detection. 
In many practical applications, SACD has played an important role. Especially in emergency situations of disaster evaluation and rescue, fast, flexible, and accurate methods are needed for timely assessment. With the rise and rapid development of aerial remote sensing technology, the characteristics of high maneuverability, high pixel resolution, and timely data capture are very suitable. The pre-image usually uses satellite images due to its abundant historical data and wide cover, while the post-image is obtained through direct flights using aircraft, which is the fastest way and can provide higher resolution with accurate information [7]. Furthermore, SACD has also played a significant role in land resource monitoring. Currently, land resource monitoring and urban management mainly rely on the technology system of satellite RS image monitoring. However, there are still shortcomings in the mobility, resolution, and timeliness of satellite monitoring in cloudy and foggy areas, which are easily constrained by weather conditions. Aerial remote sensing has the characteristics of high spatial resolution, high frequency, and high cost-effectiveness. At the same time, it can avoid the limitations of insufficient coverage and resolution under rain and fog conditions, and complements the capabilities of satellite remote sensing.
However, CD between satellite and aerial images remains a huge challenge. The main challenges are as follows:
(1)
Huge difference in resolution between satellite and aerial images. Due to satellite and aircraft having different shooting heights and sensors, a satellite image’s resolution is usually lower than that of aerial images. A HR satellite’s resolution is approximately 0.5–2 m [8], while an aerial image’s resolution is usually lower than 0.5 m [9], and can even reach the centimeter level. Aligning the resolution of satellite and aerial image pairs through interpolation, convolution, or pooling is a direct solution to the problem, but it can cause the image to lose a large amount of detailed information and introduce some accumulated errors and speckle noise.
(2)
Blurred edges caused by complex terrain scenes and interference from the satellite and aerial image gap. Dense building clusters are often obstructed by shadow occlusion, similar ground objects, and intraclass differences caused by very different materials, resulting in blurred edges. Moreover, the parallax and inference from the lower resolution of satellite images than aerial images further increases the difficulties in change detection for buildings.

2. Edge-Guided Multimodal Transformers Change Detection from Satellite and Aerial Images

2.1. Different Resolution for Change Detection

To address the issue of different resolutions in change detection, existing methods typically address this issue by reconstructing the image sample to make the homologous CD method suitable for SACD tasks. Statistics-based interpolation is the most direct and convenient method to match the differences between SACD images of different resolutions. However, the ability of image interpolation in information restoration is restricted. More specifically, image interpolation methods like bilinear and bicubic interpolation perform poorly in the face of large differences in resolution, resulting in more background noise and blurry edges, which increases the difficulty of feature alignment and generates many pseudo changes [12].
Besides using simplest interpolation methods, sub-pixel-based methods are studied most widely. Considering the superior performance of sub-pixel convolution to obtain high-resolution feature maps from low-resolution images [13,14,15], Ling et al. [16] first introduced sub-pixel convolution into CD to address the gap caused by different resolutions in heterogeneous images. Ling et al. adopted the principle of spatial correlation and designed a new land cover change pattern to obtain changes with sub-pixel convolution. Later, Wang et al. [17] proposed a Hopfield neural network with sub-pixel convolution to build the resolution gap between Landsat and MODIS images. Overall, compared to interpolation methods, the sub-pixel-based methods used to cleverly design a learnable up-sampling module can better reconstruct LR images. However, sub-pixel-based methods are largely restricted by the accuracy of the previous resolution feature map, focusing solely on shallow feature reconstruction without utilizing deep semantic information, resulting in the accumulation of redundant errors.
Furthermore, super resolution (SR) has been an independent task aimed to recover low-resolution (LR) images [12]. Li et al. [18] introduced an iterative super resolution CD method for Landsat-MODIS CD, which combines end-member estimation, spectral unmixing, and sub-pixel-based methods. Wu et al. [19] designed a back propagation network to obtain sub-pixel LCC maps from the soft-classification results of LR images [12]. However, SR is not flexible enough and may be limited by fixed zoom sizes in image recovery. 

2.2. Deep Learning for Change Detection

Deep learning has been widely applied in the field of remote sensing vision [21,22,23]. In CD tasks, deep learning methods have demonstrated their superiority and good generalization ability [24]. At present, deep learning CD tasks are mostly based on the Siamese Network [25], which has two identical branches. The parameters of the two branches choose whether to share the weight according to homologous or heterologous change detection. Previous research [26,27] used a Siamese Network as the encoder to extract features and calculate the changes by concatenating the features directly. Subsequent researchers improved the regional accuracy of change detection by designing various attention modules, including dense attention [28], spatial attention [29], spatial-temporal attention [14], and others. However, existing CD methods strived for the accuracy of regional changes through attention mechanisms, without realizing the importance of edge information.
Many remote sensing objects have their own unique and clear edge features, especially buildings [30]. However, most existing deep learning CD methods design various attention modules to improve the regional accuracy without utilizing building edge information. Ignoring edge information results in the poor performance of change detection in some cases, especially in heterogeneous SACD. In particular, dense building communities are often obstructed by shadow and interference from similar objects like buildings and roads, resulting in blurred edges interfering with the change detection [31]. In SACD, the lower resolution of the satellite image compared to the aerial one can worsen the above situation.
In building a segmentation task, utilizing edge information as prior knowledge can help CD networks pay attention to both semantic and boundary features [28,32,33]. Reference [34] designed an edge detection module and fused segmentation masks, with the loss function also incorporating edge optimization. Reference [35] used an edge refinement module, cooperating channel, and location attention module to enhance the ability of the network in CD tasks. Researchers in a previous study [7] fused and aligned satellite and aerial images in high-dimensional features through convolutional networks, and used the Hough method to obtain building edges as extra information to help the model focus more on building contours and spatial positions. However, existing methods only use edge information as prior knowledge and do not interact with deep semantic information, fully integrating edge features as a learnable part into the whole network.

This entry is adapted from the peer-reviewed paper 10.3390/rs16010086

This entry is offline, you can click here to edit this entry!
Video Production Service