BCD Datasets and SSL in Remote Sensing CD: Comparison
Please note this is a comparison between Version 1 by Wenqing Feng and Version 2 by Jessie Wu.

The detection of building changes (hereafter ‘building change detection’, BCD) is a critical issue in remote sensing analysis. Accurate BCD faces challenges, such as complex scenes, radiometric differences between bi-temporal images, and a shortage of labelled samples. Traditional supervised deep learning requires abundant labelled data, which is expensive to obtain for BCD. By contrast, there is ample unlabelled remote sensing imagery available. Self-supervised learning (SSL) offers a solution, allowing learning from unlabelled data without explicit labels. Inspired by self-supervised learning (SSL), researchers, we employed the SimSiam algorithm to acquire domain-specific knowledge from remote sensing data. Then, these well-initialised weight parameters were transferred to BCD tasks, achieving optimal accuracy. A novel framework for BCD was developed using self-supervised contrastive pre-training and historical geographic information system (GIS) vector maps (HGVMs). 

  • self-supervised learning
  • building change detection
  • pre-training
  • remote sensing

1. Brief Overview of Building Change DetectionD Datasets and Methods

The rise of deep learning has revolutionised building change detection (BCD) by employing deep convolutional neural networks (DCNNs)DCNNs for end-to-end dense prediction in remote sensing imagery. In high-resolution remote sensing images, deep learning techniques enable the segmentation and labelling of building objects, facilitating the extraction of specific building information. Image semantic segmentation methods merge traditional image segmentation techniques with object recognition, effectively dividing images into distinctive regions with unique characteristics, addressing the issue of precise pixel-level prediction in remote sensing imagery. A range of open-source datasets for building extraction and BCD has emerged, such as the Massachusetts Building Dataset [1][7], the Inria Aerial Image Labelling Dataset [2][8], the WHU Aerial Building Dataset [3][9], the Aerial Imagery for Roof Segmentation (AIRS) [4][10], LEVIR-CD [5][11], WHU BCD [6][12], Google Data Set [7][13], S2Looking [8][14], DSIFN [9][15], and 3DCD [10][16]. In addition, semantic segmentation methods, predominantly utilising fully convolutional networks (FCNs), have become widely used for building extraction tasks. Noteworthy networks in this field include SegNet [11][17], UNet [12][18], UNet++ [13][19], PSPNet [14][20], HRNet [15][21], ResUNet [16][22], and Deeplab V3+ [17][23]. The availability of these open-source datasets has significantly accelerated the progress of building extraction and BCD techniques rooted in deep learning.
Currently, within the realm of BCD, a notable supervised technique is the Fully Convolutional Siamese Network (FCSN) [18][40]. The FCSN typically adopts a dual-branch structure with shared weight parameters and takes bi-temporal remote sensing images as inputs. The network includes specific modules that calculate the similarity between the bi-temporal images. The first FCSN, proposed by Daudt et al. [18][40], includes three typical structures: FC-EF, FC-Siam-conc, and FC-Siam-diff. These models fuse the differential features and concatenation features of multi-temporal remote sensing images during training to achieve fast and accurate CD maps. Zhang et al. [9][15] proposed the DSIFN model, which uses the VGG network [19][41] to extract deep features of bi-temporal remote sensing images and spatial and channel attention modules in the decoder to fuse multi-layer features. Fang et al. [20][42] proposed the SNUNet model, which is based on the NestedUNet and Siamese networks and uses channel attention modules to enhance image features, solving the issue of position loss of change information in deep networks by employing dense connections. Chen et al. [21][43] proposed the DASNet model, which mainly utilises attention mechanisms to capture the remote correlation of bi-temporal images and obtain the feature representation of the final change map. Shi et al. [22][44] proposed the DSAMNet model, which introduces a metric module to learn change features and integrates convolutional block attention modules (CBAMs) to provide more discriminative features. Liu et al. (2021) proposed a super resolution-based CD network (SRCDNet) with a stacked attention module (SAM) to help detect changes and overcome the resolution difference between bi-temporal images. Papadomanolaki et al. [23][45] proposed the BiDateNet model, which integrates LSTM blocks into the skip connections of UNet to help detect changes between multi-temporal Sentinel-2 data. Song et al. [24][46] proposed the SUACDNet model, which uses residual structures and three types of attention modules to optimise the network and make it more sensitive to change regions while filtering out background noise. Lee et al. [25][47] proposed a local similarity Siamese network for handling CD problems in complex urban areas. Subsequently, Yin et al. [26][48] proposed a unique attention-guided Siamese network (SAGNet) to address the challenges of edge uncertainty and small target omission in the BCD process. Zheng et al. [27][49] proposed the CLNet model, which uses a special cross-layer block (CLB) to integrate contextual information and multi-scale image features from different stages. The CLB is able to reuse extracted features and capture pixel-level variation in complex scenes. In general, to improve the accuracy of CD, the aforementioned methods emphasise the design of an effective FCSN architecture and adopt common parameter initialisation methods such as random values or ImageNet pre-trained models. However, because there is a lack of prior knowledge in the CD process, the performance of these methods can be limited by the chosen parameter initialisation method, particularly when labelled sample data are insufficient.

2. Use of Self-Supervised LearningL in Remote Sensing Change Detection (CD)

Self-supervised learning (SSL) methods can acquire universal feature representations that exhibit remarkable generalisation across various downstream tasks [28][29][30][31][32][33][34][29,30,31,32,33,34,35]. Among these approaches, contrastive learning has recently gained substantial attention in the academic community, demonstrating an impressive performance. Currently, self-supervised learning network models based on pre-training methods fall into three primary categories. The first category encompasses contrastive learning methods, which involve pairing similar samples as positive pairs and dissimilar samples as negative pairs. These models are trained using the InfoNCE loss to maximise the similarity between positive pairs, while increasing the dissimilarity between negative pairs [30][31]. For example, Chen et al. [35][50] proposed a self-supervised approach to pixel-level CD in bi-temporal remote sensing images and a self-supervised CD method based on an unlabelled multi-view setting, which can handle multi-temporal remote sensing image data from different sources and times [36][51]. The second category includes knowledge distillation methods, such as BYOL [33][34], SimSiam [34][35], and DINO [37][52]. These techniques train a student network to predict the representations of a teacher network. In this approach, the teacher network’s weights are updated based on their moving average instead of traditional backpropagation. For example, Yan et al. [38][53] introduced a novel domain knowledge-guided self-supervised learning method. This method selects high-similarity feature vectors outputted by mean teacher and student networks using cosine similarity, implementing a hard negative sampling strategy that effectively improves CD performance. The third category involves masked image modelling (MIM) methods [39][40][54,55], where specific regions of an image are randomly masked, and the model is trained to reconstruct these masked portions. This approach has the advantage of a reduced reliance on large, annotated datasets. By utilising a large number of unlabelled images, it is possible to train highly capable models that can discern and interpret image content. For example, Sun et al. [41][56] presented RingMo, a foundational model framework for remote sensing that integrates the Patch Incomplete Mask (PIMask) strategy. The framework demonstrated SOTA performance across various tasks, including image classification, object detection, semantic segmentation, and CD.
Self-supervised remote sensing pre-training can learn meaningful feature representations by utilising a large amount of unlabelled remote sensing image data. These meaningful feature representations can improve the performance of various downstream CD tasks, which has drawn the attention of many researchers. Saha et al. [42][57] proposed a method for multi-sensor CD using only unlabelled target bi-temporal images to train a network involving deep clustering and SSL. Dong et al. [43][58] proposed a self-supervised representation learning method based on time prediction for CD in remote sensing images. This method transforms bi-temporal images into more consistent feature representations through self-supervision, thereby avoiding semantic supervision or any additional computation. Based on transformed feature representations, the method of Dong et al. obtains better difference images and reduces the propagation error of difference images in CD. Ou et al. [44][59] used multi-temporal hyperspectral remote sensing images to propose a hyperspectral image CD framework with an SSL pre-trained model. All aforementioned studies apply self-supervised learning directly to downstream small-scale CD datasets to extract seasonally invariant features for unsupervised CD. Similarly, Ramkumar et al. [45][46][60,61] proposed a self-supervised pre-training method for natural image scene CD tasks. Jiang et al. [47][62] proposed a self-supervised global–local contrastive learning (GLCL) framework that extends instance discrimination to pixel-level CD tasks. Through GLCL, features from the same instance with different views are pulled closer together while features from different instances are separated, enhancing the discriminative feature representation from both global and local perspectives for downstream CD tasks. Wang et al. [48][63] proposed a supervised contrastive pre-training and fine-tuning CD (SCPFCD) framework, which includes two cascading stages: supervised contrastive pre-training and fine-tuning. This SCPFCD framework aims to train a Siamese network for CD tasks based on an encoder with good parameter initialisation. Chen et al. [49][64] proposed a SaDL method based on contrastive learning, which requires the use of labels and image enhancement to obtain multi-view positive samples used to pre-train the encoder for CD tasks. Compared to other pre-training methods, SaDL achieves the best CD results but requires additional single-temporal images manually labelled by human experts for pre-training, which is extremely expensive.
Video Production Service