1. Scene-Level Change Detection
With the increasing availability of high-resolution images covering the Earth’s surface, the contextual and textural information of landscapes is becoming more abundant and detailed. It is now possible to achieve land use analysis at the scene level, such as scene classification
[1][2][3], scene segmentation
[4][5][6] and scene change detection
[7][8][9].
Although single-image classification has been extensively explored due to its wide applicability, especially in natural images, few studies have focused on scene change detection in multitemporal images. For multitemporal images, most work focuses on change detection at the pixel and object level, or on the further identification of change types.
Nevertheless, pixel-level or object-level change detection methods are not appropriate for land use variation analysis. The main reason for this could be that the objects in the scene, such as vegetation growth and the demolition/construction of individual buildings, do not directly affect the land use category, i.e., their changes within the scene do not change the land-use category, for example, from a residential area to an industrial area. Therefore, it is crucial to improve change detection methods at the scene scale. Detecting scene changes with multitemporal images and identifying land-use transitions (“from–to”) at the scene scale is a new area of interest for urban development analysis and monitoring
[10]. For example, the appearance of residential and commercial areas can indicate the development of a city
[9].
Scene-level change detection (SLCD) by remote sensing, i.e., scene change detection in remote sensing images, seeks to analyze and identify land use changes in any given multitemporal remote sensing image of the same area from a semantic point of view. Here, “scene” refers to an image cropped from a large-scale remote sensing image that includes unique land-cover information
[3].
In the SLCD method, the features of the two input images are used to generate a DI map and then to classify the input patch into two classes (change and no change
[11]) using a decision method, such as threshold segmentation or decision network. The decision method carries out change detection as a binary classification task with two outputs: change or no change. Thus, the two challenges of SLCD are finding an effective method to extract distinguishing features of the image and seeking an optimal transformation feature space to explore temporal correlations.
As with most computer vision tasks, extracting the discriminative features of a multitemporal image is an important and challenging step. Before deep learning was developed, significant efforts were made to derive discriminative visual features, including hand-crafted local features, e.g., scale-invariant feature transformation (SIFT)
[12] and encoding local features using bag-of-visual-words (BoVW)
[13]. Considering the weakness of handcrafted features, several researchers turned to unsupervised learning techniques, e.g., sparse coding
[14]. The features automatically produced from unlabeled data were successfully applied in scene classification and were then introduced to change detection. However, representative features of image scenes based on unsupervised learning have not been exploited adequately, limiting their ability to discriminate between different scene classes in remote sensing images. As regards the decision method for scene classification from the two features, the support vector machine (SVM)
[15] is the most common and effective classifier
[16].
With the collection of many annotated samples, the development in machine learning theory, and the enhancement of computation ability, deep learning models (e.g., autoencoders, CNNs, and GANs) have demonstrated their power in learning productive features
[3]. This decisive advantage has spread to scene classification and change detection from remote sensing images
[17].
The simplest method of SLCD, i.e., the post-classification method, treats the scene change detection task as an independent classification that ignores temporal correlation information and thus suffers from error accumulation. In other words, it barely considers the temporal correlation of the multitemporal images. Some researchers have begun to consider the temporal correlation between multitemporal image scenes, developing Deep Canonical Correlation Analysis (DCCA) Regularization
[8] and an improved method called Soft DCCA
[18]. However, these only focus on learning the correlated features from two inputs and cannot be optimized to improve feature representation capabilities. A learned fully connected layer can be used to model the similarity between bi-temporal scenes and improve the reliability of feature representation
[19].
2. Region-Level Change Detection
In the change detection task, the pixels and objects in images are the two main categories of analysis. Region-level—including pixel-based and object-based—methods have been studied, wherein change detection can be regarded as a dense binary segmentation task
[20]. Following the preparation of the images, a fully convolutional network (FCN) can classify the segmentation result into change or no change for each region (pixel or object). A general approach is to assign a change score to each region, whereby the changed region has a higher score than the unchanged ones. To some extent, this type of method allows for end-to-end change detection and avoids the accumulation of error. Moreover, it offers a tremendous benefit in its detection speed, which is helpful for large-scale data processing. Since most change detection applications involve identifying changes in specific regions or targets among multitemporal remote sensing images, region-level change detection methods are more popular than scene-level change detection methods.
2.1. Patch/Super-Pixel-Based Change Detection
Patches and super-pixels are the most common detection unit in remote sensing image processing applications. Patches or super-pixels are first constructed, then the DI map is generated by voting for use as a pseudo-training set, which can be used to learn the change type of the center pixel
[11][21][22]. A patch is a regular image grid cell, while a super-pixel is an irregular adjacent pixel cluster.
Patch/super-pixel-based change detection (PBCD) is only performed when the input pairs change globally. After concatenating the network channels’ features in a multilevel Siamese network, the generated vector is applied to train a decision network with two layers. This network handles change detection as a binary classification task with two outputs: 1/0 for change and no change, respectively
[23]. Each patch or super-pixel passes through a convolutional network to generate a fixed-dimensional representation. The features of the super-pixels should be transformed into 1-D features, due to their irregular shape, resulting in the loss of spatial information. Besides this, excessive interference information in the rectangular box also seriously influences the classification result.
To tackle the problem, the patch-based deep learning framework is used, which is an algorithm that trains pixels and their neighbors to form patches with a fixed size of 3 × 3, 5 × 5, 7 × 7, etc. The method pulls the patch size into a vector as a model input to predict the change category of the patch’s center pixel based on its neighborhood’s values, according to the principle of spatial proximity. For instance, the deep belief network (DBN)
[21] and the multilayer perceptron (MLP) are, relatively speaking, the simplest methods that use 1-D neural network models in patch-based change detection. In these methods, the patch is flattened to a 1-D vector as input; then, the weights are initialized using greedy layer-wise pre-training and finetuned with labeled patch samples. However, to their detriment, the two architectures with fully connected layers suffer from a large number of learnable parameters, with only a limited number of annotated training examples for change detection, leading to overfitting and an increased computational cost. Furthermore, another drawback of the before-mentioned networks is that they squeeze spatial features into 1-D vectors, resulting in the 2-D spatial properties of the imagery being neglected.
Another factor that can affect the model’s performance is the size of the patch, which can affect the size of the receptive field. It is usually challenging to find the appropriate size for the best performance. If the patch size is too small, the small receptive field with insufficient contextual information may limit the performance of change detection. The network cannot learn the change information and the enclosing fields thoroughly, thus failing to detect changes correctly. In addition, the method may reduce the computational efficiency and increase memory consumption due to the significant overlap between the neighboring fields.
Moreover, without losing spatial information
[24], the patch-based attention mechanism can effectively overcome the uncertainty of predicted categories for PBCD methods. However, the images in the remote sensing application are significantly larger than natural images, but there are smaller objects for each class in remote sensing images. Almost every image contains various object categories, and it is not easy to classify the scene information at the global level. In other words, typical attention-based procedures are not appropriate for large-scale remote sensing images’ semantic learning for the patches descriptors providing limited essential information of the local contexts.
Thus, the main limitations of PBCD methods can be listed as follows: first, it is difficult to find an appropriate patch size, which significantly affects DNN performance; second, redundant information in pixel patches leads to overfitting and increases the computational cost.
2.2. Pixel-Based Change Detection
Generally, pixel-based change detection methods extract features from individual pixels and those surrounding them and predict binary masks, classified pixel by pixel, as changed or unchanged. It is noteworthy that an encoder–decoder architecture is becoming increasingly popular in pixel-based change detection due to its high flexibility and superiority. As mentioned earlier, spectral–spatial information is important for change detection. However, most algorithms compare the spectral or textual values of a single pixel without considering the relationship between neighboring pixels and ignore the spatial environment.
FCNs
[25] use a fully convolutional (FC) layer instead of the fully connected layer in CNNs to produce a pixel-wise prediction. Thereafter, FCNs and their variants provide an effective method for fine-grained change detection, such as FC-EF
[26], FC-Siam-conc
[26] and FC-Siam-diff
[26], and W-Net
[27]. The most used encoder–decoder CNN, SegNet, which was improved via VGG16, is often used for the semantic segmentation of images. However, when directly applied to change detection, it will achieve low accuracy without skip connections. Although a simple connection can help recover the loss of spatial information, it remains challenging to fulfill the needs of change detection tasks, especially for objects of various sizes. Therefore, UNet++
[20] employs a series of nested and dense skip connections to achieve multiscale feature extraction and reduce the pseudo-changes induced by scale variance. It is a promising avenue to exploit the potential of UNet++ for the pixel-level segmentation of remote sensing images, which has the advantages of capturing fine-grained details. To fully exploit the spatial–temporal dependence between multitemporal remote sensing images, BiDateNet
[28] was proposed to better distinguish the spatial and temporal features. In BiDateNet, LSTM convolutional blocks are added to the skip connection to detect temporal patterns between bi-temporal remote sensing images using a U-Net architecture. In addition, some studies
[29][30] employed ASPP
[31] to extract multiscale features, which would improve change detection.
Moreover, the attention mechanism improves average or maximum pooling used in CNN models and enables the models to evaluate the influences of features at different locations and ranges. Attention mechanisms have been used in computer vision research for years, so it is no wonder that numerous publications apply this mechanism in change detection
[32]. The convolutional block attention module (CBAM)
[33] is used to make features from different phases more recognizable in the channel and spatial aspects
[34]. Self-attention
[35] is a mechanism that links different locations within a sequence to estimate the feature of each location in that sequence. It can model long-range correlations between bi-temporal remote sensing data. Non-local neural networks
[36] have developed self-attention in various tasks, such as video classification and object recognition.
2.3. Object-Based Change Detection
Object-based methods take objects instead of pixels as the analysis unit. An object is a group of local pixel clusters, wherein all the pixels are assigned the same classification label. An object-based method effectively exploits the homogeneous information in images and eliminates the effects of image noise, boundaries
[37], and misalignments. Because of the possible benefits of object-based methods, they are prevalent in land-cover mapping. In various publications, they have achieved better performances than pixel-based methods. This success has led to their general use in object-level investigations, e.g., object detection and instance separation. In recent years, object-based change detection techniques have also been developed for the detection of changed objects. Theoretically, this method can reduce the number of falsely detected changes that often appear in the predictions of pixel-based methods. This approach generates object-level predictions, e.g., the masks or bounding boxes of various changed objects. The methods fall broadly into two categories. The first performs super-pixel-based change detection and outputs masks. The second group of change detection methods is based on an object detection framework for finding a changed object in the form of a bounding box. The two categories can use the post-classification comparison method, which considers change detection as classifying pairs of images/boxes. In this task, land-cover classes comparison is carried out between two classified images/boxes, of which different classes are changed.
Super-pixel-based. This method works with homogenous pixel groups acquired by image segmentation, utilizing spectral texture and geometric features, e.g., pattern and area. Some of the “salt-and-pepper” noise in the change detection results is eliminated with the use of a super-pixel object. Sometimes, the super-pixels generated by multiresolution segmentation are used to refine the results to the object level
[38]. Nevertheless, whatever the super-pixel formation, inappropriate scale setting performed by hand will add extra errors. For instance, the cleanliness of the objects decreases as the segmentation scale expands. The computational effort and small observation field are the two main limiting factors in extreme segmentation (approximating the pixel-based methods). Therefore, the focus of object-level change detection is to break through the constraints of prior parameters and collect adaptive objects. Not every object produced in this way is the same size; consequently, over-segmentation and under-segmentation lead to worse change detection results
[39].
Bounding box candidates. In this method, change objects are taken as targets for object detection (OD). The usual OD methods, such as SSD
[40], Faster R-CNN
[41], and YOLO1-5
[42][43][44][45][46], have the potential for use in change detection.
This approach considers the “changed area” in remote sensing images as the detection target, while the “unchanged area” is the background. The OD methods are applied in high-resolution remote sensing image change detection
[47]. The detection results in a group of square areas and then intersecting areas with a specific change type are mixed. The feature extraction network can be a single-branch or dual-branch network. For a single-branch network, the multitemporal images are merged or subtracted first, and the result is then fed into the OD network to determine the change
[47]. The dual-branch network generates the basic and representative features of each image, respectively, and then fuses the features
[48] or proposal regions
[49] of each branch to predict the class scores and the confidence of difference. In addition, object-based instance segmentation, such as that using Mask R-CNN, can be used as a basis for detecting changes, which produces the initialized object instance
[50]. In fact, acquiring the object’s location is the first step in determining the location of a changed object.