Forest cover mapping is of paramount importance for environmental monitoring, biodiversity assessment, and forest resource management. In the realm of forest cover mapping, significant advancements have been made by leveraging fully supervised semantic segmentation models. However, the process of acquiring a substantial quantity of pixel-level labelled data is prone to time-consuming and labour-intensive procedures.
1. Introduction
Forests play a vital role in the land ecosystem of the Earth. They are indispensable for conserving biodiversity, protecting watersheds, capturing carbon, mitigating climate change effects
[1[1][2],
2], maintaining ecological balance, regulating rainfall patterns, and ensuring the stability of large-scale climate systems
[3,4][3][4]. As a result, the timely and precise monitoring and mapping of forest cover has emerged as a vital aspect of sustainable forest management and the monitoring of ecosystem transformations
[5].
Traditionally, the monitoring and mapping of forest cover has primarily relied on field research and photo-interpretation techniques. However, these methods are limited by the extensive manpower required. With the advancements in remote sensing (RS) technology, the acquisition of large-scale, high-resolution forest imagery data has become possible without the need for physical contact and without causing harm to the forest environment. Taking advantage of RS imagery, numerous studies have proposed various methods for forest cover mapping, including decision trees
[6], regression trees
[7], maximum likelihood classifiers
[8], random forest classification algorithms
[9[9][10],
10], support vector machines, spatio-temporal Markov random-field super-resolution mapping
[11], and multi-scale spectral–spatial–temporal super-resolution mapping
[12].
Recently, due to the growing prevalence of deep Convolutional Neural Networks (CNNs)
[13] and semantic segmentation
[14[14][15][16],
15,16], there has been a notable shift in the research community towards utilising these techniques for forest cover mapping with RS imagery. CNNs have emerged as powerful tools for analysing two-dimensional images, employing their multi-layered convolution operations to effectively capture low-level spatial patterns (such as edges, textures, and shapes) and extract high-level semantic information. Meanwhile, semantic segmentation techniques enable the precise identification and extraction of different objects/regions in an image by classifying each image pixel into a specific semantic category, achieving pixel-level image segmentation. In the context of forest cover mapping, several existing methods have demonstrated the effectiveness of using semantic segmentation techniques. Bragagnolo et al.
[17] proposed to integrate an attention block into the basic UNet network to segment the forest area using satellite imagery from South America. Flood et al.
[18] also proposed a UNet-based network and achieved promising results in mapping the presence or absence of trees and shrubs in Queensland, Australia. Isaienkov et al.
[19] directly employed the baseline U-Net model combined with Sentinel-2 satellite data to detect changes in Ukrainian forests. However, all these methods rely on fully supervised learning for semantic segmentation, which necessitates a substantial amount of labelled pixel data, resulting in a significant labelling expense. Semi-supervised learning
[20,21,22][20][21][22] has emerged as a promising approach to address the aforementioned challenges. It involves training models using a combination of limited labelled data and a substantial amount of unlabelled data, which reduces the need for manual annotations while still improving the performance of the model. Several research studies
[23,24][23][24] have explored the application of semi-supervised learning in semantic segmentation for land-cover-mapping tasks in RS. While these studies have assessed the segmentation of forests to some extent, their focus has predominantly been on forests situated in urban or semi-natural areas, limiting their performance in densely forested natural areas. Moreover, there are unique challenges associated with utilising satellite RS imagery specifically for forest cover mapping:
Challenge 1: As illustrated in Figure 1a, satellite remote sensing (RS) forest images often face problems such as variations in scene illumination and atmospheric interference during the image acquisition process. These factors can lead to colour deviations and distortions, resulting in poor colour fidelity and low contrast. Therefore, it becomes essential to employ image enhancement techniques to improve visualisation and reveal more details. This enhancement facilitates the ability of CNNs to effectively capture spatial patterns.
Figure 1. Visualisation of images after application of different processing methods: (a) original image; and (b) image enhanced using contrast enhancement.
Challenge 2: Due to the high density of natural forest cover and the similar reflectance characteristics between forest targets and other non-forest targets, e.g., grass and the shadow of vegetation, the boundaries between forest and non-forest areas often become unclear. As a result, it becomes challenging to accurately distinguish and delineate the details and edges of the regions of interest, as depicted in Figure 2a.
Figure 2. Visualisation of balanced and imbalanced samples (where black represents non-forest, and white represents forest): (a) balanced sample and (b) imbalanced sample.
Challenge 3: For unknown forest distributions, there are two scenarios: imbalanced (as illustrated in Figure 2) or balanced datasets. Current methods face challenges in effectively handling datasets with different distributions, resulting in poor model generalisation.
2. Semi-Supervised Semantic Segmentation
Consistency regularisation and pseudo-labelling are two main categories of methods in the field of semi-supervised semantic segmentation
[21,22][21][22].
Consistency regularisation methods aim to improve model performance by promoting consistency among diverse predictions for the same image. This is achieved by introducing perturbations to either the input images or the models themselves. For instance, the CCT approach
[25] utilises an auxiliary decoder structure that incorporates multiple robust perturbations at both the feature level and decoder output stage. These perturbations are strategically employed to enforce consistency in the model predictions, ensuring that the predictions remain consistent even in the presence of perturbations. Furthermore, Liu et al. proposed PS-MT
[26], which introduces innovative extensions to improve upon the mean-teacher (MT) model. These extensions include the introduction of an auxiliary teacher and the replacement of the MT’s mean square error (MSE) loss with a more stringent confidence-weighted cross-entropy (Conf-CE) loss. These enhancements greatly enhance the accuracy and consistency of the predictions in the MT model. Building upon these advancements, Abulikemu Abuduweili et al.
[27] proposed a novel method that leverages adaptive consistency regularisation to effectively combine pre-trained models and unlabelled data, improving model performance. However, consistency regularisation typically relies on perturbation techniques that impose certain requirements on dataset quality and distribution, as well as numerous challenging hyperparameters to fine-tune.
Contrary to the consistency regularisation methods, pseudo-labelling techniques, exemplified by self-training
[28], leverage predictions from unlabelled data to generate pseudo-labels, which are then incorporated into the training process, effectively expanding the training set and enriching the model with more information. In this context, Yi Zhu et al.
[29] proposed a self-training framework for semantic segmentation that utilises pseudo-labels from unlabelled data, addressing data imbalance through centroid sampling and focusing on optimising computational efficiency. However, pseudo-labelling methods offer more stable training but have limitations in achieving substantial improvements in model performance, leading to under-utilisation of the potential of unlabelled data.
3. Semi-Supervised Semantic Segmentation in RS
Several studies have investigated the application of semi-supervised learning in RS. Lucas et al.
[23] proposed Sourcerer, a deep-learning-based technique for semi-supervised domain adaptation in land cover mapping from satellite image time-series data. Sourcerer surpasses existing methods by effectively leveraging labelled data from a source domain and adapting the model to the target domain using a novel regulariser, even with limited labelled target data. Chen et al.
[30] introduced SemiRoadExNet, a novel semi-supervised road extraction network based on a Generative Adversarial Network (GAN). The network efficiently utilises both labelled and unlabelled data by generating road segmentation results and entropy maps. Zou et al.
[31] introduced a novel pseudo-labelling approach for semantic segmentation, improving the training process using unlabelled or weakly labelled data. Through the intelligent combination of diverse sources and robust data augmentation, the proposed strategy demonstrates effective consistency training, showing its effectiveness for data of low or high density. On the other hand, Zhang et al.
[32] proposed a semi-supervised deep learning framework for the semantic segmentation of high-resolution RS images. The framework utilises transformation consistency regularisation to make the most of limited labelled samples and abundant unlabelled data. However, the application of semi-supervised semantic segmentation methods in forest cover mapping has not been explored.