The use of remote sensing imagery has significantly enhanced the efficiency of building extraction; however, the precise estimation of building height remains a formidable challenge. In light of ongoing advancements in computer vision, numerous techniques leveraging convolutional neural networks and Transformers have been applied to remote sensing imagery, yielding promising outcomes.
1. Introduction
Buildings play a pivotal role in urban areas, and the analysis of their distribution holds substantial value for a variety of applications, including the assessments of urban livability [
1,
2] and urban planning [
3]. Consequently, continuous monitoring of building changes remains an essential task. Furthermore, the precise determination of relative building heights is of paramount importance in the domains of urban planning and development.
Traditional methods of updating building data are burdened with substantial costs in terms of labor and resources, rendering comprehensive coverage and standardized information a challenging endeavor [
4]. Thankfully, remote sensing technology provides a highly accurate means of obtaining a wide range of data related to building heights. These data can be effectively harnessed for the formulation of comprehensive urban planning schemes, the evaluation of urban volume and floor area ratios [
5], and its utilization as fundamental data for urban disaster prevention and mitigation [
6]. In practice, the increasing utilization of multi-source high-resolution satellite data offers a promising avenue for efficiently extracting building information over expansive areas through remote sensing techniques [
7].
The remote sensing data used for estimating the height of surface objects can be broadly categorized into three groups: optical images [
8,
9], synthetic aperture radar (SAR) images [
10,
11,
12,
13,
14], and the fusion of these two data sources [
15,
16].
Optical remote sensing images offer a rich source of visual information, encompassing attributes such as building size, shape, and relative positioning. By integrating the analysis of visual cues like perspective relationships, shadows, and textures, along with the application of image measurement principles and feature extraction algorithms, it becomes possible to deduce relative height differences between buildings [
17,
18]. SAR serves as an active ground detection technique, providing robust penetrability that allows it to effectively penetrate through clouds, smoke, and vegetation. This capability yields valuable information about terrain and ground objects [
19]. The estimation of the height of ground objects in SAR images relies on the analysis of phase information, particularly examining the phase differences between adjacent pixels. Consequently, SAR is widely utilized for the height estimation of ground objects [
11,
12,
13]. The fusion of SAR and optical images for building height extraction capitalizes on the distinctive imaging characteristics of both modalities. By combining the respective strengths of SAR and optical data through image fusion, more accurate building height data can be extracted [
20].
The rapid progress in computer vision has enabled the estimation of relative height from a single image. This achievement is realized through data-driven methods that learn implicit mapping relationships [
21], which are not explicitly derived from mathematical modeling. Unlike conventional mathematical modeling approaches, this data-driven method does not require precise modeling of physical parameters like depth of field, and the camera’s internal and external characteristics. Instead, it leverages extensive image datasets for training, facilitating the acquisition of more intricate representations of height-related features. Consequently, significant advancements have been made in monocular depth estimation (MDE) tasks [
22,
23,
24,
25]. MDE involves the estimation of object depths in a scene from a single 2D image, a task closely related to building height estimation. Several methods based on Vision Transformers (ViTs) [
25,
26] have been introduced. ViT offers superior feature extraction capabilities, robustness, interpretability, and generalization abilities in comparison to convolutional neural networks (CNNs). It can adapt to images of various sizes and shapes, allowing the learning of comprehensive feature representations from extensive image data.
ViT [
27] has made significant strides in the past three years and has found extensive applications in semantic segmentation [
28,
29,
30] and depth estimation [
25,
26]. In the realm of semantic segmentation, ViT restores the feature map to the original image size and conducts pixel-wise classification by incorporating an upsampling layer or a transposed convolutional layer into the network architecture. This approach allows for efficient processing and precise prediction of large-scale image data, providing robust support for a variety of computer vision tasks. In the context of depth estimation, ViT facilitates the reconstruction of 3D scenes by estimating depth information from a single image. This data-driven approach learns implicit mapping relationships, enabling the prediction of scene depth information from the image.
Currently, there is a paucity of research on height estimation using multi-source remote sensing images, especially within the context of multi-task learning with semantic constraints to enhance height estimation. Existing studies primarily concentrate on analyzing remote sensing mechanisms or utilizing multi-view remote sensing images for relative height estimation through dense matching [
17,
31,
32]. Recent endeavors have explored the utilization of SAR or optical remote sensing data for multi-task learning [
7,
33,
34,
35]. Additionally, some studies have integrated ground object height and RGB images to perform semantic segmentation tasks [
36]. These studies have showcased promising results, signifying that the joint processing of SAR and high-resolution remote sensing data can bolster the accuracy of building extraction and height estimation tasks. Moreover, they underscore the intrinsic relationship between semantic information and ground object height, highlighting the effectiveness and necessity of simultaneously conducting semantic segmentation and height estimation tasks. In recent years, deep learning methods have been employed for relative height estimation through generative techniques [
37,
38,
39], as well as end-to-end approaches [
40,
41]. For semantic segmentation, regressing the height of the building area using a height estimation model necessitates the effective separation of the building from the background while estimating its height. The continuity of the regression model presents challenges in distinguishing the foreground and background in the height estimation task. Traditionally, a threshold is set for post-processing, but semantic segmentation tasks are adept at learning to differentiate the foreground and background, offering significant assistance in this regard.
2. MDE
Estimating building height is conceptually similar to MDE, a well-explored field in computer vision. MDE focuses on estimating the depth of objects within a scene from a single 2D image [
24]. This task shares common challenges with the estimation of ground object height from remote sensing images. Both involve the complexity of recovering depth information from a 2D image projection of a 3D scene, where depth information is inherently lost, and its retrieval from a single image is challenging. MDE has diverse applications, including 3D reconstruction [
42], autonomous navigation [
43], augmented reality [
24], and virtual reality [
24]. Recent years have witnessed significant progress in MDE, primarily driven by advancements in deep learning techniques and the availability of extensive datasets for training depth estimation models. The prevalent approach in MDE is to train deep neural networks to directly predict depth maps from 2D images. These networks are typically trained on large-scale image datasets that include corresponding depth maps, employing techniques such as supervised learning [
25,
26,
44,
45], unsupervised learning [
46], or self-supervised learning [
47].
3. Semantic Segmentation
Semantic segmentation is a pixel-level classification task, and many semantic segmentation models adopt the encoder–decoder architecture, exemplified by models like Unet [
48,
49], LinkNet [
50,
51], PSPNET [
52], and more. Various studies utilizing Unet-based approaches have been instrumental in automatically extracting buildings from remote sensing imagery [
53,
54]. In recent times, there has been a surge of interest in directly integrating semantic segmentation with the task of height estimation, all from a single remote sensing image [
55,
56,
57]. These studies have consistently demonstrated that the incorporation of semantic information can significantly enhance the accuracy of height estimation. Nonetheless, the manual annotation of tags can be a cumbersome process, necessitating exploration into methods to streamline the semantic tagging procedure. Given this imperative, there is an urgent need to investigate the feasibility and efficacy of employing building tags exclusively for this purpose.
4. ViT
The advent of the Vision Transformer (ViT) [
27] has captured the interest of computer vision researchers. However, pure Transformers exhibit high computational complexity and involve a substantial number of model parameters, demanding extensive optimization efforts for ViT. A promising development in this regard is the Swin Transformer [
29], which represents a hierarchical Transformer and offers a versatile backbone for various computer vision tasks. By implementing shifted window computations, self-attention is constrained within non-overlapping local windows while also allowing for cross-window connections, leading to enhanced computational efficiency. This layered architecture excels in modeling across different scales and maintains linear computational complexity concerning image size. The Swin Transformer has found wide applications in remote sensing, including hyperspectral classification [
58] where a multi-scale mixed spectral attention model based on the Swin Transformer achieved top-class performance across multiple datasets. Additionally, the work of Wang et al. [
28] introduced BuildFormer, a novel Vision Transformer featuring a dual-path structure. This innovative design accommodates the use of a large window for capturing global context, substantially enhancing its capabilities for processing extensive remote sensing imagery.
5. Multi-Modal Fusion and Joint Learning for Remote Sensing
SAR offers the capability to retrieve height information of ground objects by analyzing the phase and amplitude information of radar echoes. However, the accurate retrieval of height information using SAR data is a complex process, as it is influenced by various factors, including terrain, vegetation, and buildings. This extraction process typically involves intricate signal processing and data analysis techniques. Nevertheless, deep learning has emerged as a promising approach to simplify the height extraction process and enable end-to-end elevation information extraction [
40,
41]. However, most existing research in this domain focuses on single data sources or single-task-based high-level information extraction, which may not generalize well to multi-source remote sensing data or multi-task joint learning. Researchers are actively exploring various methods, such as multi-modal fusion and multi-task learning, to enhance the accuracy and efficiency of height extraction from SAR data. Multi-task learning using both optical and SAR data is a complex endeavor that involves intricate processing and analysis. Acquiring suitable datasets that contain high-resolution optical and SAR data to support such tasks is also a challenging issue. Recent studies have started to investigate the use of SAR or optical remote sensing data for multi-task learning [
33,
34,
35], demonstrating the potential of multi-task learning in remote sensing. However, numerous challenges remain, such as integrating multi-source data and developing effective algorithms for joint learning. Further research is essential to address these challenges and fully exploit the potential of multi-task learning in remote sensing applications. In recent remote sensing research, there is growing interest in utilizing combined ground object height and RGB images for semantic segmentation tasks. For example, Xiong et al. [
36] demonstrated a strong correlation between the geometric information in the normalized digital surface model (nDSM) and the semantic category of land cover. Jointly utilizing two modalities, RGB and nDSM (height), has the potential to significantly improve segmentation performance, underlining the reliability of Transformer-based networks for multimodal fusion. This research highlights the interplay between semantic information and feature height information. Additionally, recent studies have investigated the use of RGB images for joint height estimation and semantic segmentation tasks in deep learning for remote sensing.
6. Multi-Task Learning
Previous studies [
15,
36] have yielded promising results, underscoring that joint processing of SAR and high-resolution remote sensing data can significantly enhance the accuracy of building extraction and height estimation tasks. These investigations have emphasized the connection between semantic and height information of ground objects, highlighting the effectiveness and necessity of simultaneously performing semantic segmentation and height estimation tasks. Currently, many deep learning tasks predominantly rely on single-task learning, yet multi-task learning, which allows the simultaneous learning of multiple related tasks and the sharing of information between them, offers superior generalization abilities compared to single-task learning [
59].
Srivastava et al. [
60] employed joint height estimation and semantic labeling on monocular aerial images, utilizing a single decoder with a fully connected layer to perform both height estimation and semantic segmentation tasks. In contrast, Carvalho et al. [
61] proposed a framework for joint semantics and local height, processing the two tasks separately in the middle part of the decoder. Gao et al. [
62] harnessed contrastive learning with an encoder featuring shared parameters, alongside cross-task contrast loss and cross-pixel contrast loss for height estimation and semantic segmentation. The decoder employed contrastive learning to encourage the model to learn detailed features. Lu et al. [
63] introduced a unified deep learning architecture that can generate both estimated relative height maps and semantically segmented maps from RGB images, allowing for end-to-end training while accomplishing relative height estimation and semantic segmentation simultaneously. However, they failed to consider the independent relationship between building texture details and building semantic information. According to the correlation between semantic segmentation and height estimation, Zhao et al. [
64] investigate and propose a semantic-aware unsupervised domain adaptation method for height estimation. They found that incorporating semantic supervision improves the accuracy of height estimation for single-view orthophotos under unsupervised domain adaptation.
This entry is adapted from the peer-reviewed paper 10.3390/rs15235552