Urban Building Height Estimation Using Multiple Source Images: Comparison
Please note this is a comparison between Version 2 by Rita Xu and Version 1 by Qingyun Yan.

The use of remote sensing imagery has significantly enhanced the efficiency of building extraction; however, the precise estimation of building height remains a formidable challenge. In light of ongoing advancements in computer vision, numerous techniques leveraging convolutional neural networks and Transformers have been applied to remote sensing imagery, yielding promising outcomes.

  • height estimation
  • multi-task learning
  • Vision Transformer

1. Introduction

Buildings play a pivotal role in urban areas, and the analysis of their distribution holds substantial value for a variety of applications, including the assessments of urban livability [1,2][1][2] and urban planning [3]. Consequently, continuous monitoring of building changes remains an essential task. Furthermore, the precise determination of relative building heights is of paramount importance in the domains of urban planning and development.
Traditional methods of updating building data are burdened with substantial costs in terms of labor and resources, rendering comprehensive coverage and standardized information a challenging endeavor [4]. Thankfully, remote sensing technology provides a highly accurate means of obtaining a wide range of data related to building heights. These data can be effectively harnessed for the formulation of comprehensive urban planning schemes, the evaluation of urban volume and floor area ratios [5], and its utilization as fundamental data for urban disaster prevention and mitigation [6]. In practice, the increasing utilization of multi-source high-resolution satellite data offers a promising avenue for efficiently extracting building information over expansive areas through remote sensing techniques [7].
The remote sensing data used for estimating the height of surface objects can be broadly categorized into three groups: optical images [8[8][9],9], synthetic aperture radar (SAR) images [10,11,12[10][11][12][13][14],13,14], and the fusion of these two data sources [15,16][15][16].
Optical remote sensing images offer a rich source of visual information, encompassing attributes such as building size, shape, and relative positioning. By integrating the analysis of visual cues like perspective relationships, shadows, and textures, along with the application of image measurement principles and feature extraction algorithms, it becomes possible to deduce relative height differences between buildings [17,18][17][18]. SAR serves as an active ground detection technique, providing robust penetrability that allows it to effectively penetrate through clouds, smoke, and vegetation. This capability yields valuable information about terrain and ground objects [19]. The estimation of the height of ground objects in SAR images relies on the analysis of phase information, particularly examining the phase differences between adjacent pixels. Consequently, SAR is widely utilized for the height estimation of ground objects [11,12,13][11][12][13]. The fusion of SAR and optical images for building height extraction capitalizes on the distinctive imaging characteristics of both modalities. By combining the respective strengths of SAR and optical data through image fusion, more accurate building height data can be extracted [20].
The rapid progress in computer vision has enabled the estimation of relative height from a single image. This achievement is realized through data-driven methods that learn implicit mapping relationships [21], which are not explicitly derived from mathematical modeling. Unlike conventional mathematical modeling approaches, this data-driven method does not require precise modeling of physical parameters like depth of field, and the camera’s internal and external characteristics. Instead, it leverages extensive image datasets for training, facilitating the acquisition of more intricate representations of height-related features. Consequently, significant advancements have been made in monocular depth estimation (MDE) tasks [22,23,24,25][22][23][24][25]. MDE involves the estimation of object depths in a scene from a single 2D image, a task closely related to building height estimation. Several methods based on Vision Transformers (ViTs) [25,26][25][26] have been introduced. ViT offers superior feature extraction capabilities, robustness, interpretability, and generalization abilities in comparison to convolutional neural networks (CNNs). It can adapt to images of various sizes and shapes, allowing the learning of comprehensive feature representations from extensive image data.
ViT [27] has made significant strides in the past three years and has found extensive applications in semantic segmentation [28,29,30][28][29][30] and depth estimation [25,26][25][26]. In the realm of semantic segmentation, ViT restores the feature map to the original image size and conducts pixel-wise classification by incorporating an upsampling layer or a transposed convolutional layer into the network architecture. This approach allows for efficient processing and precise prediction of large-scale image data, providing robust support for a variety of computer vision tasks. In the context of depth estimation, ViT facilitates the reconstruction of 3D scenes by estimating depth information from a single image. This data-driven approach learns implicit mapping relationships, enabling the prediction of scene depth information from the image.
Currently, there is a paucity of research on height estimation using multi-source remote sensing images, especially within the context of multi-task learning with semantic constraints to enhance height estimation. Existing studies primarily concentrate on analyzing remote sensing mechanisms or utilizing multi-view remote sensing images for relative height estimation through dense matching [17,31,32][17][31][32]. Recent endeavors have explored the utilization of SAR or optical remote sensing data for multi-task learning [7,33,34,35][7][33][34][35]. Additionally, some studies have integrated ground object height and RGB images to perform semantic segmentation tasks [36]. These studies have showcased promising results, signifying that the joint processing of SAR and high-resolution remote sensing data can bolster the accuracy of building extraction and height estimation tasks. Moreover, they underscore the intrinsic relationship between semantic information and ground object height, highlighting the effectiveness and necessity of simultaneously conducting semantic segmentation and height estimation tasks. In recent years, deep learning methods have been employed for relative height estimation through generative techniques [37[37][38][39],38,39], as well as end-to-end approaches [40,41][40][41]. For semantic segmentation, regressing the height of the building area using a height estimation model necessitates the effective separation of the building from the background while estimating its height. The continuity of the regression model presents challenges in distinguishing the foreground and background in the height estimation task. Traditionally, a threshold is set for post-processing, but semantic segmentation tasks are adept at learning to differentiate the foreground and background, offering significant assistance in this regard.

2. MDE

Estimating building height is conceptually similar to MDE, a well-explored field in computer vision. MDE focuses on estimating the depth of objects within a scene from a single 2D image [24]. This task shares common challenges with the estimation of ground object height from remote sensing images. Both involve the complexity of recovering depth information from a 2D image projection of a 3D scene, where depth information is inherently lost, and its retrieval from a single image is challenging. MDE has diverse applications, including 3D reconstruction [42], autonomous navigation [43], augmented reality [24], and virtual reality [24]. Recent years have witnessed significant progress in MDE, primarily driven by advancements in deep learning techniques and the availability of extensive datasets for training depth estimation models. The prevalent approach in MDE is to train deep neural networks to directly predict depth maps from 2D images. These networks are typically trained on large-scale image datasets that include corresponding depth maps, employing techniques such as supervised learning [25,26,44,45][25][26][44][45], unsupervised learning [46], or self-supervised learning [47].

3. Semantic Segmentation

Semantic segmentation is a pixel-level classification task, and many semantic segmentation models adopt the encoder–decoder architecture, exemplified by models like Unet [48[48][49],49], LinkNet [50[50][51],51], PSPNET [52], and more. Various studies utilizing Unet-based approaches have been instrumental in automatically extracting buildings from remote sensing imagery [53,54][53][54]. In recent times, there has been a surge of interest in directly integrating semantic segmentation with the task of height estimation, all from a single remote sensing image [55,56,57][55][56][57]. These studies have consistently demonstrated that the incorporation of semantic information can significantly enhance the accuracy of height estimation. Nonetheless, the manual annotation of tags can be a cumbersome process, necessitating exploration into methods to streamline the semantic tagging procedure. Given this imperative, there is an urgent need to investigate the feasibility and efficacy of employing building tags exclusively for this purpose.

4. ViT

The advent of the Vision Transformer (ViT) [27] has captured the interest of computer vision researchers. However, pure Transformers exhibit high computational complexity and involve a substantial number of model parameters, demanding extensive optimization efforts for ViT. A promising development in this regard is the Swin Transformer [29], which represents a hierarchical Transformer and offers a versatile backbone for various computer vision tasks. By implementing shifted window computations, self-attention is constrained within non-overlapping local windows while also allowing for cross-window connections, leading to enhanced computational efficiency. This layered architecture excels in modeling across different scales and maintains linear computational complexity concerning image size. The Swin Transformer has found wide applications in remote sensing, including hyperspectral classification [58] where a multi-scale mixed spectral attention model based on the Swin Transformer achieved top-class performance across multiple datasets. Additionally, the work of Wang et al. [28] introduced BuildFormer, a novel Vision Transformer featuring a dual-path structure. This innovative design accommodates the use of a large window for capturing global context, substantially enhancing its capabilities for processing extensive remote sensing imagery.

5. Multi-Modal Fusion and Joint Learning for Remote Sensing

SAR offers the capability to retrieve height information of ground objects by analyzing the phase and amplitude information of radar echoes. However, the accurate retrieval of height information using SAR data is a complex process, as it is influenced by various factors, including terrain, vegetation, and buildings. This extraction process typically involves intricate signal processing and data analysis techniques. Nevertheless, deep learning has emerged as a promising approach to simplify the height extraction process and enable end-to-end elevation information extraction [40,41][40][41]. However, most existing research in this domain focuses on single data sources or single-task-based high-level information extraction, which may not generalize well to multi-source remote sensing data or multi-task joint learning. Researchers are actively exploring various methods, such as multi-modal fusion and multi-task learning, to enhance the accuracy and efficiency of height extraction from SAR data. Multi-task learning using both optical and SAR data is a complex endeavor that involves intricate processing and analysis. Acquiring suitable datasets that contain high-resolution optical and SAR data to support such tasks is also a challenging issue. Recent studies have started to investigate the use of SAR or optical remote sensing data for multi-task learning [33,34[33][34][35],35], demonstrating the potential of multi-task learning in remote sensing. However, numerous challenges remain, such as integrating multi-source data and developing effective algorithms for joint learning. Further research is essential to address these challenges and fully exploit the potential of multi-task learning in remote sensing applications. In recent remote sensing research, there is growing interest in utilizing combined ground object height and RGB images for semantic segmentation tasks. For example, Xiong et al. [36] demonstrated a strong correlation between the geometric information in the normalized digital surface model (nDSM) and the semantic category of land cover. Jointly utilizing two modalities, RGB and nDSM (height), has the potential to significantly improve segmentation performance, underlining the reliability of Transformer-based networks for multimodal fusion. This research highlights the interplay between semantic information and feature height information. Additionally, recent studies have investigated the use of RGB images for joint height estimation and semantic segmentation tasks in deep learning for remote sensing.

6. Multi-Task Learning

Previous studies [15,36][15][36] have yielded promising results, underscoring that joint processing of SAR and high-resolution remote sensing data can significantly enhance the accuracy of building extraction and height estimation tasks. These investigations have emphasized the connection between semantic and height information of ground objects, highlighting the effectiveness and necessity of simultaneously performing semantic segmentation and height estimation tasks. Currently, many deep learning tasks predominantly rely on single-task learning, yet multi-task learning, which allows the simultaneous learning of multiple related tasks and the sharing of information between them, offers superior generalization abilities compared to single-task learning [59]. Srivastava et al. [60] employed joint height estimation and semantic labeling on monocular aerial images, utilizing a single decoder with a fully connected layer to perform both height estimation and semantic segmentation tasks. In contrast, Carvalho et al. [61] proposed a framework for joint semantics and local height, processing the two tasks separately in the middle part of the decoder. Gao et al. [62] harnessed contrastive learning with an encoder featuring shared parameters, alongside cross-task contrast loss and cross-pixel contrast loss for height estimation and semantic segmentation. The decoder employed contrastive learning to encourage the model to learn detailed features. Lu et al. [63] introduced a unified deep learning architecture that can generate both estimated relative height maps and semantically segmented maps from RGB images, allowing for end-to-end training while accomplishing relative height estimation and semantic segmentation simultaneously. However, they failed to consider the independent relationship between building texture details and building semantic information. According to the correlation between semantic segmentation and height estimation, Zhao et al. [64] investigate and propose a semantic-aware unsupervised domain adaptation method for height estimation. They found that incorporating semantic supervision improves the accuracy of height estimation for single-view orthophotos under unsupervised domain adaptation.

References

  1. Skalicky, V.; Čerpes, I. Comprehensive assessment methodology for liveable residential environment. Cities 2019, 94, 44–54.
  2. Chi, Y.L.; Mak, H.W.L. From comparative and statistical assessments of liveability and health conditions of districts in Hong Kong towards future city development. Sustainability 2021, 13, 8781.
  3. Dabous, S.A.; Shanableh, A.; Al-Ruzouq, R.; Hosny, F.; Khalil, M.A. A spatio-temporal framework for sustainable planning of buildings based on carbon emissions at the city scale. Sustain. Cities Soc. 2022, 82, 103890.
  4. Li, Z.; Shi, W.; Wang, Q.; Miao, Z. Extracting man-made objects from high spatial resolution remote sensing images via fast level set evolutions. IEEE Trans. Geosci. Remote Sens. 2014, 53, 883–899.
  5. Han, K.; Bao, S.; She, M.; Pan, Q.; Liu, Y.; Chen, B. Exploration of intelligent building planning for urban renewal. Sustainability 2023, 15, 4565.
  6. Cao, Y.; Xu, C.; Aziz, N.M.; Kamaruzzaman, S.N. BIM–GIS integrated utilization in urban disaster management: The contributions, challenges, and future directions. Remote Sens. 2023, 15, 1331.
  7. Guo, H.; Shi, Q.; Du, B.; Zhang, L.; Wang, D.; Ding, H. Scene-driven multitask parallel attention network for building extraction in high-resolution remote sensing images. IEEE Trans. Geosci. Remote Sens. 2020, 59, 4287–4306.
  8. Lee, T.; Kim, T. Automatic building height extraction by volumetric shadow analysis of monoscopic imagery. Int. J. Remote Sens. 2013, 34, 5834–5850.
  9. Licciardi, G.A.; Villa, A.; Dalla Mura, M.; Bruzzone, L.; Chanussot, J.; Benediktsson, J.A. Retrieval of the height of buildings from WorldView-2 multi-angular imagery using attribute filters and geometric invariant moments. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2012, 5, 71–79.
  10. Brunner, D.; Lemoine, G.; Bruzzone, L.; Greidanus, H. Building height retrieval from VHR SAR imagery based on an iterative simulation and matching technique. IEEE Trans. Geosci. Remote Sens. 2009, 48, 1487–1504.
  11. Elkhrachy, I. Flash flood water depth estimation using SAR images, digital elevation models, and machine learning algorithms. Remote Sens. 2022, 14, 440.
  12. Moya, L.; Mas, E.; Koshimura, S. Sparse representation-based inundation depth estimation using sAR data and digital elevation model. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 9062–9072.
  13. Parida, B.R.; Tripathi, G.; Pandey, A.C.; Kumar, A. Estimating floodwater depth using SAR-derived flood inundation maps and geomorphic model in kosi river basin (India). Geocarto Int. 2022, 37, 4336–4360.
  14. Li, X.; Zhou, Y.; Gong, P.; Seto, K.C.; Clinton, N. Developing a method to estimate building height from Sentinel-1 data. Remote Sens. Environ. 2020, 240, 111705.
  15. Fieuzal, R.; Baup, F. Estimation of leaf area index and crop height of sunflowers using multi-temporal optical and SAR satellite data. Int. J. Remote Sens. 2016, 37, 2780–2809.
  16. Sportouche, H.; Tupin, F.; Denise, L. Building detection by fusion of optical and SAR features in metric resolution data. In Proceedings of the 2009 IEEE International Geoscience and Remote Sensing Symposium, Cape Town, South Africa, 12–17 July 2009; IEEE: Piscataway, NJ, USA, 2009; Volume 4.
  17. Liasis, G.; Stavrou, S. Satellite images analysis for shadow detection and building height estimation. ISPRS J. Photogramm. Remote Sens. 2016, 119, 437–450.
  18. Qi, F.; Zhai, J.Z.; Dang, G. Building height estimation using Google Earth. Energy Build. 2016, 118, 123–132.
  19. Kulkarni, S.C.; Rege, P.P. Pixel level fusion techniques for SAR and optical images: A review. Inf. Fusion 2020, 59, 13–29.
  20. Sportouche, H.; Tupin, F.; Denise, L. Extraction and three-dimensional reconstruction of isolated buildings in urban scenes from high-resolution optical and SAR spaceborne images. IEEE Trans. Geosci. Remote Sens. 2011, 49, 3932–3946.
  21. Gao, J.; O’Neill, B.C. Mapping global urban land for the 21st century with data-driven simulations and Shared Socioeconomic Pathways. Nat. Commun. 2020, 11, 2302.
  22. Eigen, D.; Fergus, R. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2650–2658.
  23. Xu, D.; Ricci, E.; Ouyang, W.; Wang, X.; Sebe, N. Multi-scale continuous crfs as sequential deep networks for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5354–5362.
  24. Ming, Y.; Meng, X.; Fan, C.; Yu, H. Deep learning for monocular depth estimation: A review. Neurocomputing 2021, 438, 14–33.
  25. Agarwal, A.; Arora, C. Depthformer: Multiscale vision transformer for monocular depth estimation with global local information fusion. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 3873–3877.
  26. Agarwal, A.; Arora, C. Attention attention everywhere: Monocular depth prediction with skip attention. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–7 January 2023; pp. 5861–5870.
  27. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929.
  28. Wang, L.; Fang, S.; Meng, X.; Li, R. Building extraction with vision transformer. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–11.
  29. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 10–17 October 2021; pp. 10012–10022.
  30. Chen, Y.; Liu, P.; Zhao, J.; Huang, K.; Yan, Q. Shallow-Guided Transformer for Semantic Segmentation of Hyperspectral Remote Sensing Imagery. Remote Sens. 2023, 15, 3366.
  31. Xie, Y.; Feng, D.; Xiong, S.; Zhu, J.; Liu, Y. Multi-scene building height estimation method based on shadow in high resolution imagery. Remote Sens. 2021, 13, 2862.
  32. Sun, Y.; Shahzad, M.; Zhu, X.X. Building height estimation in single SAR image using OSM building footprints. In Proceedings of the 2017 Joint Urban Remote Sensing Event (JURSE), Dubai, United Arab Emirates, 6–8 March 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1–4.
  33. Wang, C.; Pei, J.; Wang, Z.; Huang, Y.; Wu, J.; Yang, H.; Yang, J. When deep learning meets multi-task learning in SAR atr: Simultaneous target recognition and segmentation. Remote Sens. 2020, 12, 3863.
  34. Ma, X.; Ji, K.; Zhang, L.; Feng, S.; Xiong, B.; Kuang, G. An open set recognition method for SAR targets based on multitask learning. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5.
  35. Heiselberg, P.; Sørensen, K.; Heiselberg, H. Ship velocity estimation in SAR images using multitask deep learning. Remote Sens. Environ. 2023, 288, 113492.
  36. Xiong, Z.; Chen, S.; Wang, Y.; Mou, L.; Zhu, X.X. GAMUS: A geometry-aware multi-modal semantic segmentation benchmark for remote sensing data. arXiv 2023, arXiv:2305.14914.
  37. Hambarde, P.; Dudhane, A.; Patil, P.W.; Murala, S.; Dhall, A. Depth estimation from single image and semantic prior. In Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 25–28 October 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1441–1445.
  38. Hambarde, P.; Murala, S.; Dhall, A. UW-GAN: Single-image depth estimation and image enhancement for underwater images. IEEE Trans. Instrum. Meas. 2021, 70, 1–12.
  39. Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1125–1134.
  40. Atteia, G.; Collins, M.J.; Algarni, A.D.; Samee, N.A. Deep-Learning-Based Feature Extraction Approach for Significant Wave Height Prediction in SAR Mode Altimeter Data. Remote Sens. 2022, 14, 5569.
  41. Sun, Y.; Hua, Y.; Mou, L.; Zhu, X.X. Large-scale building height estimation from single VHR SAR image using fully convolutional network and GIS building footprints. In Proceedings of the 2019 Joint Urban Remote Sensing Event (JURSE), Vannes, France, 22–24 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–4.
  42. Ding, Y.; Lin, L.; Wang, L.; Zhang, M.; Li, D. Digging into the multi-scale structure for a more refined depth map and 3D reconstruction. Neural Comput. Appl. 2020, 32, 11217–11228.
  43. Dong, X.; Garratt, M.A.; Anavatti, S.G.; Abbass, H.A. Towards real-time monocular depth estimation for robotics: A survey. IEEE Trans. Intell. Transp. Syst. 2022, 23, 16940–16961.
  44. Yuan, W.; Gu, X.; Dai, Z.; Zhu, S.; Tan, P. New crfs: Neural window fully-connected crfs for monocular depth estimation. arXiv 2022, arXiv:2203.01502.
  45. Kim, D.; Ka, W.; Ahn, P.; Joo, D.; Chun, S.; Kim, J. Global-local path networks for monocular depth estimation with vertical cutdepth. arXiv 2022, arXiv:2201.07436.
  46. Chen, P.Y.; Liu, A.H.; Liu, Y.C.; Wang, Y.C.F. Towards scene understanding: Unsupervised monocular depth estimation with semantic-aware representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2624–2632.
  47. Petrovai, A.; Nedevschi, S. Exploiting pseudo labels in a self-supervised learning framework for improved monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1578–1588.
  48. Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer: Cham, Switzerland, 2015; pp. 234–241.
  49. Chen, Y.; Yan, Q. Vision Transformer is required for hyperspectral semantic segmentation. In Proceedings of the 2022 5th International Conference on Pattern Recognition and Artificial Intelligence (PRAI), Chengdu, China, 19–21 August 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 36–40.
  50. Chaurasia, A.; Culurciello, E. Linknet: Exploiting encoder representations for efficient semantic segmentation. In Proceedings of the 2017 IEEE Visual Communications and Image Processing (VCIP), St. Petersburg, FL, USA, 10–13 December 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1–4.
  51. Yan, Q.; Chen, Y.; Jin, S.; Liu, S.; Jia, Y.; Zhen, Y.; Chen, T.; Huang, W. Inland water mapping based on GA-LinkNet from CyGNSS data. IEEE Geosci. Remote Sens. Lett. 2022, 20, 1–5.
  52. Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890.
  53. Shao, Z.; Tang, P.; Wang, Z.; Saleem, N.; Yam, S.; Sommai, C. BRRNet: A fully convolutional neural network for automatic building extraction from high-resolution remote sensing images. Remote Sens. 2020, 12, 1050.
  54. Deng, W.; Shi, Q.; Li, J. Attention-gate-based encoder–decoder network for automatical building extraction. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 2611–2620.
  55. Zheng, Z.; Zhong, Y.; Wang, J. Pop-Net: Encoder-dual decoder for semantic segmentation and single-view height estimation. In Proceedings of the IGARSS 2019-2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 4963–4966.
  56. Xing, S.; Dong, Q.; Hu, Z. SCE-Net: Self-and cross-enhancement network for single-view height estimation and semantic segmentation. Remote Sens. 2022, 14, 2252.
  57. Zhang, B.; Wan, Y.; Zhang, Y.; Li, Y. JSH-Net: Joint semantic segmentation and height estimation using deep convolutional networks from single high-resolution remote sensing imagery. Int. J. Remote Sens. 2022, 43, 6307–6332.
  58. Chen, Y.; Wang, B.; Yan, Q.; Huang, B.; Jia, T.; Xue, B. Hyperspectral Remote-Sensing Classification Combining Transformer and Multiscale Residual Mechanisms. Laser Optoelectron. Prog. 2023, 60, 1228002.
  59. Bhattacharjee, D.; Zhang, T.; Süsstrunk, S.; Salzmann, M. Mult: An end-to-end multitask learning transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12031–12041.
  60. Srivastava, S.; Volpi, M.; Tuia, D. Joint height estimation and semantic labeling of monocular aerial images with CNNs. In Proceedings of the 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Fort Worth, TX, USA, 23–28 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 5173–5176.
  61. Carvalho, M.; Le Saux, B.; Trouvé-Peloux, P.; Champagnat, F.; Almansa, A. Multitask learning of height and semantics from aerial images. IEEE Geosci. Remote Sens. Lett. 2019, 17, 1391–1395.
  62. Gao, Z.; Sun, W.; Lu, Y.; Zhang, Y.; Song, W.; Zhang, Y.; Zhai, R. Joint learning of semantic segmentation and height estimation for remote sensing image leveraging contrastive learning. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5614015.
  63. Lu, M.; Liu, J.; Wang, F.; Xiang, Y. Multi-Task learning of relative height estimation and semantic segmentation from single airborne rgb images. Remote Sens. 2022, 14, 3450.
  64. Zhao, W.; Persello, C.; Stein, A. Semantic-aware unsupervised domain adaptation for height estimation from single-view aerial images. ISPRS J. Photogramm. Remote Sens. 2023, 196, 372–385.
More
ScholarVision Creations