Although some research progress has been made in building footprint extraction in recent years, the diversity of remote-sensing image sources and the complexity of the environment still bring many challenges to this task, mainly including:
In recent years, deep learning methods represented by the convolutional neural network (CNN) have shown great potential in the fields of computer vision
[11,12][11][12] and remote-sensing image interpretation
[13,14][13][14]. With the powerful ability to extract high-level features, CNN-based building footprint extraction methods alleviate the above-mentioned problems to a certain extent. Most of these methods adopt the fully convolutional architecture of the encoder–decoder. For example, Ji et al. proposed a Siamese U-shaped network named SiU-Net for building extraction, which enhances the robustness of buildings of different scales by simultaneously processing original images and downsampled low-resolution images
[15]. The method proposed by Sun et al. improves the detection accuracy of building edge by combining CNN with active contour model
[16]. Yuan et al. designed a CNN with a simple structure, which integrates pixel-level prediction activated by multiple layers and introduces a symbolic distance function to establish boundaries to represent the output, which has a stronger representation ability
[17,18][17][18]. In addition, BRRNet proposed by Shao et al. introduced the atrous convolution of different dilation rates to extract more global features by gradually increasing the receiving field in the feature extraction process and the residual refinement module to further refine the residual between the result of the prediction module and the real result
[19]. However, existing approaches still suffer from challenges and limitations. Most of the methods above are an extension of the general end-to-end semantic segmentation method, do not carry out targeted analysis of the characteristics of the building itself, and do not filter the noise effectively.
2. Building Footprint Extraction Methods
Remote-sensing imagery can provide effective data support for humans to reform nature, and it has been widely used in Earth observation
[20,21,22][20][21][22]. With the rapid development of aerial photography technology such as satellite and aviation, high-resolution remote-sensing images allow for observing detailed ground targets such as buildings, roads, and vehicles. In particular, building footprint extraction is of great significance for urban development planning and urban disaster prevention and mitigation, since buildings are one of the main man-made targets for humans to transform the Earth’s surface
[23,24,25,26][23][24][25][26]. Building footprint extraction has been a constant concern by scholars, and many building footprint extraction methods have been proposed in the past decade. These methods can be grouped into the following two categories: conventional building footprint extraction methods and deep-learning-based building footprint extraction methods.
2.1. Conventional Building Footprint Extraction Methods
Building footprint extraction plays an important role in the interpretation and application of remote-sensing images
[27]. In the early stage, scholars worked on extracting building footprints through different mathematical models or combining multiple types of data information. For instance, Reference
[28] designed a fully automatic building footprint extraction approach from the differential morphological profile of high-resolution satellite imagery. In Reference
[29], a Bayesian-based approach is proposed to extract building footprints through aerial LiDAR data. This method employs the shortest path algorithm and maximizes the posterior probability using linear optimization to automatically obtain building footprints. Sahar et al. utilized vector parcel geometries and their attributes to extract building footprints by using integrated aerial imagery and geographic information system (GIS) data
[23]. These methods often require different types of data support to achieve building footprint extraction, and the results are not reliable enough
[30,31][30][31]. In addition, scholars have devoted themselves to designing various hand-crafted features to automatically extract building footprints from high-resolution remote-sensing images. Zhang et al. devised a pixel shape index to extract buildings by classifying the shape and contour information of pixels
[32]. Huang et al. proposed a morphological building index for automatic building extraction in
[33]. Similarly, Huang et al. also developed a morphological shadow index for building extraction from high-resolution remote-sensing images
[34]. Moreover, some methods use morphological attributes to achieve building footprint extraction
[35,36][35][36]. In summary, these conventional approaches have been exploited to extract building footprints from high-resolution remote-sensing images.
2.2. Deep-Learning-Based Building Footprint Extraction Methods
Computational intelligence (CI) is a biology- and linguistics-driven computational paradigm
[37,38][37][38]. In recent years, deep learning technology, as a main pillar, has been widely used in remote-sensing image interpretation with powerful layer-by-layer learning and nonlinear fitting capabilities, such as change detection
[14], scene classification
[39], semantic segmentation
[40], object detection
[41[41][42],
42], etc. In this context, the building footprint extraction method based on deep learning has attracted the attention of many scholars. The building footprint extraction task can be treated as a single-objective semantic segmentation task
[43]. Therefore, the direct idea is to use a deep learning-based semantic segmentation network for building footprint extraction, which can fully utilize mainstream deep neural networks (such as VGGNet
[44], ResNet
[45], etc.) to mine deep semantic features to recognize buildings. For example, compared with conventional methods, semantic segmentation networks such as fully convolutional network (FCN)
[46] and U-Net
[47] based on VGGNet can achieve a substantial improvement in the performance of building footprint extraction
[17]. These methods promote the research of deep-learning-based building footprint extraction methods. According to this, recently, many deep-learning-based approaches have been proposed for building footprint extraction from high-resolution remote-sensing images in an end-to-end manner
[43]. These recent methods can be broadly reviewed as follows.
As the spatial resolution of images continues to increase, the features of various building styles, such as material, color, texture, shape, scale, and distribution, have more obvious differences, which makes it difficult to accurately extract pixel-wise building footprints by using conventional semantic segmentation networks
[48]. To overcome the above challenges, many novel networks based on multi-scale and attention structures have been proposed for building footprint extraction. For example, Ji et al. proposed a Siamese U-Net (SiU-Net) for multi-source building extraction
[15]. SiU-Net
[15] trains the network by inputting the down-sampled counterparts as the input of another Siamese branch to enhance the multi-scale perception ability of the network and improve the performance of building extraction. In
[49], a novel network with an encoder–decoder structure, named building residual refine network (BRRNet), is devised for building extraction, which introduces a residual refinement module to enlarge the receptive field of the network, thus improving the performance of building extraction with various scales. Chen et al. proposed a context feature enhancement network (CFENet) to extract building footprints
[50], which builds a spatial fusion module and focus enhancement module for enhancing multi-scale feature representation. Other similar networks can be found in
[51,52][51][52]. In addition to these networks with multi-scale structures, attention-based networks have been able to enhance multi-scale feature representation, thus effectively improving building footprint extraction accuracy. For instance, Guo et al. developed a U-Net with an attention block for building extraction in
[53]. In Reference
[54], a scene-driven multitask parallel attention convolutional network is promoted for building extraction from high-resolution remote-sensing images. An attention-gate-based and pyramid network (AGPNet) with an encoder–decoder structure is designed for building extraction in
[55], which is integrated with a grid-based attention gate and atrous spatial pyramid pooling module to enhance multi-scale features. Other attention-based building footprint extraction methods are available in
[56,57,58,59][56][57][58][59].
Recently, some methods have introduced edge information and frequency information to enhance the recognition ability of buildings
[48,60][48][60]. For instance, Zhu et al. proposed an edge-detail network for building extraction
[61], which can consider the edge information of the images to enhance the identification ability to build footprints. In
[62], a multi-task frequency–spatial learning network is promoted for building extraction. Zhao et al. adopted a multi-scale attention-guided UNet++ with edge constraint to achieve accurate building footprint segmentation in
[63]. For other related papers, one can refer to the following studies
[64,65,66][64][65][66]. In addition, advanced transformer-based networks have also received attention for building extraction, such as References
[57,67,68][57][67][68]. These methods have largely contributed to the development of building footprint extraction.