2. Handcrafted Feature-Based Models
Before the advent of deep learning networks, most of best performance algorithms relied on handcrafted features. These models extract handcrafted features, for example, Histograms of Oriented Gradients (HOG)
[15][10], Scale Invariant Feature Transform (SIFT)
[16][11], Local Binary Pattern (LBP)
[17][12] and Gray Level Cooccurrence Matrix (GLCM)
[18][13] use features to train a statistical classifier that obtains a semantic segmentation map by classifying the pixels of the input image. Low-level features, namely, semantic textons are proposed in
[19][14], which combines decision trees to classify image pixels. The authors of
[20][15] combine appearance and motion features and employ a probabilistic model based on conditional random field for semantic segmentation in road scenes. Markov Random Field (MRF) is employed in
[21][16] to segment objects in street scene images. In
[22][17], color and texture descriptors are computed for superpixels and train two separate classifiers based on KNN classifiers to classify superpixels to generate the segmentation map. Similarly, in
[23][18], color and texture features are extracted from different regions of the image and train an SVM model to classify the pixels. LBP features are extracted from each region from the image, which are combined with spectral features in
[24][19] for segmentation of high resolution satellite images. An entropy-based technique is proposed in
[25][20] for automatic segmentation of color aerial images. The authors also evaluated the performance of the model on grey aerial images and conclude that the model performed better on color images than grey images. A non-supervised multicomponent aerial image segmentation model is proposed in
[26][21] that employs a self-organizing map (SOM) and hybrid genetic algorithm (HGA). The self-organizing map is used to extract discriminating features from the image. Based on extracted features, different regions of the image are clustered into homogeneous regions by employing the hybrid genetic algorithm (HGA). A land cover segmentation model is proposed in
[27][22] that employs the Structured Support Vector Machines (SSVM) model to learn appearance features and local class interactions. An adaptive mean-shift clustering algorithm is employed in
[28][23] for semantic segmentation in satellite images. The model first extracts color and texture features from different areas of the image and then employs a mean-shift clustering algorithm to combine the homogeneous region of the image. A semantic segmentation model is proposed in
[29][24] for urban aerial images. The model embeds geographic context in a pairwise CRF model and trains the random forest model on multiple descriptors to obtain class likelihood of superpixels.
Although these handcrafted feature-based models perform well in simple semantic segmentation tasks, these models exhibit poor performance in complex scenes. This may be attributed to the following reasons: (1) These models reply on manual computation of complex features which increases the computational cost. (2) Handcrafted features are not robust and are prone to noise and illumination changes. (3) These models lack global context and multi-scale features, because of which these models generally confuse different patterns, leading to misclassification.
3. Deep Learning Models
Deep learning models achieved tremendous success in various visual tasks, including object detection
[30][25], image recognition
[31][26] and semantic segmentation
[12][27]. With the success of deep learning models in natural images, researchers have explored and applied various deep learning models in aerial image analysis to extract meaningful information for scene understanding.
Generally, semantic segmentation from aerial images can be categorized in the following categories: (1) road extraction, (2) building extraction and (3) land-cover segmentation.
Road extraction from satellite images offers crucial information for intelligent traffic monitoring. This information can be utilized to detect newly constructed roads and automatically update maps accordingly. Because of this reason, a significant amount of work
[32,33,34,35,36,37][28][29][30][31][32][33] is reported in the literature regarding road extraction from satellite images. A detailed survey of road extraction from satellite images is reported in
[38][34].
Building extraction from satellite images has wide range of applications in urban planning
[39][35], disaster management
[40,41][36][37] and population estimation
[42][38]. Although several models
[43,44,45,46,47][39][40][41][42][43] have been proposed in recent years for automatic building footprints’ extraction from satellite images, these models suffer from a scale problem. Due to the different sizes of buildings, it becomes challenging for the models to precisely extract building footprints from satellite images. For example, the MFBI model is proposed in
[48][44] to address the problem of multiple scales. For multiple region extraction, an attention module with multi-scale guidance framework is proposed in
[49][45]. A multi-scale encoder–decoder framework is reported in
[50][46] to extract local and global features to model the complex and diverse shapes of buildings from satellite images.
Land cover segmentation provides high-level semantic information about the land classified into forests, vegetation, grasslands and barren lands. Such information is useful for land use management
[51][47] and precision agriculture
[52][48]. Due to immense advantages of land cover segmentation, several researchers have developed various deep learning models
[53,54,55,56,57][49][50][51][52][53] for automatic segmentation of land cover types from high resolution satellite images.
In addition to the above-mentioned methods, several methods have been reported to extract high-level semantic information for other tasks, including slum segmentation
[58][54], farmland segmentation
[59,60][55][56] and segmentation of residential solar panels
[61,62][57][58]. A fully convolutional network (FCN) is proposed in
[63][59] to identify slums in satellite images. Similarly, a deep fully convolutional network is proposed in
[64][60] for sea–land segmentation in satellite images. The network follows a similar pipeline as that of the popular U-Net
[10][61] (initially introduced for bio-medical image segmentation); however, instead of using convolutional layers in the encoder and decoder parts, DeepUNet introduced DownBlocks in the encoder part and UpBlocks in the decoder part. These two blocks are connected via U-connection and Plus connections to obtain more precise segmentation results. TreeUNet
[65][62] extended DeepUNet by introducing skip connections to discriminate the pixels of apparently similar classes for land cover segmentation in satellite images. Similarly, a deep learning framework, ResUNet-a, is proposed in
[66][63] that integrates atrous convolution layers, pyramid scene parsing and residual connection with UNet to identify the boundaries of different patterns. Recently, an attention mechanism has been introduced in deep learning networks to model long range dependencies and further refine the feature maps. In this strategy, the network focuses more on the object of interest and pays little attention to the background. A channel attention mechanism that is integrated with FCN is proposed in
[67][64] for semantic segmentation of aerial images. Similarly, a hybrid attention mechanism is introduced in
[68][65] to capture global relationships for a better representation of features.