2.3.1. Hybrid Semantic Segmentation
We use the word “hybrid” in this subcategory because the procedure of crack segmentation is involved with the application of the two aforementioned settings (i.e., either IC or OR). In other words, as the first step, either an IC setting (or an OR setting) is performed to find crack patches (or crack candidate regions) and then, a technique is employed to segment the crack pixels in the identified crack patches. The employed technique could be either an IPT or a shallow fully convolutional network (FCN). For the former case, several IPTs, such as structured Random Forest edge detection [
75], tabularity flow [
76], Otsu’s thresholding [
77], and fast block-wise segmentation [
78] have been used. For the latter case, mainly the Mask R-CNN [
79] framework which is an extended version of faster R-CNN has been used. Mask R-CNN consists of an extra shallow FCN being utilised to perform SS on the bounding boxes which contain cracks [
75,
80,
81].
Several studies have first used classification architectures to find crack patches, and then different IPTs to extract the crack map at the pixel-level. Ni et al. [
77] used GoogLeNet [
82] and ResNet [
33] classifiers to detect crack patches. Then, Otsu’s thresholding was employed to segment the identified crack patches, followed by median filtering and the Hessian matrix, which were used to eliminate the influence of illumination and to enhance the crack structures, respectively. In [
78], the idea of transfer learning was employed at the crack patch identification step using a pre-trained architecture on the ImageNet data set. Then, fast blockwise segmentation and tensor voting curve detection methods were employed to create the crack mask and improve the accuracy of crack localization, respectively.
Faster R-CNN architecture has also been used, combined with IPTs to extract the crack mask, to identify crack regions as the first step. In [
83], a Bayesian fusion algorithm was employed to suppress false alarms based on the orientation of the identified crack regions. Then, a set of IPTs, known as morphological operations, were considered to extract the crack mask. Kang et al. [
76] proposed to perform crack segmentation in the same manner. However, in their proposed approach, after crack region identification by Faster R-CNN, a modified tabularity flow field, an IPT, was employed to perform crack segmentation on concrete images taken from both indoor and outdoor environments.
In the case of approaches in which first, crack patches and regions are identified and then, an FCN is used to perform crack segmentation at the pixel-level, mask R-CNN has been most employed [
75,
80,
81,
84]. It is noteworthy to mention that Tan et al. [
84] defined a new threshold resulting in a more desirable segmentation of irregular long-thin cracks. Wei et al. [
80] employed mask R-CNN to detect concrete bugholes at the pixel-level. In [
75], a comprehensive comparison was performed between the application of mask R-CNN and a framework where faster R-CNN combined with structured random forest edge detection was employed for the task of crack segmentation. They proved that mask R-CNN achieves higher accuracy in comparison to that combination for the task of crack segmentation.
Three more hybrid SS approaches were found in the literature, where after patch detection, deep architectures have been utilised to output the crack segmentation map. In [
85], GoogLeNet [
82] was employed for crack patch detection. Then, the detected patches were fed into a feature fusion module and a set of consecutive convolution layers to perform the crack segmentation. Zhang et al. [
86] proposed a Sobel-edge adaptive sliding window technique to extract crack patches, which is computationally more efficient than the standard sliding window technique. Moreover, in the crack patch extraction step, to further reduce overall processing time and suppress the identified crack patches but keep the patches with significant local edge texture, a non-maximum image patch suppression strategy is proposed. Once the crack patches are detected, SegNet [
87] architecture is employed to output the crack mask. In [
88], firstly, the pre-trained AlexNet on ImageNet data set is used to classify the image patches into the crack, sealed crack, and background classes. Then, the knowledge acquired by the classification network is transferred into an FCN equipped with dilated convolution to perform crack classification at the pixel-level.
It is worth mentioning that since the crack detection is performed at the pixel-level with a SS setting, crack quantification has been considered frequently in studies in this subcategory via the application of different methods [
76,
77,
80,
81]. To name a few applied procedures in this subcategory, Otsu’s method and Canny edge detector, distance transform method, and Zernike moment operator have been considered in [
76,
77,
81], respectively, to estimate the width and length of the detected cracks. A summary of the studies of deep crack segmentation approaches based on the hybrid SS setting is shown in
Table 3.
Table 3. Summary of deep crack segmentation approaches based on the hybrid SS setting.
Ref |
Novelty/Novelties |
Method |
Core Architecture |
[78] |
1. Application of transfer learning to identify crack and sealed crack patches. 2. Applying fast block-wise segmentation based on linear regression to segment the identified patches. 3. Application of tensor voting curve detection to extract the detected crack curves. |
IC + IP |
- |
[88] |
1. Application of pre-trained AlexNet on ImageNet data set to classify road images into the crack, sealed crack, and background patches. 2. Application of an FCN combined with dilated convolution to segment the detected crack patches. |
IC + FCN |
AlexNet |
[85] |
Proposing a crack delineation network including a generic pre-trained CNN model (GoogLeNet) combined with a feature pyramid network to achieve feature-map fusion. |
IC + FCN |
GoogLeNet |
[77] |
1. Proposing a dual-scale CNN to perform crack patch classification and segmentation. 2. Application of Zernike moment operator for quantitative crack width estimation. |
IC + IP |
GoogLeNet |
[80] |
Application of mask R-CNN for bughole segmentation in concrete surface images and quantification of the segmented bugholes. |
OR + FCN |
Mask R-CNN |
[84] |
Application of mask R-CNN to detect cracks in pavement image data sets |
OR + FCN |
Mask R-CNN |
[83] |
Application of faster R-CNN combined with a Bayesian probability algorithm to suppress false detection and a set of IPTs to perform crack segmentation. |
OR + IP |
Faster R-CNN |
[81] |
1. Application of mask R-CNN to output crack masks. 2. Applying IPTs to quantify the detected masks. |
OR + FCN |
Mask R-CNN |
[86] |
Proposing a new hybrid crack segmentation approach based on the global non-overlapping sliding windows and Sobel-edge detector to identify crack patches combined with a deep encoder–decoder architecture (SegNet) to perform crack segmentation. |
IC + FCN |
SegNet |
[75] |
Performing a comparative study between two crack segmentation frameworks of (i) Faster R-CNN combined with structured Random Forest edge detection and (ii) Mask R-CNN. |
OR + IP and OR + FCN |
Faster R-CNN, Mask R-CNN |
[76] |
1. Performing crack segmentation using Faster R-CNN combined with a modified tabularity flow field. 2. Performing crack quantification using a modified distance transform method. |
OR + IP |
Faster R-CNN |
2.3.2. Pure Semantic Segmentation
Pure crack segmentation is performed with no crack patch or crack candidate region identification beforehand. Therefore, no IC or OR approach is involved in this procedure. In the literature, several approaches have been considered for performing pure SS with an application for crack detection. Pure SS can be done by the typical architecture utilised for IC if the fully connected layers are replaced with convolutional layers resulting in an FCN [
89,
90] (aka an encoder–decoder structure). This opens a window to different FCN structures that can be used for SS in the area of computer vision. In the area of crack detection, mostly, the pure crack segmentation approaches are performed with the application of encoder–decoder structures. However, several studies have employed other techniques to perform crack segmentation at the pixel-level. Firstly, these approaches and then, the encoder–decoder approaches will now be reviewed.
Crack segmentation at the pixel-level was done with an application of the IC architecture in [
91,
92]. In both of these studies, once the patches are extracted from the pavement images, crack patches (i.e., positive patches) are defined based on a criterion of whether the centre pixels are crack ones or not. After training and classification of the positive patches, the crack mask at the pixel-level can be achieved by stacking the positive pixels together. The difference between these studies is that in [
91], more than one pixel in the centre define whether the patch is positive or not resulting in thicker crack masks after detection. On the other hand, in [
92], only the centre pixel is considered as the criterion. The same idea of crack structure prediction in [
92] was employed to perform crack detection with high accuracy by Fan et al. [
93]. It is worth noting that they removed the pooling layers from the main structure to avoid loosing spatial information and crack quantification also was considered [
93]. CrackNet II [
94] and CrackNet-V [
95] as improvements to the CrackNet [
96], in terms of both learning capability and the required processing time, were proposed. It is worth mentioning that the CrackNet framework is based on both handcrafted features and deep learning. However, CrackNet II and CrackNet-V are deep learning-based approaches utilised for SS of 3D crack images. Both approaches do not include any pooling layers and the image width and height are invariant through the consecutive convolution layers resulting in a supervised classification at the pixel-level. The difference between CrackNet II and CrackNet-V approaches is that CrackNet-V, in addition to a set of consecutive convolution layers, consists of a pre-processing module based on the median filter and Z-score normalization. It must be added that the same research group also proposed Crack Net-R [
97] as another improved version of CrackNet architecture, based on the application of recurrent neural network (RNN) for the task of crack segmentation.
In order to perform SS via the application of an encoder–decoder structure, a backbone architecture is used to extract deep features which is formed by a sequence of convolution, pooling, and activation layers. Since the width and height of the input image decrease once it passes through the backbone architecture, a decoder module, which is formed by a sequence of deconvolution (aka transposed convolution or fractionally strided convolution) layers, is employed to restore the size of the features to the original image size so that the classification can be done at the pixel-level [
98]. The encoder–decoder structure has been abundantly applied to perform crack segmentation. Among various well-known architectures that have been proposed for performing SS in the computer vision area, U-Net [
99], SegNet [
87], and FC-DenseNet [
100] have been the most considered in the crack detection area. It must be added that to achieve higher accuracy in the SS setting, it is important to let the contextual information flow in the architecture [
101]. To achieve this, various approaches have been considered in the crack detection area, as follows.
U-Net is an architecture first proposed in the area of medical image analysis for biomedical image segmentation [
99]. To the best of the authors’ knowledge, David Jenkins et al. [
102] performed the first application of U-Net architecture in the crack detection area and showed its superiority for the task of crack segmentation over approaches based on handcrafted features combined with classical machine learning algorithms. It must be added that in the crack detection area, there is a challenge of class imbalance because of having access to more background pixels compared to crack pixels. To solve this challenge, different techniques and approaches have been proposed that, in the case of some studies, could be considered as one of the contributions of the study to the field. In [
103], U-Net architecture was equipped with a new loss function based on distance transform to deal with the challenge of class imbalance in the crack detection area. To further improve the performance of U-Net for crack segmentation, Konig et al. [
104] combined the U-Net architecture with residual connections and an attention gating mechanism. An ablation study related to the application of U-Net architecture for crack segmentation was carried out in [
105]. Three U-Net architectures with different depths and number of layers were utilised and compared to perform crack segmentation on challenging data sets of CFD and Aigle-RN. In another study [
106], to cope with the class imbalance challenge in the crack detection area, the focal loss function was employed for training of the U-Net architecture to ensure higher generalization performance. Fan et al. [
107] proposed a modified version of U-Net where two modules are added to the architecture to boost the performance of crack detection. The multi-dilation module equipped with dilation convolution and hierarchical feature learning module are the added modules which are responsible for obtaining crack features of multiple context sizes and deriving multi-scale features from high- to low-level convolution layers, respectively [
107]. Zhang et al. [
108] asserted that in industrial pixel-level crack detection, the aforementioned challenge results in a common problem called “All Black”. “All Black” happens when the algorithm classifies all the pixels of the pavement image as the background and still achieves good accuracy. To solve this problem, they proposed an approach called CrackGAN, which is based on generative adversarial networks (GANs) [
109]. In this approach, the asymmetric U-Net is employed as the backbone architecture of the generator network providing the ability to work with images of arbitrary sizes.
SegNet is an encoder–decoder architecture where the encoder is the VGG-16 [
38] architecture without the fully connected layers [
87]. The main feature of SegNet is that during the decoding, the max-pooling indices at the corresponding encoder layer are recalled and used to up-sample in the decoder module. This makes SegNet faster than U-Net. A pavement and bridge crack segmentation network inspired by SegNet architecture was proposed by Chen et al. [
110]. To train the end-to-end deep learning model proposed in this study, the “AdaDelta” optimizer and the cross-entropy loss function were used. SegNet architecture was employed and modified to have a better performance on line like object segmentation in the area of crack detection by Zou et al. [
111]. The proposed approach in this study is called “DeepCrack” and feature fusion at different scales was considered as the main novelty. Particularly, they fused the sparse features at smaller scales with continuous features at larger scales to have a better performance for the task of crack segmentation. It is stated in [
112] that the proposed encoder–decoder structure in the study which is inspired by SegNet [
87], FCN [
89], and ZFNet [
113] was the first application of deep learning to find cracks in black-box images. A pre-trained ResNet [
33] on the ImageNet data set was employed as the encoder module. Also, in the decoder module, the deconvolution technique of ZFNet and SegNet, which is based on storing the location information in the max-pooling layers and boundary information in the encoder feature maps, was considered in the three deconvolutional layers.
DenseNet was first proposed for the task of IC as an improvement to the ResNet [
33] architecture [
114]. ResNet architecture is involved with a large number of parameters because each layer has its weights and it has been proven that many layers in the ResNet architecture contribute very little. DenseNet architecture is proposed as a solution to this challenge. The main feature of the DenseNet architecture is the application of dense blocks, which benefits the task by alleviating the vanishing gradient problem, improving feature reuse, and significantly decreasing the number of parameters. Later the same year, the FCN version of DenseNet known as FC-DenseNet was proposed by Jegou et al. [
100] for the task of SS. To the best of the authors’ knowledge, the first application of FC-DenseNet for the task of crack segmentation was done by Li et al. [
115]. In this study, an FCN-based approach through the fine-tuning of the DenseNet-121 architecture was considered to perform multiple damage detection in a challenging data set. In addition to the crack images, the data set includes spalling, efflorescence, and hole images captured using a smartphone in different illumination conditions. They proved that FC-DenseNet outperforms SegNet architecture for the task of crack segmentation. An approach called “DenseCrack” was proposed by Mei and Gül [
116] where an encoder–decoder structure based on dense blocks is combined with a depth-first search (DFS)-based algorithm as a refinement module. This study also can be considered as an ablation study because three architectures of DenseCrack121, 169, and 201 were employed and compared. In [
117], the “ConnCrack” approach, which is based on GANs, was proposed. DenseNet121 architecture was employed as the generator of conditional Wasserstein GAN (cWGAN) in this study. Another approach based on the application of FC-DenseNet was proposed by the same research group in [
74]. In this study, to further improve the accuracy of the SS, a new loss function based on the connectivity of the pixels in the crack areas is defined. The application of the new loss function benefits the crack segmentation performance both by dealing with the class imbalance challenge and taking into consideration the connectivity of crack pixels. They made a comprehensive comparative study between the proposed approach and different published state-of-the-art crack segmentation approaches on three different data sets.
Several other studies have been conducted in the crack detection area with an application of encoder–decoder structures. However, the proposed approaches are not inspired by well-known architectures in the computer vision area. In these studies, the technique that is employed to let the contextual information flow between the encoder and the decoder module is notable. In Yang et al. [
118], the up-sampling part is the core novelty of the proposed architecture. The up-sampling part combines global information and local information by adding specific convolutional and deconvolutional layers resulting in the ability of the proposed architecture to deal with multi-scale and multi-level images. In [
119], the ideas of transfer learning and dilated convolution before the decoder module are combined to perform crack detection at the pixel-level. Dilated convolution (aka atrous convolution) was originally designed to aggregate contextual information at multi-scales for SS [
120]. Another approach called “DeepCrack” was proposed by Liu et al. [
121]. In this study, to let the contextual information flow, side-output layers are inserted after convolutional layers. After deep supervision at each side-output layer, the outputs are concatenated to form a final output layer that acquires multi-scale and multi-level features. Then, the fused prediction is refined by conditional random field (CRF) and Guided Filtering (GF) modules to improve the accuracy of SS. The same technique of considering a side network including 1 × 1 convolution layers applied at each level of information and associated deconvolution layers was employed by Yang et al. [
2]. However, they noted that the side-outputs of low-level layers are messy which stems from lacking context information. To solve this issue, the authors utilised a feature pyramid module between the backbone architecture and the side network resulting in combining multi-scale context information into low-level feature maps. They comprehensively compared the performance of their proposed approach with state-of-the-art deep learning-based crack segmentation approaches over five different data sets. On the other hand, without considering any technique to deal with the fusing of feature maps at different scales, simple encoder–decoder structures were considered in [
73,
122]. However, in the former study, a comparison between the application of three different pre-trained deep CNNs (i.e., VGG-16 [
38], InceptionV3 [
123], and ResNet [
33]) as the backbone architectures in the encoder module was performed. The main contribution of the latter study, to solve the problem of class imbalance, was development of an image generation algorithm using the Brownian motion process and Gaussian kernel to generate simulated crack images. A summary of the studies of deep crack segmentation approaches based on the pure SS setting is presented in
Table 4.
Table 4. Summary of deep crack segmentation approaches based on the pure SS setting.
Ref |
Novelty/Novelties |
Method |
[91] |
The first application of deep learning for the task of crack segmentation, where using ConvNet, the feature extraction is done on raw images. |
Centre crack pixels in the patches |
[94] |
Proposing an improved version of CrackNet called CrackNet II resulting in higher performance in terms of accuracy and speed. |
Consecutive conv layers with an invariant spatial size |
[118] |
1. Application of an encoder–decoder structure to perform crack segmentation without employing the sliding windows technique. 2. Extracting the geometric characteristic of cracks via the application of morphological operations. |
Encoder–decoder (feature fusion) |
[102] |
The first application of U-Net architecture in the crack detection area to cope with several limitations of applying CNN for the task of crack detection. |
Encoder–decoder (U-Net) |
[103] |
Application of U-Net architecture equipped with a new proposed loss function based on distance transform to perform the task of crack segmentation. |
Encoder–decoder (U-Net) |
[92] |
1. Predicting the crack structure using CNN. 2. Proposing a strategy to deal with the class imbalance challenge. |
Centre crack pixels in the patches |
[97] |
Proposing an improved version of CrackNet called CrackNet-R based on RNN, including a new recurrent unit. |
RNN |
[122] |
Investigation of the performance of FCN architectures in the crack detection area for the task of crack segmentation. |
Encoder–decoder |
[104] |
Proposing an encoder–decoder structure based on U-Net architecture combined with attention gating and residual connections to improve the performance. |
Encoder–decoder (U-Net) |
[105] |
Investigating the effect of the depth of the U-Net architecture on the crack segmentation performance. |
Encoder–decoder (U-Net) |
[111] |
Designing a new end-to-end trainable neural network based on SegNet architecture for robust crack detection. |
Encoder–decoder (SegNet) |
[115] |
Proposing a deeper and more comprehensive FCN architecture to detect four concrete damages where it requires no sliding window technique. |
Encoder–decoder (FC-DenseNet) |
[112] |
Performing a successful application of deep learning methods for detecting road cracks in black-box images. |
Encoder–decoder (ResNet + SegNet, FCN, ZFNet) |
[73] |
1. Proving the superiority of the SS over OR setting. 2. Proposing a method based on Gaussian kernel and Brownian motion process to generate arbitrary simulated crack images. |
Encoder–decoder |
[119] |
Application of an FCN architecture based on dilated convolution to perform SS of crack images. |
Encoder–decoder (dilated convolution) |
[121] |
1. Proposing an FCN combined with condition random field and guided filtering methods. 2. Designing a new loss function to deal with the class imbalance challenge. 3. Making the employed data set publicly available. |
Encoder–decoder (feature fusion) |
[106] |
Application of U-Net architecture for crack segmentation where Adam algorithm and focal loss function are used as the optimizer and evaluation function, respectively. |
Encoder–decoder (U-Net) |
[2] |
Proposing a feature pyramid and hierarchical boosting network to achieve more robust feature representation and deal with the class imbalance challenge. |
Encoder–decoder (feature fusion) |
[74] |
1. Proposing a novel deep architecture based on dense blocks. 2. Proposing a novel loss function based on the connectivity of crack pixels in the crack areas. |
Encoder–decoder (FC-DenseNet) |
[108] |
Proposing a CrackGAN framework based on the application of GAN architecture where the proposed approach is capable of working with partially annotated ground truths. |
Encoder–decoder (U-Net as generator) |
[117] |
Proposing a crack segmentation approach based on conditional Wasserstein GAN combined with connectivity map to refine the results. |
Encoder–decoder (FC-DenseNet as generator) |
[116] |
1. Application of skip connections in the deep architecture for feature fusion. 2. Application of a depth-first search algorithm for post-processing and improving the accuracy. |
Encoder–decoder (FC-DenseNet) |
[110] |
Application of SegNet architecture to perform the crack segmentation where the “Adadelta” optimizer and cross-entropy loss function are used |
Encoder–decoder (SegNet) |
[95] |
1. Proposing an improved version of CrackNet called CrackNet-V resulting in higher performance in terms of accuracy and speed. 2. Proposing a new activation function to improve the accuracy of crack segmentation for shallow cracks. |
Consecutive conv layers with an invariant spatial size |
[107] |
Proposing U-hierarchical dilated network, a modified encoder–decoder architecture based on U-Net, where two modules of multi-dilation and hierarchical feature learning are added to improve the performance of crack segmentation. |
Encoder–decoder (U-Net) |