Deep Learning Methods for Smoke Recognition: Comparison
Please note this is a comparison between Version 2 by Rafik Ghali and Version 1 by Rafik Ghali.

Fire accidents cause alarming damage. They result in the loss of human lives, damage to property, and significant financial losses. Early fire ignition detection systems, particularly smoke detection systems, play a crucial role in enabling effective firefighting efforts. In this paper, a novel DL (Deep Learning) method, namely BoucaNet, is introduced for recognizing smoke on satellite images while addressing the associated challenging limitations. BoucaNet combines the strengths of the deep CNN EfficientNet v2 and the vision transformer EfficientFormer v2 for identifying smoke, cloud, haze, dust, land, and seaside classes. Extensive results demonstrate that BoucaNet achieved high performance, with an accuracy of 93.67%, an F1-score of 93.64%, and an inference time of 0.16 seconds compared with baseline methods. BoucaNet also showed a robust ability to overcome challenges, including complex backgrounds; detecting small smoke zones; handling varying smoke features such as size, shape, and color; and handling visual similarities between smoke, clouds, dust, and haze.

  • smoke recognition
  • satellite images
  • deep learning
  • BoucaNet

1. Introduction

Fires cause severe damage to economies, properties, ecosystems, and human lives. They destroy properties, homes, and resources, leading to considerable financial losses, and contribute to ecological imbalances. For example, since 1990, wildfires have destroyed an average of 2.5 million hectares per year in Canada [1]. In addition, over the past decade, the cost of firefighting in Canada ranged between $800 million and $1.5 billion a year [1]. Since January 2023, 260,000 hectares have already burned in the European Union [2]. Researchers have focused on developing fire ignition and early detection systems to reduce this alarming statistic and improve firefighting capabilities [3,4]. Both smoke and fire detection systems are used to provide comprehensive early warning and fire protection. Fire detection systems are used to detect the presence of flames, while smoke detection systems are adopted to identify the first signs of smoke, even before flames are visible.
Recently, smoke recognition methods made significant progress by exploiting visible features captured by vision sensors [5]. Additionally, classical machine learning methods, such as dynamic texture and optical flow, were employed to manually extract smoke features from images or videos. These extracted features were then used to identify the presence of smoke using various classifiers, such as SVM (Support Vector Machine), Random Forest, and AdaBoost. These approaches showed interesting efficiency, but were related to false alarms and the identification of relevant features that accurately represented the smoke recognition problem [5].
Deep learning models were successfully employed in many fields and industries [6,7]. More specifically, they were used for fire ignition detection due to their ability to learn to automatically extract smoke features from large amounts of data. They provide diverse and informative feature maps, which are often better than manually generated features in terms of performance and robustness [8,9]. More recently, satellite remote sensing images were adopted for this task, representing a great opportunity thanks to the advantages of satellite remote sensing, including timeliness and large coverage areas [10,11].
High false-alarm rates are still present due to background complexity; the variability of smoke regarding its size, intensity, and shape; and the presence of smoke-like objects, such as haze, dust, and clouds. These objects often have very similar textures, colors, shapes, and spectral features to smoke, leading to false results in detecting smoke. Therefore, this paper presents a novel ensemble learning method, namely BoucaNet, for recognizing smoke on remote sensing satellite images, addressing these challenging limitations. BoucaNet employed a vision transformer, EfficientFormer v2 [12], and a deep CNN (Convolutional Neural Network), EfficientNet v2 [13], to extract smoke features from satellite images. It was trained and evaluated using a satellite dataset, USTC_SmokeRS [14], which comprises six classes (smoke, cloud, haze, dust, seaside, and land). This paper presents three main contributions:

1. A novel DL method, BoucaNet, is introduced to detect the presence of smoke in satellite images, thereby improving the performance of DL-based smoke classification

methods.

2. BoucaNet demonstrated a robust ability to handle challenging situations such as background complexity and dynamism; detecting small smoke areas; varying characteristics of smoke regarding its air concentration, flow pattern, intensity, shape, and color; and handling its visual similarity to haze, dust, and clouds. This ability

reduces false alarms, making BoucaNet a reliable solution for smoke remote sensing applications with high accuracy.

3. An optimized architecture is proposed in this study, achieving fast inference time, which is an important aspect in developing an early smoke-detection system.

2. Deep Learning Methods for Smoke Recognition

Over the years, numerous DL methods were developed to improve the performance of smoke classification in different fields of application. Among them, Tao et al. [15] suggested a simple CNN to recognize smoke in ground images, addressing challenging limitations such as varying smoke colors, shapes, and textures. The proposed CNN is a modified AlexNet [16] by changing the order of the max pooling layers and normalization layers, which follow the first and second convolutional layers. The modified AlexNet was trained and evaluated using the Yuan dataset  (5695 smoke images and 18,522 non-smoke images) [17], resulting an accuracy of 96.88%. Yin et al. [18] proposed a new deep normalization CNN, namely DNCNN, to improve smoke detection performance. DNCNN incorporates batch normalization into convolutional layers to deal with overfitting and gradient dispersion. Data augmentation techniques (vertical  flipping, rotation, and horizontal flipping) were also used to address the challenges of imbalanced data between smoke and non-smoke images (5695 smoke images and 18,522 non-smoke images [17]). Test results showed that DNCNN achieved an impressive performance with an accuracy of 98.08%, surpassing popular CNNs such as AlexNet, ZF-Net [19], and VGG-16 [20]. Khan et al. [21] studied three CNN models (AlexNet, VGG-16, and GoogleNet [22]) to identify smoke in a normal and foggy IoT environment. Experimental tests were performed using a very large dataset, comprising 18,532 smoke images, 17,474 non-smoke images, 17,474 non-smoke images with fog, and 18,532 smoke images with fog. VGG-16 obtained the higher performance with an accuracy of 97.72% compared with AlexNet, GoogleNet, and published fire models, demo, demonstrating its ability to detect smoke in a foggy environment. Peng and Wand [23] proposed a video smoke detection method to recognize smoke in complex environments. First, a GMM (Gaussian Mixture Model) [24] was employed as an image processing method to extract the suspected smoke areas from images collected from surveillance cameras. Then, the SqueezeNet model [25] was adopted to detect the presence of smoke. Using a large dataset (25,000 smoke images and 25,000 non-smoke images), this proposed method showed a high performance with an accuracy of 97.12% and a high prediction time compared with existing wildfire models such as AlexNet, ShuffleNet [26], Xception [27], and MobileNet [28]. Gu et al. [29] developed a DCNN (Deep Dual-Channel Neural Network) as a smoke recognition method. The DCNN is composed of two deep subnet channels, SBNN (Selective-based Batch Normalization Network) and SCNN (Skip Connection-based Neural Network). SBNN comprises six convolutional layers, four normalization layers, three max pooling layers, and three fully connected layers. SCNN includes eleven convolutional layers, seven normalization layers, three max pooling layers, and one global average pooling layer. DCNN was trained on large public learning data [17], comprising 5695 smoke images and 18,522 non-smoke images, and dand data augmentation techniques (rotation of 90, 180, and 270 degrees). It achieved an accuracy of 99.5%, higher than hand-crafted methods and state-of-the-art DL methods such as DNCNN [18], AlexNet, VGG, GoogLeNet, Xception, ResNet, etc. Zhang et al. [30] presented a DL method, called DC-CNN (Dual-Channel Convolutional Neural Network), for detecting smoke. DC-CNN is composed of two channels. The first channel employs a pretrained AlexNet in extracting smoke features. The second channel is a simple CNN architecture, consisting of four convolutional layers, a pooling layer, and two fully connected layers for generating more advanced characteristics. Extensive studies were conducted using learning data, including 9794 smoke and 9794 non-smoke images, to handle the challenges related to smoke features, such as transparency properties, homogeneity, and visual similarity to clouds, steam, haze, and fog. DC-CNN obtained the highest accuracy of 99.33% compared with baseline DL models such as LeNet, AlexNet, VGG-16, and DNCNN [18]. Jia et al. [31] designed a new method for detecting smoke in videos. Firstly, GMM-based domain knowledge of smoke was adopted to segment the suspected areas of smoke. Then, three pretrained deep learning models (AlexNet, Inception v3, and ResNet50 [32]) were used to recognize smoke. ResNet50 with GMM performed best, with an F1-score of 99.32% compared with the other models using 138 smoke videos as testing data. He et al. [33] proposed a DL method for smoke detection in a foggy environment. This method combines the VGG-16 method as a backbone to extract smoke features and an attention method, which consists of channel attention and spatial attention to improve the detection of small smoke areas. It was also trained and evaluated using 33,666 images (8342 smoke images, 8522 smoke with fog images, 8401 non-smoke images, and 8401 non-smoke with fog images). It achieved an F1-score of 99.97%, outperforming the AlexNet, VGG-16, and SqueezeNet methods. Zhang et al. [34] developed an end-to-end CNN method to identify smoke. Two CNNs (spatial stream and temporal stream), each comprising five convolutional layers, three max pooling layers, and an attention module to suppress noise, and which extract salient features from temporal and spatial feature maps and improve detection performance, were adwere adopted to extract the spatial and temporal features of smoke. This method achieved an accuracy of 96.8%, better than state-of-the-art methods using 116 fire videos and 89 non-fire videos. Cheng et al. [35] presented a deep convulational network, namely PACNN, to improve the robustness of smoke recognition tasks. PACNN is a deep CNN with a PAAModule (Pixel Aware Attention Module), which integrates into the residual structure via element-wise addition and skip connection on two feature maps. Testing results showed that PACNN reached a high accuracy of 98.91% compared with popular CNNs (AlexNet, Inception v4, ResNet34, SEResNet34, DenseNet-121, and DNCNN) and and vision transformers (ViT, Swin-T, and DeiT-Ti) using the Yuan dataset. Tao and Duan [36] introduced a video smoke recognition method, AFSNet, to address slow-moving smoke challenges. AFSNet is composed of three main modules: AFSM (Adaptative Frame Selection Module) for extracting multi-scale spatial and spatiotemporal features; FEM (Feature Extraction Module) for incorporating a context attention module, an enhanced dilated convolution module, and a spatiotemporal feature attention module to minimize the loss of detailed information; and RM (Recognition Module) for detecting smoke presence. AFSNet was trained on two large datasets, SRSet (14,100 smoke images and 15,380 non-smoke images) and RISE (12,567 videos). It achieved impressive F1-scores of 96.57% and 91.00% using the SRSet and RISE datasets, respectively, surpassing classical machine learning methods and existing deep learning models. Cheng et al. [37] proposep learning methods performed a novel vision transformer, called CViTNet (Convolution-enhanced Vision Transformer Network), for identifying smoke. CViTNet consists of three stages (s1, s2, and s3). The first stage, s1, comprises a convolutional stem and a ViT transformer encoder. Each of the s1 and s2 stages includes a ViT transformer encoder [38] and a convolutional token embedding, which was proposed to improve the multiscale feature representation of tokenization. Using the Yuan dataset, CViTNet achieved a high accuracy of 99.20% compared with existing CNNs (AlexNet, ResNet, SEResNet, DenseNet, DNCNN, etc.) and vision transformer methods (ViT-B, DeiT-S, conViT-Ti, Swin-T, etc.) [37]. In the study conducted by Mohammed [39], a pretrained InceptionResNet v2 model [40] was employed for the detection of forest smoke and fires. Mohammed utilized a dataset comprisinter in recog aerial and ground images (1102 fire images and 1102 smoke images). Data augmentation methods, including scaling and horizontal/vertical flipping, were applied during the training phase. Testing results showed that InceptionResNet v2 achieved an impressive accuracy of 99.09%. Chen et al. [41] studied the effectiveness of five DL methods (LeNet5, VGG-16, ResNet18, MobileNet v2 [42], and Xception) for wildland smoizing smoke/fire recognition on aerial images. These models were trained using a large dataset comprising a total of 53,451 images, which were divided into three categories: 25,434 fire/smoke images, 14,317 fire/no-smoke images, and 13,700 no-fire/no-smoke images. VGG-16 obtained an accuracy of 99.91%, surpassing MobileNet v2, ResNet18, LeNet5, Xception, and a traditional machine learning method (Logistic Regression) by 0.56%, 1.52%, 4.58%, 5.35%, and 9.54%. Dilshad et al. [43] proposed a fire detection model, E-FireNet, to recognize fires in a surveillance environment. E-FireNet is a modified VGG-16 by deleting block 5 and adjusting the convolutional layers of block 4. The experimental setup was performed using data augmentation techniques (horizontal flipping, rotation, and scaling). E-FireNet achieved an accuracy 98% better than that of the pretrained MobileNet v1, VGG-19, EfficientNet-B0, VGG-16, and NASNetMobile v1 models using the SV-Fire dataset (1500 images) [43]. Yar et al. [44] developed a modified YOLO v5 method for detecting and locating fires in smart cities. A total of 1957 images, comprising indoor fires (118 images), building fires (723 images), and vehicle fires (1116 images), were used to train and evaluate the proposed model, achieving an F1-score of 84%. Priya and Vani [45] introduced a CNN based on Inception v3 architecture [46] for the recognition of forest smoke/fires using satellite images. Their study utilized a dataset consisting of 534 satellite images, with 239 fire images and 295 no-fire images, for both training and testing purposes. Their proposed method achieved an accuracy of 98%. Ba et al. [14] also proposed a DL method, namely SmokeNet, to address the challenge of recognizing smoke on satellite data, including varying smoke features such as colors, shapes, and spectral overlaps. SmokeNet is a CNN model with channel-wise and spatial attention. A novel satellite dataset, namely USTC_SmokeRS, comprising 6225 satellite images divided into six classes (smoke, cloud, haze, dust, seaside, and land), was used in the training and testing phases. SmokeNet showed high performance with an accuracy of 92.75%. Deep learning methods performed better in recognizing smoke. However, several challenging limitations persist, including the complexity and dynamics of the background; the visual similarity between smoke, clouds, dust, and haze; the varying characteristics of smoke regarding its air concentration, flow pattern, and color; and detecting small smoke zones.

3. Materials and Methods

3.1. Proposed Method for smoke Classification

In thisA paper, a new ensemble learning approach, namely BoucaNet, is introduced for recognizing smoke in satellite images and for addressing challenging limitations, including background complexity and dynamics due to the presence of dynamically changing backgrounds in input satellite images; visual similarities of smoke with clouds, dust, and haze; and varying features of smoke regarding its shape, form, color, flow pattern, and texture. BoucaNet combines the deep CNN EfficientNet v2 (EfficientNetV2M) [13] and the vision transformer EfficientFormer v2 (EfficientFormerV2L) [12]. To employ EfficientNet v2 and EfficientFormer v2 models in the specific task of smoke recognition, their classification layers (last layers), originally developed for different classification tasks, are removed. As depicted in Figure 1, the preprocepreprocessing steps start with resizing the input satellite images to 224 × 224 pixels. Next, four data augmentation techniques, including rotation, shearing, shifting, and zooming, are utilized to diversify learning data, improve the potential of BoucaNet to generalize different real-world scenarios, and ovoid overfitting. Then, the input satellite images and the generated images are simultaneously fed into the EfficientNet v2 and EfficientFormer v2 models to extract complex contextual features, comprising both smoke plume patterns and background contextual information, and provide a comprehensive representation of various smoke scenarios. After concatenating the two feature maps generated by the EfficientNet v2 and EfficientFormer v2 models, the Gaussian dropout regularization technique with a rate of 0.3 is employed. This method adds random noise from a Gaussian distribution to the input satellite data, improving BoucaNet’s generalization ability and avoiding overfitting. Finally, a Softmax function generates a probability score ranging from 0 to 1, determining the appropriate class, such as smoke, cloud, haze, dust, seaside, or land, for the input satellite images.

3.2. Datasets

Many large fire datasets are made available to help researchers in benchmarking and comparing DL techniques dealing with the same problem. However, this is not the case

for smoke recognition problems, especially when using satellite data, thus making the evaluation of these DL methods a little challenging. To train and test the proposed smoke recognition method, BoucaNet, the available satellite data, USTC_SmokeRS [14], is utilized. This dataset is collected using MODIS (Moderate Resolution Imaging Spectroradiometer) and represents numerous smoke scenes through satellite remote sensing. It is selected from a remote sensing platform in Hefei, China, and the Level-1 and Atmosphere Archive & Distribution System (LAADS) Distributed Active Archive Center (DAAC) situated at the Goddard Space Flight Center in Greenbelt, Maryland, USA. The USTC_SmokeRS dataset comprises a total oomprises a total of 6225 satellite images with dimensions of 256 × 256 pixels and a spatial resolution of 1 km. It comprises six classes:

• Smoke (1016 satellite images) as the target class for wildfire detection.

• Dust (1009 satellite images) and haze (1002 satellite images) as negative classes to smoke, which share similar features (texture and spectral) with smoke. • Cloud (1164 satellite images) as the most common class in satellite images, with similar color, shape, and spectral characteristics to smoke.

• Land (1027 satellite images) and seaside (1007 satellite images) as background classes for fire smoke scenes.

4. Results and Discussion

BoucaNet was trained using the USTC_SmokeRS satellite dataset. This dataset allowed BoucaNet to learn on various classes and scenarios, thereby enabling it to learn and recognize various aspects of smoke in satellite images. It comprises a total of 6225 satellite images, divided into six distinct classes. The evaluation of BoucaNet includes several key aspects. Firstly, its performance was analyzed in terms of F1-score, accuracy, and inference time with the method, namely CT-Fire, which combines EfficientFormer v2 [14] and RegNetY [51] models as the backbone, RegNetY-16GF [51], the vision transformer EfficientFormer v2 [12], and SmokeNet [14] as the state-of-the-art smoke detection method. Next, the obtained F1-scores of these models for each class, namely smoke, cloud, dust, haze, land, and seaside, were presented. Then, the resulting confusion matrix generated by BoucaNet was illustrated and discussed. Finally, visual results of the input images predicted by these models were presented. RegNetY-16GF and Eshowed a high perfficientFormer v2 were selected due to their excellent performance in classifying objects. CT-Fire is an ensemble learning method, which combines EfficientFormer v2 and RegNetY-16GF to extract features. Then, the Gaussian drop regularization method and the softmax function were used to recognize the presence of smoke. BoucaNet showed a high performance during testingance during testing, achieving a loss of 0.2184, an accuracy of 93.67%, and an F1-score of 93.64%. This performance was obtained thanks to the diversity of feature maps extracted by EfficientNet v2 and EfficientFormer v2 models, including details, complexity, and local and global feature (colors, shapes, textures, etc.) for the smoke, cloud, haze, seaside, land, and dust classes, thus enathus enabling BoucaNet to distinguish between smoke and complex backgrounds and identify small areas of smoke. In terms of F1-score, BoucaNet outperformed CT-Fire, RegNetY-16GF, and EfficientFormer by 2.75%, 1.38%, and 1.50%, respectively. This proposed model also performed better than the state-of-the-art method SmokeNet, which achieved an accuracy of 92.75% using the USTC_SmokeRS dataset [14].

It demonstrated t demonstrated its potential to address and overcome challenging limitations related to recognizing smoke in satellite images. These challenges include complex backgrounds, comprising various land covers and geographical features, which can make it difficult to accurately identify smoke in input satellite images. Additionally, BoucaNet handled the varying and dynamic nature of smoke in terms of its shape, color, intensity, and flow pattern features, as well as the visual similarities of smoke, including color, shape, and spectral characteristics, which are often shared with clouds, dust, and haze. On the other hand, BoucaNet achieved an efficient processing speed with an inference time of 0.16 seconds, slightly surpassing the inference times of EfficientFormer v2, CT-Fire, and RegNetY-16GF. This inference time showed BoucaNet’s suitability for real-time processing of satellite images for smoke recognition while maintaining high performance. In addition, BoucaNet achieved superior results with an F1-score of 95.58%, 91.00%, 90.82%, 95.01%, 98.76%, and 90.36% for recognizing cloud, dust, haze, land, seaside, and smoke classes, respectively, compared with CT-Fire, RegNetY-16GF, and EfficientFormer v2. It demonstrated its ability to accurately differentiate between cloud, smoke, haze, dust, land, and seaside features, thereby proving its capability to overcome challenges related to background complexity and visual similarities, including color, shape, and spectral characteristics, between smoke and other classes (cloud, dust, and haze). In conclusion, BoucaNet performed well in recognizing smoke in satellite images compared with baseline models (EfficientFormer v2, RegNetY-16GF, CT-Fire, and SmokeNet). Notably, it demonstrated its potential to address challenging limitations, including complex backgrounds; the dynamic nature of smoke in terms of its shape, intensity, and color; detecting small areas of smoke; and distinguishing visual similarities in terms of color, shape, and spectral characteristics between smoke and other elements, including clouds, dust, and haze. Additionally, BoucaNet achieved an interesting inference time.

5. Conclusions

InA this paper, a novel ensemble learning method, namely BoucaNet, was presented for recognizing smoke in satellite images while addressing the associated challenges. BoucaNet combines the strengths of EfficientNet v2 and EfficientFormer v2 to extract rich and diverse feature maps for smoke, cloud, haze, dust, land, and seaside classes. It demonstrated a high performance, with an accuracy of 93.67% and an F1-score of 93.64%, using the USTC_SmokeRS dataset, which consists of 6225 satellite images. Furthermore, BoucaNet outperformed existing deep learning models for object classification, specifically EfficientFormer v2 and RegNetY-16GF, as well as state-of-the-art methods, including SmokeNet. . It also showed an interesting processing speed, with an inference time of 0.16 s. Additionally, BoucaNet demonstrated its potential as a robust solution to the challenges of recognizing smoke in satellite images, including complex backgrounds; the dynamic nature of smoke, which can present variations in shape, intensity, and color; detecting small areas of smoke; and visual similarities between smoke and other elements, such as clouds, dust, and haze.    
 
ScholarVision Creations