2. Deep Learning Methods for Smoke Recognition
Over the years, numerous DL methods were developed to improve the performance of smoke classification in different fields of application. Among them, Tao et al. [
15] suggested a simple CNN to recognize smoke in ground images, addressing challenging limitations such as varying smoke colors, shapes, and textures. The proposed CNN is a modified AlexNet [
16] by changing the order of the max pooling layers and normalization layers, which follow the first and second convolutional layers. The modified AlexNet was trained and evaluated using the Yuan dataset
(5695 smoke images and 18,522 non-smoke images) [
17], resulting an accuracy of 96.88%. Yin et al. [
18] proposed a new deep normalization CNN, namely DNCNN, to improve smoke detection performance. DNCNN incorporates batch normalization into convolutional layers to deal with overfitting and gradient dispersion. Data augmentation techniques
(vertical flipping, rotation, and horizontal flipping) were also used to address the challenges of imbalanced data between smoke and non-smoke images
(5695 smoke images and 18,522 non-smoke images [17]). Test results showed that DNCNN achieved an impressive performance with an accuracy of 98.08%, surpassing popular CNNs such as AlexNet, ZF-Net [
19], and VGG-16 [
20]. Khan et al. [
21] studied three CNN models (AlexNet, VGG-16, and GoogleNet [
22]) to identify smoke in a normal and foggy IoT environment. Experimental tests were performed using a very large datase
t, comprising 18,532 smoke images, 17,474 non-smoke images, 17,474 non-smoke images with fog, and 18,532 smoke images wit
h fog. VGG-16 obtained the higher performance with an accuracy of 97.72%
, demo compared with AlexNet, GoogleNet, and published fire models, demonstrating its ability to detect smoke in a foggy environment.
Peng and Wand [
23] proposed a video smoke detection method to recognize smoke in complex environments. First, a GMM (Gaussian Mixture Model) [
24] was employed as an image processing method to extract the suspected smoke areas from images collected from surveillance cameras. Then, the SqueezeNet model [
25] was adopted to detect the presence of smoke. Using a large dataset
(25,
000 smoke images and 25,000 non-smoke images), this proposed method showed a high performance with an accuracy of 97.12% and a high prediction time compared with existing wildfire models such as AlexNet, ShuffleNet [
26], Xception [
27], and MobileNet [
28]. Gu et al. [
29] developed a DCNN (Deep Dual-Channel Neural Network) as a smoke recognition method. The DCNN is composed of two deep subnet channels, SBNN (Selective-based Batch Normalization Network) and SCNN (Skip Connection-based Neural Network).
SBNN comprises six convolutional layers, four normalization layers, three max pooling layers, and three fully connected layers. SCNN includes eleven convolutional layers, seven normalization layers, three max pooling layers, and one global average pooling layer. DCNN was trained on large public learning data [
17]
, and dcomprising 5695 smoke images and 18,522 non-smoke images, and data augmentation techniques
(rotation of 90, 180, and 270 degrees). It achieved an accuracy of 99.5%, higher than hand-crafted methods and state-of-the-art DL methods such as DNCNN [
18], AlexNet, VGG, GoogLeNet, Xception, ResNet, etc.
Zhang et al. [
30] presented a DL method, called DC-CNN (Dual-Channel Convolutional Neural Network), for detecting smoke.
DC-CNN is composed of two channels. The first channel employs a pretrained AlexNet in extracting smoke features. The second channel is a simple CNN architecture, consisting of four convolutional layers, a pooling layer, and two fully connected layers for generating more advanced characteristics. Extensive studies were conducted using learning data, including 9794 smoke and 9794 non-smoke images, to handle the challenges related to smoke features, such as transparency properties, homogeneity, and visual similarity to clouds, steam, haze, and fog. DC-CNN obtained the highest accuracy of 99.33% compared with baseline DL models
such as LeNet, AlexNet, VGG-16, and DNCNN [18]. Jia et al. [
31] designed a new method for detecting smoke in videos. Firstly, GMM-based domain knowledge of smoke was adopted to segment the suspected areas of smoke. Then, three pretrained deep learning models (AlexNet, Inception v3, and ResNet50 [
32]) were used to recognize smoke. ResNet50 with GMM performed best, with an F1-score of 99.32% compared with the other models using 138 smoke videos as testing data. He et al. [
33] proposed a DL method for smoke detection in a foggy environment. This method combines the VGG-16 method as a backbone to extract smoke features and an attention method, which consists of channel attention and spatial attention to improve the detection of small smoke areas. It was also trained and evaluated using 33,666 image
s (8342 s
moke images, 8522 smoke with fog images, 8401 non-smoke images, and 8401 non-smoke with fog images). It achieved an F1-score of 99.97%, outperforming the AlexNet, VGG-16, and SqueezeNet methods.
Zhang et al. [
34] developed an end-to-end CNN method to identify smoke. Two CNNs (spatial stream and temporal stream)
, were adeach comprising five convolutional layers, three max pooling layers, and an attention module to suppress noise, and which extract salient features from temporal and spatial feature maps and improve detection performance, were adopted to extract the spatial and temporal features of smoke. This method achieved an accuracy of 96.8%, better than state-of-the-art methods
using 116 fire videos and 89 non-fire videos. Cheng et al. [
35] presented a deep convulational network, namely PACNN, to improve the robustness of smoke recognition tasks.
PACNN is a deep CNN with a PAAModule (Pixel Aware Attention Module), which integrates into the residual structure via element-wise addition and skip connection on two feature maps. Testing results showed that PACNN reached a high accuracy of 98.91% compared with popular CNNs
and (AlexNet, Inception v4, ResNet34, SEResNet34, DenseNet-121, and DNCNN) and vision transformers
(ViT, Swin-T, and DeiT-Ti) using the Yuan dataset.
Tao and D
uan [36] introduce
d a vide
p learning methods performeo smoke recognition method, AFSNet, to address slow-moving smoke challenges. AFSNet is composed of three main modules: AFSM (Adaptative Frame Selection Module) for extracting multi-scale spatial and spatiotemporal features; FEM (Feature Extraction Module) for incorporating a context attention module, an enhanced dilated convolution module, and a spatiotemporal feature attention module to minimize the loss of detailed information; and RM (Recognition Module) for detecting smoke presence. AFSNet was trained on two large datasets, SRSet (14,100 smoke images and 15,380 non-smoke images) and RISE (12,567 videos). It achieved impressive F1-scores of 96.57% and 91.00% using the SRSet and RISE datasets, respectively, surpassing classical machine learning methods and existing deep learning models. Cheng et al. [37] proposed
a novel vision transformer, called CViTNet (Convolution-enhanced Vision Transformer Network), for identifying smoke. CViTNet consists of three stages (s1, s2, and s3). The first stage, s1, comprises a convolutional stem and a ViT transformer encoder. Each of the s1 and s2 stages includes a ViT transformer encoder [38] and a convolutional token embe
dding, which was proposed t
ter in recoo improve the multiscale feature representation of tokenization. Using the Yuan dataset, CViTNet achieved a high accuracy of 99.20% compared with existing CNNs (AlexNet, ResNet, SEResNet, DenseNet, DNCNN, etc.) and vision transformer methods (ViT-B, DeiT-S, conViT-Ti, Swin-T, etc.) [37].
In the study conducted by Mohammed [39], a pretrained InceptionResNet v2 model [40] was employed for the detection of forest smoke and fires. Mohammed utilized a dataset comprising
aerial an
izing smod ground images (1102 fire images and 1102 smoke images). Data augmentation methods, including scaling and horizontal/vertical flipping, were applied during the training phase. Testing results showed that InceptionResNet v2 achieved an impressive accuracy of 99.09%. Chen et al. [41] studied the effectiveness of five DL methods (LeNet5, VGG-16, ResNet18, MobileNet v2 [42], and Xception) for wildland smoke
/fire recognition on aerial images.
These models were trained using a large dataset comprising a total of 53,451 images, which were divided into three categories: 25,434 fire/smoke images, 14,317 fire/no-smoke images, and 13,700 no-fire/no-smoke images. VGG-16 obtained an accuracy of 99.91%, surpassing MobileNet v2, ResNet18, LeNet5, Xception, and a traditional machine learning method (Logistic Regression) by 0.56%, 1.52%, 4.58%, 5.35%, and 9.54%.
Dilshad et al. [43] proposed a fire detection model, E-FireNet, to recognize fires in a surveillance environment. E-FireNet is a modified VGG-16 by deleting block 5 and adjusting the convolutional layers of block 4. The experimental setup was performed using data augmentation techniques (horizontal flipping, rotation, and scaling). E-FireNet achieved an accuracy 98% better than that of the pretrained MobileNet v1, VGG-19, EfficientNet-B0, VGG-16, and NASNetMobile v1 models using the SV-Fire dataset (1500 images) [43]. Yar et al. [44] developed a modified YOLO v5 method for detecting and locating fires in smart cities. A total of 1957 images, comprising indoor fires (118 images), building fires (723 images), and vehicle fires (1116 images), were used to train and evaluate the proposed model, achieving an F1-score of 84%.
Priya and Vani [45] introduced a CNN based on Inception v3 architecture [46] for the recognition of forest smoke/fires using satellite images. Their study utilized a dataset consisting of 534 satellite images, with 239 fire images and 295 no-fire images, for both training and testing purposes. Their proposed method achieved an accuracy of 98%. Ba et al. [14] also proposed a DL method, namely SmokeNet, to address the challenge of recognizing smoke on satellite data, including varying smoke features such as colors, shapes, and spectral overlaps. SmokeNet is a CNN model with channel-wise and spatial attention. A novel satellite dataset, namely USTC_SmokeRS, comprising 6225 satellite images divided into six classes (smoke, cloud, haze, dust, seaside, and land), was used in the training and testing phases. SmokeNet showed high performance with an accuracy of 92.75%.
Deep learning methods performed better in recognizing smoke. However, several challenging limitations persist, including the complexity and dynamics of the background; the visual similarity between smoke, clouds, dust, and haze; the varying characteristics of smoke regarding its air concentration, flow pattern, and color; and detecting small smoke zones.
3. Materials and Methods
3.1. Proposed Method for smoke Classification
AIn this paper, a new ensemble learning approach, namely BoucaNet, is introduced for recognizing smoke in satellite images
and for addressing challenging limitations, including background complexity and dynamics due to the presence of dynamically changing backgrounds in input satellite images; visual similarities of smoke with clouds, dust, and haze; and varying features of smoke regarding its shape, form, color, flow pattern, and texture. BoucaNet combines the deep CNN EfficientNet v2 (EfficientNetV2M) [
13] and the vision transformer EfficientFormer v2 (EfficientFormerV2L) [12].
T
o employ EfficientNet v2 and EfficientFormer v2 models in the
preprocespecific task of smoke recognition, their classification layers (last layers), originally developed for different classification tasks, are removed. As depicted in Figure 1, the preprocessing steps start with resizing the input satellite images to 224 × 224 pixels. Next, four data augmentation techniques, including rotation, shearing, shifting, and zooming, are utilized to diversify learning data, improve the potential of BoucaNet to generalize different real-world scenarios, and ovoid overfitting. Then, the input satellite images and the generated images are simultaneously fed into the EfficientNet v2 and EfficientFormer v2 models to extract complex contextual features, comprising both smoke plume patterns and background contextual information, and provide a comprehensive representation of various smoke scenarios. After concatenating the two feature maps generated by the EfficientNet v2 and EfficientFormer v2 models, the Gaussian dropout regularization technique with a rate of 0.3 is employed.
This method adds random noise from a Gaussian distribution to the input satellite data, improving BoucaNet’s generalization ability and avoiding overfitting. Finally, a Softmax function generates a probability score ranging from 0 to 1, determining the appropriate class, such as smoke, cloud, haze, dust, seaside, or land, for the input satellite images.
3.2. Datasets
Many large fire datasets are made available to help researchers in benchmarking and comparing DL techniques dealing with the same problem. However, this is not the case
for smoke recognition problems, especially when using satellite data, thus making the evaluation of these DL methods a little challenging. To train and test the proposed smoke recognition method, BoucaNet, the available satellite data, USTC_SmokeRS [14], is utilized. This dataset is collected using MODIS (Moderate Resolution Imaging Spectroradiometer) and represents numerous smoke scenes through satellite remote sensing. It
is selec
omprises a total oted from a remote sensing platform in Hefei, China, and the Level-1 and Atmosphere Archive & Distribution System (LAADS) Distributed Active Archive Center (DAAC) situated at the Goddard Space Flight Center in Greenbelt, Maryland, USA. The USTC_SmokeRS dataset comprises a total of 6225 satellite images with dimensions of 256 × 256 pixels and a spatial resolution of 1 km.
It comprises six classes:
• Smoke (1016 satellite images) as the target class for wildfire detection.
• Dust (1009 satellite images) and haze (1002 satellite images) as negative classes to smoke, which share similar features (texture and spectral) with smoke.
• Cloud (1164 satellite images) as the most common class in satellite images, with similar color, shape, and spectral characteristics to smoke.
• Land (1027 satellite images) and seaside (1007 satellite images) as background classes for fire smoke scenes.
4. Results and Discussion
BoucaNet was trained using the USTC_SmokeRS satellite dataset. This dataset allowed BoucaNet to learn on various classes and scenarios, thereby enabling it to learn and recognize various aspects of smoke in satellite images. It comprises a total of 6225 satellite images, divided into six distinct classes.
The evaluation of BoucaNet
showed a high perincludes several key aspects. Firstly, its performance was analyzed in terms of F1-score, accuracy, and inference time with the method, namely CT-Fire, which combines EfficientFormer v2 [14] and RegNetY [51] models as the backbone, RegNetY-16GF [51], the vision transformer EfficientFormer v2 [12], and SmokeNet [14] as the state-of-the-art smoke detection method. Next, the obtained F1-scores of these models for each class, namely smoke, cloud, dust, haze, land, and seaside, were presented. Then, the resulting confusion matrix generated by BoucaNet was illustrated and discussed. Finally, visual results of the input images predicted by these models were presented.
RegNetY-16GF and Ef
ficientForm
ance during testinger v2 were selected due to their excellent performance in classifying objects. CT-Fire is an ensemble learning method, which combines EfficientFormer v2 and RegNetY-16GF to extract features. Then, the Gaussian drop regularization method and the softmax function were used to recognize the presence of smoke. BoucaNet showed a high performance during testing, achieving a loss of 0.2184, an accuracy of 93.67%, and an F1-score of 93.64%. This performance was obtained thanks to the diversity of feature maps extracted by EfficientNet v2 and EfficientFormer v2 models,
thus enaincluding details, complexity, and local and global feature (colors, shapes, textures, etc.) for the smoke, cloud, haze, seaside, land, and dust classes, thus enabling BoucaNet to distinguish between smoke and complex backgrounds and identify small areas of smoke. I
t demonstrated n terms of F1-score, BoucaNet outperformed CT-Fire, RegNetY-16GF, and EfficientFormer by 2.75%, 1.38%, and 1.50%, respectively. This proposed model also performed better than the state-of-the-art method SmokeNet, which achieved an accuracy of 92.75% using the USTC_SmokeRS dataset [14].
It demonstrated its potential to address and overcome challenging limitations related to recognizing smoke in satellite images. These challenges include complex backgrounds, comprising various land covers and geographical features, which can make it difficult to accurately identify smoke in input satellite images. Additionally, BoucaNet handled the varying and dynamic nature of smoke in terms of its shape, color, intensity, and flow pattern features, as well as the visual similarities of smoke, including color, shape, and spectral characteristics, which are often shared with clouds, dust, and haze. On the other hand, BoucaNet achieved an efficient processing speed with an inference time of 0.16 seconds
, slightly surpassing the inference times of EfficientFormer v2, CT-Fire, and RegNetY-16GF. This inference time showed BoucaNet’s suitability for real-time processing of satellite images for smoke recognition while maintaining high performance.
In addition, BoucaNet achieved superior results with an F1-score of 95.58%, 91.00%, 90.82%, 95.01%, 98.76%, and 90.36% for recognizing cloud, dust, haze, land, seaside, and smoke classes, respectively, compared with CT-Fire, RegNetY-16GF, and EfficientFormer v2. It demonstrated its ability to accurately differentiate between cloud, smoke, haze, dust, land, and seaside features, thereby proving its capability to overcome challenges related to background complexity and visual similarities, including color, shape, and spectral characteristics, between smoke and other classes (cloud, dust, and haze).
In conclusion, BoucaNet performed well in recognizing smoke in satellite images compared with baseline models
(EfficientFormer v2, RegNetY-16GF, CT-Fire, and SmokeNet). Notably, it demonstrated its potential to address challenging limitations, including complex backgrounds; the dynamic nature of smoke in terms of its shape, intensity, and color; detecting small areas of smoke; and distinguishing visual similarities in terms of color, shape, and spectral characteristics between smoke and other elements, including clouds, dust, and haze. Additionally, BoucaNet achieved an interesting inference time.
5. Conclusions
AIn this paper, a novel ensemble learning method, namely BoucaNet, was presented for recognizing smoke in satellite images while addressing the associated challenges. BoucaNet combines the strengths of EfficientNet v2 and EfficientFormer v2 to extract rich and diverse feature maps for smoke, cloud, haze, dust, land, and seaside classes. It demonstrated a high performance
. , with an accuracy of 93.67% and an F1-score of 93.64%, using the USTC_SmokeRS dataset, which consists of 6225 satellite images. Furthermore, BoucaNet outperformed existing deep learning models for object classification, specifically EfficientFormer v2 and RegNetY-16GF, as well as state-of-the-art methods, including SmokeNet. It also showed an interesting processing speed
, with an inference time of 0.16 s. Additionally, BoucaNet demonstrated its potential as a robust solution to the challenges of recognizing smoke in satellite images, including complex backgrounds; the dynamic nature of smoke, which can present variations in shape, intensity, and color; detecting small areas of smoke; and visual similarities between smoke and other elements, such as clouds, dust, and haze.