Hybrid LSTM Model and Air Pollution Prediction

Hybrid LSTM Model and Air Pollution Prediction: Comparison

Please note this is a comparison between Version 1 by Kruna Ratković and Version 2 by Rita Xu.

Air pollution is a critical environmental concern that poses significant health risks and affects multiple aspects of human life. ML algorithms provide promising results for air pollution prediction. In the existing scientific literature, Long Short-Term Memory (LSTM) predictive models, as well as their combination with other statistical and machine learning approaches, have been utilized for air pollution prediction. However, these combined algorithms may not always provide suitable results due to the stochastic nature of the factors that influence air pollution, improper hyperparameter configurations, or inadequate datasets and data characterized by great variability and extreme dispersion. The fo identifcus of this paper is applying and comparing the performance of Support Vector Machine and hybrid LSTM regression models for air pollution prediction. To identify optimal hyperparameters for the LSTM model, a hybridization with the Genetic Algorithm is proposed. To mitigate the risk of overfitting, the bagging technique is employed on the best LSTM model. The proposed predicitive model aims to determine the Common Air Quality Index level for the next hour in Niksic, Montenegro. With the hybridization of the LSTM algorithm and by applying the bagging technique, theour approach aims to significantly enhance the accuracy and reliability of hourly air pollution prediction. The major contribution of this paper is in the application of advanced machine learning analysis and the combination of the LSTM, Genetic Algorithm, and bagging techniques, which have not been previously employed in the analysis of air pollution in Montenegro. The proposed model will be made available to interested management structures, local governments, national entities, or other relevant institutions, empowering them to make effective pollution level predictions and take appropriate measures.

air pollution prediction
common air quality index
long short-term memory
support vector machine
genetic algorithm
bagging techniques

1. Introduction

Air pollution is a concerning global issue, with approximately 1.3 million annual deaths attributed to it, according to the World Health Organization (WHO) ^[1]. Air quality assessment plays a vital role in monitoring and managing pollution levels. WHO data reveal that air pollution exceeding the recommended limits affects nearly the entire global population (99%), with a significant impact in low- and middle-income countries. It is crucial to anticipate and prepare for fluctuations in pollution levels to effectively mitigate the adverse effects of air pollution. Improving air quality not only enhances public health but also contributes to mitigating climate change, as air quality is closely interconnected with our planet’s climate and the health of its ecosystems. By reducing air pollution, we can alleviate the burden of diseases associated with air pollution and make long-term contributions to climate change mitigation efforts.

Since 2005, the Common Air Quality Index (CAQI) has been employed in Europe as a comprehensive and standardized metric to evaluate and communicate air quality levels to the general public ^[2]. It provides a simplified and easily understandable representation of air pollution levels, making it easier for individuals to make informed decisions regarding their health and well-being. The CAQI is based on the measurement of several air pollutants that are known to have detrimental effects on human health, including particulate matter (PM_2.5 and PM₁₀), nitrogen dioxide (NO₂), ozone (O₃), carbon monoxide (CO), and sulfur dioxide (SO₂) ^[3]. These pollutants are commonly monitored by air quality monitoring stations located in various regions. The CAQI is designed to provide a numerical value or color-coded scale that corresponds to the air quality level.

Typically, the CAQI scale ranges from 0 to 100 and it is divided into several categories, such as very low, low, medium, high, and very high ^[4], and the visual color scale is presented from green to red. To calculate the CAQI value, individual pollutant concentrations are first converted into indexes using predefined equations that are based on value interpolation. These indexes are then combined, weighted, and transformed into a single CAQI value. The weighting factors assigned to each pollutant are determined based on their relative health impacts.

The CAQI is a valuable tool in terms of raising awareness about air pollution and its potential health risks. It enables policymakers, environmental agencies, and the general public to monitor and address air quality issues effectively. Additionally, the CAQI facilitates the comparison of air quality between different locations and allows for long-term trend analysis, aiding in the formulation of targeted strategies for air pollution control and mitigation.

The advancement of machine learning (ML) techniques, including deep learning, has opened up new opportunities to enhance air quality research ^[5]. Among these techniques, the Support Vector Machine (SVM) has demonstrated promising outcomes in diverse domains. As a supervised learning algorithm, SVM is designed in the manner that it can identify optimal hyperplanes to enable the formation of data classes. In the context of air pollution prediction, SVM can learn complex patterns and relationships from historical pollution data and meteorological variables ^[6]. On the other hand, LSTM represents a type of recurrent neural network known for its effectiveness in modeling sequential data ^[7]. It can capture long-term dependencies and temporal patterns, making it suitable for time series forecasting tasks such as air pollution prediction.

The hybridization of ML algorithms with other techniques yields good results, especially when it comes to metaheuristic algorithms. Hybridization allows the faster convergence of algorithms and increases the prediction accuracy of ML algorithms. There is a wide range of metaheuristic algorithms ^[8] and one of the most commonly used is the Genetic Algorithm (GA), which is inspired by the process of natural selection ^[9]. It can effectively search for optimal or suboptimal solutions in a large solution space.

Bagging is an ensemble learning technique that enhances predictions by consolidating multiple models trained on diverse subsets of data. By aggregating the predictions of individual models, bagging reduces overfitting and increases the stability and robustness of the algorithms. It helps to capture different patterns and relationships present in the data, increasing the model’s overall performance by enhancing accuracy, handling data noise, and increasing robustness ^[10].

2. Hybrid LSTM Model and Air Pollution Prediction

Recent studies have been focusing on sophisticated learning algorithms to enhance air quality evaluation and air pollution prediction. Drewil and Al-Bahadili ^[11] used the LSTM model in conjunction with GA to enhance the performance of air prediction models. The performance of the GA-LSTM model was evaluated and compared with models employing manual criteria. The results showed a significant improvement in LSTM performance with the integration of GA. Waseem et al. ^[12] chose to perform only PM_2.5 forecasting by applying deep learning techniques, among which the LSTM encoder–decoder variant showed promising results. In another study, Xayasouk et al. ^[13] examined the methods of predicting PM levels and showed that LSTM combined with deep autoencoder techniques showed slightly better performance than the typical LSTM model. Triana and Osowski ^[14] employed bagging and boosting techniques for PM prediction. Their experiments demonstrated significant improvements in result quality when using bagging and boosting ensembles with weak predictors. The Mean Absolute Error was reduced by more than 30% for PM₁₀ and 20% for PM_2.5 compared to individual predictors. Liang et al. ^[15] developed multiple ML models, including adaptive boosting (AdaBoost), an artificial neural network (ANN), random forest (RF), a stacking ensemble, and SVM, for the prediction of air quality index levels over different time intervals (1 h, 8 h, and 24 h). The stacking ensemble, AdaBoost, and RF models showed the best prediction performance, although their forecasting accuracy varied across geographical regions. Madhuri et al. ^[16] used linear regression, SVM, decision tree, and RF models for air quality prediction. The RF model achieved the highest accuracy among the tested algorithms. Kumar and Pande ^[17] applied five different ML models to predict air quality. The authors showed that the strongest correlation between predicted and actual data was achieved by the XGBoost model. Sanjeev ^[18] conducted a study where a few standard classification models were applied to a dataset that included pollutant concentrations and meteorological data. Due to its robustness against overfitting, the RF classifier demonstrated superior performance compared to other classifiers.

References

World Health Organization. Air Pollution. Available online: https://www.who.int/health-topics/air-pollution#tab=tab_1/ (accessed on 1 June 2023).
van den Elshout, S.; Léger, K.; Nussio, F. Comparing urban air quality in Europe in real time, a review of existing air quality indices and the proposal of a common alternative. Environ. Int. 2008, 34, 720–726.
van den Elshout, S.; Léger, K.; Heich, H. CAQI Common Air Quality Index–update with PM2.5 and sensitivity analysis. Sci. Total Environ. 2014, 488, 461–468.
Environmental Protection Agency of Montenegro. Available online: http://www.epa.org.me/vazduh/caqi (accessed on 1 June 2023).
Li, Y.; Sha, Z.; Tang, A.; Goulding, K.; Liu, X. The application of machine learning to air pollution research: A bibliometric analysis. Ecotoxicol. Environ. Saf. 2023, 257, 114911.
Wang, W.; Men, C.; Lu, W. Online prediction model based on support vector machine. Neurocomputing 2008, 71, 550–558. Available online: https://www.sciencedirect.com/science/article/abs/pii/S0925231207002883 (accessed on 7 August 2023).
Gul, S.; Khan, G.M. Forecasting Hazard Level of Air Pollutants Using LSTM’s. Artif. Intell. Appl. Innov. 2020, 584, 143–153.
Reeves, C.R. Genetic algorithms. In Handbook of Metaheuristics; Gendreau, M., Potvin, J.Y., Eds.; Springer: Berlin/Heidelberg, Germany, 2010; pp. 109–139.
Eiben, A.E.; Smith, J.E. Introduction to Evolutionary Computing, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 2015.
Breiman, L. Bagging predictors. Mach. Learn. 1996, 24, 123–140.
Drewil, G.; Al-Bahadili, R. Air pollution prediction using LSTM deep learning and metaheuristics algorithms. Meas. Sensors 2022, 10, 100546.
Waseem, K.H.; Mushtaq, H.; Abid, F.; Abu-Mahfouz, A.M.; Shaikh, A.; Turan, M.; Rasheed, J. Forecasting of Air Quality Using an Optimized Recurrent Neural Network. Processes 2022, 10, 2117.
Xayasouk, T.; Lee, H.; Lee, G. Air Pollution Prediction Using Long Short-Term Memory (LSTM) and Deep Autoencoder (DAE) Models. Sustainability 2020, 12, 2570.
Triana, D.; Osowski, S. Bagging and boosting techniques in prediction of particulate matters. Bull. Pol. Acad. Sci. 2020, 68, 1207–1215.
Liang, Y.-C.; Maimury, Y.; Chen, A.H.-L.; Juarez, J.R.C. Machine Learning-Based Prediction of Air Quality. Appl. Sci. 2020, 10, 9151.
Madhuri, V.M.; Samyama, G.H.; Kamalapurkar, S. Air pollution prediction using machine learning supervised learning approach. Int. J. Sci. Technol. Res. 2020, 9, 118–123.
Kumar, K.; Pande, B.P. Air pollution prediction with machine learning: A case study of Indian cities. Int. J. Environ. Sci. Technol. 2023, 20, 5333–5348.
Sanjeev, D. Implementation of machine learning algorithms for analysis and prediction of air quality. Int. J. Eng. Res. Technol. 2021, 10, 533–538.