ML models can be classified into several types depending on the task objectives, such as regression, classification, reinforcement learning, generative models, and so on. Regarding ML models available for regression prediction, all ML models in the collected research were classified into 4 categories: traditional convex optimizationbased models (TCOB models), tree models, linear regression (LR), and modern deeplearning structure models (modern DL structure).
1. Traditional convex optimizationbased model
Two main model types are included in the TCOB model group: Support Vector Machine (SVM) and artificial neural networks (ANNs). The optimization algorithms of SVM and ANNs are based mostly on convex optimization (e.g., a stochastic gradient descent algorithm). Essentially, these two models add nonlinear data transformation based on a linear model. In addition, the methods of data transformation are different in SVM and ANNs: SVM transforms the data by means of kernel functions, while ANNs use activation functions.
The development of SVM can be divided into two stages, nonkernel SVM and kernel SVM
^{[1][2]}[67,68], the latter of which is commonly applied today. The kernel function transforms input features from a low dimension to a higher dimension, simplifying the mathematical calculations in the higherdimensional space. In practice, linear, polynomial, and Radial Basis Function (RBF) kernels are three commonly used model kernels. Kernel selection depends on the specific tasks and model performance.
Multiple Layer Perceptron (MLP), also called Back Propagation Neural Network (BPNN)
^{[3]}[69], is the simplest neural network in this model group. MLP contains three types of layers inside: the input layer, the hidden layer, and the output layer. The input layer is a onedimensional layer that passes independent variables organized into the network. The hidden layer receives data from the input layer and processes by a feedforward algorithm. All parameters (including the weight and bias between two adjacent layers) in the network are optimized by a backpropagation algorithm. In the training stage, the prediction result is passed to the output layer after each epoch, and network parameters are updated to better fit the predictions. In the validation or testing stage, the network parameters are frozen and make predictions directly.
After MLP was proposed, a lot of artificial neural networks (ANNs) were developed from the 1970s to the 2010s, such as the Radial Basis Function Network (RBFN)
^{[4]}[70], ELMAN network
^{[5]}[71], General Regression Neural Network (GRNN)
^{[6]}[72], Nonlinear Autoregressive with Exogenous Inputs Model (NARX)
^{[7]}[73], Extreme Learning Machine (ELM)
^{[8]}[74], and Deep Belief Networks (DBN)
^{[9]}[75]. One distinctive characteristic of these models is that they are relatively shallow due to the limited computing power when the models were proposed and their artificial design. For example, RBFN contains a Gaussian activation function inside the network, which is not a suitable design for a “deep” network. Furthermore, among ANNs, more layers in the network do not always mean improved prediction performance; sometimes, performance even deteriorates. Even so, ANNs are currently still effective tools for atmospheric pollution prediction due to the simplicity of model application and powerful model performance.

2. Tree models
The development of tree models went through two stages: basic models and ensemble models. Basic models include ID3
^{[10]}[76], C4.5
^{[11]}[77], and CART
^{[12]}[78]. The differences between them lie in the method of selecting features and the number of branches in the tree. We will not introduce the algorithms mathematically here, as they can readily be found. As a further development of basic tree models, ensemble tree models are key to the maturity of this group of ML models. There were two ensemble ideas in the history of development: bagging and boosting. The representative bagging model is the random forest (RF)
^{[13]}[79], which develops
n submodels from the original input data and makes a prediction by voting. The two main ideas in boosting are changing the sample weight, and fitting the residual error according to the loss function during the training stage. AdaBoost
^{[14]}[80] uses the former idea, whereas the Gradient Boosting Decision Tree (GBDT)
^{[15]}[81], also called the Gradient Boosting Model (GBM), uses the other idea. For now, GBDT has been improved and developed into different models, such as XGBoost
^{[16]}[82], LightGBM
^{[17]}[83], and CatBoost
^{[18]}[84], which have been widely used for classification as well as regression tasks.

3. Linear regression
This group includes multiple regression (MLR), the Autoregressive Integrated Moving Average model (ARIMA), ridge regression
^{[19]}[85], Least Absolute Shrinkage and Selection Operator (LASSO)
^{[20]}[86], Elastic Net
^{[21]}[87], and Generalized Additive Model (GAM)
^{[22]}[88]. These models were originally designed to solve regression tasks. From the perspective of ML, ridge regression, LASSO, and Elastic Net are for the regularization of linear regression. ARIMA is a timeseries function transforming unstable time series into stable series for model fitting; GAM as described here refers specifically to GAM for regression, where the target variable is the sum of a series of subfunctions. The function can be expressed as follows:
f_{i} can be any function here.
LR has a long history of development. However, the innovation of model algorithms has stagnated since Elastic Net was proposed. One important reason for this is the limited nonlinearfitting ability of this group.

4. Modern deeplearning structure models
Modern DL structure models are another important part of deep learning that evolved from the development of ANNs, which are redesigned based on MLP considering the characteristics of the prediction tasks and input data. Modern DL structure models include mainly a convolutional neural network (CNN)
^{[23]}[89] and a recurrent neural network (RNN)
^{[24]}[90]. CNN contains a featurecapturing filter module called a “kernel” to catch local spatial features, thus making substantial connections between neighboring layers that are sparser compared to the dense connections inside MLP. This design makes optimization and convergence of the network easier. CNN has developed many network structures with innovative model design concepts, such as AlexNet (network goes “deeper”)
^{[25]}[14], VGG (doubles the number of layers, half the height and width)
^{[26]}[91], ResNet (skip connection)
^{[27]}[92], and GoogLeNet (inception block)
^{[28]}[93]. These networks can not only be applied directly to prediction tasks, but also provide modern ideas for future network design.
Compared to CNN, RNN is better for capturing temporal relationships in a time series. This group of models retains historical data in the “memory” unit and passes them into the network in the following training. The classical RNN simply passes history information from the last time step into the network along with input data in the current time step. However, this original “memory” unit design leads to a terrible problem: a vanishing gradient, which hinders the successful training of the model. Advanced RNNbased structures such as the long shortterm memory network (LSTM)
^{[24]}[90] and gated recurrent units (GRU)
^{[29]}[94] significantly alleviate this problem with structure modification. These advanced RNNs are now more widely applied compared to the original RNN.
During the development of modern DL structure models, several improved model components were proposed, which efficiently improved the performance of both ANNs and modern DL structure models. For instance, a sigmoid activation function was replaced by the Rectified Linear Unit (ReLU)
^{[30]}[95] or LeakyReLU
^{[31]}[96] in most regression tasks; the dropout method
^{[32]}[97] was usually applied in the model training stage to alleviate overfitting; Adam
^{[33]}[98] and weight decay regularization
^{[34]}[99] are commonly used in network optimization.