The rapid growth of ecommerce has significantly increased the demand for advanced techniques to address specific tasks in the ecommerce field.
1. Introduction
Machine learning is a subset of artificial intelligence (AI) that focuses on developing algorithms and models capable of automatically learning, identifying patterns, and making predictions or decisions from data ^{[1]}. The field of machine learning encompasses a broad array of methods and algorithms. Some prominent examples of supervised learning methods include linear regression, logistic regression, decision tree, random forest, support vector machine, and artificial neural network techniques ^{[2]}. On the other hand, unsupervised learning methods typically include Kmeans clustering, hierarchical clustering, principal component analysis, and matrix factorization ^{[3]}.
Deep learning is a specialized branch of machine learning that emphasizes training artificial neural networks with multiple hidden layers, enabling them to acquire hierarchical representations of data ^{[4]}. Exceptional accomplishments in diverse domains, including image classification, object detection, speech recognition, and language translation, have been achieved through the use of deep learning approaches ^{[5]}. The ability to automatically learn intricate features from raw data has positioned deep learning as a pivotal component in modern AI systems ^{[6]}.
Ecommerce refers to the buying and selling of goods and services over the Internet, which involves online transactions, electronic payments, and digital interactions between businesses and customers ^{[7]}. Ecommerce has become increasingly popular due to its convenience, wide product range, and global accessibility. It also provides a favorable environment for the application of machine learning and deep learning techniques, due to the availability of vast data sets, the need for personalized experiences, the challenges of fraud detection and security, the potential for supply chain optimization, and the importance of customer sentiment analysis ^{[8]}. By leveraging these techniques, ecommerce businesses can enhance customer satisfaction, improve operational efficiency, drive sales, and gain a competitive edge in the digital marketplace ^{[9]}.
2. The Utilized Machine Learning and Deep Learning Techniques
2.1. Machine Learning Techniques

Support vector machine (SVM) ^{[10]} is a machine learning model used for classification and regression. An SVM operates by identifying an optimal hyperplane that maximizes the margin between distinct classes, which is determined by critical data points known as support vectors. It can handle both linearly separable and nonlinearly separable data using the kernel trick, such as the linear, polynomial, radial basis function, and sigmoid kernels. It is particularly effective for binary and even multiclass classification problems ^{[11]}.

Decision Tree ^{[12]} is a model used for prediction tasks, functioning by segmenting the predictor space into simple regions for analysis. It uses a treelike structure to make decisions based on feature values. At each internal node of the tree, a decision or splitting criterion is applied to determine the best feature and threshold for splitting the data ^{[13]}. In classification tasks, each leaf node represents a class label while, in regression tasks, the leaf nodes contain the predicted continuous value in that subset.

Random Forest ^{[14]}^{[15]} is an ensemble learning method that combines multiple decision trees to make predictions. It enhances classification and regression tasks by training multiple trees on various subsamples of the data set and aggregating the predictions of individual trees to improve accuracy and prevent overfitting ^{[16]}.

Naïve Bayes ^{[17]} is based on the assumption that features are independently and naïvely unrelated to each other. It utilizes the Bayes theorem to calculate the posterior probabilities of classes based on observed feature values. Depending on the assumed distribution type of the features, there are Gaussian, Multinomial, and Bernoulli Naïve Bayes algorithms. Naïve Bayes is widely recognized for its simplicity and efficiency in training and prediction tasks, making it popular for various applications ^{[18]}.

Logistic regression ^{[19]} utilizes the logistic function or the sigmoid function to estimate the probabilities of inputs belonging to different classes. This method can be extended to softmax regression or multinomial logistic regression by replacing the sigmoid function with the softmax function. Logistic and softmax regression provide straightforward and interpretable approaches to classification problems, allowing for accurate and probabilistic predictions ^{[20]}.

Principal component analysis (PCA) ^{[21]} is a linear modeling technique used to map highdimensional input features to a lowerdimensional space, typically referred to as latent factors or principal components. PCA aims to transform the original data into a set of orthogonal components that explain the maximum variance in the data ^{[22]}.

Matrix factorization algorithms ^{[23]}^{[24]} work by decomposing the original matrix into two or more lowerdimensional matrices that represent latent factors. These algorithms aim to find lowerrank representations of the data by uncovering the underlying structure or patterns within the matrix ^{[25]}.

Knearest neighbors (KNN) ^{[26]} is a nonparametric algorithm that predicts the class label (for classification) or the target value (for regression) of a test instance based on its similarity to its K nearest neighbors in the training data. In classification, the majority vote among the neighbors determines the class label while, in regression, the average (or weighted average) of the target values is taken ^{[27]}.
2.2. Deep Learning Techniques
Deep learning approaches continue to evolve rapidly, with new architectures, algorithms, and techniques having been developed to address various challenges in different domains. Their ability to learn complex representations from data has significantly advanced the field of artificial intelligence and contributed to various groundbreaking applications ^{[28]}.

An Artificial Neural Network (ANN) ^{[29]} is a computational model inspired by the structure and functionality of biological neural networks in the human brain. It is composed of interconnected artificial neurons or nodes, organized into layers including the input layer, hidden layers, and output layer. The connections between neurons have associated weights, which are adjusted iteratively by propagating the error from the output layer back to the input layer, guided by a defined objective or loss function ^{[30]}.

A Convolutional Neural Network (CNN) ^{[31]}^{[32]} consists of convolutional layers that apply filters to extract features from input data, followed by pooling layers to reduce the spatial dimensions. They have demonstrated exceptional performance in image classification, object detection, and image segmentation ^{[33]}.

The Visual Geometry Group network (VGG) ^{[34]} is a deep convolutional neural network architecture (e.g., with 16–19 convolutional layers) developed by the Visual Geometry Group. It showcases the effectiveness of deep convolutional neural networks in capturing complex image features and hierarchies ^{[35]}.

A Temporal Convolutional Network (TCN) ^{[36]} utilizes dilated convolutional layers to capture temporal patterns and dependencies in the input data. These dilated convolutions enable an expanded receptive field without significantly increasing the number of parameters or computational complexity.

Recurrent Neural Networks (RNNs) ^{[29]} are designed to process sequential data and utilize recurrent connections that enable information to be carried across different time steps. The key characteristic of an RNN is its recurrent connections, which create a looplike structure and allow information to flow in cycles, enabling the network to maintain a form of memory or context to process and remember information from previous steps ^{[37]}.

Long ShortTerm Memory (LSTM) ^{[38]} is a type of RNN architecture that excels at capturing longterm dependencies and processing sequential data. It utilizes a memory cell and a set of gates that regulate the flow of information; in particular, the memory cell retains information over time, the input gate determines which values to update in the memory cell, the forget gate decides what information to discard from the memory cell, and the output gate selects the relevant information to be output at each time step ^{[37]}.

Bidirectional Long ShortTerm Memory (BiLSTM) ^{[39]} combines two LSTMs that process the input sequence in opposite directions: one LSTM processes the sequence in the forward direction, while the other processes it in the backward direction. This bidirectional processing allows the model to capture information from both past and future contexts, providing a more comprehensive understanding of the input sequence. It has demonstrated strong performance in various natural language processing tasks.

The Gated Recurrent Unit (GRU) ^{[40]} is a simplified alternative to the LSTM network, offering comparable performance with fewer parameters and less computation. In GRU, the update gate determines the amount of the previous hidden state to retain and the extent to which the new input is incorporated. The reset gate controls how much of the previous hidden state is ignored and whether the hidden state should be reset, based on the current input ^{[41]}.

The BiGRU ^{[41]}^{[42]} is an extension of the standard GRU, which processes the input sequence in both forward and backward directions simultaneously, resulting in a more comprehensive understanding of the sequence.

The attentionbased BiGRU ^{[42]}^{[43]} adopts attention mechanisms to dynamically assign different weights to different time steps of the sequence, allowing the model to attend to more informative or salient parts of the input. It has demonstrated superior performance in various natural language processing tasks ^{[44]}.

Reinforcement Learning (RL) ^{[45]}^{[46]} involves an agent learning through interactions with an environment, receiving feedback in the form of rewards or punishments based on its actions, and learning a mapping from states to actions that maximize the expected cumulative reward over time ^{[47]}.

Deep QNetworks (DQN) ^{[46]} combine reinforcement learning and deep learning, utilizing the deep neural network to approximate the Qfunction and then learn optimal policies in complex environments. The Qfunction—also known as the actionvalue or quality function—represents the expected cumulative reward an agent can achieve by taking a specific action in a given state and following a certain policy. In recent years, Deep RL has gained substantial attention and success in various domains, including robotics, game playing, and autonomous systems ^{[48]}.

A Generative Adversarial Network (GAN) ^{[49]} is composed of a generator network and a discriminator network, which engage in a competitive game. The generator aims to produce synthetic data samples, while the discriminator tries to discern between real and fake samples. Through iterative training in this adversarial process, GANs have exhibited remarkable capabilities in tasks such as image generation, imagetoimage translation, and text generation ^{[50]}^{[51]}.

Transformers ^{[52]}^{[53]} are neural networks that use selfattention to capture relationships between words or tokens in a sequence. Selfattention involves calculating attention scores based on the relevance of each element to others, obtaining attention weights through the softmax function, and computing weighted sums using these attention weights. In transformers, the encoder computes representations for each element using selfattention, capturing dependencies and relationships, while the decoder uses this information to generate an output sequence ^{[54]}.

Bidirectional Encoder Representations from Transformers (BERT) ^{[55]} is a powerful pretrained language model introduced by Google in 2018. BERT is trained in a bidirectional manner, learning to predict missing words by considering both the preceding and succeeding context, resulting in a better understanding of the overall sentence or document. BERT’s ability to capture contextual information and leverage pretraining has paved the way for advancements in understanding and generating human language ^{[56]}.

Autoencoders ^{[57]}^{[58]} are neural networks that learn to reconstruct their input data. They consist of an encoder network that maps input data to a compressed latent space and a decoder network that reconstructs the original data from the latent representation. They can be employed for tasks such as dimensionality reduction, anomaly detection, and generative modeling ^{[59]}.

A Stack Denoising Autoencoder (SDAE) ^{[60]} is a deep neural network composed of multiple layers of denoising autoencoders. These autoencoders are designed to reconstruct the input data from a corrupted or noisy version, enabling the model to learn robust and informative representations ^{[61]}.

A Deep Belief Network (DBN) ^{[62]}^{[63]} is a type of generative deep learning model that consists of multiple layers of stochastic unsupervised restricted Boltzmann machines (RBMs). An RBM is a twolayer neural network with binary nodes that learns representations by minimizing the energy between visible and hidden nodes ^{[64]}.

Graph Neural Networks (GNNs) ^{[65]}^{[66]}^{[67]} are a class of deep learning model designed to learn node representations by aggregating information from neighboring nodes in a graph, which are typically used to capture and propagate information through the graph structure, enabling effective learning and prediction tasks on graphstructured data ^{[68]}.

A Directed Acyclic Graph Neural Network (DAGNN) ^{[69]} is an architecture specifically designed for directed acyclic graphs, where the nodes represent entities or features, and edges denote dependencies or relationships. DAGNNs can effectively capture complex dependencies and facilitate learning and inference in domains with intricate relationships among variables.
2.3. Optimization Techniques for Machine and Deep Learning
Optimization techniques play a crucial role in machine learning and deep learning algorithms, helping to find the optimal set of parameters that minimize a loss function or maximize a performance metric with the aim of improving the model’s accuracy and generalization ability. Some popular optimization techniques are detailed below ^{[70]}.

Gradient Descent is an iterative algorithm that updates the model’s parameters by moving in the direction of steepest descent of the loss function.

Stochastic Gradient Descent (SGD) is a variant of the Gradient Descent algorithm that is particularly suitable for largescale data sets. It is widely used in deep learning, where it updates the network parameters based on a randomly selected subset of training examples, called a minibatch.

Adaptive Moment Estimation is an extension of gradient descent that incorporates adaptive learning rates for different parameters. It dynamically adjusts the learning rate based on the first and second moments of the gradients.

Root Mean Square Propagation is an optimization algorithm that adapts the learning rate individually for each parameter based on the average of past squared gradients.

Adagrad adapts the learning rate for each parameter based on their historical gradients. It places more emphasis on less frequent features by reducing the learning rate for frequently occurring features.
Researchers and practitioners often experiment with different optimization algorithms to achieve better training outcomes.
2.4. Ensemble Techniques for Machine and Deep Learning
Ensemble techniques for machine and deep learning approaches involve combining multiple individual models to create a more powerful and accurate predictive model. By leveraging the strengths and diversity of different models, ensemble techniques often present improved performance and robustness when compared to using a single model ^{[71]}.
Some common ensemble techniques for machine and deep learning are as follows.

Bagging (Bootstrap Aggregating) ^{[72]} involves training multiple models independently on different subsets of the training data, typically using the same learning algorithm. The final prediction is obtained by averaging or voting the predictions of the individual models. Random Forest is an example of a popular ensemble method that utilizes bagging ^{[73]}.

AdaBoost (Adaptive Boosting) ^{[74]} sequentially trains multiple homogeneous weak models and adjusts the weights of the training examples to emphasize misclassified instances. The final prediction is a weighted combination of the predictions from the individual models, with more weight given to more accurate models ^{[75]}.

Gradient Boosting ^{[76]} is an advanced boosting methodology that incorporates the principles of gradient descent for optimization purposes. It assembles an ensemble of weak learners in a sequential manner. The primary objective during this iterative process is for each subsequent model to specifically address and minimize the residual errors—also referred to as gradients—with respect to a predetermined loss function ^{[77]}.

XGBoost (Extreme Gradient Boosting) ^{[78]} is an optimized and highly efficient implementation of gradient boosting. It introduces regularization techniques to control model complexity and prevent overfitting and uses a more advanced construction to provide parallel processing capabilities to accelerate training on large data sets. It also offers builtin functionality for handling missing values, feature importance analysis, and early stopping ^{[79]}.

Stacking ^{[80]}^{[81]} enhances the predictive accuracy by integrating heterogeneous weak learners. These base models are trained in parallel to provide a range of predictions, upon which a metamodel is subsequently trained, synthesizing them into a unified final output. This not only leverages the strengths of individual models, but also reduces the risk of overfitting.
Ensemble techniques can enhance model performance by reducing overfitting, increasing model stability, and capturing diverse aspects of the data. They are widely used in various domains and have been shown to improve performance in tasks such as classification, regression, and anomaly detection.
2.5. Techniques to Prevent OverFitting and Improve Generalization
To prevent overfitting and improve the generalization capability of individual or ensemble models, beside the abovementioned ensemble methods, several other techniques can be employed ^{[82]}, as detailed below.

Crossvalidation ^{[83]}^{[84]} is a widely used technique to estimate the performance of a model on unseen data. It involves partitioning the available data into multiple subsets, training the model on a subset, and evaluating its performance on the remaining subset which can guide the selection of hyperparameters and model architecture.

Regularization methods ^{[85]}^{[86]}, such as L1 and L2 regularization, add a penalty term to the loss function during training. This discourages the model from fitting the training data too closely and encourages simpler and more robust models ^{[87]}.

Dropout ^{[88]} is a technique commonly used in deep learning models. It randomly deactivates a fraction of the neurons during training, effectively creating an ensemble of smaller subnetworks. This encourages the network to learn more robust and less dependent representations, reducing overfitting and improving generalization.

Early stopping ^{[89]}^{[90]} involves monitoring the model’s performance on a validation set during training and stopping the training process when the performance on the validation set starts to degrade. This prevents the model from overoptimizing the training data and helps to find an optimal point that balances training accuracy and generalization.

Data augmentation ^{[91]} involves artificially increasing the size of the training set by applying various transformations to the existing data. This introduces diversity into the training data, reducing the risk of overfitting and helping the model to better generalize to unseen examples.
These techniques, either used individually or in combination, can help to mitigate overfitting and improve the generalization ability of machine learning and deep learning models, leading to better performance on unseen data.