The general mathematical formulation of the RL problem is a Markov Decision Process (MDP). An MDP is a mathematical model that describes the interaction between the agent and the environment. It follows the Markov Property, which states that the conditional probability distribution of future states of a stochastic process depends only on the current state and not the previous ones. This means that the history of previous states and their actions’ consequences do not define the future states. The Markov Property is the basis for the mathematical framework of Markov models, which are widely used in a variety of fields, including physics, economics and computer science. MDPs describe fully observable RL environments and they consist of a finite number of discrete states. The agent’s actions cause the environment to transition from one state to the next and the goal is to discover a policy that is associated with an action to maximize the expected cumulative reward or return, over time.
3. Value Functions and Bellman Equations
In the RL framework, value functions have a critical role in estimating the expected cumulative reward that an agent will receive if it follows a particular policy and starts its course of action from a specific state or a state–action pair
[48,49][3][4]. A fundamental distinction between value functions and reward functions is that the latter provide immediate rewards to the agent, according to its most recent action, whereas the former estimate cumulative rewards in the long term. Thus, the value function estimates the expected sum of rewards that an agent will obtain from a specific state over the entire duration of its interaction with the environment. This makes the value functions difficult to estimate, given that the value of a state or state–action pair must be estimated multiple times throughout an episode from state observations recorded during an agent’s lifetime. Bellman Equations are mathematical expressions utilized for the purpose of computing the value functions in an efficient, recursive way. Specifically, the Bellman Equation expresses the value of a state or state–action pair as the sum of the immediate reward received in that state and the expected value of the next state under the policy being followed by the agent
[48][3]. The Bellman Equation takes into account that the optimal value of a state is dependent on the optimal values of its neighboring states and it iteratively estimates and updates the value functions and the policy followed. This iterative process is repeated until the value functions converge to their optimal values and therefore reach the optimal policy. When the agent initiates its actions from a specific state, the value function is referred to as the ‘state-value’ function, whereas when the agent initiates from a specific state–action pair, it is then called the ‘action-value’ function. The state value describes the expected performance of an agent, when it follows policy
𝜋
and has started from state
s. The mathematical representation of a typical value function is as seen in Equation (2), while the Bellman Equation for the value function which expresses the dependency of the value of one state with the values of future states is seen in Equation (3).
Respectively, the action-value describes the expected performance of an agent when it chooses action a, while it finds itself in state s and follows policy 𝜋. Then, the mathematical representation of the value function is as in Equation (4) and the Bellman equivalent is as in Equation (5).
4. Reinforcement Learning Methods and Algorithms
The RL methods can be divided into two categories: model-based and model-free methods
[49][4]. In model-based methods, there is a model which mimics the behavior of an environment and has the ability to observe a state and action and immediately determine the reward of the agent and the future state. The agent utilizes the model in such a way that it can simulate different sequences of actions and predict their outcomes. These methods can achieve faster learning as the model of the system is built beforehand. It can also provide higher sample efficiency, as it leads to the accurate generation of additional training data to improve the agent’s performance. In model-free methods, the agent is trained effectively in complex and unknown environments, as it learns directly from the previous experience the agent has gathered. A model which could be inaccurate and lead to potentially unwanted results does not exist; therefore, the agent needs to rely solely on the trial-and-error approach. Model-free and model-based RL methods can, in turn, be divided into two more categories, policy optimization and Q-Learning, also known as policy-based and value-based methods, and learn the model given the model methods. This
pape
rntry will focus on model-free methods and their subcategories, as the main advantages of RL are modeling and environment agnostic approaches. In conclusion of this section, a comprehensive table is presented, showcasing state-of-the-art RL algorithms. The classification of these algorithms is based on their respective approaches to policy optimization, Q-learning, or a combination of both (see
Table 1).
4.1. Policy-Optimization and Policy-Based Approaches
Policy-optimization methods directly update the agent’s policy and their objective is to discover the optimal policy, which will lead to the maximization of the expected cumulative reward, without explicitly learning a value function. They use stochastic policies,
𝜋𝜃(𝑎|𝑠), and attempt to optimize the
𝜃 parameters of the policy through various techniques. One technique is the gradient ascent technique, which aims to find the maximum of an optimality function
𝐽(𝜋𝜃), as seen in Equation (8). Another technique is the maximization of local estimates of this function, as seen in Equation (9). Gradient-based policy-optimization algorithms are known as policy gradient methods, as they directly compute the gradient,
∇𝜃𝐽(𝜋𝜃), of the expected cumulative reward in terms of the policy parameters and use this gradient to compute the policy. Policy gradient algorithms are mostly on-policy methods, as in every iteration data gathered from the most recent policy followed is used. They can learn stochastic policies, which can be useful in environments with uncertainty; they can handle continuous action spaces and learn policies that are robust to environments with uncertainty.
![](/media/common/202307/mceclip4-64b6374615876.png)
Policy-optimization methods also include techniques that introduce a constraint to ensure policy changes are not too large and maintain a certain level of performance. These are called trust region-optimization methods. Additionally, policy gradient algorithms include actor–critic methods, which consist of two components: the actor-network, or the policy function, responsible for the choosing of the agent’s action, and the critic network, or the value function estimator, responsible for the estimation of the value of state–action pairs. The actor receives the critic’s estimate and proceeds to update its policy with respect to the policy parameters. Actor–critic methods leverage the benefits of both value-based and policy-based methods. Some of the most well-known policy-optimization methods are Proximal Policy Optimization (PPO)
[51][5], Trust Region Policy Optimization (TRPO)
[52][6], Vanilla Policy Gradient (VPG)
[53][7], Deep Deterministic Policy Gradient (DDPG)
[54][8], Soft Actor–Critic (SAC)
[55][9], Advantage Actor–Critic (A2C)
[56][10] and Asynchronous Advantage Actor–Critic (A3C)
[56][10].
4.2. Q-Learning and Value-Based Approaches
Q-Learning methods
[57,58][11][12] are characterized as off-policy learning methods that do not require a learning model, where the agent learns an action-value function, the
Q-function,
𝑄𝜃(𝑠,𝑎), for the optimal action-value function,
𝑄∗(𝑠,𝑎). This function estimates the expected return of selecting action
a, in state
s, and following the optimal policy from the next state,
s′. The
Q-function’s objective is to maximize quality values of a state, the
Q-values, by choosing the appropriate actions for each given state. It simultaneously updates the
Q-values of that state. The
Q-function is given by Equation (10).
![](/media/common/202307/mceclip5-64b6376c0f315.png)
The best actions of the agent for a state, s, are selected through a greedy method, called
𝜖-Greedy. The selection of these actions happen with a probability
1−𝜖, while the selection of a random future action happens with a probability of
𝜖, where
𝜖∈(0,1).
![](/media/common/202307/mceclip6-64b6377d9d4e9.png)
The
𝜖-Greedy method is an approach to the exploitation versus exploration problem, where the agent needs to find an equilibrium between exploiting the selection of previously found good actions which yield satisfying rewards and exploring new and potentially better, unknown actions. Q-Learning techniques are characterized by their simplicity and ease of implementation. They are utilized in continuous and discrete state and action spaces and have the ability to learn from experiences that do not have an optimal policy. Some well-known extensions and improved methods that are derived from Q-Learning are Double Q-Learning
[59][13], Deep Q-Networks (DQNs)
[60][14] and Dueling Q-Learning
[61][15]. Double Q-Learning attempts to solve the problem of overestimation of
Q-values in certain scenarios by modifying the original Q-Learning algorithm and learning two
Q-functions independently as estimators. The
Q-function that yields the highest
Q-value is the primary estimator whose
Q-value will be selected. The other
Q-function is the secondary estimator used to estimate the
Q-value of the selected action, which helps reduce the overestimation bias that can occur in standard Q-Learning. DQNs make use of deep neural networks which take the current state as input and outputs the estimated
Q-value of the
Q-function. This is achieved by updating the weights of the network and minimizing the mean squared error between the predicted and actual
Q-values. Dueling Q-Learning makes use of a state-value function, which estimates the value of a particular state and an advantage function, which estimates the advantage of taking an action in that state. The
Q-value is the sum of the advantage and state-value functions, minus the mean of the advantage of all previous actions.
Table 1.
State-of-the-art RL algorithms.
Policy Optimization
|
Q-Learning
|
Policy Optimization and Q-Learning-Based
|
Vanilla Policy Gradient [53][7]
|
Double Q-Learning [59][13]
|
DDPG [54][8]
|
PPO [51][5]
|
DQN [60][14]
|
SAC [55][9]
|
A2C/A3C [56][10]
|
Dueling Q-Learning [61][15]
|
TD3 [62][16]
|
TRPO [52][6]
|
C51 [63][17]
|
|