2. Markov Decision Processes
The general mathematical formulation of the RL problem is a Markov Decision Process (MDP). An MDP is a mathematical model that describes the interaction between the agent and the environment. It follows the Markov Property, which states that the conditional probability distribution of future states of a stochastic process depends only on the current state and not the previous ones. This means that the history of previous states and their actions’ consequences do not define the future states. The Markov Property is the basis for the mathematical framework of Markov models, which are widely used in a variety of fields, including physics, economics and computer science. MDPs describe fully observable RL environments and they consist of a finite number of discrete states. The agent’s actions cause the environment to transition from one state to the next and the goal is to discover a policy that is associated with an action to maximize the expected cumulative reward or return, over time.
An MDP can be described by the tuple (𝒮,𝒜,𝒫,ℛ,𝜌0,𝛾,𝜋), where:

𝒜 denotes the action space, which contains all the possible actions an agent can take;

𝒫 denotes the transition probability of the agent, 𝒫(𝑠t+1𝑠t,𝑎t), where 𝑡+1 is the next time step and t the current one;

ℛ(𝑠t,𝑎t) denotes the reward function which defines the reward the agent will receive;

𝜌0 denotes the startstate distribution;

𝛾∈[0,1) denotes the discount factor, which determines the influence an action has in future rewards, with values closer to 1 having greater impact;

𝜋(𝑎𝑠) is the policy which maps the states to the actions taken.
In RL algorithms, the agent interacts with its environment in timesteps 𝑡=1,2,3,… In each timestep the agent receives an observation of the environment state, 𝑠t∈𝑆 and then chooses an action, 𝑎t∈𝐴. In timestep 𝑡+1, the agent will receive the arithmetic reward, 𝑟t∈𝑅⊂ℜ, for the action taken. The policy in an MDP problem gives the probability to transition to state s when executing action a. The main objective in an MDP problem is to find an optimal policy, 𝜋*, which ultimately leads to the maximization of the cumulative reward, given by Equation (1).
3. Value Functions and Bellman Equations
In the RL framework, value functions have a critical role in estimating the expected cumulative reward that an agent will receive if it follows a particular policy and starts its course of action from a specific state or a state–action pair ^{[3]}^{[4]}. A fundamental distinction between value functions and reward functions is that the latter provide immediate rewards to the agent, according to its most recent action, whereas the former estimate cumulative rewards in the long term. Thus, the value function estimates the expected sum of rewards that an agent will obtain from a specific state over the entire duration of its interaction with the environment. This makes the value functions difficult to estimate, given that the value of a state or state–action pair must be estimated multiple times throughout an episode from state observations recorded during an agent’s lifetime. Bellman Equations are mathematical expressions utilized for the purpose of computing the value functions in an efficient, recursive way. Specifically, the Bellman Equation expresses the value of a state or state–action pair as the sum of the immediate reward received in that state and the expected value of the next state under the policy being followed by the agent ^{[3]}. The Bellman Equation takes into account that the optimal value of a state is dependent on the optimal values of its neighboring states and it iteratively estimates and updates the value functions and the policy followed. This iterative process is repeated until the value functions converge to their optimal values and therefore reach the optimal policy. When the agent initiates its actions from a specific state, the value function is referred to as the ‘statevalue’ function, whereas when the agent initiates from a specific state–action pair, it is then called the ‘actionvalue’ function. The state value describes the expected performance of an agent, when it follows policy 𝜋
and has started from state
s. The mathematical representation of a typical value function is as seen in Equation (2), while the Bellman Equation for the value function which expresses the dependency of the value of one state with the values of future states is seen in Equation (3).
Respectively, the actionvalue describes the expected performance of an agent when it chooses action a, while it finds itself in state s and follows policy 𝜋. Then, the mathematical representation of the value function is as in Equation (4) and the Bellman equivalent is as in Equation (5).
This equation is also known as the Qfunction, which calculates the Qvalue or ’quality value’, of the RL agent. These equations have their optimal equivalents, in which the policies followed are the optimal policies of the agent. The optimal equations, which are given by Equations (6) and (7), are called ’Bellman Optimality Equations’ and describe the optimal actionvalue function in terms of the optimal value function. These equations in reality describe a system of equations, with each equation corresponding to one state.
4. Reinforcement Learning Methods and Algorithms
The RL methods can be divided into two categories: modelbased and modelfree methods ^{[4]}. In modelbased methods, there is a model which mimics the behavior of an environment and has the ability to observe a state and action and immediately determine the reward of the agent and the future state. The agent utilizes the model in such a way that it can simulate different sequences of actions and predict their outcomes. These methods can achieve faster learning as the model of the system is built beforehand. It can also provide higher sample efficiency, as it leads to the accurate generation of additional training data to improve the agent’s performance. In modelfree methods, the agent is trained effectively in complex and unknown environments, as it learns directly from the previous experience the agent has gathered. A model which could be inaccurate and lead to potentially unwanted results does not exist; therefore, the agent needs to rely solely on the trialanderror approach. Modelfree and modelbased RL methods can, in turn, be divided into two more categories, policy optimization and QLearning, also known as policybased and valuebased methods, and learn the model given the model methods. This entry will focus on modelfree methods and their subcategories, as the main advantages of RL are modeling and environment agnostic approaches. In conclusion of this section, a comprehensive table is presented, showcasing stateoftheart RL algorithms. The classification of these algorithms is based on their respective approaches to policy optimization, Qlearning, or a combination of both (see Table 1).
4.1. PolicyOptimization and PolicyBased Approaches
Policyoptimization methods directly update the agent’s policy and their objective is to discover the optimal policy, which will lead to the maximization of the expected cumulative reward, without explicitly learning a value function. They use stochastic policies, 𝜋𝜃(𝑎𝑠), and attempt to optimize the 𝜃 parameters of the policy through various techniques. One technique is the gradient ascent technique, which aims to find the maximum of an optimality function 𝐽(𝜋𝜃), as seen in Equation (8). Another technique is the maximization of local estimates of this function, as seen in Equation (9). Gradientbased policyoptimization algorithms are known as policy gradient methods, as they directly compute the gradient, ∇𝜃𝐽(𝜋𝜃), of the expected cumulative reward in terms of the policy parameters and use this gradient to compute the policy. Policy gradient algorithms are mostly onpolicy methods, as in every iteration data gathered from the most recent policy followed is used. They can learn stochastic policies, which can be useful in environments with uncertainty; they can handle continuous action spaces and learn policies that are robust to environments with uncertainty.
Policyoptimization methods also include techniques that introduce a constraint to ensure policy changes are not too large and maintain a certain level of performance. These are called trust regionoptimization methods. Additionally, policy gradient algorithms include actor–critic methods, which consist of two components: the actornetwork, or the policy function, responsible for the choosing of the agent’s action, and the critic network, or the value function estimator, responsible for the estimation of the value of state–action pairs. The actor receives the critic’s estimate and proceeds to update its policy with respect to the policy parameters. Actor–critic methods leverage the benefits of both valuebased and policybased methods. Some of the most wellknown policyoptimization methods are Proximal Policy Optimization (PPO) ^{[5]}, Trust Region Policy Optimization (TRPO) ^{[6]}, Vanilla Policy Gradient (VPG) ^{[7]}, Deep Deterministic Policy Gradient (DDPG) ^{[8]}, Soft Actor–Critic (SAC) ^{[9]}, Advantage Actor–Critic (A2C) ^{[10]} and Asynchronous Advantage Actor–Critic (A3C) ^{[10]}.
4.2. QLearning and ValueBased Approaches
QLearning methods ^{[11]}^{[12]} are characterized as offpolicy learning methods that do not require a learning model, where the agent learns an actionvalue function, the Qfunction, 𝑄𝜃(𝑠,𝑎), for the optimal actionvalue function, 𝑄∗(𝑠,𝑎). This function estimates the expected return of selecting action a, in state s, and following the optimal policy from the next state, s′. The Qfunction’s objective is to maximize quality values of a state, the Qvalues, by choosing the appropriate actions for each given state. It simultaneously updates the Qvalues of that state. The Qfunction is given by Equation (10).
The best actions of the agent for a state, s, are selected through a greedy method, called 𝜖Greedy. The selection of these actions happen with a probability 1−𝜖, while the selection of a random future action happens with a probability of 𝜖, where 𝜖∈(0,1).
The 𝜖Greedy method is an approach to the exploitation versus exploration problem, where the agent needs to find an equilibrium between exploiting the selection of previously found good actions which yield satisfying rewards and exploring new and potentially better, unknown actions. QLearning techniques are characterized by their simplicity and ease of implementation. They are utilized in continuous and discrete state and action spaces and have the ability to learn from experiences that do not have an optimal policy. Some wellknown extensions and improved methods that are derived from QLearning are Double QLearning ^{[13]}, Deep QNetworks (DQNs) ^{[14]} and Dueling QLearning ^{[15]}. Double QLearning attempts to solve the problem of overestimation of Qvalues in certain scenarios by modifying the original QLearning algorithm and learning two Qfunctions independently as estimators. The Qfunction that yields the highest Qvalue is the primary estimator whose Qvalue will be selected. The other Qfunction is the secondary estimator used to estimate the Qvalue of the selected action, which helps reduce the overestimation bias that can occur in standard QLearning. DQNs make use of deep neural networks which take the current state as input and outputs the estimated Qvalue of the Qfunction. This is achieved by updating the weights of the network and minimizing the mean squared error between the predicted and actual Qvalues. Dueling QLearning makes use of a statevalue function, which estimates the value of a particular state and an advantage function, which estimates the advantage of taking an action in that state. The Qvalue is the sum of the advantage and statevalue functions, minus the mean of the advantage of all previous actions.
Table 1. Stateoftheart RL algorithms.
Policy Optimization

QLearning

Policy Optimization and QLearningBased

Vanilla Policy Gradient ^{[7]}

Double QLearning ^{[13]}

DDPG ^{[8]}

PPO ^{[5]}

DQN ^{[14]}

SAC ^{[9]}

A2C/A3C ^{[10]}

Dueling QLearning ^{[15]}

TD3 ^{[16]}

TRPO ^{[6]}

C51 ^{[17]}

