Single-Agent Reinforcement Learning and Multi-Agent Reinforcement Learning

Single-Agent Reinforcement Learning and Multi-Agent Reinforcement Learning: History

View Latest Version

Please note this is an old version of this entry, which may differ significantly from the current revision.

Subjects: Computer Science, Artificial Intelligence

Contributor:

Jing Yang

Flexible job shop scheduling (FJSP) is regarded as an effective measure to deal with the challenge of mass personalized and customized manufacturing in the era of Industry 4.0, and is widely extended to many real applications. Single-Agent Reinforcement Learning (SARL) is the algorithm only contains one agent that makes all the decisions for a control system. Multi-Agent Reinforcement Learning (MARL) is the algorithm comprises multiple agents that interact with the environment through their respective policies.

production planning and scheduling
multi-agent reinforcement learning
flexible job shop
path flexibility

1. SARL for Scheduling

SARL virtualizes an agent interacting with the scheduling environment, learning a scheduling policy, and then making decisions. The early paper applying SARL to JSP may be traced back to Zhang and Dietterich (1995) to learn a heuristic evaluation function over states [18]. Subsequently, Aydin and Öztemel (2000) [19] applied reinforcement learning to choose dispatching rules depending on the current state of a production system. Since the proposal of Deep Q-Network (DQN), using SARL to solve JSP has attracted more and more attention.

1.1. SARL with Value Iteration

Waschneck et al. (2018) [20] applied the DQN algorithm to solve a dynamic and flexible production problem with the objective of maximizing plant throughput. The proposed model took machine availability and processing characteristics as states and mapped the states to the station selection. Luo (2020) [21] developed an optimization algorithm based on Double DQN (DDQN) for dynamic FJSP with order insertion. The algorithm can select appropriate scheduling rules according to the job state and obtain a plan better than the general scheduling rules. Lang et al. (2020) [22] combined the DQN algorithm with discrete event simulation to solve a flexible job shop problem with process planning. Two independent DQN agents are trained. One agent selects operation sequences, while the other assigns jobs to machines. Du et al. (2021) [6] considered an FJSP with time-of-use electricity price constraint and dual-objective optimization for the makespan and total price and proposed a hybrid multi-objective optimization algorithm of estimation of distribution algorithm and DQN to solve the problem. Li et al. (2022) [5] presented dynamic FJSPs with insufficient traffic resources (DFJSP-ITR). They proposed a hybrid DQN (HDQN) that includes double Q-learning, prioritized replay, and a soft target network update policy to minimize the maximum duration and total energy consumption. Gu et al. (2023) [23] integrated DQN method into a scalp swarm algorithm (SWA) framework to dynamically tune the population parameters of SWA for JSP solving.

1.2. SARL with Policy Iteration

Wang et al. (2021) [24] considered the uncertainties, such as the mechanical failure of the job shop, and proposed a dynamic scheduling method based on the proximal policy optimization (PPO) to find the optimal scheduling policy. The states are defined by the job processing state matrix, the designated machine matrix, and the processing time matrix of the operations. An action set is defined as the operation selected from the candidate operation set, and the reward is related to machine utilization. The results showed that the proposed approach based on reinforcement learning obtained comparative solutions and achieved adaptive and real-time scheduling.

The application of SARL to solve the scheduling problem has some limitations. Firstly, FJSP-DT occurs in an uncertain environment, and the information obtained by the agent is likely to be incomplete, which is difficult for SARL to handle since SARL depends on global information. Secondly, SARL does not tackle communication and collaboration between jobs or machines, resulting in the loss of important scheduling information [16]. Thirdly, the action space of SARL expands with the number of jobs or machines [25], and a high action dimension can pose a challenge to policy learning. Research shows that the performance of policy gradient methods gradually lags as the action dimension increases [26].

2. MARL for Scheduling

MARL aims to model complex worlds where each agent can make adaptive decisions, realizing competition and cooperation with humans and other agents, and is attracting more and more attention in academia and industry [27].

From the perspective of the multi-agent system’s training paradigm, agents’ training can be broadly divided into distributed and centralized schemes.

Distributed Training Paradigm (DTP): In the distributed Paradigm, agents learn independently of other agents and do not rely on explicit information exchange.
Centralized Training Paradigm (CTP): The centralized paradigm allows agents to exchange additional information during training, which is then abandoned during tests. Agents receive only the locally observable information and independently determine actions according to their policies during execution.

2.1. MARL with DTP

Regarding works on MARL with DTP, Aissani et al. (2009) [28] applied MARL for adaptive scheduling in multi-site companies. The supervisor agent sends requests to inventory agents and resource agents at different sites for a solution in the company’s multi-agent system. The inventory agent asks the resource agent to propose a solution. The resource agent then starts its decision-making algorithm based on the SARSA (state–action–reward–state-action) algorithm using the data system (the resource state, task duration, etc.) and sends back a solution. Martínez et al. (2020) [29] proposed a MARL tool for JSPs where machines are regarded as agents. This tool allows the user to either keep the best schedule obtained by a Q-learning algorithm or modify it by fixing some operations to satisfy certain constraints. The tool then optimizes the modified solution, taking into account the user’s preferences and using the possible alternatives. Hameed et al. (2020) [30] presented a distributed reinforcement learning approach for JSPs. The innovation of this paper is that the authors modeled various relationships within the manufacturing environments (robot manufacturing cells) as graph neural networks (GNN). Zhou et al. (2021) [31] proposed a new distributed architecture with multiple artificial intelligence (AI) schedulers for the online scheduling of orders in smart factories. Each machine is taken as a scheduler agent, which collects the scheduling states of all machines as input for training separately and executes the scheduling policy, respectively. Popper et al. (2021) [32] proposed a distributed MARL scheduling method for the multi-objective optimization problem of minimizing energy consumption and delivery delay in the production process. The basic algorithm is solved by PPO, which regulates the joint behavior of each agent through a common reward function. The algorithm can schedule any number of operations. Burggräf et al. (2022) [7] presented a deep MARL approach with distributed actor-critic architecture to solve dynamic FJSPs. The novelty of this work lies in parameterizing the state and action spaces. Zhang et al. (2022) [8] constructed a multi-agent manufacturing system (MAMS) with the capability of online scheduling and policy optimization for the dynamic FJSP. Various machine tools are modeled as agents capable of environmental perception, information sharing, and autonomous decision making.

2.2. MARL with CTP

Wang et al. (2021) [33] proposed a flexible and hybrid production scheduling problem and introduced a multi-agent deep reinforcement learning (MADRL) scheduling method called Multi-Agent Deep Deterministic Policy Gradient (MADDPG). Similarly, Wang et al. (2022) [9] introduced a modeling method of the decentralized partially Markov decision process (Dec-POMDP) for the resource preemption working environment (PRE), and applied QMIX to solve the PRE scheduling problem, where each job is an agent who selects its action according to current observation. Jing (2022) [10] designed a MARL scheduling framework based on GCN, namely a graph-based multi-agent system (GMAS), to solve the FJSP. First, a probabilistic model of the directed acyclic graph of the FJSP is constructed from the product processing network and workshop environment. Then, the author modeled the FJSP as a topological graph prediction process and adjusted the scheduling policy by predicting the connection probabilities between the edges.

In contrast to DTP, CTP shares information among agents through a centralized evaluation function, making learning more stable and leading to fast convergence. Therefore, regarding the solving method for FJSP-DT, MARL integrated with the CTP training paradigm bears the great potential of agility, adaptability, and accuracy, and is the focus of study therewith.

This entry is adapted from the peer-reviewed paper 10.3390/machines12010008

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.