Deep Reinforcement Learning-Based BEMS per Building Type

Deep Reinforcement Learning-Based BEMS per Building Type: Comparison

Please note this is a comparison between Version 1 by Ayas Mahr Shaqour and Version 2 by Catherine Yang.

The deep reinforcement learning (DRL)-based building energy management systems (BEMS) field has grown rapidly in the last five years, with numerous creative ideas and innovations for integrating advanced data-driven control methods in the development of fully enabled smart buildings. Although residential buildings are by far the largest energy consumers, other building types, such as offices and educational buildings, are also being investigated. It would be useful to realize the different directions of research, types of applications, and innovative ideas being implemented for each building type. In particular, it is crucial from a data-centric perspective, as being able to train and use data-driven methods requires large amounts of data, particularly when deploying such systems in the real world.

building energy demand
deep reinforcement learning
data-driven control
energy demand prediction
energy efficiency
residential building
office building
commercial building
data centre

1. Residential Buildings

As previously mentioned, residential buildings account for almost 22% of global energy demand, making them one of the most energy-consuming building types. Additionally, while some types of commercial buildings are primarily used during the day by employees, especially in the post-COVID-19 era, they can be more grid-friendly from the perspective of being more aligned with solar energy availability depending on the work culture of the country. It is primarily because there is more solar energy during the day, whereas residential energy demand increases after work hours, peaking in the evening, which can be a sensitive period for grid operators to compensate for the supply demand change. This could be one underlying factor why this type of building is receiving significant attention in this field, particularly from a DR perspective. Table 14 presents recent research conducted on DRL-based BEMS in residential buildings.

Table 14.

Recent application of DRL-based BEMS on residential buildings.

Ref	Year	Building Study Scale	BEMS	ESS	PV	DR	DRL	Estimator	Unique Objective	Real System	Energy */Cost Saving

The primary objectives of most BEMS systems are typically the same in terms of comfort and reducing energy/cost. In terms of energy and cost, they are highly correlated, where a reduction in one depicts a reduction in the other, although different studies report their primary objective improvements in terms of energy or cost based on whether DR is considered; hence, the price of energy analysis is included. Other secondary objectives, highlighted by some studies, include health factors such as indoor CO₂ levels, and the reduction of peak demand, which usually refers to the improvement over a rule-based baseline controller or a comparison between single and multiple-agent methods. Hence, the high energy-saving percentages do not necessarily depict the overall energy reduction, making it harder to cross-compare studies based on these numbers. Nevertheless, they highlight the advantages of energy savings in residential buildings utilizing DRL. Finally, real implementations are significantly lacking, with only three studies (<10%) out of 31 having validated their models outside of a simulation environment, which highlights a clear research gap.

2. Office Buildings

Office buildings face the challenge of a limited variety of appliances apart from HVAC systems, mainly because they are located in cities and high-rise buildings with limited space for installing renewable energy. While keeping these facts in perspective, the recent application of DRL-based BEMS in offices can be observed in Table 36.

Table 36.

Recent applications of DRL-based BEMS in office buildings.

Ref	Year	BEMS	ESS	PV	DR	DRL	Estimator	Unique Objective	Real System
^[1][51]	2022	Single	HVAC	x	x	o	DQN	DNN	-
^[72]	-	[	19.40%
118]	2022

[119]. Finally, Beimann et al. conducted a comparative analysis of four different DRL methods for the control of a simulated HVAC system of a data centre. Their computational experimental results revealed that SAC has exceptionally high sample efficiency, reaching stable performance with 10 times less data required in comparison to PPO, TRP, and TD3; hence, it is recommended for future utilization, particularly in noisy environments. Moreover, it was reported that all models can achieve an energy reduction of approximately 10% in comparison to a baseline controller ^[74][120].

5. Other Commercial Buildings

Finally, Table 69 includes commercial buildings that are not classified as educational, offices or data centres. Such types of buildings are introduced as either commercial buildings, storehouses, industrial parks, or a mix of (retail and restaurant buildings, offices, and residential) ^[77][78][123,124].

Table 69.

Recent applications of DRL-based BEMS in other commercial buildings.

Ref	Year	Scale	BEMS	ESS	PV	DR	DRL	Estimator	Unique Objective	Real System	Energy */Cost Savings
HVAC	o	x	o	PPO-Clip
78
]
2020
Multi
TES
o
o
o
SAC
DNN
Peak
-
-
^[
³¹
^]
[17]	2018	Multi	HVAC, EV, Appliances	x	o	o	DQN, DDPG	DNN	Peak	-	27.40%

o: included, x: not included, * Energy saving, ^▲ Over all energy/cost saving (Others are an improvement over a baseline Controller). Table Abbreviations: (ACKTR) Actor-critic kronecker-factored trust region; (A2C) advantage actor-critic; (BCNN) Bayesian-Convolutional-Neural-Networks; (CNN) Convolutional neural network; (DDQN-PER) Double deep Q-learning prioritized experience replay; (DNN) Deep neural network; (EV) Electric vehicle; (RLMPC) Reinforcement Learning Model Predictive Control; (SAC) Soft actor-critic; (TES) Thermal energy storage; (TD3) Twin Delayed DDPG; (WHP) Water heating pump.

As indicated in Table 14, DRL-based BEMS research can consider one or multiple buildings to measure the performance of DRL algorithms under different scenarios or to test a multiple-agent DRL approach for managing energy flow, considering multiple buildings or zones simultaneously ^[27][75]. Glatt et al. introduced a decentralized actor-critic reinforcement learning algorithm MARLISA; however, they focused on integrating a centralized critic (MARLISA_DACC) to coordinate energy storage systems (ESS) control, such as batteries and thermal energy storage (TES), between various buildings in a manner that enhances DR performance and reduces carbon footprints ^[28][76]. With the increase in the scale of residential buildings, multiple-agent approaches can learn to share information and act in a positively correlated manner to maximize the BEMS performance over single-agent approaches. Ahrarinouri et al. utilized a distributed reinforcement learning energy management (DRLEM) to control the energy flow of combined heat and power (CHP) and boilers between multiple buildings, where the connection between the multiple agents reduced the heat losses and costs by 18.3% and 3.3%, respectively, and increased energy sharing in peak time by 23% ^[25][73]. Hence, distributed, and multi-agent approaches will be key methods in further research on residential neighbourhoods and buildings, where renewable energy and EV can be coordinated between different houses to reduce renewable energy curtailment and maximize profits in peer-to-peer local energy trading hubs.

The large variety of appliances and BEMS targets are major opportunities in deploying DRL-based BEMS in residential buildings, and there is a high potential for DR because of their contribution to both morning and evening peak demands ^[32][79], and detached houses having space for renewable energy integration. In the recently reviewed literature in Table 14, 77% of the studies considered demand response systems, where the varying electricity price was integrated into the objectives of the control logic, while 42% and 45% had also considered the integration of ESS and PV renewable energy, respectively. Furthermore, while 74% of the systems were deployed to manage HVAC systems related to BEMS targets, 32% of the studies included different types of shiftable/fixable appliances, and 19% investigated the inclusion of electric vehicles (EVs). Table 14 classifies the general BEMS target systems in residential buildings, while Table 25 includes a detailed list of appliances that were directly controlled, apart from HVAC systems and TE; noticeable appliances include dishwashers, washing machines, and EVs. The diversity of BEMS targets in residential buildings is noticeable and considerably high, giving it a unique potential and research perspective. This is probably related to the fact that homeowners might have higher relative demand flexibility than office buildings; for example, owing to direct cost benefits. The operating environment tends to have higher levels of stress and no direct benefits to individuals to compromise their comfort, where the benefit is for business owners.

Table 25.

Residential appliances controlled using DRL-based BEMS.

Appliance	#No.	Reference
HVAC
^[	19	^34][80]^{[2][3][4][6][7]}^[8]^[10]^[11]^[13]^[14]^[15]^[17	DNN	-	HVAC	x	x	x	SAC
^[	DNN	^79][	Operation	125^]^[¹⁸^]^[²³^]^[²⁴^]^[²⁶^[³¹^][13,17^][,52^27][,53²⁹,54,56,57,58,60,61,63,64,	-	3–5.5%	-
]	65	,67,71^],72,74,75,77]	2022	CHP, Battery, PV	9.17%
2022	Single	HVAC	x	x	x	MA-CWSC, DQN	DNN	-	-	11.10% *	^[2][52]	^35][812022	]	2022Single	HVAC	x	x	o	DQN, DDPG
Washing machine	8	DNN	-	√	25.9–32%
^[	³	^]^[5]^[7]^[8]^[10]^[15]^[16]^[26][53,55,57,58,60,65,66,74]	HVAC	x	x	x	DQN	DNN	-	-	-	^[3][53]	2022	Single	^][17,
^[36]	53	,55,57,58,60,65,66]
o	o	x	DDPG	HVAC, EV, Appliances	o	o	o	ACKTR	Kronecker-Factored	-	-	25.37% ^▲
[82]	2022	HVAC, TES	o	o	o	SAC	DNN	Self-consumption/Sufficiency	-	^[4][54]	2022	Single	HVAC	o	o	o	Clustering-DDPG	DNN	-	-	41%
^[5][55]	2022	Single	Appliances	x
Dish washer	8	^[3]^[5]^[7]^[8]	39.5–84.3%	Electric vehicle (EV)	6	^[3]^[7]^[10^[³¹^]
^[37][83]	[	17,25^][,53^15][,57^20],60,65]
2022	HVAC	x	x	x	PPO	DNN	o	o	DQN	DNN	Peak demand	-	30%
^[67][113]	2022	University	Multi	TES	o	x	^[73]o	[119]SAC	2022	HPC/AI Cluster	x	xDNN	Load-Factor	-	x	DQN6.72%
^[	DNN	^80][	Operation	126	√	40%
]	2022	Storehouse	HVAC	x	x	x	DDQN	DNN	-	-	34.20% *	^[63][109]	2022	University	Lab.	^[Ventilation	x	x	x	DQN	DNN	^74][Health	120]	2021
^[	HVAC	^81][	x	127x	x	]SAC, PPO, TD3, TRPO	DNN	Operation	-	2022	Industrial Park	HVAC	x	x	o	Dueling SAC√	2.4–43.7%	Water heating pump (WHP)	5	^[6]^[9]^[12]^[20]^[27][25,56,59,62,75]
DNN	-	-	48.97% *
10%
DNN	-	-	2.80%	^▲	^[68][114]	2022	University	Single	HVAC	x	x	x	SAC	^[DNN	⁷⁵
^[77]	-	√	-
^][121]	[1232019	HVAC	x	]x	2022	Multi	HVAC, WHP, Inverter, Batteryx	DQN	DNN	o	o	o	PPOOperation	-	-	DNN	Over/Under voltage	-	-	^[38][84]	2022	^[69][115]HVAC, PCSs	x	x	2021	University	Multi	HVACx	MAAC	-	-	0.7–4.18% *^,▲	x	x	x
^[	DDPG	⁷⁶	DNN	-	^][122	-	15.40% *	^,▲
]	2019	HVAC	x	x	x	Model-Based DRL, PPO	DNN	Operation
^[82][128]	2022	Single	-	17.1–21.8%	^[6][56]	Underfloor heating2022	^[Single	⁶⁴HVAC, WHP	^]2	[110]^[19][33][49,68]x	o	x	DDQN	DNN	Health	-	7–60% *^,▲
-	2020	Clothes dryer	2	^[3][53^[16],66]
-	Vacuum cleaner	1
^[39][85]	2022	Chiller, TES	o	x	o	SAC	DNN	SchoolDiscomfort	-	-	Single	HVAC	x	x	x	DDPG	DNN	Health	-	21% *^,▲	^[7][57]
^[	2022	⁷⁰Single	^][HVAC, EV, Appliances	o
^[40116]	2020	University	x	^][86	o	A2C	DNN	-	-	23%
]	2022	Battery, fan coil units	o	o	o	Dueling DQN	DNN	Discomfort	-	8%	Single	HVAC	^[8][58]	^[16
x	x	x	PPO	DNN	-	-	10.80% *	^[41]2022	[87]Single	^][66HVAC, Appliances	o	o	o	MDRL	DNN	]-	-	25.80%
2022	HVAC	x	x	x	A3C	DNN
^[65][	-	-	16.10% *	111]	2017	School	Single	HVAC	x	x	x	fitted Q-iteration	-	-	√	33% *^,▲	^[9][59]	2022	Single	HVAC	x	x	x	DDQN	Passive heating and cooling	1
^[42][88]	^[	²²	DNN	Health	-	23.80%	^▲
^[	¹⁰	^]^[¹⁵^]^[^16][	202270]
HVAC	x	x	x	^[10][60]	2022	Single	HVAC, EV, Appliances	o	x	o	DQN	DNN	-	Boiler	1^][73]
-	A3C	DNN	-	-	12.80% *	-	21.30%
^[	²⁵	^[11][61]	2021	LightSingle	1
^[	^[	^10]	HVAC	[60]
Ventilation	1	^[13][63]
Grinder	1	^[16][66]

For the DRL methods, the most-utilized algorithm was DQN, while DDQN and DDPG were notable. Many studies include a comparison between the different types of DRL to determine the best method based on realizing system objectives. Meanwhile, others investigated hybrid methods, such as the mixed deep reinforcement learning (MDRL) introduced by Huang et al. ^[8][58], which combines both DQN and DDPG for enhanced performance, and the RLMPC implemented by Arroyo et al. ^[19][68] which combines both the MPC and DDQN methods in a manner that leverages the benefits of both methods. Two recent unique variations of DRL were also observed. First, the actor-critic approach using the Kronecker-factored trust region (ACKTR) introduced by Chu et al. ^[3][53] increased the sampling efficiency and integrated discrete and continuous action spaces that exhibited high potential. The second algorithm is a combination of clustering and DDPG developed by Zenginis et al., which homogeneously partitions the training data using a clustering method and then trains different agents of each subset of the training data, achieving higher energy efficiency over a single agent ^[4][54]. While these methods are not directly related to the type of building, exhibiting such methods can aid researchers in choosing recently advanced implementations of DRL on the basis of their application and building type. Finally, DNNs have been the most used value/policy function estimators, whereas very few used other methods, such as CNN. In general, owing to the mixed type of state variables, DNNs can effectively map state–action spaces and can be considered the default estimator; however, this indicates that there can be potential for testing other methods.

^[
⁴³
^]
[
89
]
2022
HVAC	x	x	x	BDQ	DNN	-	-	14% *	^,▲
^[44][90	x

o: included, x: not included, * Energy saving, ^▲ Overall energy/cost saving (others are an improvement over a baseline controller).

Finally, a recent innovative idea introduced by Zhou et al. combines DRL with deep learning for building energy prediction. It was not included in Table 47 because it is indirectly related to the BEMS. They utilized DDPG to add an additional learning layer to an LSTM forecaster by having the agent learn to tune the hyperparameters of the LSTM as new training data arrive. They demonstrated that when there is a high variation in the new training data, the prediction accuracy can be increased by up to 23.5% ^[71][117].

4. Datacenters

As listed in Table 58, few studies have investigated data centres. It was observed that the BEMS does not consider DR, renewable energy, or storage systems and is primarily focused on HVAC systems. In general, the main objective of the BEMS is to lower energy demand while meeting operational constraints, while comfort can be slightly compromised in other building types. As a system target, the operational efficiency of data centers is more sensitive as it can compromise the data center’s main operation.

Table 58.

Recent applications of DRL-based BEMS in datacenters.

Ref	Year	BEMS	ESS	PV	DR	DRL	Estimator	Unique Objective	Real System	Overall Energy Saving

o: included, x: not included. Table Abbreviations: (TRPO) Trust Region Policy Optimization.

One unique study implemented by Narantuya et al. utilized a multi-agent DRL (mDRL) based on a DQN to optimize computational resource allocation in high-performance computing (HPC)/AI systems. Their system was further deployed in real-time, reducing the task completion time by 20% and the energy consumption by 40% ^[73]

HVAC

DDQN

DNN

50% *

^[83][16]

2021

Single

HVAC

MAAC

DNN

Health

56.50–75.25%

^[78][124]

2021

Multi

HVAC, TES

SAC

DNN

7% *, 4%

^[84][129]

2021

Multi

HVAC, TES

SAC

DNN

Peak

23% ^▲

^[85][130]

2020

Single

HVAC

A3C, Apex-DQN

DNN

^[86][131]

2019

Single

HVAC

DQN

DNN

]

19.48%

2022

HVAC

PPO, A2C

DNN

Discomfort

4–22%*

^[12][62]

PPO

DNN

22% *

^[2021

⁴⁵Single

WHP

^]x

DQN

DNN

[91]19–35%

2022

HVAC

DQN

DNN

Emissions

^[13][63]

2021

Single

^[46

HVAC

^][92

DDQN-PER

DNN

Health

]

3.51–8.56%

2021

HVAC

DQN

DNN

6% *

^[14][64]

2021

Single

^[47

HVAC

^][93

DDPG

DNN

]

12.7–50% ^▲

2021

HVAC

DQN

DNN

Health

^[15][65]

2021

^[48][94]

2021

Single

HVAC, EV, Appliances

HVACo

SACTD3, DQN, DPG

DNN

DNN5.93–12.45%

9.70%

^[16][66]

2020

^[49][95]

2021

Single

Appliances

HVAC, Blindo

BDQN, SAC, PPODQN

CNN

Peak demand

DNN

-11.66% ^▲

11.0–31.8%

^[17][67]

2020

^[50

Single

^][96

HVAC

DQN

DNN

43.89%

2021

PPO

DNN

62.5%

^▲

^[18][13]

2020

^[51][97]

Single

HVAC, Battery

DDPG

DNN

2021

HVAC

8.10–15.21%

^[19][68]

2022

Single

HVAC

RLMPC vs. (DDQN, MPC)

DNN

³¹

PPO

DNN

4.5–13.2%

^[52][

98]

2021

HVAC

SAC

DNN

Temperature violation

^[20][25]

2022

Single

WHP, EV

DDPG

DNN

√

30% *

^[21]

^[53][99]

2021

HVAC, Battery

DDPG

DNN

39.60%

[69]

2021

^[54]

Single

TES

REINFORCE

[

DNN

100-

50%

]

2020

HVAC

DQN

DNN

Health

15.70% *

^[22][70]

2021

Single

HVAC

REINFORCE

Monte-Carlo PG

13–64%

^[55][101]

2020

Water Heating

DDQN

DNN

5–12% ^▲

^[23][71]

2020

Single

HVAC

^56][

102

DQN

DNN

]

√

2020

21% ^▲, 30%

HVAC, Battery, EV, EWH

DQN

DNN

^[24][72]

^[57

2020

^][103Multi.

]

2020HVAC

HVAC

DQN

BCNN

53% *

DDPG

DNN

27–30%

^▲

^[25][73]

2022

⁵⁸Multi

CHP, Boiler

^]x

DRLEM

104-

3.30%

2019

HVAC, Light, Blind

BDQ

DNN

8.1–14.26%

^[26][74]

2022

⁵⁹Multi

HVAC, Appliances, Battery

^]o

A2C

105DNN

Peak

5–35%

2019

HVAC

DQN

DNN

12.4–32.2% *

^[27][75]

2022

Multi

HVAC, WHP, Appliances

SAC

^[60

DNN

^][106

]

3–7%

2019

HVAC

A3C

DNN

√

16.70% *

^[28][76]

2021

⁶¹Multi

TES, Battery

^]o

MARLISA-_DACC

107DNN

Emissions

2018

HVAC

A3C

DNN

√

16.6–18.2% *

^[29][77]

2021

Multi

HVAC

DQN

^[62

DNN

^][108

]

5–12%

2018

HVAC

A3C

DNN

√

15%

^▲

^[30][

o: included, x: not included, * Energy saving, ^▲ Over all energy/cost saving (Others are an improvement over a baseline Controller). Table Abbreviations: (A3C) Asynchronous advantage actor-critic; (BDQ) Branching-Dueling Q-network; (CHP) Combined heat and power; (EWH) Electric water heater; (MAAC) Multi-agent actor-critic; (PCS) Personal comfort systems.

The number of recent office building-related studies is comparable to that of residential buildings. The first difference can be noticed when observing the appliance category type, which is primarily related to HVAC systems. Only two studies investigated EVs, while few other control targets were investigated, such as TES, blind control, light control, and personal comfort systems (PCSs). HVAC systems are the main energy consumers in offices and have the flexibility and potential to save energy. In addition to HVAC control, recent innovations can be found for BEMS integrated with EVs. Liang et al. included EVs in their BEMS that utilized a safe reinforcement learning (SRL) strategy to mitigate the effect of extreme weather events and increase building resilience and proactivity ^[56][102]. Meanwhile, Mbuwir et al. used EVs as their core and only a BEMS target in an office building, which revealed that by utilizing a multi-agent DRL; specifically, a promising saving potential of up to 62.5% can be achieved ^[50][96]. Furthermore, it can be noticed that only 24% of research considered DR systems, and only 21% included PV or energy storage systems.

The methods of DRL utilized in office buildings are more diversified than those observed in residential buildings, including the asynchronous advantage actor-critic (A3C) and the soft-actor critic (SAC), where their comparison has indicated improved performance over baseline, rule-based controllers, although one downside is that their comparison to other DRL has not always been considered. Zhang et al. introduced a branching–dueling Q-network (BDQN) and compared it to both PPO and SAC, where they reported that BDQN converged to the highest reward, followed by SAC, revealing higher sample complexity than their counterpart, although they performed slower than PPO, and consumed less memory. Hence, this revealed a trade between time, RAM usage, and reward. Another comparison between the advantage actor-critic (A2C) and PPO was conducted by Lee et al., where A2C exhibited better performance ^[44][90]. Such a comparison is useful in guiding researchers to choose the best subset of algorithms from the current large pool of DRL algorithms.

A critical observation related to office buildings is the significance of indoor thermal comfort in realizing the high productivity of workers. This can be observed in four studies that highlighted the reduction in discomfort or temperature violations as a system objective. Because there is less DR inclusion in the BEMS, a higher number of studies have reported energy savings rather than cost savings in comparison to residential buildings. Finally, only three studies conducted by Zhang et al. implemented and validated their models in real systems ^[60][106].

3. Educational Buildings

As depicted in Table 47, which shows recent research on educational buildings, they are mainly either schools or university facilities and laboratories. The target of the BEMS primarily focused on HVAC systems, and one study investigated TES control and other ventilation systems by controlling windows and air cleaners. Only two recent works included demand response systems with integrated energy storage, mainly TES. As for the objectives, health was considered by An et al., who deployed DQN to control ventilation in two laboratory rooms to achieve reduced economic loss and PM_2.5-related health risks ^[63][109]. This is an interesting co-benefit perspective to quantify not only energy and cost reduction, but also to quantify the impact on human health and integrate the findings into the BEMS objective. Furthermore, Chemingui et al. included the reduction of indoor contamination as a core target of their BEMS. This was realized by optimizing the HVAC system managing 21 zones in a school model, achieving 44% increased thermal comfort, 21% reduction in energy consumption, and low indoor CO₂ concentration ^[64][110]. Considering real implementations, three studies conducted real model validation: one in a laboratory setting, one in a university building, and another in a school setting. Laboratories are suitable for real-system validation, although acquiring data to train the agent can be challenging if the data does not already exist. In An et al., the approach was first to conduct an offline training phase based on an apartment model coupled with particle dynamics for PM_2.5 modelling, after which the trained agent was tested in a laboratory room with different PM_2.5 ^[63][109]. Schmidt et al. conducted a 43-day experiment in a Spanish school by deploying a BEMS utilizing a fitted Q-iteration and Bayesian regularized neural network coupled with genetic optimization. They confirmed that by maintaining comfort levels similar to the reference period, energy consumption decreased by almost 33%, and while prioritizing higher comfort, only a 5% energy increase was observed ^[65][111].

Table 47.

Recent applications of DRL-based BEMS in educational buildings.

Ref	Year	Type	Scale	BEMS	ESS	PV	DR	DRL	Estimator	Unique Objective	Real System	Energy */Cost Savings
^[66][112]	2022	University	Single

o: included, x: not included, * Energy saving, ^▲ Over all energy/cost saving (others are an improvement over a baseline controller). Table Abbreviations: (MA-CWSC) Multi-Agent deep reinforcement learning method for the building Cooling Water System Control.

All of the studies listed in Table 69 investigated HVAC systems as the main BEMS target, while two studies included TES and one considered WHP and renewable energy inverters. DR systems were also included in seven studies, particularly in those with larger scales, such as industrial parks or multiple buildings. One notable method introduced was the dueling SAC-based memory-augmented DRL by Zhao et al. to overcome the limitation of time lag in district heating systems in an industrial park. Their novel methodology reduced the energy costs by 2.8% ^[81][127]. Furthermore, two multi-agent approaches were observed. First, Fu et al. utilized a multi-agent DRL method for developing a cooling water system control (MA-CWSC) to control the frequency of the cooling tower and cooling water pump in many chillers. Compared with the single-agent DQN, the proposed model had faster training and simpler action space, resulting in an 11.1% energy saving over the rule-based baseline ^[79][125]. Second, Yu et al. introduced a multi-agent actor-critic (MAAC) algorithm for a multi-zone HVAC system. Their objective was not only to minimize energy costs but also reduce the indoor CO₂ concentration in the building ^[83][16].

In terms of secondary objectives, Pigott et al. considered voltage regulations for a simulated IEEE-33 bus connected to nine buildings. The building types are diverse and include 37 fast-food restaurants, four medium offices, five retail stores, a mall, and 145 residential houses. These models were based on the recent CityLearn framework, which is a platform dedicated to multi-agent models in smart grids, and hence contains both building and power-flow models. Utilizing multiple DRL agents, their model nominally reduced the under-voltage instances and overvoltage occurrences by 34% ^[77][123]. Moreover, Pinto et al. considered both peak demand and peak-to-average ratio, which were reduced by 23% and 20%, respectively, by using a centralized SAC agent controlling four different building types (small/medium offices, retail, and restaurant). Finally, in terms of real system validation, none was observed ^[84][129].