RULE-BASED POLICY REGULARIZATION FOR REIN-FORCEMENT LEARNING-BASED BUILDING CONTROL Anonymous

Abstract

Rule-based control (RBC) is widely adopted in buildings due to its stability and robustness. It resembles a behavior cloning methodology refined by human expertise. However, it is unlikely for RBC to exceed a reinforcement learning (RL) agent's performance as deep RL model constantly evolves and is scalable. In this paper, we explore how to incorporate rule-based control into reinforcement learning to learn a more robust policy in both online and offline settings with a unified approach. We start with state-of-the-art online and offline RL methods, TD3 and TD3+BC, then improve on them using a dynamically weighted actor loss function to selectively choose which policy RL models to learn from at each time step of training. With experiments across various weather conditions in both deterministic and stochastic scenarios, we empirically demonstrate that our rule-based incorporated control regularization (RUBICON) method outperforms representative baseline methods in offline settings by 40.7% and by 49.7% in online settings for building-RL environments. We open-source our codes, baselines, and data for both RL and building researchers to explore the opportunity to apply offline RL in building domain.

1. INTRODUCTION

Most buildings implement rule-based control via building management systems, adjusting the setpoints of actuators to co-optimize occupants' thermal comfort and energy efficiency. These rulebased control systems codify the problem-solving know-how of human experts, akin to behavioral cloning policy learnt from expert demonstration without randomness and uncertainty (Hayes-Roth, 1985) . While stable, such control lacks the flexibility to evolve over time. Much research has demonstrated that RL can outperform RBC in both online and offline settings. Zhang et al. (2019) developed a framework for whole building HVAC (heating, ventilation, airconditioning) control in online settings. Liu et al. (2022) incorporated Kullback-Leibler (KL) divergence constraint during training of an offline RL agent for stability, and deployed the policy in a real building. We focus on improving upon RL algorithms for HVAC control in buildings where a rule-based policy already exists. We use established building-RL simulation environments for our experiments, in both online and offline settings (Jiménez-Raboso et al., 2021) . Reinforcement learning is traditionally studied as an online paradigm. However, in real-world problems, configuring an accurate simulator or training environment dynamic models might be infeasible or time-consuming. Batch reinforcement learning (BRL, also known as offline reinforcement learning) learns policies using only historical data without simulators to interact with during training. RL regularization methods are typically tailored specifically to online or offline setting. For example, online methods encourage exploration to either improve the estimate of non-greedy actions' values or explore a better policy (Haarnoja et al., 2018; Ziebart et al., 2008; Haarnoja et al., 2017) . On the other hand, offline methods favor exploitation since it is unlikely for BRL models to accurately estimate uncharted state-action values with a static dataset (Fujimoto et al., 2019; Wu et al., 2019) . In our work, we explore the questions: Can we incorporate an existing rule-based control policy into training of a reinforcement learning policy to improve performance? Can this method be implemented in both online and offline settings? TD3+BC (Fujimoto & Gu, 2021) makes minimal changes to convert an online method TD3 (Fujimoto et al., 2018) to offline mode with comparable performance as state-of-the-art BRL methods only by adding a behavior cloning term to regularize the policy. In TD3+BC, learning behavioral policy relies on historical data. We build on this idea to regularize RL policy using an existing RBC policy combined with the behavioral policy. Our method can be incorporated into existing actor-critic RL algorithms with minimal changes. RUBICON considers RBC as a safe policy based on which RL training can be improved. The actor selectively trains on either RBC or behavioral policy, depending on which policy yields a higher averaged Q-value in a mini-batch estimated by the critic network. Our proposed approach is distinct from prior works in three aspects: (1) We develop a unified regularization approach for both online and offline RL methods with minimal algorithmic modification. (2) Rule-based control policy is directly incorporated into the policy update step to provide stability and robustness. (3) We introduce a dynamic weighting method in actor-critic settings. The actor loss is varied from time step to time step depending on the Q-value estimate of behavioral policy and RBC policy predicted from the value networks. We empirically demonstrate that our method outperforms state-of-the-art methods in offline settings and improves on TD3 in online training in HVAC control environments. To our knowledge, previously RBC is only used as hard constraints or heuristics in RL settings, and we are the first to incorporate an existing reference policy directly into actor-critic algorithms.

2. RELATED WORK

Rule-based systems Rule-based system is one of the first artificial intelligence (AI) methods that solves many real-world problems. Planner (Hewitt, 1971 ) is a problem-solving language that embeds real-world knowledge in procedures. It was used in robotic control and as a semantic base for English. MYCIN (Shortliffe, 1974 ) is a rule-based problem-solving system to assist physicians with an appropriate therapy for bacterial infections. XCON, a production-rule-based system, is used to validate the technical correctness (configurability) of customer orders and to guide the actual assembly of these orders. It has about 80K rules and achieved 95∼98% accuracy. It saved $25M/year by reducing errors in orders, and gained increasing customer satisfaction (Kraft, 1984) .

RL + RBC

The combination of RL and RBC has been explored in many studies, where RBCs are primarily used as auxiliary constraints or guiding mechanism. Lee et al. (2020) propose to use two modules in their control flow, one for continuous control with RL agent and a discrete one controlled by RBC. Wang et al. (2019) improve RL with low level rule-based trajectory modification to achieve a safe and efficient lane-change behavior. Zhu et al. (2021) incorporate RBC for generating the closed-loop trajectory and reducing the exploration space for RL pre-processing. Berenji (1992) use a learning process to fine-tune the performance of a rule-based controller. Radaideh & Shirvan (2021) first train RL proximal policy optimization (PPO) (Schulman et al., 2017) agents to master matching some of the problem rules and constraints, then RL is used to inject experiences to guide various evolutionary/stochastic algorithms. Likmeta et al. (2020) learn RBC parameters via RL methods. These previous methods incorporate RBC in the flow as heuristic or as hard constraints. Instead, we directly incorporate RBC policy in RL training in an algorithmic way.

Online RL regularization

The online baseline we will compare to in evaluation is a state-ofthe-art algorithm: TD3. It applies target policy smoothing regularization to avoid overfitting in the value estimate with deterministic policies. TRPO (Schulman et al., 2015) uses a trust region constraint based on KL-divergence between old and new policy distributions for robust policy updates. SAC (Haarnoja et al., 2018) uses soft policy iteration for learning optimal maximum entropy policies. Munchausen-RL (Vieillard et al., 2020) regularizes policy updates with a KL-divergence penalty similar to TRPO, and adds a scaled entropy term to penalize policy that is far from uniform policy. Our method differs from these methods in that we incorporate an existing real-world policy for online RL training. Offline RL regularization Offline RL is more conservative compared with online methods as it does not require interaction with the environment. It suffers from extrapolation errors induced by selecting out-of-distribution actions. Since offline RL policies are learnt entirely from a static dataset, it is unlikely for value networks to accurately estimate the values in regime where there is no sufficient state-action visitation. Thus, regularization methods become more prominent in offline settings. Batch-constrained deep Q-learning (BCQ) (Fujimoto et al., 2019) , one of the pioneers of offline RL, ascribes extrapolation errors to three main factors: absent data, model bias, and training mismatch. It mitigates the errors by deploying a variational autoencoder (VAE) to reconstruct the action given a state using the data collected by the behavioral policy. The offline baseline method we will compare to in our study is TD3+BC. It starts from online method TD3, adds a behavior cloning term in the policy update to regularize the actor to imitate the behavioral policy and avoid selecting outof-distribution actions. BRAC (Wu et al., 2019) studies both value penalty and policy regularization with multiple divergence metrics (KL, maximum mean discrepancy (MMD), and Wasserstein) to regularize the actor's policy based on the behavioral policy. FisherBRC (Kostrikov et al., 2021) incorporates a gradient penalty regularizer for the state-action value offset term and demonstrates the equivalence to Fisher divergence regularization. CQL (Kumar et al., 2020) learns a conservative, lower-bound estimate in value network via regularizing Q-values. Model-based method, e.g. COMBO (Yu et al., 2021) regularizes the value function on out-of-support transitions generated via environment dynamic models' rollouts. All of these prior works use data collected by a behavioral policy, and do not assume access to any existing policy. The behavioral policy used in experiments is typically a random agent or an RL agent trained partially (medium agent) or to convergence (expert agent). In contrast, we assume direct access to a robust behavioral policy in the form of rule-based control. While this assumption may not hold for robotic applications typically used to evaluate offline RL algorithms, rule-based control policies are routinely deployed in industrial control settings, such as building HVAC control. The regularization methods in both online and offline settings regularize with ensemble models, divergence between partially trained policies, or with logged data. These algorithms still fall into the deadly triad of function approximation, bootstrapping, and off-policy training (Sutton & Barto, 2018) . In our case, we incorporate a robust reference policy to improve RL policy performance. The behavioral policy is selectively trained with actor only when the averaged Q-value estimate is higher than the rule-based control policy actions in every mini-batch. The rule-based control policy reduces uncertainty due to its deterministic behavior. On the opposite, deep learning model is affected by random initialization conditions, even if trained on the same dataset, as varied initialization conditions might lead to different policies.

3. BACKGROUND

In the reinforcement learning paradigm, an agent acts in an environment, sequentially selects actions based on its policy at every time step. The problem can be formulated as a Markov Decision Process (MDP) defined by a tuple (S, A, R, p, γ), with state space S, action space A, reward function R, transition dynamics p, and discount factor γ ∈ [0, 1). The goal is to maximize the expectation of the cumulative discounted rewards, denoted by 2018)). The agent's behavior is determined by a policy π : S → A, which maps states to actions either in deterministic way or with a probability distribution. The expected return following the policy from a given state s is the action-value function R t = ∞ i=t+1 γ i r(s i , a i , s i+1 )( Sutton & Barto ( Q π (s, a) = E π [ ∞ t=0 γ t R t+1 |s 0 = s, a 0 = a] by taking action a. We use the building RL environment from Jiménez-Raboso et al. (2021) . The objective of the agent is to maintain a comfortable thermal environment with minimum energy use. The state consists of indoor/outdoor temperatures, time/day, occupant count, thermal comfort, and related sensor data. The action adjusts the temperature setting of the thermostat. The reward is a linear combination of occupants' thermal comfort and energy consumption. The environment is a single floor building divided in 5 zones, with 1 interior and 4 exterior rooms. See Appendix B for details about our building RL settings.

4. RULE-BASED INCORPORATED CONTROL REGULARIZATION

Our goal is to improve an agent's ability to learn with the assistance of human expert's domain knowledge in both online and offline settings. In real-world problems, sometimes we have existing simulators as oracles so we can safely learn with online RL methods before deploying in the real environment, for example, in robot control (Todorov et al., 2012) , Go (Silver et al., 2017) , and video games (Mnih et al., 2013) . However, in some real-world problems, it is time-consuming and requires domain expertise to build a functional simulation environment (e.g. building thermal simulations), or it can be dangerous or risky to evaluate partially trained policy (e.g. healthcare and financial trading). Offline RL algorithms, on the other hand, rely on historical data collected by an existing but unknown behavioral policy. The objective is to learn a policy that improves on the behavioral policy measured through episodic rewards. In Fig. 1 , we illustrate the process of RL training that accommodates both the online and offline paradigms.

Online approach Offline approach

Static buffers 𝑠, 𝑎, 𝑠 ! , 𝑟 ~𝒟" Our algorithm builds on existing actor-critic algorithms TD3 and TD3+BC. We only modify the policy update strategy with the rule-based control policy, and use the critic as-is. Therefore, we focus our discussion on the policy update of the algorithm. TD3 starts from DDPG (Silver et al., 2014) and mitigates the function approximation error with double Q-learning and delayed policy updates. TD3+BC is an offline RL algorithm adapted from TD3, and is one of the state-of-theart offline RL methods evaluated with D4RL datasets (Fu et al., 2020) . TD3+BC adds a behavior cloning term to the policy update step to penalize the policy that is far away from behavioral policy (Eq. 1). The blue-colored terms indicate the changes from TD3 to TD3+BC. Simulator 𝑇 𝑠 ! 𝑠, 𝑎 , 𝑟(𝑠, 𝑎) RBC policy 𝜋 #$% (𝑠) BRL models 𝜋 & ℬ 𝑠 , 𝑄 ( ℬ (𝑠, 𝑎) Online-RL models 𝜋 & 𝑠 , 𝑄 ( (𝑠, π = arg max π E (s,a)∼D λQ(s, π(s)) -(π(s) -a) 2 (1) λ = α 1 N (si,ai) |Q(s i , a i )| In Eq. 1, λ is decided by the averaged mini-batch Q-estimate and a hyperaparameter α to adjust between RL and imitation learning (Eq. 2). Our method, RUBICON, dynamically weighs both TD3 and TD3+BC's policy update steps with either RBC policy or behavioral policy in each training iteration. In Eq. 3, we replace the actions a sampled from the buffers in Eq. 1 with π Qmax (s) and add a hyperparameter ξ to integrate TD3 and TD3+BC methods as one. Red-colored terms indicate the changes from TD3 and TD3+BC to our method. We replace the notation of sampled actions a in TD3+BC with behavioral policy π b (s) to avoid confusion. Details of the hyperparameter settings in our work are in Appendix D. π = arg max π E s∼D λQ(s, π(s)) -ξ(π(s) -π Qmax (s)) 2 (3) π Qmax (s) = arg max π { Q(s, π b (s)), Q(s, π rbc (s)} (4) Every time when the policy is being updated, given the states s of the sampled mini-batch, the behavioral policy π b (s) and the RBC policy π rbc (s) select actions in a deterministic fashion. The state-action pairs' Q-values are estimated by the critic, the average of the Q-value estimations in the mini-batches are Q(s, π b (s)) and Q(s, π rbc (s)). We dynamically select the transitions with a higher average Q-value to be regularized from in each policy update step, i.e. the actor loss function is dynamically weighted. In online settings, the behavioral policies are the older versions of the policy we used to generate the buffer. The reason we choose the average as the metric to decide which set of transitions to learn from instead of selecting each transition with higher estimated value (each batch is a combination of π b (s) and π rbc (s) ) is that if we choose by each transition we will lose the information on which stateaction visitations lead to worse values, the model will then suffer from the imbalanced data problem. The credit/blame assignment is essential in RL learning convergence and the experience replay can help speed up the propagation process (Lin, 1992) . The algorithm of our method is given in Alg. 1. Changes from baselines to our method are highlighted in blue. d is the policy update frequency, the noise ϵ added to the policy is sampled with Gaussian N (0, σ) and clipped by c. In both online and offline approaches, the policy update follows Eq. 3 and 4 with different hyperparameter settings. Our rule-based control algorithm is described in Alg. 2. It is derived from the rule-based controller in Sinergym's (Jiménez-Raboso et al., 2021) example. For the purpose of computation efficiency and to fit the batch settings in our algorithm, we vectorize the original RBC policy. The rules are simple and intuitive, and can generalize well: First, we get the datetime information we need from the states. Then, we can get the seasonal comfort temperature zone for every transition. If the indoor air temperature (IAT) is below the lower bound of the comfort zone, then we set both cooling and heating setpoints a degree higher (measured in Celcius degree). On the opposite, if the IAT is above the upper bound of the comfort zone, then we set both the heating and cooling setpoints a degree lower than the current setpoints. Finally, we examine if the current datetime is in the office hours. If not, then the setpoints are set to be (18.33, 23.33) (°C) for the purpose of energy reduction since occupants' thermal comfort is not important in these time periods assuming zero occupancy. Algorithm 1:  RUBICON Initialize critic networks Q θ 1 , Q θ 2 , ′ Store transition (s, a, r, s ′ ) in B Sample mini-batch of N transitions (s, a, r, s ′ ) from B ã ← π ϕ ′ (s ′ ) + ϵ,ϵ ∼ clip(N (0, σ), -c,c) y ← r + γ minj=1,2 Q θ ′ j (s ′ , ã) Update critics θj ← arg min θ j N -1 (y -Q θ j (s, a)) 2 if t mod d then Update ϕ by policy gradient: Policy update follows Eq. 3 and 4 Calculate ∇ ϕ J(ϕ) Update target networks: then  θ ′ j ← τ θj + (1 -τ )θ ′ j ϕ ′ ← τ ϕ + (1 -τ )ϕ a h i = a h i -1 ac i = ac i -1 if IATi < max(season comf ort zonei) then a h i = a h i + 1 ac i = ac i + 1 ai = (a hi ,

5. EXPERIMENTS

In our experiments, there are two environment types: deterministic and stochastic. A Gaussian noise with µ=0 and σ=2.5 is added to the outside temperature from episode to episode in the stochastic environments. And three weather types: hot, cool, and mixed. In the results, "hot-deterministic" indicates that the task is learnt and evaluated with hot weather condition and deterministic environment. Similarly, we have all six combinations such as "cool-stochastic" and so on. More details about the RL setup is given in Appendix B. All scores in tables and figures in this paper are normalized with expert policy as 100 and random policy as 0.

5.1. OFFLINE APPROACH

First, we consider the offline approach, where no simulator exists but historical data is available. We follow the standard procedure for BRL evaluation (Fu et al., 2020) : (1.) Train behavioral agents for 500K time steps, then compare the most representative algorithms DDPG, TD3, and SAC (learning curves are shown in Appendix C). The online methods we compare are described below: • DDPG: Deep deterministic policy gradient is a method that combines the actor-critic approach and deep Q-network (DQN) (Mnih et al., 2013) . It is capable of dealing with continuous action spaces problems via policy gradient in a deterministic approach which outperforms stochastic policy method in high dimensional tasks. • SAC: Soft actor-critic, an off-policy maximum entropy RL algorithm to encourage exploration. They empirically show that SAC yields a better sample efficiency than DDPG. • TD3: Twin delayed deep deterministic policy gradient algorithm, it reduces overestimation with double Q-learning, combines with target networks to limit errors from imprecise function approximation. (2.) Select the best agent as our expert agent and generate buffers with it for 500K time steps. A medium agent is trained "halfway", it means that an agent is trained most closed to an agent with the evaluation performance half the performance as the expert agent. And a random agent which samples actions randomly and generate buffers. (3.) Train BRL methods for 500K time steps and evaluate the policy every 25K time steps in all buffers mentioned above in step (2.). We show the detailed learning curves in Appendix C. Normalized and averaged scores across runs are shown in Table 1 . The offline methods we compare with are listed below: • TD3+BC: An offline version of TD3, it adds a behavior cloning term to regularize policy towards behavioral policy combined with mini-batch Q-values and buffer states normalization for stability improvement. • CQL: Conservative Q-learning, derived from SAC, it learns a lower-bound estimate of the value function by regularizing the Q-values during training. • BCQ: Batch-constrained deep Q-learning, it implements a variational autoencoder (VAE) (Kingma & Welling, 2013) to reconstruct the action given the state. And adds perturbation in actor on the policy, the degree of perturbation and size of mini-batch can be adjusted in order to behave more like a traditional RL method or imitation learning. • BC: Behavior cloning, we train a VAE to reconstruct action given state. It simply imitates the behavioral agent without reward signals. In Table 1 , we observe that RUBICON outperforms all other benchmarks in overall score across weather types, random seeds, and environment types. Other BRL methods show good performance either in specific tasks or with a specific randomly initialized configuration; however, overall they are more unstable cf. RUBICON. Our method provides more robust and consistent performance across all variants and demonstrates the ability to generalize across various weather types and response modes of tasks. Also, as we can see in Fig. 2 , learning from both medium buffer and RBC policy, RUBICON improves on their best performance. Our method is stable since the standard deviation is the least among the policies trained. We include the BRL learning curves with expert and random buffers in Appendix C

5.1.1. DATA EFFICIENCY EXPERIMENT

We conduct the experiments with buffers of only one year of data (35, 040 transitions). Data efficiency is a challenge for RL to yield accurate value estimation. In Table 2 , we observe that our method still outperforms its baseline overall. Although it dominates with random buffers and has comparable performance with expert buffers, it does not learn well with medium buffers. The root cause is the similarity of the quality of actions between medium buffers and RBC policy, which causes the critic to misjudge which action to pick between them. However, RUBICON still outperforms the baseline in other two types of buffers since the value estimation differences between (π b (s), s) and (π rbc (s), s) are more prominent in these scenarios. (5)

5.1.3. TRANSFER EXPERIMENT

We consider a realistic scenario where in one weather condition we already have an existing buffer then we want to use it as prior knowledge to combine with RBC and transfer our model to another weather type where we have no data. We experiment with the medium buffers in stochastic environments. The results shown in and hot weathers. On the other hand, transfer from monotonic weather condition leads to worse returns.

5.1.4. REWARD ANALYSIS EXPERIMENT

Since Q-value estimation is usually overestimated, we want to use immediate reward as a reference to examine the quality of the predictions of Q-functions. We pre-train a reward model R ψ (s, a) to predict reward r given state s and selected action a with 200K iterations with the buffer as our training data. At each iteration of policy update, we record the policy π(s) and the predict rewards in each batch r = R ψ (s, π(s)). We plot the distributions of reward in action spaces in Fig. 3 . It demonstrates that RUBICON selects the actions in a wider range cf. TD3+BC, however, with a reward distribution of higher values. The distribution shown is with 10% of data randomly selected from the entire training for a better visualization. 

5.2. ONLINE APPROACH

In online approach, we assume an oracle exists for accurate simulation. In real-world applications, researchers train online model in simulation before deployment in real building environments. We compare with TD3, the baseline we develop our method on. Experimental results comparing TD3 and our method can be found in Table 5 . In five out of six tasks, our method outperforms TD3 A EXPERIMENT DETAILS

Software

• Python: 3.9.12 • Pytorch: 1.12.1+cu113 (Paszke et al., 2019) • Sinergym: 1.9.5 (Jiménez-Raboso et al., 2021) • Gym: 0.21.0 (Brockman et al., 2016) • Numpy: 1.23.1 (Van Der Walt et al., 2011) • CUDA: 11.2

Hardware

• CPU: Intel Xeon Gold 6230 (2.10 GHz) • GPU: NVidia RTX A6000 Benchmark implementations • DDPG: We adopt the DDPG implementation in TD3 author-provided implementation • TD3: Author-provided implementation • SAC: We adopt CleanRL (Huang et al., 2021) implementation due to software version conflict with author-provided repository • TD3+BC: author-provided implementation • CQL: We adopt d3rlpy (Seno & Imai, 2021) implementation due to software version conflict with author-provided repository • BCQ: Author-provided implementation

B BUILDING RL SETTINGS

In this section, we list all the details about the MDP settings in our problem. • State: Site outdoor air dry bulb temperature, site outdoor air relative humidity, site wind speed, site wind direction, site diffuse solar radiation rate per area, site direct solar radiation rate per area, zone thermostat heating setpoint temperature, zone thermostat cooling setpoint temperature, zone air temperature, zone thermal comfort mean radiant temperature, zone air relative humidity, zone thermal comfort clothing value, zone thermal comfort Fanger model PPD (predicted percentage of dissatisfied), zone people occupant count, people air temperature, facility total HVAC electricity demand rate, current day, current month, and current hour. • Action: Heating setpoint and cooling setpoint in continuous settings. • Reward: We follow the default linear reward settings, it considers the energy consumption and the absolute difference to temperature comfort. • Environment: A single floor building with an area of 463.6m 2 divided in 5 zones, 1 interior and 4 exterior. The HVAC system is a packaged VAV (variable air volume) (DX (direct expansion) cooling coil and gas heating coils) with fully auto-sized input. The control variables are the cooling and heating temperature setpoints for the interior zone, and the simulation period is a full year (Jiménez-Raboso et al 1 ), we conduct experiments combining CQL and our RU-BICON method to learn from random buffers since we expect the most improvement in this scenario. The results in Figure 12 and Table 8 indicate that RUBICON also improves CQL. However, it does not consistently improve CQL's performance from task to task, which is not as we observe from TD3+BC to RUBICON learn from random buffers (see Figure 7 ). Also, the improvement is limited and not even reach the RBC policy performance, thus we did not continue exploring the possibility combining CQL with RUBICON. • Learn from a mixture of the original buffer and RBC buffer In order to evaluate if mixing the buffers (of RBC buffer and the original buffer to learn from) is equivalent to RUBICON, we conduct experiments mixing 50% of transitions in RBC buffer with 50% of transition in the original buffer to learn from. The result is shown in Figure 14 and Table 10 . It indicates our selective algorithm is necessary to dynamically decide if RBC policy or the behavioral policy to learn from instead of randomly trained on both. • Learn from worsened RBC policies We run another ablation experiment to observe how the quality of RBC policy affects the performance compared with RUBICON and baseline TD3+BC. We design two worsened RBC policies: The first one is a biased RBC where we modify the change in setpoints (a hi and a ci ) from 1 to 5 in Algo. 2, we name this method "RBC CB" in Figure 15 . The other is to replace RBC with random policy, it is named as "RBC Random". From the results in Table 11 we could find that even with constantly worsened RBC policy it still improves from baseline, However, it is still too aggressive for the models to learn a robust policy. And with random policy as a worsened RBC it is almost equivalent as no reference policy, the performance is similar to our baseline TD3+BC. All learning curves are normalized with random policy as 0 and expert policy as 100, averaged with 3 random seeds and the scores shown in tables are the average and standard deviation last 5 evaluations unless mentioned otherwise.

D MODEL PARAMETERS

We list the hyperparameters used in this paper for reproducibility. Unless mentioned otherwise, we keep the original hyperparameters setups as the implementations listed in Sec. A since DRL methods are sensitive to hyperparameter tuning (Henderson et al., 2018) (see Table 12, 13, 14, and 15) . 



Some scores with a standard deviation of 0 is caused by the round down of normalized scores, they are negligible numbers.



Figure 2: Learning curves of RUBICON and the baseline method TD3+BC with medium buffers

Figure 3: Reward distribution in action spaces of hot-continuous environment learns from medium buffer, from left to right: RUBICON (1.842/1.978/-0.577), TD3+BC (1.534/1.332/-0.668), and buffer (0.908/0.915/-0.799), tuples indicate the (a1 range/a2 range/reward mean).

Figure 5: Learning curves of BRL models learn from expert buffers.

Figure 6: Learning curves of BRL models learn from medium buffers.

Figure 7: Learning curves of BRL models learn from random buffers

Figure 10: Learning curves of online RUBICON hyperparameter optimization.

Figure 12: Learning curves of CQL, CQL+RUBICON, and RUBICON learn from random buffers

Our proposed method, RUBICON, it incorporates RBC into RL to improve stability in building HVAC control. It could be applied on both online and offline approaches.

actor network π ϕ , and RBC policy π rbc with random parameters θ1, θ2, ϕ, target networks θ ′ 1 ← θ1, θ ′ 2 ← θ2, ϕ ′ ← ϕ and replay buffer or load replay buffer B for t = 1 to T do if online then Select action with exploration noise a ∼ π

BRL methods benchmark: Average normalized score over the final 5 evaluations and 3 random seeds. ± corresponds to standard deviation over the last 5 evaluations across runs.1

Table 4 indicate that our method is capable of transferring from one weather condition to another with comparable performance without any hyperparameter changes. As the results indicate, due to the diversity of the mixed weather, it improves the learning in cool Transfer experiment

., 2021). The weather types are classified according to the U.S. Department of Energy (DOE) standard (Department of Energy). The weather type details and their representative geometric locations are listed below:-Cool marine: Washington, USA. The mean annual temperature and mean annual relative humidity are 9.3°C and 81.1% respectively. -Hot dry: Arizona, USA with mean annual temperature of 21.7°C and a mean annual relative humidity of 34.9%-Mixed humid: New York, USA with a mean annual temperature of 12.6°C and a mean annual relative humidity of 68.5% based on TMY3 datasets (National Renewable Energy Laboratory). We illustrate the learning curves of BRL methods learn from different quality of buffers for a better visualization of comparison in Fig.52 and 7 • Transfer learning The learning curves of BRL transfer experiments are illustrated in Fig. 8 • Behavioral agents The behavioral agents' learning curves are demonstrated in Fig. 9 • Hyperparameter optimization Learning curves of hyperparameter experiments in online RUBICON are shown in Fig. 10 C.2 ADDITIONAL EXPERIMENTS • Learn from RBC buffers To simulate a more realistic scenario where we learn from buffers of real-world HVAC control dataset. It consists of transitions generated from rulebased control policy, which is widely used in building HVAC control. The learning curves are illustrated in Figure 11. The results in Table7indicate that even when learning with RBC buffers with RBC policy itself, RUBICON could still outperforms RBC policy due to its learning ability. • CQL+RUBICON Since CQL demonstrates a better performance compared to other methods except RUBICON (see Table

TD3, TD3+BC, and RUBICON hyperparameters

SAC/CQL hyperparameters

annex

in averaged scores and with a substantially smaller standard deviation across runs. The detailed learning curves are illustrated in Fig. 4 . Since we integrate TD3 and TD3+BC as one, we need to introduce the weighting λ and its hyperparameter α (see Eq. 2 and Alg. 1) in the online settings. It is mentioned in TD3+BC paper that the value of α decides if the model learns similar to RL (α=4) or imitation learning (α=1) and the default value set in TD3+BC is α =2.5. We experiment on the values {1, 2.5, 4} to observe how it affects the performance of our models in all tasks. The result (See Table 6 ) shows that when α=1 the model gives the highest scores and the least variance. Which indicates that the actor policy should imitate RBC policy more in order to yield a more stable and robust averaged score rather than a traditional RL configuration (α=4). In offline settings, we follow TD3+BC, use α=2.5.

6. DISCUSSION

In this paper, we explore how rule-based control policy can be incorporated as regularization. Our method can be implemented on the baseline method with minimal changes. We apply our method in building HVAC control simulation environments in both online and offline settings. We empirically demonstrate that our method outperforms several state-of-the-art batch reinforcement learning methods and improves from its online baseline by a substantial amount in building HVAC tasks where rule-based control is robust and a standard in real-world settings. We expect our study would encourage both domain and RL experts to explore more opportunities for the combination of existing policies and RL and extend this concept to more real-world applications. 

