RULE-BASED POLICY REGULARIZATION FOR REIN-FORCEMENT LEARNING-BASED BUILDING CONTROL Anonymous

Abstract

Rule-based control (RBC) is widely adopted in buildings due to its stability and robustness. It resembles a behavior cloning methodology refined by human expertise. However, it is unlikely for RBC to exceed a reinforcement learning (RL) agent's performance as deep RL model constantly evolves and is scalable. In this paper, we explore how to incorporate rule-based control into reinforcement learning to learn a more robust policy in both online and offline settings with a unified approach. We start with state-of-the-art online and offline RL methods, TD3 and TD3+BC, then improve on them using a dynamically weighted actor loss function to selectively choose which policy RL models to learn from at each time step of training. With experiments across various weather conditions in both deterministic and stochastic scenarios, we empirically demonstrate that our rule-based incorporated control regularization (RUBICON) method outperforms representative baseline methods in offline settings by 40.7% and by 49.7% in online settings for building-RL environments. We open-source our codes, baselines, and data for both RL and building researchers to explore the opportunity to apply offline RL in building domain.

1. INTRODUCTION

Most buildings implement rule-based control via building management systems, adjusting the setpoints of actuators to co-optimize occupants' thermal comfort and energy efficiency. These rulebased control systems codify the problem-solving know-how of human experts, akin to behavioral cloning policy learnt from expert demonstration without randomness and uncertainty (Hayes-Roth, 1985) . While stable, such control lacks the flexibility to evolve over time. Much research has demonstrated that RL can outperform RBC in both online and offline settings. Zhang et al. (2019) developed a framework for whole building HVAC (heating, ventilation, airconditioning) control in online settings. Liu et al. (2022) incorporated Kullback-Leibler (KL) divergence constraint during training of an offline RL agent for stability, and deployed the policy in a real building. We focus on improving upon RL algorithms for HVAC control in buildings where a rule-based policy already exists. We use established building-RL simulation environments for our experiments, in both online and offline settings (Jiménez-Raboso et al., 2021) . Reinforcement learning is traditionally studied as an online paradigm. However, in real-world problems, configuring an accurate simulator or training environment dynamic models might be infeasible or time-consuming. Batch reinforcement learning (BRL, also known as offline reinforcement learning) learns policies using only historical data without simulators to interact with during training. RL regularization methods are typically tailored specifically to online or offline setting. For example, online methods encourage exploration to either improve the estimate of non-greedy actions' values or explore a better policy (Haarnoja et al., 2018; Ziebart et al., 2008; Haarnoja et al., 2017) . On the other hand, offline methods favor exploitation since it is unlikely for BRL models to accurately estimate uncharted state-action values with a static dataset (Fujimoto et al., 2019; Wu et al., 2019) . In our work, we explore the questions: Can we incorporate an existing rule-based control policy into training of a reinforcement learning policy to improve performance? Can this method be implemented in both online and offline settings? TD3+BC (Fujimoto & Gu, 2021) makes minimal changes to convert an online method TD3 (Fujimoto et al., 2018) to offline mode with comparable performance as state-of-the-art BRL methods only by adding a behavior cloning term to regularize the policy. In TD3+BC, learning behavioral policy relies on historical data. We build on this idea to regularize RL policy using an existing RBC policy combined with the behavioral policy. Our method can be incorporated into existing actor-critic RL algorithms with minimal changes. RUBICON considers RBC as a safe policy based on which RL training can be improved. The actor selectively trains on either RBC or behavioral policy, depending on which policy yields a higher averaged Q-value in a mini-batch estimated by the critic network. Our proposed approach is distinct from prior works in three aspects: (1) We develop a unified regularization approach for both online and offline RL methods with minimal algorithmic modification. (2) Rule-based control policy is directly incorporated into the policy update step to provide stability and robustness. (3) We introduce a dynamic weighting method in actor-critic settings. The actor loss is varied from time step to time step depending on the Q-value estimate of behavioral policy and RBC policy predicted from the value networks. We empirically demonstrate that our method outperforms state-of-the-art methods in offline settings and improves on TD3 in online training in HVAC control environments. To our knowledge, previously RBC is only used as hard constraints or heuristics in RL settings, and we are the first to incorporate an existing reference policy directly into actor-critic algorithms.

2. RELATED WORK

Rule-based systems Rule-based system is one of the first artificial intelligence (AI) methods that solves many real-world problems. Planner (Hewitt, 1971) is a problem-solving language that embeds real-world knowledge in procedures. It was used in robotic control and as a semantic base for English. MYCIN (Shortliffe, 1974 ) is a rule-based problem-solving system to assist physicians with an appropriate therapy for bacterial infections. XCON, a production-rule-based system, is used to validate the technical correctness (configurability) of customer orders and to guide the actual assembly of these orders. It has about 80K rules and achieved 95∼98% accuracy. It saved $25M/year by reducing errors in orders, and gained increasing customer satisfaction (Kraft, 1984) .

RL + RBC

The combination of RL and RBC has been explored in many studies, where RBCs are primarily used as auxiliary constraints or guiding mechanism. Lee et al. (2020) propose to use two modules in their control flow, one for continuous control with RL agent and a discrete one controlled by RBC. Wang et al. (2019) improve RL with low level rule-based trajectory modification to achieve a safe and efficient lane-change behavior. Zhu et al. (2021) incorporate RBC for generating the closed-loop trajectory and reducing the exploration space for RL pre-processing. Berenji (1992) use a learning process to fine-tune the performance of a rule-based controller. Radaideh & Shirvan (2021) first train RL proximal policy optimization (PPO) (Schulman et al., 2017) agents to master matching some of the problem rules and constraints, then RL is used to inject experiences to guide various evolutionary/stochastic algorithms. Likmeta et al. ( 2020) learn RBC parameters via RL methods. These previous methods incorporate RBC in the flow as heuristic or as hard constraints. Instead, we directly incorporate RBC policy in RL training in an algorithmic way. Online RL regularization The online baseline we will compare to in evaluation is a state-ofthe-art algorithm: TD3. It applies target policy smoothing regularization to avoid overfitting in the value estimate with deterministic policies. TRPO (Schulman et al., 2015) uses a trust region constraint based on KL-divergence between old and new policy distributions for robust policy updates. SAC (Haarnoja et al., 2018) uses soft policy iteration for learning optimal maximum entropy policies. Munchausen-RL (Vieillard et al., 2020) regularizes policy updates with a KL-divergence penalty similar to TRPO, and adds a scaled entropy term to penalize policy that is far from uniform policy. Our method differs from these methods in that we incorporate an existing real-world policy for online RL training. Offline RL regularization Offline RL is more conservative compared with online methods as it does not require interaction with the environment. It suffers from extrapolation errors induced by selecting out-of-distribution actions. Since offline RL policies are learnt entirely from a static dataset, it is unlikely for value networks to accurately estimate the values in regime where there is no sufficient state-action visitation. Thus, regularization methods become more prominent in offline settings. Batch-constrained deep Q-learning (BCQ) (Fujimoto et al., 2019) , one of the pioneers of offline RL, ascribes extrapolation errors to three main factors: absent data, model bias, and training mismatch. It mitigates the errors by deploying a variational autoencoder (VAE) to reconstruct the action given a state using the data collected by the behavioral policy. The offline baseline method we will

