RULE-BASED POLICY REGULARIZATION FOR REIN-FORCEMENT LEARNING-BASED BUILDING CONTROL Anonymous

Abstract

Rule-based control (RBC) is widely adopted in buildings due to its stability and robustness. It resembles a behavior cloning methodology refined by human expertise. However, it is unlikely for RBC to exceed a reinforcement learning (RL) agent's performance as deep RL model constantly evolves and is scalable. In this paper, we explore how to incorporate rule-based control into reinforcement learning to learn a more robust policy in both online and offline settings with a unified approach. We start with state-of-the-art online and offline RL methods, TD3 and TD3+BC, then improve on them using a dynamically weighted actor loss function to selectively choose which policy RL models to learn from at each time step of training. With experiments across various weather conditions in both deterministic and stochastic scenarios, we empirically demonstrate that our rule-based incorporated control regularization (RUBICON) method outperforms representative baseline methods in offline settings by 40.7% and by 49.7% in online settings for building-RL environments. We open-source our codes, baselines, and data for both RL and building researchers to explore the opportunity to apply offline RL in building domain.

1. INTRODUCTION

Most buildings implement rule-based control via building management systems, adjusting the setpoints of actuators to co-optimize occupants' thermal comfort and energy efficiency. These rulebased control systems codify the problem-solving know-how of human experts, akin to behavioral cloning policy learnt from expert demonstration without randomness and uncertainty (Hayes-Roth, 1985) . While stable, such control lacks the flexibility to evolve over time. Much research has demonstrated that RL can outperform RBC in both online and offline settings. Zhang et al. (2019) developed a framework for whole building HVAC (heating, ventilation, airconditioning) control in online settings. Liu et al. (2022) incorporated Kullback-Leibler (KL) divergence constraint during training of an offline RL agent for stability, and deployed the policy in a real building. We focus on improving upon RL algorithms for HVAC control in buildings where a rule-based policy already exists. We use established building-RL simulation environments for our experiments, in both online and offline settings (Jiménez-Raboso et al., 2021) . Reinforcement learning is traditionally studied as an online paradigm. However, in real-world problems, configuring an accurate simulator or training environment dynamic models might be infeasible or time-consuming. Batch reinforcement learning (BRL, also known as offline reinforcement learning) learns policies using only historical data without simulators to interact with during training. RL regularization methods are typically tailored specifically to online or offline setting. For example, online methods encourage exploration to either improve the estimate of non-greedy actions' values or explore a better policy (Haarnoja et al., 2018; Ziebart et al., 2008; Haarnoja et al., 2017) . On the other hand, offline methods favor exploitation since it is unlikely for BRL models to accurately estimate uncharted state-action values with a static dataset (Fujimoto et al., 2019; Wu et al., 2019) . In our work, we explore the questions: Can we incorporate an existing rule-based control policy into training of a reinforcement learning policy to improve performance? Can this method be implemented in both online and offline settings? TD3+BC (Fujimoto & Gu, 2021) makes minimal changes to convert an online method TD3 (Fujimoto et al., 2018) to offline mode with comparable performance as state-of-the-art BRL methods only by adding a behavior cloning term to regularize

