REWARD LEARNING WITH TREES: METHODS AND EVALUATION

Abstract

Recent efforts to learn reward functions from human feedback have tended to use deep neural networks, whose lack of transparency hampers our ability to explain agent behaviour or verify alignment. We explore the merits of learning intrinsically interpretable tree models instead. We develop a recently proposed method for learning reward trees from preference labels, and show it to be broadly competitive with neural networks on challenging high-dimensional tasks, with good robustness to limited or corrupted data. Having found that reward tree learning can be done effectively in complex settings, we then consider why it should be used, demonstrating that the interpretable reward structure gives significant scope for traceability, verification and explanation.

1. INTRODUCTION

For a reinforcement learning (RL) agent to reliably achieve a goal or desired behaviour, this objective must be encoded as a reward function. However, manual reward design is widely understood to be challenging, with risks of under-, over-, and mis-specification leading to undesirable, unsafe and variable outcomes (Pan et al., 2022) . For this reason, there has been growing interest in enabling RL agents to learn reward functions from normative feedback provided by humans (Leike et al., 2018) . These efforts have proven successful from a technical perspective, but an oft-unquestioned aspect of the approach creates a roadblock to practical applications: reward learning typically uses black-box neural networks (NNs), which resist human scrutiny and interpretation. For advocates of explainable AI (XAI), this is a problematic state of affairs. The XAI community is vocal about the safety and accountability risks of opaque learning algorithms (Rudin, 2019) , but an inability to interpret even the objective that an agent is optimising places us in yet murkier epistemic territory, in which an understanding of the causal origins of learnt behaviour, and their alignment with human preferences, becomes virtually unattainable. Black-box reward learning could also be seen as a missed scientific opportunity. A learnt reward function is a tantalising object of study from an XAI perspective, due to its triple status as (1) an explanatory model of revealed human preferences, (2) a normative model of agent behaviour, and (3) a causal link between the two. The approach proposed by Bewley & Lecue (2022) provides a promising way forward. Here, human preference labels over pairs of agent behaviours are used to learn tree-structured reward functions (reward trees), which are hierarchies of local rules that admit visual and textual representation and can be leveraged to monitor and debug agent learning. In this paper, we adapt and extend the method (including by integrating it with model-based RL agents), and compare it to NN-based reward learning in a challenging aircraft handling domain. We find it to be broadly competitive on both quantitative metrics and qualitative assessments, with our new modification to tree growth yielding significant improvements. The resultant trees are small enough to be globally interpretable (≈ 20 leaves), and we demonstrate how they can be analysed, verified, and used to generate explanations. The primary contribution of this paper is positive empirical evidence that reward learning can be done effectively using interpretable models such as trees, even in complex, high-dimensional continuous environments. We also make secondary methodological contributions: improvements to the originally-proposed learning algorithm, as well as metrics and methods for reward evaluation and interpretability that may be useful to others working in what remains a somewhat preparadigmatic field. After reviewing the necessary background and related work in Sections 2 and 3, we present our refinement of reward tree learning in Section 4, and describe how we deploy it online with a model-based agent in Section 5. Section 6 contains our experiments and results, which consider both quantitative and qualitative aspects of learning performance, and an illustrative analysis of learnt tree structures. Finally, Section 7 concludes and discusses avenues for future work.

