THRESHOLDED LEXICOGRAPHIC ORDERED MULTI-OBJECTIVE REINFORCEMENT LEARNING

Abstract

Lexicographic multi-objective problems, which impose a lexicographic importance order over the objectives, arise in many real-life scenarios. Existing Reinforcement Learning work directly addressing lexicographic tasks has been scarce. The few proposed approaches were all noted to be heuristics without theoretical guarantees as the Bellman equation is not applicable to them. Additionally, the practical applicability of these prior approaches also suffers from various issues such as not being able to reach the goal state. While some of these issues have been known before, in this work we investigate further shortcomings, and propose fixes for improving practical performance in many cases. We also present a policy optimization approach using our Lexicographic Projection Optimization (LPO) algorithm that has the potential to address these theoretical and practical concerns. Finally, we demonstrate our proposed algorithms on benchmark problems.

1. INTRODUCTION

The need for multi-objective reinforcement learning (MORL) arises in many real-life scenarios and the setting cannot be reduced to single-objective reinforcement learning tasks in general Vamplew et al. (2022) . However, solving multiple objectives requires overcoming certain inherent difficulties. In order to compare candidate solutions, we need to incorporate given user preferences with respect to the different objectives. This can lead to Pareto optimal or non-inferior solutions forming a set of solutions where no solution is better than another in terms of all objectives. Various methods of specifying user preferences have been proposed and evaluated along three main fronts: (a) expressive power, (b) ease of writing, and (c) the availability of methods for solving problems with such preferences. For example, writing preference specifications that result in a partial order of solutions instead of a total order makes the specification easier for the user but may not be enough to describe a unique preference. Three main motivating scenarios differing on when the user preference becomes available or used have been studied in the literature. (1) User preference is known beforehand and is incorporated into the problem a priori. (2) User preference is used a posteriori, i.e., firstly a set of representative Pareto optimal solutions is generated, and the user preference is specified over it. (3) An interactive setting where the user preference is specified gradually during the search and the search is guided accordingly. The most common specification method for the a priori scenario is linear scalarization which requires the designer to assign weights to the objectives and take a weighted sum of the objectives, thus making solutions comparable Feinberg & Shwartz (1994) . The main benefit of this technique is that it allows the use of many standard off the shelf algorithms as it preserves the additivity of the reward functions. However, expressing user preference with this technique requires significant domain knowledge and preliminary work in most scenarios Li & Czarnecki (2019) . While it can be the preferred method when the objectives can be expressed in comparable quantities, e.g. when all objectives have a monetary value, this is not the case most of the time. Usually, the objectives are expressed in incomparable quantities like money, time, and carbon emissions. Additionally, a composite utility combining the various objectives, and an approximation of that with linear scalarization limits us to a subset of the Pareto optimal set. To address these drawbacks of linear scalarization, several other approaches have been proposed and studied. Nonlinear scalarization methods like Chebyshev Perny & Weng (2010) are more expressive and can capture all of the solutions in the Pareto optimal set, however, they do not address the user-friendliness requirement. In this paper, we will focus on an alternative specification method that overcomes both limitations of linear scalarization, named Thresholded Lexicographic Ordering (TLO) Gábor et al. (1998 ) Li & Czarnecki (2019) . In lexicographic ordering, the user determines an importance order for the objectives, and the less important objectives are only considered if two solutions respect the ordering of the more important objectives. The thresholding part of the technique allows a more generalized definition for being the same w.r.t. an objective. The user provides a threshold for each objective except the last, and the objective values are clipped at the corresponding thresholds. This allows the user to specify values beyond which they are indifferent to the optimization of an objective. There is no threshold for the last objective as it is considered an unconstrained open-ended objective. Despite the strengths of this specification method, the need for a specialized algorithm to use it in reinforcement learning (RL) has prevented it from being a common technique. The Thresholded Lexicographic Q-Learning (TLQ) algorithm was proposed as such an algorithm and has been studied and used in several papers Li & Czarnecki (2019) Hayes et al. (2020) . While it has been noted that this algorithm does not enjoy the convergence guarantees of its origin algorithm (Q-Learning), we found that its practical use is limited to an extent that has not been discussed in the literature before. In this work, we investigate such issues of TLQ further. We also present a Policy Gradient algorithm as a general solution that has the potential to address many of the shortcomings of TLQ algorithms. Our Contributions. Our main contributions in this work are as follows: (1) We demonstrate the shortcomings of existing TLQ variants on a common control scenario where the primary objective is reaching a goal state and the other secondary objectives evaluate trajectories taken to the goal. We formulate a taxonomy of the problem space in order to give insights into TLQ's performance in different settings. ( 2) We propose a lexicographic projection algorithm which computes a lexicographically optimal direction that optimizes the current unsatisfied highest importance objective while preserving the values of more important objectives using projections onto hypercones of their gradients. Our algorithm allows adjusting how conservative the new direction is w.r.t. preserved objectives and can be combined with first-order optimization algorithms like Gradient Descent or Adam. We also validate this algorithm on a simple optimization problem from the literature. (3) We explain how this algorithm can be applied to policy-gradient algorithms to solve Lexicographic Markov Decision Processes (LMDPs) and experimentally demonstrate the performance of a REIN-FORCE adaptation on the cases that were problematic for TLQ. Additionally, in Appendices C and D, we give further insights into TLQ by giving more details about how different TLQ variants fail in problematic scenarios. Then, we present both some of our failed efforts and the promising directions we identified in order to guide future research.

2. RELATED WORK

Gábor et al. (1998) was one of the first papers that investigate the use of RL in multi-objective tasks with preference ordering. It introduces TLQ as an RL algorithm to solve such problems. Vamplew et al. (2011) showed that TLQ significantly outperforms Linear Scalarization (LS) when the Pareto front is globally concave or when most of the solutions lie on the concave parts. However, LS performs better when the rewards are not restricted to terminal states, because TLQ cannot account for the already received rewards. Later, Roijers et al. (2013) generalized this analysis by comparing more approaches using a unifying framework. To our knowledge, Vamplew et al. ( 2011) is the only previous work that explicitly discussed shortcomings of TLQ. However, we found that TLQ has other significant issues that occur even outside of the problematic cases they analyze. Wray et al. (2015) introduced Lexicographic MDP (LMDP) and the Lexicographic Value Iteration (LVI) algorithm. LMDPs define the thresholds as slack variables which determine how worse than the optimal value is still sufficient. While Wray et al. (2015) proved the convergence to desired policy if slacks are chosen appropriately, such slacks are generally too tight to allow defining user preferences. This is also observed in Pineda et al. (2015) which claimed that while ignoring these slack bounds negates the theoretical guarantees, the resulting algorithm still can be useful in practice. Li & Czarnecki (2019) investigated the use of Deep TLQ for urban driving. It showed that the TLQ version proposed in Gábor et al. (1998) introduces additional bias which is especially problematic in function approximation settings like deep learning. Also, it depends on learning the true Q function, which can not be guaranteed. To overcome these drawbacks, it used slacks instead of the static thresholds and proposed a different update function. Hayes et al. (2020) used TLQ in a multi-objective multi-agent setting and proposed a dynamic thresholding heuristic to deal with the explosion of the number of thresholds to be set.

