HEDGE YOUR ACTIONS: FLEXIBLE REINFORCEMENT LEARNING FOR COMPLEX ACTION SPACES Anonymous

Abstract

Real-world decision-making is often associated with large and complex action representations, which can even be unsuited for the task. For instance, the items in recommender systems have generic representations that apply to each user differently, and the actuators of a household robot can be high-dimensional and noisy. Prior works in discrete and continuous action space reinforcement learning (RL) define a retrieval-selection framework to deal with problems of scale. The retrieval agent outputs in the space of action representations to retrieve a few samples for a selection critic to evaluate. But, learning such retrieval actors becomes increasingly inefficient as the complexity in the action space rises. Thus, we propose to treat the retrieval task as one of listwise RL to propose a list of action samples that enable selection phase to maximize the environment reward. By hedging its action proposals, we show that our agent is more flexible and sample efficient than conventional approaches while learning under a complex action space. Website.

1. INTRODUCTION

An essential goal of reinforcement learning (RL) is to solve tasks humans cannot solve, such as those with innumerable action spaces. For instance, recommender systems have a set of millions of discrete items, and robotic control requires acting in continuous dimensions. In such domains, all the actions cannot be enumerated feasibly to perform RL. Thus, actions are associated with continuous parameterizations or representations that enable generalized learning. We consider continuous and large parameterized discrete action spaces as problems of innumerable actions. Typically innumerable action space tasks are solved in two phases: retrieval and selection (See Fig. 1 ). QT-Opt (Kalashnikov et al., 2018) learns robotic manipulation by retrieving actions from a distribution fitted using the cross entropy method and selecting the action that maximizes a learned Q-function. Similarly, Dulac-Arnold et al. (2015) performs large discrete action RL by acting in the space of action representations, retrieving k-nearest-neighbors, and selecting with a Q-function. Continuous actor-critic approaches like DDPG and SAC (Lillicrap et al., 2015; Haarnoja et al., 2018) directly learn an actor with a learning objective of retrieving an action that maximizes a Q-function. However, action retrieval in these approaches takes the form of a single action, a distribution, or a local neighborhood defined on the action space. While these forms of retrieval work when the selection Q-function is smooth over the actions, it becomes a limiting factor in complex action spaces. The action space can be imprecise, noisy, high-dimensional, or independently derived from the task (such as in recommender systems), leading to a mismatch between the action representations and their effects on the task. Therefore, to enable efficient reinforcement learning in complex and innumerable action spaces, we aim to address the goal of performing robust action retrieval. Our critical insight is to perceive the problem of action retrieval as one of listwise reinforcement learning (Sunehag et al., 2015) . Concretely, the retrieval network is considered an RL agent with a modified action space of picking a list of k action candidates, trained to enable the selection Qfunction to maximize the environment reward (Fig. 1 ). Listwise retrieval improves the efficiency of RL in two ways. Firstly, retrieval lags behind selection during initial training because it is trained with RL loss. By hedging or diversifying the retrieved actions, our selection phase gets better candidates to maximize, enabling directed exploration of the task. Second, listwise retrieval can learn to adapt the list composition (e.g., diverse or similar) over training because the actions are not A common approach is to (a) retrieve k actions, and (b) select only from those actions. We posit that the retrieval task can be generally seen as listwise RL, where the retrieval agent must learn to output the list of candidates as a whole. This enables flexible learning in complex innumerable action space RL. constrained to a predefined distribution or local neighborhood form. We show that the flexibility of listwise retrieval is crucial in complex action spaces. To this end, we propose a novel framework FLAIR, Flexible Listwise ActIon Retrieval, that can incrementally build a list of k candidate actions without enumerating all possible list combinations. We extend cascaded DQN (Chen et al., 2019c; Jain et al., 2021) to continuous action space resulting in a cascaded DDPG framework with k actors and k critics. Specifically, an actor outputs a candidate action for each list index while considering the state and the partially built list as input. Each cascaded actor is trained to maximize its associated critic's value, while the critics are trained to maximize their environment reward. Overall, each actor-critic pair learns to retrieve a candidate action that can optimize the environment reward when combined with the current list of candidates. Our primary contribution is introducing the problem of complex innumerable action spaces in reinforcement learning. We make the retrieval-selection approach flexible by incorporating listwise RL. We demonstrate our proposed method FLAIR learns to perform listwise action retrieval that enables flexible and efficient decision-making in innumerable action space tasks, such as recommender systems, a novel mine-world environment, and continuous control. et al., 2019; Chen et al., 2019b; Tessler et al., 2019; Narasimhan et al., 2015) . We show that such conventional approaches suffer in complex action spaces due to their predefined constraints on the form of action retrieval.

2. RELATED WORK

Recently, Hubert et al. (2021) propose model-based planning in innumerable action spaces. Their insight of using a retrieval and selection framework is similar to ours, to avoid enumerating all possible actions. They propose an important sampling-like loss modification to enable learning with a subset of samples. However, unlike our work, this work focuses on the selection network, while their retrieval network is assumed to be specified as a proposal distribution and not learned. (Van de Wiele et al., 2020) learn a proposal distribution that can do arg max Q by supervised learning on targets obtained from a stochastic maximization procedure on the Q-function. However, this stochastic



Figure1: Large discrete action set and continuous action space tasks cannot be solved with a single Q-function because it is infeasible to enumerate all the actions. A common approach is to (a) retrieve k actions, and (b) select only from those actions. We posit that the retrieval task can be generally seen as listwise RL, where the retrieval agent must learn to output the list of candidates as a whole. This enables flexible learning in complex innumerable action space RL.

LARGE DISCRETE ACTION SPACE RL Continuous action space is a common problem in RL, with several on-policy algorithms like TRPO (Schulman et al., 2015), PPO (Schulman et al., 2017), and off-policy algorithms like DDPG (Lillicrap et al., 2015), SAC (Haarnoja et al., 2018) proposed for it. Dulac-Arnold et al. (2015); Chandak et al. (2019) address the issue of large discrete action spaces in reinforcement learning, by learning an actor-critic framework in the space of action representations. Likewise, several approaches are based on the retrieval and selection framework (Tan

