HEDGE YOUR ACTIONS: FLEXIBLE REINFORCEMENT LEARNING FOR COMPLEX ACTION SPACES Anonymous

Abstract

Real-world decision-making is often associated with large and complex action representations, which can even be unsuited for the task. For instance, the items in recommender systems have generic representations that apply to each user differently, and the actuators of a household robot can be high-dimensional and noisy. Prior works in discrete and continuous action space reinforcement learning (RL) define a retrieval-selection framework to deal with problems of scale. The retrieval agent outputs in the space of action representations to retrieve a few samples for a selection critic to evaluate. But, learning such retrieval actors becomes increasingly inefficient as the complexity in the action space rises. Thus, we propose to treat the retrieval task as one of listwise RL to propose a list of action samples that enable selection phase to maximize the environment reward. By hedging its action proposals, we show that our agent is more flexible and sample efficient than conventional approaches while learning under a complex action space. Website.

1. INTRODUCTION

An essential goal of reinforcement learning (RL) is to solve tasks humans cannot solve, such as those with innumerable action spaces. For instance, recommender systems have a set of millions of discrete items, and robotic control requires acting in continuous dimensions. In such domains, all the actions cannot be enumerated feasibly to perform RL. Thus, actions are associated with continuous parameterizations or representations that enable generalized learning. We consider continuous and large parameterized discrete action spaces as problems of innumerable actions. Typically innumerable action space tasks are solved in two phases: retrieval and selection (See Fig. 1 ). QT-Opt (Kalashnikov et al., 2018) learns robotic manipulation by retrieving actions from a distribution fitted using the cross entropy method and selecting the action that maximizes a learned Q-function. Similarly, Dulac-Arnold et al. (2015) performs large discrete action RL by acting in the space of action representations, retrieving k-nearest-neighbors, and selecting with a Q-function. Continuous actor-critic approaches like DDPG and SAC (Lillicrap et al., 2015; Haarnoja et al., 2018) directly learn an actor with a learning objective of retrieving an action that maximizes a Q-function. However, action retrieval in these approaches takes the form of a single action, a distribution, or a local neighborhood defined on the action space. While these forms of retrieval work when the selection Q-function is smooth over the actions, it becomes a limiting factor in complex action spaces. The action space can be imprecise, noisy, high-dimensional, or independently derived from the task (such as in recommender systems), leading to a mismatch between the action representations and their effects on the task. Therefore, to enable efficient reinforcement learning in complex and innumerable action spaces, we aim to address the goal of performing robust action retrieval. Our critical insight is to perceive the problem of action retrieval as one of listwise reinforcement learning (Sunehag et al., 2015) . Concretely, the retrieval network is considered an RL agent with a modified action space of picking a list of k action candidates, trained to enable the selection Qfunction to maximize the environment reward (Fig. 1 ). Listwise retrieval improves the efficiency of RL in two ways. Firstly, retrieval lags behind selection during initial training because it is trained with RL loss. By hedging or diversifying the retrieved actions, our selection phase gets better candidates to maximize, enabling directed exploration of the task. Second, listwise retrieval can learn to adapt the list composition (e.g., diverse or similar) over training because the actions are not

