JOINTLY-TRAINED STATE-ACTION EMBEDDING FOR EFFICIENT REINFORCEMENT LEARNING

Abstract

While reinforcement learning has achieved considerable successes in recent years, state-of-the-art models are often still limited by the size of state and action spaces. Model-free reinforcement learning approaches use some form of state representations and the latest work has explored embedding techniques for actions, both with the aim of achieving better generalization and applicability. However, these approaches consider only states or actions, ignoring the interaction between them when generating embedded representations. In this work, we propose a new approach for jointly learning embeddings for states and actions that combines aspects of model-free and model-based reinforcement learning, which can be applied in both discrete and continuous domains. Specifically, we use a model of the environment to obtain embeddings for states and actions and present a generic architecture that uses these to learn a policy. In this way, the embedded representations obtained via our approach enable better generalization over both states and actions by capturing similarities in the embedding spaces. Evaluations of our approach on several gaming, robotic control, and recommender systems show it significantly outperforms state-of-the-art models in both discrete/continuous domains with large state/action spaces, thus confirming its efficacy and the overall superior performance.

1. INTRODUCTION

Reinforcement learning (RL) has been successfully applied to a range of tasks, including challenging gaming scenarios (Mnih et al., 2015) . However, the application of RL in many real-world domains is often hindered by the large number of possible states and actions these settings present. For instance, resource management in computing clusters (Mao et al., 2016; Evans & Gao, 2016) , portfolio management (Jiang et al., 2017) , and recommender systems (Lei & Li, 2019; Liu et al., 2018) all suffer from extremely large state/action spaces, thus challenging to be tackled by RL. In this work, we investigate efficient training of reinforcement learning agents in the presence of large state-action spaces, aiming to improve the applicability of RL to real-world domains. Previous work attempting to address this challenge has explored the idea of learning representations (embeddings) for states or actions. Specifically, for state embeddings, using machine learning to obtain meaningful features from raw state representations is a common practice in RL, e.g. through the use of convolutional neural networks for image input (Mnih et al., 2013) . Previous works such as by Ha & Schmidhuber (2018b) have explored the use of environment models, termed world models, to learn abstract state representations, and several pieces of literature explore state aggregation using bisimulation metrics (Castro, 2020) . While for action embeddings, the most recent works by Tennenholtz & Mannor (2019) and Chandak et al. (2019) propose methods for learning embeddings for discrete actions that can be directly used by an RL agent and improve generalization over actions. However, these works consider the state representation and action representation as isolated tasks, which ignore the underlying relationships between them. In this regard, we take a different approach and propose to jointly learn embeddings for states and actions, aiming for better generalization over both states and actions in their respective embedding spaces. To this end, we propose an architecture consisting of two models: a model of the environment that is used to generate state and action representations and a model-free RL agent that learns a policy using the embedded states and actions. By using these two models, our approach combines aspects of model-based and model-free reinforcement learning and effectively bridges the gap between both approaches. In contrast to model-based RL, however, we do not use the environment model for planning, but to learn state and action embeddings. One key benefit of this approach is that state and action representations can be learned in a supervised manner, which greatly improves sampling efficiency and potentially enables their use for transfer learning. In sum, our key contributions are: • We formulate an embedding model for states and actions, along with an internal policy π i that leverages the learned state/action embeddings, as well as the corresponding overall policy π o . We show the existence of an overall policy π o that achieves optimality in the original problem domain. • We further prove the equivalence between updates to the internal policy π i acting in embedding space and updates to the overall policy π o . • We present a supervised learning algorithm for the proposed embedding model that can be combined with any policy gradient based RL algorithms. • We evaluate our methodology on some game-based as well as real-world tasks and find that it outperforms state-of-the-art models in both discrete/continuous domains. The remainder of this paper is structured as follows: In Section 2, we provide some background on RL. We then give an overview of related work in Section 3, before presenting our proposed methodology in Section 4, which we evaluate in Section 5.

2. BACKGROUND

We consider an agent interacting with its environment over discrete time steps, where the environment is modelled as a discrete-time Markov decision process (MDP), defined by a tuple (S, A, T , R, γ). Specifically, S and A are the sets of all possible states and actions, referred to as the state space and action space, respectively. In this work, we consider both discrete and continuous state and action spaces. The transition function from one state to another, given an action, is T : S × A → S, which may be deterministic or stochastic. The agent receives a reward at each time step defined by R : S × A → R. γ ∈ [0, 1] denotes the reward discounting parameter. The state, action, and reward at time t ∈ {0, 1, 2...} are denoted by the random variables S t , A t , and R t . The initial state of the environment comes from an initial state distribution d 0 . Thereafter, the agent follows a policy π, defined as a conditional distribution over actions given states, i.e., π(a|s) = P (A t = a|S t = s). The goal of the reinforcement learning agent is to find an optimal policy π * that maximizes the expected sum of discounted accumulated future rewards for a given environment, i.e., π * ∈ arg max π E[ ∞ t=0 γ t R t |π]. For any policy, we also define the state value function v π (s) = E[ ∞ k=0 γ k R t+k |π, S t = s] and the state-action value function Q π (s, a) = E[ ∞ k=0 γ k R t+k |π, S t = s, A t = a].

3. RELATED WORK

For the application of state embeddings in reinforcement learning, there are two dominant strands of research, namely world models and state aggregation using bisimulation metrics. World model approaches train an environment model in a supervised fashion from experience collected in the environment, which is then used to generate compressed state representations (Ha & Schmidhuber, 2018a) or to train an agent using the learned world model (Ha & Schmidhuber, 2018b; Schmidhuber, 2015) . Further applications of world models, e.g. for Atari 2000 domains, show that abstract state representations learned via world models can substantially improve sample efficiency (Kaiser et al., 2019; Hafner et al., 2020) . Similar to this idea, Munk et al. (2016) pre-train an environment model and use it to provide state representations for an RL agent. Furthermore, de Bruin et al. (2018) investigate additional learning objectives to learn state representations, and Francois-Lavet et al. (2019) propose the use of an environment model to generate abstract state representations; their learned state representations are then used by a Q-learning agent. By using a learned model of the environment to generate abstract states, these approaches capture structure in the state space and reduce the dimensionality of its representation -an idea similar to our proposed embedding model.

