SOLVING CONTINUOUS CONTROL VIA Q-LEARNING

Abstract

While there has been substantial success for solving continuous control with actorcritic methods, simpler critic-only methods such as Q-learning find limited application in the associated high-dimensional action spaces. However, most actorcritic methods come at the cost of added complexity: heuristics for stabilisation, compute requirements and wider hyperparameter search spaces. We show that a simple modification of deep Q-learning largely alleviates these issues. By combining bang-bang action discretization with value decomposition, framing singleagent control as cooperative multi-agent reinforcement learning (MARL), this simple critic-only approach matches performance of state-of-the-art continuous actor-critic methods when learning from features or pixels. We extend classical bandit examples from cooperative MARL to provide intuition for how decoupled critics leverage state information to coordinate joint optimization, and demonstrate surprisingly strong performance across a variety of continuous control tasks.



Figure 1 : Q-learning yields state-of-the-art performance on various continuous control benchmarks. Simply combining bang-bang action discretization with full value decomposition scales to highdimensional control tasks and recovers performance competitive with recent actor-critic methods. Our Decoupled Q-Networks (DecQN) thereby constitute a concise baseline agent to highlight the power of simplicity and to help put recent advances in learning continuous control into perspective.

1. INTRODUCTION

Reinforcement learning provides a powerful framework for autonomous systems to acquire complex behaviors through interaction. Learning efficiency remains a central aspect of algorithm design, with a broad spectrum spanning sample-efficient model-based off-policy approaches (Ha & Schmidhuber, 2018; Hafner et al., 2019) at one extreme and time-efficient on-policy approaches leveraging parallel simulation at the other extreme (Rudin et al., 2022; Xu et al., 2021) . Particularly in high-dimensional domains with complicated environment dynamics and task objectives, complex trade-offs between representational capacity, exploration capabilities, and optimization accuracy commonly arise. Continuous state and action spaces yield particularly challenging exploration problems due to the vast set of potential trajectories they induce. Significant research effort has focused on improving efficiency through representation learning in the context of model-free abstraction or model-based planning (Ha & Schmidhuber, 2018; Srinivas et al., 2020; Wulfmeier et al., 2021) , guided exploration via auxiliary rewards (Osband et al., 2016; Pathak et al., 2017; Sekar et al., 2020; Seyde et al., 2022b) , or constrained optimization particularly to stabilize learning with actor-critic approaches (Schulman et al., 2015; Haarnoja et al., 2018; Abdolmaleki et al., 2018) . However, recent results have shown that competitive performance can be achieved with strongly reduced, discretized versions of the original action space (Tavakoli et al., 2018; Tang & Agrawal, 2020; Seyde et al., 2021) . This opens the question whether tasks with complex high-dimensional action spaces can be solved using simpler critic-only, discrete action-space algorithms instead. A potential candidate is Qlearning which only requires learning a critic with the policy commonly following via -greedy or Boltzmann exploration (Watkins & Dayan, 1992; Mnih et al., 2013) . While naive Q-learning struggles in high-dimensional action spaces due to exponential scaling of possible action combinations, the multi-agent RL literature has shown that factored value function representations in combination with centralized training can alleviate some of these challenges (Sunehag et al., 2017; Rashid et al., 2018) , further inspiring transfer to single-agent control settings (Sharma et al., 2017; Tavakoli, 2021) . Other methods have been shown to enable application of critic-only agents to continuous action spaces but require additional, costly, sampling-based optimization (Kalashnikov et al., 2018) . We build on insights at the intersection of these methods to show that a surprisingly straightforward variation of deep Q-learning (Mnih et al., 2013) , within the framework of Hypergraph Q-Networks (HGQN) (Tavakoli et al., 2021) , can indeed solve various state-and pixel-based single-agent continuous control problems at performance levels competitive with state-of-the-art continuous control algorithms. This is achieved by a combination of extreme action space discretization and full value decomposition with extensive parameter-sharing, requiring only small modifications of DQN. To summarize, this work focuses on the following key contributions: • The DecQN agent as a simple, decoupled version of DQN combining value decomposition with bang-bang action space discretization to achieve performance competitive with stateof-the-art continuous control actor-critic algorithms on state-and pixel-based benchmarks. • The related discussion of which aspects are truly required for competitive performance in continuous control as bang-bang control paired with actuator decoupling in the critic and without an explicit policy representation appears sufficient to solve common benchmarks. • An investigation of time-extended collaborative multi-agent bandit settings to determine how decoupled critics leverage implicit communication and the observed state distribution to resolve optimisation challenges resulting from correlations between action dimensions.

2. RELATED WORKS

Discretized Control Continuous control problems are commonly solved with continuous policies (Schulman et al., 2017; Abdolmaleki et al., 2018; Haarnoja et al., 2018; Hafner et al., 2019; Yarats et al., 2021) . Recently, it has been shown that even discretized policies can yield competitive performance and favorable exploration in continuous domains with acceleration-level control (Tavakoli et al., 2018; Farquhar et al., 2020; Neunert et al., 2020; Tang & Agrawal, 2020; Seyde et al., 2021; 2022a) . When considering discretized policies, discrete action-space algorithms are a natural choice (Metz et al., 2017; Sharma et al., 2017; Tavakoli, 2021) . Particularly Q-learning based approaches promise reduced model complexity by avoiding explicit policy representations (Watkins & Dayan, 1992) , although implicit policies in the form of proposal distributions may be required for scalability (Van de Wiele et al., 2020) . We build on perspectives from cooperative multi-agent RL to tackle complex single-agent continuous control tasks with a decoupled extension of Deep Q-Networks (Mnih et al., 2013) over discretized action spaces to reduce agent complexity and to better dissect which components are required for competitive agents in continuous control applications. Cooperative MARL Conventional Q-learning requires both representation and maximisation over an action space which exponentially grows with the number of dimensions and does not scale well to the high-dimensional problems commonly encountered in continuous control of robotic systems. Significant research in multi-agent reinforcement learning (MARL) has focused on improving scalability of Q-learning based approaches (Watkins & Dayan, 1992) . Early works considered indepen-

