SOLVING CONTINUOUS CONTROL VIA Q-LEARNING

Abstract

While there has been substantial success for solving continuous control with actorcritic methods, simpler critic-only methods such as Q-learning find limited application in the associated high-dimensional action spaces. However, most actorcritic methods come at the cost of added complexity: heuristics for stabilisation, compute requirements and wider hyperparameter search spaces. We show that a simple modification of deep Q-learning largely alleviates these issues. By combining bang-bang action discretization with value decomposition, framing singleagent control as cooperative multi-agent reinforcement learning (MARL), this simple critic-only approach matches performance of state-of-the-art continuous actor-critic methods when learning from features or pixels. We extend classical bandit examples from cooperative MARL to provide intuition for how decoupled critics leverage state information to coordinate joint optimization, and demonstrate surprisingly strong performance across a variety of continuous control tasks.

1. INTRODUCTION

Reinforcement learning provides a powerful framework for autonomous systems to acquire complex behaviors through interaction. Learning efficiency remains a central aspect of algorithm design, with a broad spectrum spanning sample-efficient model-based off-policy approaches (Ha & Schmidhuber, 2018; Hafner et al., 2019) at one extreme and time-efficient on-policy approaches leveraging parallel simulation at the other extreme (Rudin et al., 2022; Xu et al., 2021) . Particularly in high-dimensional domains with complicated environment dynamics and task objectives, complex trade-offs between representational capacity, exploration capabilities, and optimization accuracy commonly arise. Continuous state and action spaces yield particularly challenging exploration problems due to the vast set of potential trajectories they induce. Significant research effort has focused on improving



Figure 1: Q-learning yields state-of-the-art performance on various continuous control benchmarks. Simply combining bang-bang action discretization with full value decomposition scales to highdimensional control tasks and recovers performance competitive with recent actor-critic methods. Our Decoupled Q-Networks (DecQN) thereby constitute a concise baseline agent to highlight the power of simplicity and to help put recent advances in learning continuous control into perspective.

