

Abstract

Recent advancements have made Deep Reinforcement Learning (DRL) exceedingly more powerful, but the produced models remain very computationally complex and therefore difficult to deploy on edge devices. Compression methods such as quantization and distillation can be used to increase the applicability of DRL models on these low-power edge devices by decreasing the necessary precision and number of operations respectively. Training in low-precision is notoriously less stable however, which is amplified by the decrease in representational power when limiting the number of trainable parameters. We propose Quantization-aware Policy Distillation (QPD), which overcomes this instability by providing a smoother transition from high to low-precision network parameters. A new distillation loss specifically designed for the compression of actor-critic networks is also defined, resulting in a higher accuracy after compression. Our experiments show that these combined methods can effectively compress a policy network down to 0.5% of its original size, without any loss in performance.

1. INTRODUCTION

Deep Reinforcement Learning (DRL) recently achieved super-human performance on Atari games (Mnih et al., 2015) , Go (Schrittwieser et al., 2020) and Starcraft (Vinyals et al., 2019) . But at the same time, their policy networks have become significantly bigger. Running inference for such a network, such as ResNet-200(He et al., 2016) , can take half a second on a GPU and a large amount of memory. Applying such a network locally, on a low-power Locobot (loc, 2019) for example, becomes completely impractical however due to computation and power constraints. Model compression can counter this by reducing the size of Deep Neural Networks (DNN) without decreasing performance. Three such methods currently show the most potential: pruning, quantization and distillation. We introduce the Quantization-aware Policy Distillation (QPD) algorithm based on the latter two techniques, with the concept and results shown in figure 1 . In quantization, the precision of the DNN parameters is reduced, requiring less memory and enabling inference on simpler embedded hardware (Stanton et al., 2021) . Distillation can reduce the number of DNN parameters and therefore computations, by transferring knowledge of a larger teacher network to a student network with fewer 1



Figure1: The size and performance differences between the original teachers, our policy distillation methods and QPD technique. The average returns on the Atari Breakout environment are shown for the best and worst performing student size.

