SIMPLE AUGMENTATION GOES A LONG WAY: ADRL FOR DNN QUANTIZATION

Abstract

Mixed precision quantization improves DNN performance by assigning different layers with different bit-width values. Searching for the optimal bit-width for each layer, however, remains a challenge. Deep Reinforcement Learning (DRL) shows some recent promise. It however suffers instability due to function approximation errors, causing large variances in the early training stages, slow convergence, and suboptimal policies in the mixed precision quantization problem. This paper proposes augmented DRL (ADRL) as a way to alleviate these issues. This new strategy augments the neural networks in DRL with a complementary scheme to boost the performance of learning. The paper examines the effectiveness of ADRL both analytically and empirically, showing that it can produce more accurate quantized models than the state of the art DRL-based quantization while improving the learning speed by 4.5-64×.

1. INTRODUCTION

By reducing the number of bits needed to represent a model parameter of Deep Neural Networks (DNN), quantization (Lin et al., 2016; Park et al., 2017; Han et al., 2015; Zhou et al., 2018; Zhu et al., 2016; Hwang & Sung, 2014; Wu et al., 2016; Zhang et al., 2018; Köster et al., 2017; Ullrich et al., 2017; Hou & Kwok, 2018; Jacob et al., 2018) is an important way to reduce the size and improve the energy efficiency and speed of DNN. Mixed precision quantization selects a proper bit-width for each layer of a DNN, offering more flexibility than fixed precision quantization. A major challenge to mixed precision quantization (Micikevicius et al., 2017; Cheng et al., 2018) is the configuration search problem, that is, how to find the appropriate bit-width for each DNN layer efficiently. The search space grows exponentially as the number of layers increases, and assessing each candidate configuration requires a long time of training and evaluation of the DNN. Research efforts have been drawn to mitigate the issue for help better tap into the power of mixed precision quantization. Prior methods mainly fall into two categories: (i) automatic methods, such as reinforcement learning (RL) (Lou et al., 2019; Gong et al., 2019; Wang et al., 2018; Yazdanbakhsh et al., 2018; Cai et al., 2020) and neural architecture search (NAS) (Wu et al., 2018; Li et al., 2020) , to learn from feedback signals and automatically determine the quantization configurations; (ii) heuristic methods to reduce the search space under the guidance of metrics such as weight loss or Hessian spectrum (Dong et al., 2019; Wu et al., 2018; Zhou et al., 2018; Park et al., 2017) of each layer. Comparing to the heuristic method, automatic methods, especially Deep Reinforcement Learning (DRL), require little human effort and give the state-of-the-art performance (e.g., via actor-critic set-ting (Whiteson et al., 2011; Zhang et al., 2016; Henderson et al., 2018; Wang et al., 2018) ). It however suffers from overestimation bias, high variance of the estimated value, and hence slow convergence and suboptimal results. The problem is fundamentally due to the poor function approximations given out by the DRL agent, especially during the early stage of the DRL learning process (Thrun & Schwartz, 1993; Anschel et al., 2017; Fujimoto et al., 2018) when the neural networks used in the DRL are of low quality. The issue prevents DRL from serving as a scalable solution to DNN quantization as DNN becomes deeper and more complex. This paper reports that simple augmentations can bring some surprising improvements to DRL for DNN quantization. We introduce augmented DRL (ADRL) as a principled way to significantly magnify the potential of DRL for DNN quantization. The principle of ADRL is to augment the neural networks in DRL with a complementary scheme (called augmentation scheme) to complement the weakness of DRL policy approximator. Analytically, we prove the effects of such a method in reducing the variance and improving the convergence rates of DRL. Empirically, we exemplify ADRL with two example augmentation schemes and test on four popular DNNs. Comparisons with four prior DRL methods show that ADRL can shorten the quantization process by 4.5-64× while improving the model accuracy substantially. It is worth mentioning that there is some prior work trying to increase the scalability of DRL. Dulac-Arnold et al. ( 2016), for instance, addresses large discrete action spaces by embedding them into continuous spaces and leverages nearest neighbor to find closest actions. Our focus is different, aiming to enhance the learning speed of DRL by augmenting the weak policy approximator with complementary schemes.

2. BACKGROUND

Deep Deterministic Policy Gradient (DDPG) A standard reinforcement learning system consists of an agent interacting with an environment E. At each time step t, the agent receives an observation x t , takes an action a t and then receives an award r t . Modeled with Markov decision process (MDP) with a state space S and an action space A, an agent's behavior is defined by a policy π : S → A. A state is defined as a sequence of actions and observations s t = (x 1 , a 2 , • • • , a t-1 , x t ) when the environment is (partially) observed. For DNN quantization, the environment is assumed to be fully observable (s t = x t ). The return from a state s at time t is defined as the future discounted return R t = T i=t γ i-t r(s i , a i ) with a discount factor γ. The goal of the agent is to learn a policy that maximizes the expected return from the start state J (π) = E[R 1 |π] . An RL agent in continuous action spaces can be trained through the actor-critic algorithm and the deep deterministic policy gradient (DDPG). The parameterized actor function µ(s|θ µ ) specifies the current policy and deterministically maps a state s to an action a. The critic network Q(s, a) is a neural network function for estimating the action-value E[R t |s t = s, a t = a, π]; it is parameterized with θ Q and is learned using the Bellman equation as Q-learning. The critic is updated by minimizing the loss L(θ Q ) = E[(y t -Q(s t , a t |θ Q )) 2 ], where y t = r(s t , a t ) + γQ(s t+1 , µ(s t+1 |θ µ )|θ Q ). (1) The actor is updated by applying the chain rule to the expected return J with respect to its parameters: ∇ θ µ J ≈ E[∇ a Q(s, a|θ Q )| s=st,a=µ(st) ∇ θµ µ(s|θ µ )| s=st ]. (2) DDPG for Mixed Precision Quantization To apply the DRL to mixed precision quantization, previous work, represented by HAQ (Wang et al., 2018) , uses DDPG as the agent learning policy. The environment is assumed to be fully observed so that s t = x t , where the observation x t is defined as x t = (l, c in , c out , s kernel , s stride , s f eat , n params , i dw , i w/a , a t-1 ) for convolution layers and x t = (l, h in , h out , 1, 0, s f eat , n params , 0, i w/a , a t-1 ) for fully connected layers. Here, l denotes the layer index, c in and c out are the numbers of input and output channels for the convolution layer, s kernel and s stride are the kernel size and stride for the convolution layer, h in and h out are the numbers of input and output hidden units for the fully connected layer, n params is the number of parameters, i dw and i w/a are binary indicators for depth-wise convolution and weight/activation, and a t-1 is the action given by the agent from the previous step. In time step t -1, the agent gives an action a t-1 for layer l -1, leading to an observation x t . Then the agent gives the action a t for layer l at time step t given x t . The agent updates the actor and critic networks after one episode following DDPG, which is a full pass of all the layers in the target neural network for quantization. The time step t and layer index l are interchangeable in this scenario.

