FUZZY TILING ACTIVATIONS: A SIMPLE APPROACH TO LEARNING SPARSE REPRESENTATIONS ONLINE

Abstract

Recent work has shown that sparse representations-where only a small percentage of units are active-can significantly reduce interference. Those works, however, relied on relatively complex regularization or meta-learning approaches, that have only been used offline in a pre-training phase. In this work, we pursue a direction that achieves sparsity by design, rather than by learning. Specifically, we design an activation function that produces sparse representations deterministically by construction, and so is more amenable to online training. The idea relies on the simple approach of binning, but overcomes the two key limitations of binning: zero gradients for the flat regions almost everywhere, and lost precision-reduced discrimination-due to coarse aggregation. We introduce a Fuzzy Tiling Activation (FTA) that provides non-negligible gradients and produces overlap between bins that improves discrimination. We first show that FTA is robust under covariate shift in a synthetic online supervised learning problem, where we can vary the level of correlation and drift. Then we move to the deep reinforcement learning setting and investigate both value-based and policy gradient algorithms that use neural networks with FTAs, in classic discrete control and Mujoco continuous control environments. We show that algorithms equipped with FTAs are able to learn a stable policy faster without needing target networks on most domains. 1

1. INTRODUCTION

Representation learning in online learning systems can strongly impact learning efficiency, both positively due to generalization but also negatively due to interference (Liang et al., 2016; Heravi, 2019; Le et al., 2017; Liu et al., 2019; Chandak et al., 2019; Caselles-Dupré et al., 2018; Madjiheurem & Toni, 2019) . Neural networks particularly suffer from interference-where updates for some inputs degrade accuracy for others-when training on temporally correlated data (McCloskey & Cohen, 1989; French, 1999; Kemker et al., 2018) . Recent work (Liu et al., 2019; Ghiassian et al., 2020; Javed & White, 2019; Rafati & Noelle, 2019; Hernandez-Garcia & Sutton, 2019) , as well as older work (McCloskey & Cohen, 1989; French, 1991) , have shown that sparse representation can reduce interference in training parameter updates. A sparse representation is one where only a small number of features are active, for each input (Cheng et al., 2013) . Each update only impacts a small number of weights and so is less likely to interfere with many state values. Further, when constrained to learn sparse features, the feature vectors are more likely to be orthogonal (Cover, 1965) , which further mitigates interference. The learned features can still be highly expressive, and even more interpretable, as only a small number of attributes are active for a given input. However, learning sparse representations online remains relatively open. Some previous work has relied on representations pre-trained before learning, either with regularizers that encourage sparsity (Tibshirani, 1996; Xiang et al., 2011; Liu et al., 2019) or with meta-learning (Javed & White, 2019) . Other work has trained the sparse-representation neural network online, by using sparsity regularizers online with replay buffers (Hernandez-Garcia & Sutton, 2019) or using a winner-take-all strategy where all but the top activations are set to zero (Rafati & Noelle, 2019) . Hernandez-Garcia & Sutton (2019) found that many of these sparsity regularizers were ineffective for obtaining sparse representations without high levels of dead neurons, though the regularizers did still often improve learning. The Winner-Take-All (WTA) approach is non-differentiable, and there are mixed results on it's efficacy, some positive (Rafati & Noelle, 2019) and some negative (Liu et al., 2019) . Finally, kernel representations can be used online, and when combined with a WTA approach, provide sparse representations. There is some evidence that using only the closest prototypes-and setting kernel values to zero for the further prototypes-may not hurt approximation quality (Schlegel et al., 2017) . However, kernel-based methods can be difficult to scale to large problems, due to computation and difficulties in finding a suitable distance metric. Providing a simpler approach to obtain sparse representations, that are easy to train online, would make it easier for researchers from the broad online learning community to adopt sparse representations and further explore their utility. In this work, we pursue a strategy for what we call natural sparsity-an approach where we achieve sparsity by design rather than by encoding sparsity in the loss. We introduce an activation function that facilitates sparse representation learning in an end-to-end manner without the need of additional losses, pre-training or manual truncation. Specifically, we introduce a Fuzzy Tiling Activation (FTA) function that naturally produce sparse representation with controllable sparsity and can be conveniently used like other activation functions in a neural network. FTA relies on the idea of designing a differentiable approximate binning operation-where inputs are aggregated into intervals. We prove that the FTA guarantees sparsity by construction. We empirically investigate the properties of FTA in an online supervised learning problem, where we can carefully control the level of correlation. We then empirically show FTA's practical utility in a more challenging online learning setting-the deep Reinforcement Learning (RL) setting. On a variety of discrete and continuous control domains, deep RL algorithms using FTA can learn more quickly and stably compared to both those using ReLU activations and several online sparse representation learning approaches.

2. PROBLEM FORMULATION

FTA is a generic activation that can be applied in a variety of settings. A distinct property of FTA is that it does not need to learn to ensure sparsity; instead, it provides an immediate, deterministic sparsity guarantee. We hypothesize that this property is suitable for handling highly nonstationary data in an online learning setting, where there is highly correlated data stream and a strong need for interference reduction. We therefore explicitly formalize two motivating problems: the online supervised learning problem and the reinforcement learning (RL) problem. Online Supervised Learning problem setting. The agent observes a temporally correlated stream of data, generated by a stochastic process {(X t , Y t )} t∈N , where the observations X t depend on the past {X t-i } i∈N . In our supervised setting, X t depends only on X t-1 , and the target Y t depends only on X t according to a stationary underlying mean function f (x) = E[Y t |X t = x]. On each time step, the agent observes X t , makes a prediction f θ (X t ) with its parameterized function f θ , receives target Y t and incurs a prediction error. The goal of the agent is to approximate function f -the ideal predictor-by learning from correlated data in an online manner, unlike standard supervised learning where data is independent and identically distributed (iid). RL problem setting. We formalize the interaction using Markov decision processes (MDPs). An MDP consists of (S, A, P, R, γ), where S is the state space, A is the action space, P is the transition probability kernel, R is the reward function, and γ ∈ [0, 1] is the discount factor. At each time step t = 1, 2, . . . , the agent observes a state s t ∈ S and takes an action a t ∈ A. Then the environment transits to the next state according to the transition probability distribution, i.e., s t+1 ∼ P(•|s t , a t ), and the agent receives a scalar reward r t+1 ∈ R according to the reward function R : S × A × S → R. A policy is a mapping from a state to an action (distribution) π : S × A → [0, 1]. For a given state-action pair (s, a), the action-value function under policy π is defined as Q π (s, a) = E[G t |S t = s, A t = a; A t+1:∞ ∼ π] where G t def = ∞ t=0 γ t R(s t , a t , s t+1 ) is the return of a sequence of transitions s 0 , a 0 , s 1 , a 1 , ... by following the policy π. The goal of an agent is to find an optimal policy that obtains maximal expected return from each state. The policy is either directly learned, as in policy gradient methods (Sutton et al., 1999; Sutton & Barto, 2018) , or the action-values are learned and the policy inferred by acting greedily with respect to the action-values, as in Q-learning (Watkins & Dayan, 1992) . In either setting, we often parameterize the policy/value function by a neural network (NN). For example, Deep Q Networks (DQN) (Mnih



Code is available at https://github.com/yannickycpan/reproduceRL.git

