ON THE DATA-EFFICIENCY WITH CONTRASTIVE IMAGE TRANSFORMATION IN REINFORCEMENT LEARNING

Abstract

Data-efficiency has always been an essential issue in pixel-based reinforcement learning (RL). As the agent not only learns decision-making but also meaningful representations from images. The line of reinforcement learning with data augmentation shows significant improvements in sample-efficiency. However, it is challenging to guarantee the optimality invariant transformation, that is, the augmented data are readily recognized as a completely different state by the agent. In the end, we propose a contrastive invariant transformation (CoIT), a simple yet promising learnable data augmentation combined with standard model-free algorithms to improve sample-efficiency. Concretely, the differentiable CoIT leverages original samples with augmented samples and hastens the state encoder for a contrastive invariant embedding. We evaluate our approach on DeepMind Control Suite and Atari100K. Empirical results verify advances using CoIT, enabling it to outperform the new state-of-the-art on various tasks.

1. INTRODUCTION

Improving data-efficiency to accomplish sequential decisions has always been a crucial problem in pixel-based reinforcement learning. As the agent has to learn an optimal policy with a meaningful information abstraction from observations parallel. Unlike supervised representation learning with strong supervised high-dimensional signals, the training process in RL is fragile. It could be harmful to the training process and cause performance degradation consequently using inappropriate manners. Hence, it is an urgent request to seek subtle representation learning methods for visual RL. Previous works have been proposed in the literature to demonstrate that introducing auxiliary loss functions such as pixel reconstruction (Yarats et al., 2019) and contrastive learning (Laskin et al., 2020b) alleviates this issue. In particular, data augmentations have already proven beneficial to dataefficiency. RAD (Laskin et al., 2020a ) performs an extension of experiments and widely analyzes the impact of various techniques in data augmentation. DrQ (Yarats et al., 2020) and DrQ-v2 (Yarats et al., 2021 ) make use of appropriate image augmentation with great success. Also, previous works have carried out the potential of data augmentation in terms of generalization (Hansen et al., 2021; Raileanu et al., 2020; Zhang & Guo, 2021; Hansen & Wang, 2021; Fan et al., 2021) . Despite the mentioned efforts, it is pretty hard to guarantee that the augmented representations are sufficiently diverse yet semantically consistent. To this end, we explore the underlying condition for representation learning in RL. It is rational to hypothesize that there is an optimal transformation enabling an encoder to abstract informative latent space. This line of works belongs to the regime of state abstraction (Du et al., 2019; Zhang et al., 2020b; Tomar et al., 2021; Wang et al., 2022) , which derives from grouping similar world-states for descriptions of the environment (Dietterich, 2000; Andre & Russell, 2002; Castro & Precup, 2010) . Inspired by spatial transformer networks (STN) (Jaderberg et al., 2015) , a data augmentation model in the vision domain, we consider that merging the parameterized transformation with visual RL could be beneficial. The designed transformation not only discovers the optimality of state abstraction but also produces diverse virtual samples for the agent. To do so, we enforce a learnable data augmentation that updates its parameters along with the RL objective. To understand parameterized augmentation and its relation to representation learning in RL, we focus on fundamental data manipulation by generating augmented data from a learnable Gaussian distribution. To be clear, we present the image transformation to control the margin of the augmentation under an RL training-friendly data distribution. Since changed data distributions meanwhile being controlled by learning algorithms would be helpful in high-dimensional cases (Balestriero et al., 2021) . Here we raise our idea: In light of this challenge, we present a contrastive invariant transformation (CoIT), a novel contrastive learning to ameliorate the data-efficiency for visual RL. CoIT integrates a learnable transformation for model-free methods with minimal modification to the architecture and training pipeline. Specifically, we parameterize the mean and variance of a Gaussian distribution for transforming data and update the parameters together with RL by using constraints to urge faster algorithm convergence empirically. As the learning goes on, the agent approximates the TRANSFORM distribution that is optimal for the task at hand to solve the task. In addition, we evaluate CoIT on DeepMind Control Suite and Atari100K, and experimental results demonstrate that the learnable transformation outperforms the current SOTA methods. Besides, our method does not claim any custom architecture choices and is essential for reproducing end-to-end training. Based on these results, we demonstrate that a learnable transformation improves dataefficiency effectively for visual RL. Key Contributions: (i) We present CoIT, a simple yet effective framework with a learnable image transformation that integrates invariant representations with model-free RL to improve data-efficiency. (ii) We propose a theoretical analysis of how our method can approximate a stationary distribution over the transformed data by the optimal invariant metric, thus learning better representations. (iii) We evaluate CoIT on popular benchmarks and show that our method outperforms previous state-of-the-art methods on data-efficiency and stability.

2. RELATED WORK

Several concurrent methods have been proposed for improving data-efficiency whose common ingredients containing data augmentation and self-supervised learning are listed. Data augmentation in RL. Like the success of data augmentation in computer vision (Zhong et al., 2020; DeVries & Taylor, 2017; Yun et al., 2019; Zhang et al., 2017) , these methods have played a key role in improving the data-efficiency of visual RL problems Mnih et al. (2013) ; Yarats et al. (2019) ; Hafner et al. (2019) ; Lee et al. (2019) . RAD (Laskin et al., 2020a) conducted mounts of experiments and finds out that different data augmentations lead to entirely different results. It provides a broader perspective for the follow-up study of data augmentation in RL. DrQ (Yarats et al., 2020) proposed an effective augmentation method called random shift and introduced a regularization term for Q-learning. Based on DrQ, the DrQ-v2 (Yarats et al., 2021) conducted minimal changes and demonstrated that merely a simple augmentation method could match the state-of-the-art model-based algorithm on data-efficiency and performance. Self-supervised learning in RL. Motivated by the breakthrough in self-supervised learning (Chen et al., 2020; He et al., 2020; Caron et al., 2020; Grill et al., 2020) , it is natural to combine these methods with visual RL to learn rich representations. CURL (Laskin et al., 2020b) introduces a framework similar to SimCLR (che) into visual RL. CoBERL (Banino et al., 2021 ) also tried to minimize the consistency between positive samples by semantic-preserving data augmentation. Besides, STDIM (Mazoure et al., 2020) and PI-SAC (Lee et al., 2020) maximize the temporal mutual information (MI) between the nearby states. SPR (Schwarzer et al., 2020) and PlayVirtual (Yu et al., 2021) follow their idea, but they utilize the dynamics model to predict nearby states in latent space. DBC (Zhang et al., 2020b) and PSM (Agarwal et al., 2021) focus on learning task-relevant information. They utilize signals in the environment to achieve an invariant representation learning and thereby generalize the agent to unseen environments.

3.1. REINFORCEMENT LEARNING FROM OBSERVATIONS

Visual RL control is formulated as an infinite-horizon Markov Decision Process (MDP) (Bellman, 1957) , as the observations can not fully describe the underlying state. To address this problem, we stack multiple consecutive frames together to represent the current underlying state s (Mnih et al., 2013) . In this mind, the MDP M is a 5-tuple ⟨O, S, A, r, γ⟩. Here, the observation space O generally consists of multiple-stack frames. The state space S is either observable or unobservable (Silver et al., 2017; Zhang et al., 2020a) . The agent uses observations O to sample actions from the action space A. Every time the agent interacts with the environment, it obtains a reward r. The end goal is to train an agent to maximize the cumulative reward R. The policy evaluation used as estimating the performance of the policy π ϕ is normally defined by rewards in infinite-horizon tasks, E[R] = E ∞ t=0 γ t rt(st, at, st+1) π ϕ . where γ ∈ [0, 1) is the discount factor and r t denotes the reward at time t.

3.2. Q LEARNING

The state-action value function Q θ is trained by minimizing the Bellman error to estimate the cumulative reward at the current state: J θ (D) = Ee∼D[(Q θ (st, at) -rt -γQθ(st+1, π ϕ (st+1))) 2 ] ( ) where e is a transition from the replay buffer D. And θ denotes an exponential moving average of θ. For the continuous control tasks, we utilize an actor-critic algorithm called Deep Deterministic Policy Gradient (DDPG) (Silver et al., 2014; Lillicrap et al., 2015) which consists of the aforementioned state-action value function Q θ and a deterministic policy π ϕ . The policy π ϕ aims at maximizing J ϕ (D) = E D [Q θ (s t , π ϕ (s t )]. Various effective improvements have also been lead to DDPG. The Qlearning process incorporates n-step returns (Watkins, 1989; Peng & Williams, 1994) . The scheduled exploration noise is produced by a linear decay σ(t) for the variance σ2 which provides different levels of exploration at different training steps: σ(t) = σinit + (1 -min(t/T, 1))(σ final -σinit ). The initial and final value for standard deviation are defined by σinit and σfinal , and the decay horizon T is related to the total training steps of the environment. For the discrete control, we use the data-efficient Rainbow DQN (Van Hasselt et al., 2019) which applied multiple improvements on top of the original Nature DQN (Mnih et al., 2015) .

3.3. STATE ABSTRACTION

While visual RL has achieved many successes in simulated tasks, it remains challenging to learn robust representations from real vision, where images reveal detailed scenes of a complex and unstructured world (Zhang et al., 2020b; Agarwal et al., 2021; Wang et al., 2022) . Therefore, abstracting meaningful elements from the visual scene to present the underlying state is significantly important for visual RL. We follow the Block Markov Decision Process (BMDP) (Du et al., 2019) , which refers to episodic learning tasks via an unobservable latent space S and an observable context space X . The environment generates a context by x ∼ p(•|s). They present a fundamental assumption as: each observation x uniquely determines its generating state s. Similarly, in model-free RL without modeling dynamics, the manipulated context x can be conditioned on a certain probability given an environment transition e which is x ∼ p(•|e).

4.1. LEARNABLE INVARIANT TRANSFORMATION

Following the motivation of smoothing training experiences to stabilize the target Q network (Mnih et al., 2013; Lillicrap et al., 2015) , the transformed x ′ is required to satisfy x ′ ∼ p(•|e), where environment transition e is ideally in the replay distribution D. Note that e is a random variables. Formally, we are ready to introduce the optimal invariant metric to reach the stationary distribution D over the augmented context x ′ , through the definition, Definition 4.1. (Optimal Invariant Metric). Given a transition distribution D for tuples in the replay buffer, suppose the block structure assumption holds, the shift between transitions x and its context x ′ can be measured by a conditional divergence: where d KL (•||•) is the Kullback-Leibler (KL) divergence. It indicates the expected distance between x and x ′ conditioning on e. One may argue that the dynamics cannot be assumed as a fixed distribution when it comes to new observations, especially after data manipulation. Nevertheless, it generally claims that the experiences are uniformly sampled in a replay memory (Mnih et al., 2013) . Also, the fact that the given conditions of the transition are consistent for the observed data as well as the transformed data, makes the above definition reasonable. Next, we will show why the conditional divergence defined in Eq.( 3) is an optimal invariant metric from theoretical perspectives. We employ the Bayes' rule on the conditionally distribution p(x|e) = p(e|x)p(x)/p(e), ∀x ∈ O, e ∈ D. Then the transition operator p(e|x) can be further divided as p(e|s)p(s|x) for any x ∈ O, if e and x are conditional independent given sfoot_0 . Eq.( 3) is rewritten as, Ee dKL p(x|e)||p(x ′ |e) = E e|s dKL p(s|x)p(x)||p(s|x ′ )p(x ′ ) (4) Therefore, minimizing the conditional divergence leads to encoding the observation x and the transformed context x ′ into an invariant latent state space S. As a consequence, the learnable pixel transformation is an optimality invariant combining a qualified encoder. Now we have the observation encoder g : O → S mapping from the observed state O to the latent state S by a non-trivial function g such that g(x) = p(s|x), ∀x. Traditionally, there is only one encoder dubbed feature backbone in RL models. Since the pixel transformation could drift away, e.g., supported by different components with those supporting x (Du et al., 2019) . To enforce the invariant hidden states, another state encoder g ′ that can map x ′ to s should exist, which is g ′ (x ′ ) = p(s|ν(x, •)), ∀x ′ , ν(x, •) is the transformation. So far, the goal of learning the optimal transformed data and encoders boils down to minimizing the distance between representations g(x) and g ′ (x ′ ). To tackle the issue, we first provide two definitions to introduce measurements as follows, Definition 4.  ′ are β-similar if E x∼ D,x ′ ∼ D′ [d(g(x), g(x ′ ))] ≤ β. Without loss of generality, the distance between the encoded states g(x) and g ′ (x ′ ) can be expressed as the following triangular inequality. To obtain a metric, Kullback-Leibler divergence is rewritten in a form of the square root of Jensen-Shannon divergence. Therefore, we have, d(g(x), g ′ (x ′ )) ≤ d(g(x), g ′ (x)) encoding: ϵ-Approximation + d(g ′ (x), g ′ (x ′ )) augmentation: β-Similarity From the view of invariant learning, minimizing the right side of the inequality can upper bound our problem. The first term on the right side is the so-called ϵ-approximation to measure the functional similarity after state abstraction, whereas the second term exists based on the procedure of data augmentation. Thus, we learn the encoders and shifted data simultaneously through the upper bound.

4.2. OPTIMAL STATE ABSTRACTION

To restrict the functional similarity of Eq.( 5) from the perspective of learning a good encoding function with consistent semantics, the approaches formulating the main and momentum feature learning are utilized, motivated by contrastive learning (He et al., 2020) . In particular, we enforce the encoding functions with exactly the same architecture, and use ξt = (1 -τ m ) ξt-1 + τ m ξ t at timestep t to update the parameters of momentum function gξ with g ξ , where τ m ∈ [0, 1] is the updating rate. Furthermore, we design a projection that is f : S → Y using a ReLU network (Petersen & Voigtlaender, 2018) to upper bound the divergence by minimizing the distance in the projected space Y. Previously, the projection has been proposed by Chen et al. (2020) , while the theoretical guarantees of the underlying mechanism with momentum updating for model-free RL are explained in this work. Suppose the Markov chain O g -→ S f -→ Y holds. For two functions g and f in the compatible ranges, we use f • g to denote the function composition f (g(•)). Before showing the proposed data transformed method, we introduce technical lemmas to take advantage of the designable projection function by leveraging the convexity. The momentum updating paradigm is capable of turning into momentum feature updating through a convex function or an equivalence of the convex function. Lemma 4.1. Assume that h : R |S| → R |Y| can be written as h(ξ) = f (< ξ, s >), for some s ∈ R |S| , and f : R |Y| → R |Y| with parameter ξ. Then, convexity of f implies the convexity of h. Lemma 4.2. Given the dynamical updating: ξt = (1 -τ m ) ξt-1 + τ m ξ t . By Lemma 4.1, f ξ = fξ holds after convergence. As a result, the problem of min E x [∥f ξ • g ξ (x) -fξ • gξ(x)∥] is equivalent to the problem of min E x [∥g ξ (x) -gξ(x)∥]. To meet the requirement of a small upper bound, we state a theorem that provides some insights into why it is necessary to learn optimal transformed data together with the encoders. Theorem 4.1. (CoIT) Suppose that Lipschitzness holds for functions g ξ , gξ, f ξ and fξ, respectively. The updating dynamics is: ξt = (1 -τ m ) ξt-1 + τ m ξ t , τ m ∈ [0, 1]. For any input x ∼ D and transformed x ′ obtained via the transform operator ν(x, •), optimizing the conditional divergence in Definition 4.1 means to minimize the upper bound as follows, Ex d(fξ • gξ(x), f ξ • g ξ (x ′ )) ≤ ρEx ∥x -x ′ ∥ (6) where ρ = L g (CL f + ∥ξ f ∥) , C = 1+τ 1-τ , τ = 1 -τ m are constants. L f and L g are Lipschitz constants of the functions f (s) and g(x), respectively. The upper bound of the right side measures the margin of augmentation between original and transformed data. The left side measures the distribution changes in the latent space. Since both transformed data x ′ and encoder g are updating, incorporating augmentation directly cannot well meet the basic stationary environment. Hereby, the theorem suggests us that the automatic transformation is used to bound the representation learning so that the abstracted states enhance the stationary distribution of tuples e and facilitate efficient training. Proofs are given in Appendix A. The empirical comparisons of the projection network are presented in Appendix D.

4.3. PARAMETERIZABLE OBSERVATION

From the Theorem 4.1, we know the relation between the parameterized augmentation and the learnable latent state. That is, image transformation needs to be constrained by the distance between an observation and its associated augmentation. The optimal embedding can be obtained by minimizing of Eq.( 5). Particularly, we parameterize the transformed data x ′ as ν(x, G), where G is a Gaussian distribution dynamically changed along with the RL training. In Algorithm 1, it defines the MDP M with Gaussian random variables G 0 ∼ G |O| for initialization. The TRANSFORM operator is fulfilled by the aforementioned pixel transformation ν which is a shift subject following Gaussian distribution G t (µ t , σ t ) on the top of data interpolation. The transformation x ′ = ν(x, G) is normalized according to both x and learned distribution G, and thereby contributes to the cumulative reward maximization in an interactive way. We use the scope to the observation sampling for an optimal state abstraction. It can be regarded as data sampling from the replay buffer to reach a training-friendly distribution, dubbed Contrastive Invariant Transformation. 

4.4. STABILIZING REWARD FUNCTION

x ′ t = TRANSFORM(x t , G t ) 5: R = R + γ t r(x ′ t , a t ) 6: Adjust to an optimal G t (µ t , σ t ). 7: end for * Details about learning µt and σt are given in Algorithm 2 in Appendix B. It identifies one of the key impacts that CoIT in the RL training procedure as parametrizing the underlying invariant optimization to smooth the distribution D in the replay buffer. To further stabilize the reward function, we propose a mixed CoIT that samples multiple transformed data from the learned distribution G(µ, σ), and then mix up the learned observation x ′ . Similarly, we provide the invariant learning guarantee by optimizing the right side transformation in Theorem 4.2. Theorem 4.2. (Mixed CoIT) Suppose that Lipschitzness holds for functions g ξ , gξ, f ξ and fξ, respectively. The updating dynamics is: ξt = (1 -τ m ) ξt-1 + τ m ξ t . For any input x ∼ D and transformed x ′ , the divergence with mixed transformed observation can be bound by, Ex d(fξ • gξ(x), f ξ • g ξ (E x ′ [x ′ ])) ≤ ρExE x ′ ||x -x ′ || where ρ = L g (CL f + ∥ξ f ∥) , C = 1+τ 1-τ , τ = 1 -τ m . L f and L g are Lipschitz constants of the functions f (s) and g(x) respectively. The proof of Theorem 4.2 is straightforward based on Theorem 4.1, and the details are presented in the supplement.

4.5. LEARNING CONTRASTIVE INVARIANT TRANSFORMATION

With theoretical analysis of invariant transformations, we presented a new framework with normalization variants to ensure above discussed learning guarantees by optimizing parameters. We initialize a distribution G t (µ, σ) and use the TRANSFORM operator to produce different views of x t (Algorithm 1). The transformed data x ′ t is viewed as the positive pair of x t . We also utilize a similarity metric d to learn contrastive invariant transformation for the encoder g ξ (•). We first apply the bilinear interpolation to x t and sample shift terms from G t to produce multiple positive samples x 1 t , x 2 t , .., x n t and mix them together as x ′ t following Theorem 4.2. To prevent the distribution of transformed data from shifting too far away from the replay buffer D, we borrow a similar idea from Yin et al. ( 2020) to regularize and smooth the distribution shift between the transformed and overall data. We use the statistical data stored in the BatchNorm layers to approximate the distribution of the overall data. In this way, the distribution shift between the transformed and overall data can be estimated by the following formulation, Kω(x ′ t ) = l μ(x ′ t ) -E(μ l (x)|O) 2 + l σ2 (x ′ t ) -E(σ 2 l (x)|O) 2 where μ(x ′ t ) and σ2 (x ′ t ) are the mean and variance of the transformed data and ω represents parameter collection {µ t , σ t } of the Gaussian distribution G t (µ, σ). The expectation terms E(μ l (x)|O) and E(σ 2 l (x)|O) respectively denote the estimation of the batch-wise mean and variance for the feature map corresponding to the l-th convolution layer, and O is the given observations. Second, we utilize the similarity metric d proposed by Chen et al. (2020) for learning the encoder g ξ (•) which maps high-dimensional observation to embeddings to meet the invariant transformation in Eq.5. Given a positive observation pair (x t , x ′ t ), the loss is given by L ξ, ξ,ω (D) ≜ f ξ (g ξ (x ′ t )) -fξ(gξ(xt)) 2 2 = 2 -2 • f ξ (g ξ (x ′ t )), fξ(gξ(xt)) ∥f ξ (g ξ (x ′ t ))∥ 2 • fξ(gξ(xt)) 2 Here ξ denotes the momentum version of parameters ξ and f ξ (•) is a non-linear projection of the representations embedded by g ξ (•). D indicates the tuples stored in the replay buffer. Next, we update the critic network Q θ with transformed data x ′ t and x ′ t+n to minimize the TD error for n-steps returns. This is regarded as a regularized Q learning by Yarats et al. (2020) where the regularized representation learning is beneficial for optimal action taking (Zhang et al., 2020a) . JQ(D; θ, ω, ξ) = Q θ g ξ (x ′ t ), at - n-1 i=0 γ i rt+i -γ n Qθ g ξ (x ′ t+n ), π(•|g ξ (x ′ t+n )) 2 Eventually, we give the unified objective function as the full version of the CoIT, J θ,ξ,ω (D) = JQ(D) + αL ξ, ξ,ω (D) + λKω(D) where α and λ are hyper-parameters and the overall architecture is presented in Figure 2 . We replace the vanilla Q-learning by J θ,ξ,ω (D) and the entire algorithm is presented in Algorithm 2 in Appendix B. Then, we evaluate CoIT on popular benchmarks to demonstrate the benefits of our method.

5. EXPERIMENTS

In this section, we benchmark our method on the DeepMind control suite and Atari100K. We compare CoIT with prior model-free methods first, then we present ablation studies to show the details of our method. Implementation details can be found in Appendix C.

5.1. ENVIRONMENTS

DMControl. DeepMind control suite (Tassa et al., 2018 ) is a widely used benchmark with several robot control tasks. Each episode is set to be 1, 000 frames and we use the total experienced frames to measure the data-efficiency. The per-frame reward is in the unit interval [0, 1], so each episode contains a total reward of no more than 1, 000. Considering the different difficulties depending on tasks, we refer to setting more episodes with hard tasks for better evaluation. Atari100K. There have been a number of prior papers that have benchmarked data-efficiency on the Atari 2600 Games for discrete control. Van Hasselt et al. (2019) and Kielak (2019) propose the data-efficient version of Rainbow DQN (Hessel et al., 2018) compared with human performance (Kaiser et al., 2019) within 100K time steps (400K frames, frame skip of 4). This sample-constrained evaluation is the so-called Atari100K and we benchmark CoIT on all 26 Atari Games.

5.2. BASELINES

DMControl. For continuous control we present several baselines, including methods of using data augmentation and contrastive learning to improve data-efficiency: (i) DrQ-v2 (Yarats et al., 2021) , (ii) Figure 3 : Results of complex tasks in DMControl. These tasks are chosen to offer multiple degrees of challenges, including complex dynamics, sparse rewards, hard exploration, and more. CURL (Laskin et al., 2020b) , (iii) Pixel SAC, and (iv) Pixel DDPG : Vanilla SAC and DDPG training directly from pixels. All methods are evaluated with the same periodicity of frames and average over 10 episodes return for evaluation query. Atari100K. To benchmark the data-efficiency of CoIT for discrete control tasks, we compare our method to (i) DrQ (Yarats et al., 2020) , (ii) CURL (Laskin et al., 2020b) , (iii) SPR (Schwarzer et al., 2020) , (iv) Random Agent, and Human Performance (Kaiser et al., 2019) . All the algorithms are evaluated within 100K time steps for interaction. We average CoIT's performance over 10 random seeds and report the best score for each game following prior works.

5.3. MAIN RESULTS

DMControl. We choose 8 complex tasks from the DMControl for evaluation and present the results in Figure 3 and Table 2 in Appendix C. We also report the percentage (%) of score solved in the DMControl for baselines and CoIT in 500K and 100K steps in Figure 1 . Below are key findings: (i) CoIT outperforms vanilla DDPG and SAC in a wide range. (ii) We also compare CoIT with DrQ-v2, a remarkable method for continuous control, to better demonstrate our method's data-efficiency. (iii) From general trends of the learning curves, CoIT improves or keeps the data-efficiency in a more stable manner which is not trivial on DMControl tasks. Atari100K. We present results for Atari100K in Table 1 and  



The tuples in reply buffer can be written as e = (st, at, rt, st+1) after encoding, which makes s an intermediate random variable.



Figure 1: Percentage (%) of score solved in the DMC. We set the score of DrQ-v2 as 100% and report the result of CoIT and CURL: Up: in 500K steps. Bottom: in 100K steps. The task is solved when return nearly reaches the upper bound.

d(x, x ′ |e) ≜ Ee∼D dKL p(x|e = e)||p(x ′ |e = e) = e dKL p(x|e = e)||p(x ′ |e = e) dp(e) (3)

(ϵ-Approximation). Given a distance metric d : O×S → R + satisfies d(s, s) = 0, ∀s, and let g, g ′ : O → S be two functions. Let ϵ ≥ 0, given a distribution D on O, then g and g ′ are ϵ-approximate w.r.t.(d, D) if E x∼ D [d(g(x), g ′ (x))] ≤ ϵ.Definition 4.3. (β-Similarity). Given a distance metric d : O × S → R + satisfies d(s, s) = 0, ∀s. There exists g : O → S. Let β ≥ 0, given distributions D and D′ on O, then x and x

Figure 2: Overall architecture of CoIT. The observations are transformed following a Gaussian distribution G(µ, σ) and encoded by the state encoder g ξ . The observation encoder gξ and projection fξ are the exponentially moving average version of the state encoder g ξ and projection f ξ .

Parameterized Transformation in RL 1: Initialization: Draw distribution G 0 ∼ G |O| with given mean µ 0 and variance σ 2 0 ; Set cumulative reward R = 0. 2: Training: 3: for each timestep t in 0, • • • , T do 4:

below are key findings: (i) CoIT achieves top-performance on 10 of 26 games while still being competitive in the rest. (ii) CoIT surpasses superhuman performance on 6 games on the basis of Rainbow DQN . (iii) We also report the mean and std of the scores achieved by CoIT in 10 of 26 games which are top-performance. We present the results of two versions of CoIT (mixed & no-mixed) in Figure6in the appendix. From the histogram, we find that CoIT has much better stability, which is similar to the observation in the DMControl.5.4 ABLATION STUDIESWe first visualize the TRANSFORM operator to demonstrate that there exists an invariant transformation for each task. We initialize the Gaussian distribution G t (µ, σ) based on the range of pixel shifts and plot the curves of the mean µ and std σ during training in the DMControl in Figure4.

Figure 4: Visualization of the parameters of the Gaussian distribution for TRANSFORM.

Mean episodic returns achieved by CoIT and baselines on 26 Atari games benchmarked at 100K environment steps. The results are recorded and averaged over 10 random seeds.

ACKNOWLEDGMENT

This work was supported in part by the National Key Research and Development Program of China No. 2020AAA0103400, National Key Research and Development Program of China No. 2021ZD0201504, National Natural Science Foundation of China under Grant 62273347, and CCF-Tencent Open Research Fund RAGR20220104. We thank anonymous reviewers for their discussions and feedback on the paper.

availability

//github.com/mooricAnna/CoIT.

annex

According to the curves below, we find that the mean and std converge to an interval as the training goes on. These results demonstrate that the gaussian distribution proposed in CoIT could automatically find a TRANSFORM to smooth the distribution shift between the different views of the same observation, therefore being beneficial to the representation learning.Then, we study the effects of different components in Eq.( 11). This object function is composed of two parts: K ω (x ′ t ) for regularization and L ξ, ξ,ω (D) for similarity metric. On this basis, we divide CoIT into 4 versions: (i) Critic. Transformation is only updated with the critic. (ii) X-stats & Critic. Transformation is updated by critic andTransformation is updated by critic and L ξ, ξ,ω (D) together. (iv) Unified Objective. We evaluate all of these versions on 8 representative tasks from the DMControl and present the results in Figure 9 in the appendix.Compared Critic with other variants, we demonstrate that both of the components are beneficial to the performance. Though Critic is data-efficient on most tasks, it may fall into trivial solutions. To solve this issue, we utilize the regularization in Eq.( 8) with the similarity metric in Eq.( 9) to meet the invariant transformation. Thus the Unified Objective's performance leads ahead of all tasks. See Appendix C for extra ablation studies.

6. CONCLUSION

A novel pixel transformation CoIT under model-free RL algorithms that significantly improves the data-efficiency and stability for visual tasks is introduced in this work. We theoretically analyze how the learnable transformation constrains the distribution of transformed data, and dissect its benefits to representation learning. CoIT is no need for any additional modifications to the backbone RL algorithm and is easy to implement. We compare CoIT to SOTA methods on popular benchmarks and certify that it gains promising performance with advanced stability. Hopefully, contrastive invariant transformation can lead to a new branch for representation learning in RL.

