LEARNING CONTROL BY ITERATIVE INVERSION

Abstract

We formulate learning for control as an inverse problem -inverting a dynamical system to give the actions which yield desired behavior. The key challenge in this formulation is a distribution shift in the inputs to the function to be inverted -the learning agent can only observe the forward mapping (its actions' consequences) on trajectories that it can execute, yet must learn the inverse mapping for inputsoutputs that correspond to a different, desired behavior. We propose a general recipe for inverse problems with a distribution shift that we term iterative inversion -learn the inverse mapping under the current input distribution (policy), then use it on the desired output samples to obtain a new input distribution, and repeat. As we show, iterative inversion can converge to the desired inverse mapping, but under rather strict conditions on the mapping itself. We next apply iterative inversion to learn control. Our input is a set of demonstrations of desired behavior, given as video embeddings of trajectories (without actions), and our method iteratively learns to imitate trajectories generated by the current policy, perturbed by random exploration noise. We find that constantly adding the demonstrated trajectory embeddings as input to the policy when generating trajectories to imitate, a-la iterative inversion, we effectively steer the learning towards the desired trajectory distribution. To the best of our knowledge, this is the first exploration of learning control from the viewpoint of inverse problems, and the main advantage of our approach is simplicity -it does not require rewards, and only employs supervised learning, which can be easily scaled to use state-ofthe-art trajectory embedding techniques and policy representations. Indeed, with a VQ-VAE embedding, and a transformer-based policy, we demonstrate non-trivial continuous control on several tasks. Further, we report an improved performance on imitating diverse behaviors compared to reward based methods.

1. INTRODUCTION

The control of dynamical systems is fundamental to various disciplines, such as robotics and automation. Consider the following trajectory tracking problem. Given some deterministic but unknown actuated dynamical system, s t+1 = f (s t , a t ), where s is the state, and a is an actuation, and some reference trajectory, s 0 , . . . , s T , we seek actions that drive the system in a similar trajectory to the reference. For system that are 'simple' enough, e.g., linear, or low dimensional, classical control theory (Bertsekas, 1995) offers principled and well-established system identification and control solutions. However, for several decades, this problem has captured the interest of the machine learning community, where the prospect is scaling up to high-dimensional systems with complex dynamics by exploiting patterns in the system (Mnih et al., 2015; Lillicrap et al., 2015; Bellemare et al., 2020) . In reinforcement learning (RL), learning is driven by a manually specified reward signal r(s, a). While this paradigm has recently yielded impressive results, defining a reward signal can be difficult for certain tasks, especially when high-dimensional observations such as images are involved. An alternative to RL is inverse RL (IRL), where a reward is not manually specified. Instead, IRL algorithms learn an implicit reward function that, when plugged into an RL algorithm in an inner loop, yields a trajectory similar to the reference. The signal driving IRL algorithms is a similarity metric between trajectories, which can be manually defined, or learned (Ho & Ermon, 2016) . We propose a different approach to learning control, which does not require explicit nor implicit reward functions, and also does not require a similarity metric between trajectories. Our main idea is that Equation (1) prescribes a mapping F from a sequence of actions to a sequence of states, s 0 , . . . , s T = F(a 0 , . . . , a T -1 ). (2) The control learning problem can therefore be framed as finding the inverse function, F -1 , without knowing F, but with the possibility of evaluating F on particular action sequences (a.k.a. roll-outs). Learning the inverse function F -1 using regression can be easy if one has samples of action sequences and corresponding state sequences, and a distance measure over actions. However, in our setting, we do not know the action sequences that correspond to the desired reference trajectories. Interestingly, for some mappings F, an iterative regression technique can be used to find F -1 . In this scheme, which we term Iterative Inversion (IT-IN), we start from arbitrary action sequences, collect their corresponding state trajectories, and regress to learn an inverse. We then apply this inverse on the reference trajectories to obtain new action sequences, and repeat. We show that with linear regression, iterative inversion will converge under quite restrictive criteria on F, such as being strictly monotone and with a bounded ratio of derivatives. Nevertheless, our result shows that for some systems, a controller can be found without a reward function, nor a distance measure on states. We then apply iterative inversion to several continuous control problems. In our setting, the desired behavior is expressed through a video embedding of a desired trajectory, using a VQ-VAE (Van Den Oord et al., 2017) , and a deep network policy maps this embedding and a state history to the next action. The agent generates trajectories from the system using its current policy, given the desired embeddings as input, and subsequently learns to imitate its own trajectories, conditioned on their own embeddings. Interestingly, we find that when iterating this procedure, the input of the desired trajectories' embeddings steers the learning towards the desired behavior, as in iterative inversion. Given the strict conditions for convergence of iterative inversion, there is no a-priori reason to expect that our method will work for complex non-linear systems and expressive policies. Curiously, however, we report convergence on all the scenarios we tested, and furthermore, the resulting policy generalized well to imitating trajectories that were not seen in its 'steering' training set. This surprising observation suggests that IT-IN may offer a simple supervised learning-based alternative to methods such as RL and IRL, with several potential benefits such as a reward-less formulation, and the simplicity and stability of the (iterated) supervised learning loss function. Furthermore, on experiments where the desired behaviors are abundant and diverse, we report that IT-IN outperforms reward-based methods, even with an accurate state-based reward.

2. RELATED WORK

In learning from demonstration (Argall et al., 2009) , it is typically assumed that the demonstration contain both the states and actions, and therefore supervised learning can be directly applied, either by behavioral cloning (Pomerleau, 1988) or interactive methods such as DAgger (Ross et al., 2011) . In our work, we assume that only states are observed in the demonstrations, precluding straightforward supervised learning. Inverse RL is a similar problem to ours, and methods such as apprenticeship learning (Abbeel & Ng, 2004) or generative adversarial imitation learning (Ho & Ermon, 2016) simultaneously train a critic that discriminates between the data trajectories and the policy trajectories (a classification problem), and a policy that confuses the critic as best as possible (an RL problem). It is shown that this procedure will converge to a policy that visits the same states as the data. While works such as (Fu et al., 2019; Ding et al., 2019) considered a goal-conditioned IRL setting, we are not aware of IRL methods that can be conditioned on a more expressive description than a target goal state, such as a complete trajectory embedding, as we explore here. In addition, our approach avoids the need of training a critic, as in Ding et al. (2019) , or training an RL agent in an inner loop. Most related to our work, Ghosh et al. (2019) proposed goal conditioned supervised learning (GCSL). In GCSL, the agent iteratively executes random trajectories, and uses them as direct supervision for training a goal-conditioned policy, where observed states in the trajectory are substituted as goals. The desired goals are also input to the policy when generating the random trajectories. In comparison to GCSL, we do not consider tasks of only reaching goal states, but tasks where the whole trajectory is important. This significantly increases the diversity of possible tasks, and thereby increases the difficulty of the problem. In addition, the theoretical analysis of Ghosh et al. (2019) Figure 1 : Learning an inverse function under a distribution shift. We wish to learn the inverse function over outputs y 1 , . . . , y M , using linear least squares, having matching inputs-outputs for x 1 , . . . , x M . showed convergence under an assumption that all goal states have a probability of being visited under the initial policy. The analysis we show for iterated inversion, and the fact that this assumption almost never holds in practice (except, e.g., in offline RL, where the data is already 'explorative' enough Emmons et al. 2021) , suggest that the practical success of GCSL is less obvious than as the theory of Ghosh et al. (2019) predicts. In self-supervised RL, the agent is not given reward, and uses its own experience to explore the environment, typically by training a goal-conditioned policy, and proposing to it goals that are novel in some measure (Ecoffet et al., 2019; Hazan et al., 2019; Sekar et al., 2020; Endrawis et al., 2021; Mendonca et al., 2021) . The space of all trajectories is much larger than the space of all states, and we are not aware of methods that demonstrably explore such a space. For this reason, in our approach we steer the exploration towards a set of desired trajectories. Very recently, in their work on video pretraining, Baker et al. (2022) also used a transformer to learn an inverse model conditioned on a video. Importantly, Baker et al. (2022) collected human-labelled data to train their inverse model on desired behavior trajectories. The main point in our work is a self-supervised learning paradigm that automatically steers data collection to the desired behavior. Finally, we mention the extensive literature on deep learning based solutions for inverse problems (Lucas et al., 2018; Kamyab et al., 2021) . In many studies, the forward mapping is known, and differentiated to iteratively optimize reconstruction (Xia et al., 2022) . In contrast, blind inversion methods learn the inverse mapping directly from data (Kulkarni et al., 2016) . To the best of our knowledge, the formulation of learning control as an inverse problem, and iterative inversion as a solution for the resulting distribution shift problem, are novel.

3. ITERATIVE INVERSION

In this section we describe a general problem of learning an inverse function under a distribution shift, and present the iterative inversion algorithm. We then analyse the convergence of iterative inversion in several simplified settings. In the proceeding, we will apply iterative inversion to learning control. Let F : X → Y be a bijective function. We are given a set of M desired outputs y 1 , . . . , y M ∈ Y, and an arbitrary set of M initial inputs x 1 , . . . , x M ∈ X . We assume that F is not known, but we are allowed to observe F(x) for any x ∈ X that we choose during our calculations. Our goal is to find a function G : Y → X such that for any desired output y i , we have G(y i ) = F -1 (y i ). More specifically, we will adopt a parametric setting, and search for a parametric function G θ , where θ ∈ Θ is a parameter vector, that minimizes the average loss: min θ∈Θ 1 M M i=1 L(G θ (y i ), F -1 (y i )). For example, G θ could represent the space of linear functions G θ (y) = θ T y + θ 0 , and L could be the squared error between inputs, L(x, x ′ ) = (x -x ′ ) 2 . This example, which is depicted in Figure 1 for the 1-dimensional case X = Y = R, corresponds to a linear least squares fit of the inverse function. As can be seen, the challenge in this problem arises from the mismatch between the distributions of the desired outputs and initial inputs. The iterative inversion algorithm, proposed in Algorithm 1, seeks to solve problem (3) iteratively.

Algorithm 1 Iterative Inversion

Require: Desired outputs y 1 , . . . , y M ∈ Y, loss function L : X × X → R, initial parameter θ 0 . 1: for n = 0, 1, 2, . . . do 2: Calculate current inputs: x n 1 , . . . , x n M = G θn (y 1 ), . . . , G θn (y n ) 3: Calculate current outputs: y n 1 , . . . , y n M = F(x n 1 ), . . . , F(x n M ) 4: Regression θ n+1 = arg min θ∈Θ 1 M M i=1 L(G θ (y n i ), x n i ) 5: end for We next investigate when, and why, should iterative inversion produce an effective solution for (3). We restrict ourselves to a linear function class for G θ , and the squared loss. We report convergence results for different classes of functions F. Y) as the input and output matrices, Y) as the inputs, desired outputs, and current outputs means, (•) † the Moore-Penrose pseudoinverse operator, and F -1 the ground-truth inverse function. Denote X n ≡ (x n 1 , . . . , x n M ) T ∈ R M ×dim(X ) , F(X n ) ≡ (F(x n 1 ), . . . , F(x n M )) T ∈ R M ×dim( X n ≡ M i=1 x n i /M ∈ R dim(X ) and Y ≡ M i=1 y i /M , F(X n ) ≡ M i=1 F(x n i )/M ∈ R dim( We start with the simple case of a linear F. As is clear from Figure 1 , the distribution shift is not a problem in this case, and iterative inversion converges in a single iteration. Theorem 1. If F is a linear function and rank(F(X 0 ) -F(X 0 )) = dim(Y) then Algorithm 1 converges in one iteration, i.e., y 1 1 , . . . , y 1 M = y 1 , . . . , y M . Iterative Inversion can be interpreted as a variant of the classic Newton's method (Ortega & Rheinboldt, 2000) , where we replace the unknown Jacobian J of F with a linear approximation using the current input-output pairs, and the evaluation of F with the mean of the current outputs. Recall that Newton's method seeks to find the root x * of a function r(x) = F(x) -y using the iterative update rule x n+1 = x n + (y -F(x n ))[J(x n )] -1 , where [J(x n )] -1 is the Jacobian inverse of F at x n . Iterative Inversion, similarly, applies the following updating rule, as proved in Appendix A.1, X n+1 = X n + Y -F(X n ) J-1 n , where J-1 n ≡ (F(X n ) -F(X n )) † (X n -X n ) is the Jacobian of G θn+1 , the linear regressor plane from F(x) to x at x n 1 , . . . , x n M , which can be considered to be an approximation of [J(X n )] -1 . When the approximations J-1 n ≈ [J(X n )] -1 and F(X n ) ≈ F(X n ) are accurate, Iterative Inversion coincides with Newton's method, and enjoys similar convergence properties, as we establish next. Assumption 1. F : R K → R K is bijective, and F and F -1 are both continuously differentiable. Denote J(x) the Jacobian matrix of F at x ∈ X , and J -1 (x) ≡ [J(x)] -1 the inverse matrix of J(x) and the Jacobian of F -1 at F(x) ∈ Y, under Assumption 1. Also denote ∥ • ∥ to be any induced norm (Horn & Johnson, 2012) . We assume that the derivatives of F and F -1 are bounded, as follows. Assumption 2. ∥J(x 1 ) -J(x 2 )∥ ≤ γ, ∥J(x)∥ ≤ ζ and ∥J -1 (x)∥ ≤ β ∀x 1 , x 2 , x ∈ R K . Further assume that at every iteration n, the approximations J-1 n and F(X n ) are accurate enough. Assumption 3. ∀n: ∥F(X n ) -F(X n )∥ ≤ λ and J-1 n = J -1 (X n )(I + ∆ n ), ∥∆ n ∥ ≤ δ < 1/ζβ. Assumption 3 may hold, for example, when the inputs x n 1 , . . . , x n M are distributed densely, relative to the curvature of F, and evenly, such that the regression problem in Algorithm 1 is well-conditioned. The requirement δ < 1/ζβ is set to ensure that J-1 n is non-singular. Theorem 2. Suppose Assumptions 1, 2 and 3 hold. Let µ ≡ ζ 2 βδ 1-ζβδ and assume β(1 + δ)(γ + µ) < 1. Let ρ ≡ 2λβ(1+δ)(µ+ζ) 1-β(1+δ)(µ+γ) . Then for every ϵ > 0 there exists k < ∞ such that ∥F(X k ) -Y ∥ ≤ ρ + ϵ. The proof for Theorem 2 builds on the analysis of Newton's method to show that IT-IN is an iterated contraction, and is reported in Section A.3 of the supplementary material. In Theorem 2, the term ρ can be interpreted as the radius of the ball centered at Y that the sequence convergences to. To get some intuition about Theorem 2, consider the 1-dimensional case F : R → R, where the approximations in Assumption 3 are perfect, i.e., λ = δ = µ = ρ = 0. Then, the condition for convergence is β(1 + δ)(γ + µ) = βγ < 1 =⇒ max |F ′ (x)| min |F ′ (x)| < 2, which can be interpreted as a 'close to linear' F. The conditions in Theorem 2 can therefore be intuitively interpreted as F being 'close to linear' globally, and the linear approximation being accurate locally. In Appendix A.4, we provide additional convergence results that use a different analysis technique for the simple case F : R → R. These results do not require Assumption 3, but still require a condition similar to max |F ′ (x)| min |F ′ (x)| < 2, and show a linear convergence rate. We further remark that a quadratic convergence rate is known for Newton's method when the initial iterate is close to optimal; we believe that similar results can be shown for IT-IN as well. Here, however, we focused on the case of an arbitrary initial iterate, similarly to the experiments we shall describe in the sequel.

4. ITERATIVE INVERSION FOR LEARNING CONTROL

In this section, we apply iterative inversion for learning control. We first present our problem formulation, and then propose an IT-IN algorithm. We follow a standard RL formulation. Let S denote the state space, A denote the action space, and consider the dynamical system in Equation 1. We assume, for simplicity, that the initial state s 0 is fixed, and that the time horizon is T .foot_0 Given a state-action trajectory τ = s 0 , a 0 , . . . , s T -1 , a T -1 , s T ∈ Ω, where Ω denotes the T -step trajectory space, we denote by τ s ∈ Ω s its state component and by τ a ∈ Ω a its action component, i.e., τ s = s 0 , . . . , s T , τ a = a 0 , . . . , a T -1 , and Ω = Ω s × Ω a . We will henceforth refer to τ s as a state trajectory and to τ a as an action trajectory. Let F denote the mapping from an action trajectory to the resulting state trajectory, as given by Equation (2). For presenting our control learning problem, we will assume that F is bijective, and therefore F -1 is well defined. We emphasize, however, that our algorithm makes no explicit use of F -1 , and our empirical results are demonstrated on problems where this assumption does not hold. We represent a state trajectory using an embedding function z = Z(τ s ) ∈ Z, and we term z the intent. Note that z, by definition, can contain partial information about τ s , such as the goal state (Ghosh et al., 2019) . In all the experiments reported in the sequel, we generated intents by feeding a rendered video of the state trajectory into a VQ-VAE encoder, which we found to be simple and well performing. Consider a state-action trajectory τ , with a corresponding intent Z(τ s ). We would like to learn a policy that reconstructs the intent into its corresponding action trajectory τ a . Let H t denote the space of t-length state-action histories, and a policy π t : Z × H t → A. With a slight abuse of notation, we denote by π(z) ∈ Ω a the action trajectory that is obtained when applying π t sequentially for T time steps (i.e., a rollout). Similarly to the problem in Section 3, our goal is to learn a policy such that π(Z(τ s )) = F -1 (τ s ). More specifically, let L : Ω a × Ω a → R be a loss function between action trajectories, and let P (τ s ) denote a distribution over desired state trajectories, we seek a policy π θ parametrized by θ ∈ Θ that minimizes the average loss: min θ∈Θ E τs∼P L π θ (Z(τ s )), F -1 (τ s ) . (5) In our approach we assume that P (τ s ) is not known, but we are given a set D steer of M intents, z 1 , . . . , z M , where z i = Z(τ i s ), and τ i s are drawn i.i.d. from P (τ s ). Henceforth, we will refer to D steer as the steering dataset, as it should act to steer the learning of the inverse mapping towards the desired trajectory distribution P (τ s ). It is worth relating Problem (5) to the general inverse problem in Section 3, and what we referred to as the distribution shift problem. Initially, the policy is not expected to be able to produce state-action trajectories that match the state trajectories in D steer , but only trajectories that are output by the initial (typically random) policy. While these initial trajectories could be used for imitation learning, yielding an intent-conditioned policy, there is no reason to expect that this policy will be any good for intents in D steer , which are out-of-distribution with respect to the policy's training data. We now propose a method for solving Problem (5) based on iterative inversion, as detailed in Algorithm 2. There are four notable differences from the iterative inversion method in Algorithm 1. First, we operate on batches of size N instead of on the whole steering data (of size M ), for computational efficiency. Second, we sample a batch of intents from a mixture of the steering dataset and the intents calculated for rollouts in the previous iteration. We found that this helps stabilize the algorithm. Third, we add random exploration noise to the policy when performing the rollouts, which we found to be necessary (see Sec. 5). Fourth, we used a replay buffer for the supervised learning part of the algorithm, also for improved stability. For L, we used the MSE between action trajectories, and for the optimization in line 7, we perform several epochs of gradient-based optimization using Adam (Kingma & Ba, 2014), keeping the state history input to π θ (ẑ) fixed as τ s when computing the gradient. The size of the replay buffer was set to K × N . Perform N rollouts τ 1 , . . . , τ N using policy π θn with input intents z 1 , . . . , z N , adding exploration noise η 5: Compute intents for the rollouts ẑi = Z(τ i s ), i ∈ 1, . . . , N

6:

Add intents and trajectories{ẑ i , τ i } to Replay Buffer there are no rewards, and the loss function is routine. In the following, we provide empirical evidence that, perhaps surprisingly -given the strict conditions for convergence of iterative inversion -IT-IN yields well-performing policies on several nontrivial tasks.

5. EXPERIMENTS

In this section, we evaluate IT-IN on several domains. Our investigation is aimed at studying the unique features of IT-IN and especially, the steering behavior that we expect to observe. We start by describing our evaluation domains, and implementation details that are common to all our experiments. We then present a series of experiments aimed at answering specific questions about IT-IN. To appreciate the learned behavior, we encourage the reader to view our supporting video results at: https://sites.google.com/view/iter-inver.

COMMON SETTINGS:

VQ-VAE Intents: For all our experiments, we generate intents using a VQ-VAE embedding of a rendered video of the trajectory. Rendering settings are provided next for each environment. We use VideoGPT's VQ-VAE implementation (Yan et al., 2021 ). An input video of size 64 × 64 × T (w, h, t) is encoded into a 16 × 16 × T /4 integer intent z i given a codebook of size 50. Each integer represents a float vector of length 4. The training of the VQ-VAE is not the focus of this work, and we detail the training data for each VQ-VAE separately for each domain in the supplementary material. We remark that by visually inspecting the reconstruction quality, we found that our VQ-VAEs generalized well to the trajectories seen during learning.

GPT-based policies and exploration noise

The policy architecture is adapted from VideoGPT (Yan et al., 2021) , and consists of 8 layers, 4 heads and a hidden dimension of size 64. The model is conditioned on the intent via cross-attention. In the supplementary material, we report similar results with a GRU-based policy. Our exploration noise adds a Gaussian noise of scale η to the action output. Evaluation Protocol: While our algorithm only uses a loss on actions, a loss on the resulting trajectories is often easier to interpret for measuring performance. We measure the sum of Euclidean distances between agent state variables, accumulated over time, as a proxy for trajectory similarity; in our results, this measure is denoted as MSE. Except when explicitly noted otherwise, all our results are evaluated on test trajectories (and corresponding intents) that were not in the steering data, but were generated from the same trajectory distribution. None of the trajectories we plot or our video results are cherry picked.

DOMAINS

2D Particle: A particle robot is moved on a friction-less 2D plane, by applying a force F = [F X , F Y ] for a duration of ∆t. The observation space includes the positions and velocities of the particle S = [X, Y, V X , V Y ], and motion videos are rendered using Matplotlib Animation (Hunter, 2007) . While relatively simple for control, this environment allows for distinct and diverse behaviors that are easy to visualize. In particular, we experiment with 2 behavior classes, for which we procedurally created training trajectories: (1) Spline motion, and (2) Deceleration motion. Both require highly coordinated actions, and are very different from the motion that a randomly initialized policy induces. Full details about the datasets are described in Appendix B.1.1. Reacher: A 2-DoF robotic arm from OpenAI Gym's Mujoco Reacher-v2 environment (Brockman et al., 2016) . While usually in Reacher-v2 the agent is rewarded for reaching a randomly generated target, the goal in our setting is for the policy to reconstruct the whole arm motion, as given by the intent, which is encoded from a video of the motion rendered using Mujoco (Todorov et al., 2012) . We handcrafted a trajectory dataset, termed FixedJoint, which is fully described in Appendix B.2.1. Hopper: From OpenAI Gym's Mujoco Hopper-v2 environment (Brockman et al., 2016) . The dataset is from D4RL's hopper-medium-v2 (Fu et al., 2020) , and consists of mostly forward hopping behaviors. There are several challenges in this domain: (1) the dynamics are non-linear, and include a non-smooth contact with the ground; (2) the desired behavior (hopping) is very different from the behavior of an untrained policy (falling), and requires applying a very specific force exactly when making contact with the ground (a 'bottleneck' in the state space); and (3) the camera is fixed on the agent, and forward movement can only be inferred from the movement of the background.

STEERING EVALUATION

The first question we investigate is whether IT-IN indeed steers learning towards the desired behavior. To answer this, we consider domains where the desired behavior is very different from the behavior of the initial random policy -the Spline and Deceleration motions for the particle, and the hopping behavior for Hopper-v2. As we show in Figure 2 (for particle), and Figure 3 (for Hopper-v2), IT-IN produces a policy that can track the desired behavior with high accuracy. We further show, in Figure 5 and Figure 6 in the supplementary material, that IT-IN works well for different trajectory lengths T . Another question is whether IT-IN really steers the policy towards the desired trajectories, or perhaps improves some general properties of the policy, allowing a generally better reconstruction. We explore this question by a cross-evaluation -evaluating the performance of a policy trained with steering intents from Particle:Splines on test intents from Particle:Deceleration, which we will refer to as out-of-distribution intents, and vice versa. Interestingly, as Table 1 shows, performance on out-of-distribution intents is significantly worse than the performance that would have been obtained by training the policy with these intents as the steering dataset, and is even worse or comparable to training with no steering at all (cf. Table 2 ). We also evaluated the importance of the exploration noise. We tested Splines with T = 64 and Hopper-v2 with T = 128 with and without exploration noise, and a large D steer (2180 for Hopper-v2, 500 for Particle). As the results in Table 6 in the supplementary material show, exploration noise η is crucial for the training procedure to converge towards the desired behavior. Relating this observation to our theoretical analysis, we believe that exploration improves the conditioning of the supervised learning problem. 

STEERING DATASET SIZE AND GENERALIZATION

We next evaluate the generalization performance of IT-IN to intents that were not seen in the data, but correspond to state trajectories drawn from P (τ s ). To investigate this, we consider a domain where the desired behavior is very diverse -the Spline motions for the particle. We also report results on domains where the behavior is less diverse, such as Hopper-v2 and Deceleration motions for particle. Naturally, we expect generalization to correlate with M , the size of D steer . As our results in Table 2 show, additional steering data indeed improves generalization to unseen trajectories, albeit with diminishing returns as the amount of steering data is increased. As expected, in the more diverse distribution there was more gain to reap from additional data (significant improvement up to  |D

COMPARISON WITH RL BASELINES

We compare IT-IN with reward-driven RL baselines. We consider the Particle:Splines environment, and two reward functions: (1) STATE-MSE: MSE between desired position and current position, and (2) INTENT-MSE: a sparse reward that is the MSE between the intents of the desired trajectory and the executed trajectory, given at the end of the episode. STATE-MSE is privileged compared to IT-IN and is arguably stronger than any IRL method in this task, as the reward is dense, and exactly captures the desired behavior. Any IRL method will run RL in an inner loop, with a reward that is less precise. INTENT-MSE is motivated by the fact that IT-IN effectively learns some similarity measure in intent space, and this reward captures this idea explicitly. We used exactly the same policy architecture for all comparisons. We found that both RL methods did not train well with the GPT-based policy architecture 2 , therefore we report results for the GRU 2 Difficulty of RL with transformers was discussed in (Parisotto et al., 2020; Hausknecht & Wagener, 2022) . Steering policy, which is described in detail in the supplementary material. We used PPO (Schulman et al., 2017) for RL training, based on the implementation of Kostrikov (2018) . In Figure 4 

6. DISCUSSION

We presented a new formulation for learning control, based on an inverse problem approach, and demonstrated its application to learning deep neural network policies that can reconstruct diverse behaviors, given an embedding of the desired trajectory. We developed the fundamental theory underlying iterative inversion, and demonstrated promising results on several simple tasks. We only considered a particular trajectory embedding based on an off-the-shelf VQ-VAE, which we believe to be general and practical. Important questions for future work include characterizing the effect of the embedding on performance, and training an embedding jointly with the policy. Additionally, the exploration noise, which we found to be important, can potentially be replaced with more advanced exploration strategies from the RL literature. Another interesting question is how to generate intents from a partial description of a trajectory, such as a natural language description. Diffusion models, which have recently gained popularity for learning distributions over latent variables (Rombach et al., 2021) , are one potential approach for this. One open question that remains is the gap between the strict conditions for convergence under a linear approximation in our theory, and the generally stable performance we observed in practice with expressive policies and non-linear dynamics. Another open question is whether iterative inversion can be extended to non-deterministic systems. We believe that our work provides the fundamentals for further investigating these important questions.

A PROOFS

A.1 PROOF OF EQUATION 4 Throughout this and the rest of the theoretical proofs, with a slight abuse of notation, when a vector u ∈ R N is added to a matrix A ∈ R M ×N , the addition is row-wise: X ) and b ∈ R 1×dim(X ) (note the added parameter b to account for the bias). At iteration n + 1: A + u ≡ A + 1u where 1u = (u, . . . , u) T ∈ R M ×N . Denote Y = (y 1 , . . . , y M ) ∈ R M ×dim(Y) . The approximated linear function is G Θ,b (Y ) = Y Θ + b where Θ ∈ R dim(Y)×dim( Θ n+1 , b n+1 = arg min Θ,b ∥F(X n )Θ + b -X n ∥ 2 This is an ordinary linear least squares problem with the solution Θ n+1 = (F(X n ) -F(X n )) † (X n -X n ), b n+1 = X n -F(X n )Θ n+1 (6) Then X n+1 = G Θn+1,bn+1 (Y ) = Y Θ n+1 + b n+1 = X n + (Y -F(X n ))Θ n+1 and averaging over the points obtains the result: X n+1 = X n + (Y -F(X n ))Θ n+1 A.2 PROOF OF THEOREM 1 Using the notation defined in AppendixA.1. Assuming F(X) = XF + h is a linear function with F ∈ R dim(X )×dim(Y) and h ∈ R 1×dim(Y) . Assuming rank(F(X 0 ) -F(X 0 )) = dim(Y) then (F(X 0 ) -F(X 0 )) T (F(X 0 ) -F(X 0 ) ) is invertible and Θ 1 defined on Equation 6 is well defined. Then using the fact that F(X n ) -F(X n ) = (X n -X n )F : Θ 1 = (F(X 0 ) -F(X 0 )) † (X 0 -X 0 ) = (X 0 -X 0 )F † (X 0 -X 0 ) and it satisfies Θ 1 F = I where I is the identity matrix. The bias term, according to Equation 6, is b 1 = X 0 -F(X n )Θ 1 = X 0 -(X 0 F + h)Θ 1 At the end of iteration 1, X 1 = Y Θ 1 + b 1 and its matching outputs equal to the desired outputs: F(X 1 ) = X 1 F + h = Y Θ 1 F + b 1 F + h = Y + (X 0 -(X 0 F + h)Θ 1 )F + h = Y + (X 0 F -X 0 F -h) + h = Y A.3 PROOF OF THEOREM 2 For clarity in the representation, we will use the following notation: J -1 n ≡ J -1 (X n ), J n ≡ J(X n ), Fn ≡ F(X n ) and F n ≡ F(X n ). We also define H n ≡ F n -Y and Hn ≡ Fn -Y First we show that J-1 n is non-singular. Since δ < 1 ζβ then ρ(J n ∆ n J -1 n ) ≤ ∥J n ∆ n J -1 n ∥ ≤ δζβ < 1 where ρ(A) denotes the spectral radius of A. Therefore (I + J n ∆ n J -1 n ) is non-singular and J-1 n = J -1 n (I + J n ∆ n J -1 n ) is non-singular as a multiplication of non-singular matrices. We denote by Jn ≡ J-1 n -1 its inverted matrix, and obtain the following bounds: ∥H n -Hn ∥ = ∥F n -Fn ∥ ≤ λ (7) ∥ J-1 n ∥ = ∥(I + ∆ n )J -1 n ∥ ≤ ∥J -1 n ∥(1 + ∥∆ n ∥) ≤ β(1 + δ) ∥ Jn -J n ∥ (1) ≤ ∥J n ∥ 2 ∥∆ n J -1 n ∥ 1 -∥J n ∆ n J -1 n ∥ ≤ ∥J n ∥ 2 ∥∆ n ∥∥J -1 n ∥ 1 -∥J n ∥∥∆ n ∥∥J -1 n ∥ ≤ ζ 2 δβ 1 -ζδβ ≡ µ (9) ∥ Jn ∥ ≤ ∥ Jn -J n ∥ + ∥J n ∥ ≤ µ + ζ (10) ∥ Fn -Y ∥ = ∥ Hn ∥ = ∥(X n+1 -X n ) Jn ∥ ≤ ∥ Jn ∥∥X n+1 -X n ∥ ≤ (µ + ζ)∥X n+1 -X n ∥ (11) Inequality (1) is developed in Horn & Johnson (2012, p. 381) . Also note that the rest of the inequalities in ( 9) are well defined since δ < 1/ζβ. The proof now continues similarly to the proof of Ortega & Rheinboldt (2000, 12.3.3) . We set G n = X n -Hn J-1 n = X n+1 , and show that G n is an Iterated Contraction: ∥X n+2 -X n+1 ∥ = ∥X n+1 -Hn+1 J-1 n+1 -X n+1 ∥ = ∥ Hn+1 J-1 n+1 ∥ (2) ≤ β(1 + δ)∥ Hn+1 ∥ ≤ β(1 + δ)∥ Hn+1 -Hn -(X n+1 -X n ) Jn ∥ (3) ≤ β(1 + δ)∥ Hn+1 -Hn -(X n+1 -X n )J n ∥ + β(1 + δ)∥ Jn -J n ∥∥X n+1 -X n ∥ (4) ≤ β(1 + δ) 2λ + ∥H n+1 -H n -(X n+1 -X n )J n ∥ + β(1 + δ)µ∥X n+1 -X n ∥ (5) ≤ β(1 + δ) 2λ + γ∥X n+1 -X n ∥ + β(1 + δ)µ∥X n+1 -X n ∥ ≤ β(1 + δ) 2λ ∥X n+1 -X n ∥ + γ + µ ∥X n+1 -X n ∥ (6) ≤ β(1 + δ) 2λ(µ + ζ) ∥ Fn -Y ∥ + γ + µ ∥X n+1 -X n ∥ = g(∥ Fn -Y ∥)∥X n+1 -X n ∥ where inequality (2) holds because of Bound 8, (3) is the triangle inequality, (4) is due to the Bounds 7 and 9 and the triangle inequality. Inequality (5) is proven in Ortega & Rheinboldt (2000, 3.2.5 ) and inequality (6) is from Bound 11. Assuming β(1 + δ)(γ + µ) < 1: g(∥ Fn -Y ∥) = 1 ⇐⇒ ∥ Fn -Y ∥ = 2λβ(1 + δ)(µ + ζ) 1 -β(1 + δ)(µ + γ) ≡ ρ g ≥ 1 only when Fn is close to Y . g is strictly-decreasing function of ∥ Fn -Y ∥, thus if ∥ Fn -Y ∥ ≥ ρ + ϵ for some ϵ > 0 then g(∥ Fn -Y ∥) ≤ α < 1 where α is independent of ∥ Fn -Y ∥. The convergence of X n follows from Ortega & Rheinboldt (2000, 12. 3.2), as long as g ≤ α < 1. Since the following holds ∥ Fn -Y ∥ (7) ≤ (µ + ζ)∥X n+1 -X n ∥ ≤ • • • ≤ (µ + ζ)α n ∥X 1 -X 0 ∥ then for every ϵ > 0 there exists k < ∞ such that α k ≤ ρ+ϵ (µ+ζ)∥X 1 -X 0 ∥ and the convergence is towards the ball ∥ Fn -Y ∥ ≤ ρ. Inequality (7) is from Bound 11. For uniqueness, see the end of the proof of Ortega & Rheinboldt (2000, 12.3.3) .

A.4 CONVERGENCE RESULTS FOR 1-DIMENSIONAL F

We restrict ourselves to the 1-dimensional case, where X = Y = R, and assume the function F is strictly monotone and its maximum and minimum slopes are not too different, thus the function is "close to" linear. We then show convergence at a linear rate. Let S F (x 1 , x 2 ) ≡ (F(x 1 ) -F(x 2 ))/(x 1 -x 2 ) denote the slope of F between x 1 and x 2 , and max  |S F | ≡ max x1,x2∈X |S F (x 1 , x 2 )| denote (x n+1 i ) -y i | ≤ (1 -ϵ)|F(x n i ) -y i |. When the number of desired outputs is greater than 2, then convergence for each output is generally not guaranteed. Theorem 4. Assume X = Y = R, that Assumption 4 holds and that at iteration n, ∀i x n i < F -1 (Y ) or ∀i x n i > F -1 (Y ) . Then X n+1 -F -1 (Y ) ≤ (1 -ϵ) X n -F -1 (Y ) . Theorem 4 guarantees that after finite iterations, the output segment intersects with the desired output segment. Note that Theorems 3 and 4 do not require any kind of approximations as in Assumption 3 nor for F to be differentiable.  x n+1 i = G θn+1,bn+1 (y i ) = y i θ n+1 + b n+1 (12) θ n+1 , b n+1 = arg min θ,b M i=1 (θF(x n i ) + b -x n i ) 2 Lemma 5. if X = Y = R then ∀n: 1 S F max ≤ θ n+1 ≤ 1 S F min if F is strictly increasing and 1 S F min ≤ θ n+1 ≤ 1 S F max if F is strictly decreasing. Proof. We will prove for strictly increasing F. The proof for strictly decreasing F is symmetrical. W.L.O.G assuming that X n is sorted: ∀i: x n i ≤ x n i+1 . Let k > i then: F(x n i ) + S F min (x n k -x n i ) ≤ F(x n k ) ≤ F(x n i ) + S F max (x n k -x n i ) 1 S F max (F(x n k ) -F(x n i )) ≤ x n k -x n i ≤ 1 S F min (F(x n k ) -F(x n i )) θ n+1 = 1 M M i=1 (x n i -X n ) F(x n i ) -F(X n ) 1 M M i=1 F(x n i ) -F(X n ) 2 = = 1 M 2 M -1 i=1 M k=i+1 (x n k -x n i ) (F(x n k ) -F(x n i )) 1 M 2 M -1 i=1 M k=i+1 (F(x n k ) -F(x n i )) 2 ≤ 1 M 2 M -1 i=1 M k=i+1 1 S F min (F(x n k ) -F(x n i )) 2 1 M 2 M -1 i=1 M k=i+1 (F(x n k ) -F(x n i )) 2 = 1 S F min and θ n+1 ≥ 1 M 2 M -1 i=1 M k=i+1 1 S F max (F(x n k ) -F(x n i )) 2 1 M 2 M -1 i=1 M k=i+1 (F(x n k ) -F(x n i )) 2 = 1 S F max When M = 2, the regression line passes exactly at the points (F(x n 1 ), x n 1 ) and (F(x n 2 ), x n 2 ), and b n+1 also takes the following forms: b n+1 = x n 1 -θ n+1 F(x n 1 ) = x n 2 -θ n+1 F(x n 2 ) Then placing b n+1 in Equation 12we get for every i ∈ [1, 2]: x n+1 i = x n i + θ n+1 (y i -F(x n i )) Denote the slope of F between x n+1 i and x n i : S F (x n+1 i , x n i ) ≡ F (x n+1 i )-F (x n i ) x n+1 i -x n i = F (x n+1 i )-F (x n i ) θn+1(yi-F (x n i )) Then the following equations hold: F(x n+1 i ) = F(x n i ) + θ n+1 S F (x n+1 i , x n i )(y i -f (x n i )) y i -F(x n+1 i ) = 1 -θ n+1 S F (x n+1 i , x n i ) (y i -F(x n i )) Using Lemma5 and since F is always increasing or always decreasing, then θ n+1 S F (x n+1 i , x n i )) > 0 and 1 2 -ϵ ≤ min |S F | max |S F | ≤ θ n+1 S F (x n+1 i , x n i )) ≤ max |S F | min |S F | ≤ 2 -ϵ 1 -θ n+1 S F (x n+1 i , x n i )) ≤ max |1 - 1 2 -ϵ |, |1 -ϵ| = 1 -ϵ Then, placing it into Equation 13 y i -F(x n+1 i ) = 1 -θ n+1 S F (x n+1 i , x n i ) |y i -F(x n i )| ≤ (1 -ϵ) |y i -F(x n i )| Note the convergence in one iteration for the linear case when ϵ = 1.

A.4.2 PROOF OF THEOREM 4

Denote L n : L n ≡ Y -F(X n ) F -1 (Y ) -X n = 1 M N i=1 Y -f (x n i ) 1 M N k=1 F -1 (Y ) -x n k = M i=1 F -1 (Y ) -x n i M k=1 F -1 (Y ) -x n k Y -f (x n i ) F -1 (Y ) -x n i = M i=1 w n,i Y -f (x n i ) F -1 (Y ) -x n i = M i=1 w n,i S F (F -1 (Y ), x n i ) Where w n,i = F -1 (Y )-x n i M k=1 F -1 (Y )-x n k , M i=1 w n,i = 1 and, since we assumed ∀i x n i < F -1 (Y ) or that ∀i x n i > F -1 (Y ), then ∀i w n,i > 0. Therefore L n is a weighted-mean of the slopes and S F min ≤ L n ≤ S F max . From Equation 4 the following holds: X n+1 -X n = θ n+1 (Y -F(X n )) = θ n+1 L j (F -1 (Y ) -X n ) F -1 (Y ) -X n+1 = (1 -θ n+1 L j )(F -1 (Y ) -X n ) (15) Using Lemma 5 and the inequalities S F min ≤ L n ≤ S F max , Inequality (14) from Appendix A.4.1 also applies for L n , and we obtain: 4 contains Particle and Reacher-v2 specific hyperparameters, while Table 5 is listing Hopper-v2 specific hyperparameters. We note that the minor difference in hyperparameter values between the domains evaluated is purposed only at achieving slightly better MSE results per domain. We observed that the steering behavior was relatively robust to hyperparameter values. |1 -θ n+1 L n | ≤ 1 -ϵ F -1 (Y ) -X n+1 ≤ (1 -ϵ) F -1 (Y ) -X n B EXPERIMENTAL DETAILS

B.1 PARTICLE ROBOT

The 2D plane in which the robot is allowed to move is finite, with the maximum coordinates increasing for longer horizons. When rendering the videos we include the entire 2D plane, up to the maximum coordinates. When evaluating policies, a validation set of 2,000 trajectories was used, which were unseen during training of the policies.

B.1.1 DATASETS

Splines Trajectories follow the function of a B-spline curvefoot_1 . The curves are of degree 2 with 5 control points, which are uniformly sampled between 0-1 in a 2-dimensional space. Deceleration Random X and Y forces for the first t acc trajectory steps, and then T -t acc steps of deceleration, where T is the time horizon. Deceleration at step j > t acc is done by setting F j x = -1 2 V j-1 x ∆t , F j y = -1 2 V j-1 y ∆t (assuming the mass of the particle is 1). V and ∆t are defined in Section 5. ∆ t = 0.1 B.2 REACHER-V2 B.2.1 DATASETS Fixed Joint Trajectories were collected to represent a scenario where one of the two robot arm joints is malfunctioning and is force fixed in place. The policy can only control the other robotic arm joint. When evaluating policies, a validation set of 2,000 trajectories was used, which were unseen during training of the policies.

B.3 HOPPER-V2 B.3.1 DATASETS

Hopping The datasets of size 2180 trajectories used for sequence-lengths 64 and 128 were extracted from D4RL's hopper-medium-v2, and consist of mostly forward hopping behaviors. When evaluating policies, a validation set of 436 trajectories was used, which were unseen during training of the policies. Unlike in the other evaluated domains, where trajectories sampled from a random policy were used to train the VQ-VAE, in Hopper-v2 we have used input videos from D4RL's Hopper-medium-v2 -the reason is that using the initial random policy, the trajectories terminated (hopper fell down) before reaching the desired T . For IT-IN training, we have modified Hopper-v2 slightly so that the episode will not terminate when the Hopper falls, thus allowing it to reach T steps.

B.4 GPT-BASED ARCHITECTURE

The model is conditioned on the intent via cross-attention. The actor network is comprised of 2 Linear layers of size 64, with a tanh activation.

B.5 GRU-BASED ARCHITECTURE

The single-layer GRU's hidden state size is set to match the flattened intent size of 4096. The actor network is comprised of 2 hidden Linear layers of size 4096 and a tanh activation is used. 

C.4 RE A C H E R-V2

We present sample reconstruction visualizations for random-action trajectories from Reacher-v2 on 16-step sequences in Figure 10 . Sample videos for 64-step FixedJoint sequences (trained with a GPT-based policy) can be found in the project's website: https://sites.google.com/ view/iter-inver. In each plot, the red row is the reference trajectory and the blue row is the policy reconstruction. These are based on a GRU policy. For ease of viewing, we modified the dark colors of the original rendered images.

C.5 HO P P E R-V2

We show additional examples of rollouts for the Hopper-v2 domain on 128-step sequences in Figure 9 . C.6 EXPLORATION (2017) , and a GRU policy.

C.8 STEERING CROSS EVALUATION

In Figure 11 we show example rollouts from the experiments described in Section 5.

C.9 GRU-BASED POLICY EXPERIMENTS

We report similar results with a GRU-based policy to the the results shown in Table 2 (analyzing the effect of steering dataset size) in Table 8 , and similar results to Table 1 (steering cross-evaluation) in Table 9 . 



A varying time horizon can be handled as an additional input to F. https://en.wikipedia.org/wiki/B-spline



Iterative Inversion for Learning Control Require: Steering data D steer , exploration noise parameter η, steering ratio α ∈ [0, 1], batch size N 1: Initialize D prev = D steer , θ 0 arbitrary 2: for n = 0, 1, 2, . . . do 3: Sample αN intents from D steer and (1 -α)N intents from D prev , yielding z 1 , . . . , z N 4:

Train π θn+1 by supervised learning: θ n+1 = arg min θ∈Θ {ẑ,τ }∈Replay Buffer [L (π θ (ẑ), τ a )] 8: Set D prev = ẑi N i=1 9: end for Note the simplicity of the IT-IN algorithm -it only involves exploration and supervised learning;

Figure 2: Particle results on Splines (top) and Deceleration (bottom). Here T = 64 and |D steer | = 500.All trajectories start at (0,0), marked by a blue circle. In Deceleration, the particle quickly decelerates to a stop at t = 32 -note the small overshoot at the end of each reconstructed trajectory, due to imperfect reconstruction of stopping in place.

Figure 3: Trajectory reconstructions in Hopper-v2, with T = 128 and |D steer | = 500. Additional rollouts are presented in Appendix C.5 and in the supporting video results.

steer | = 50), compared with the less diverse domain (most of the improvement is achieved already with |D steer | = 10). Trajectory visualizations for Splines with different sizes of D steer are shown in Appendix C.2.

Figure 4: Comparison of IT-IN with RL Baselines. All results are MSE (lower is better) averaged over 3 random seeds. Note that IT-IN outperforms RL on test trajectories (dashed lines) for all D steer sizes.

, we report results both on a held-out test set of trajectories, and on the training trajectories. As expected, for a small D steer , STATE-MSE obtains near perfect reconstruction of training trajectories, yet high error on test trajectories, as the precise reward makes it easy for PPO to overfit. Interestingly, however, when increasing the size of D steer , it becomes more difficult to overfit with PPO, even with the STATE-MSE reward. Note that for |D steer | = 2000, the performance of STATE-MSE on training is worse than the performance of IT-IN on test! We are not aware of studies that investigated RL training of policies conditioned on very diverse contexts, and our results suggest that vanilla PPO is not well suited to this task. Importantly, on test data, IT-IN significantly outperforms both RL methods for all D steer sizes. We attribute this finding to the combination of stable supervised learning updates, and not relying on a reward. Finally, our results for INTENT-MSE do not come close to IT-IN, which we attribute to the more difficult learning from sparse reward.

the maximum absolute slope of F and similarly min |S F | ≡ min x1,x2∈X |S F (x 1 , x 2 )| the minimum absolute slope. Assumption 4. F is continuous and strictly monotone, and max |S F | min |S F | ≤ 2 -ϵ for some 0 < ϵ ≤ 1. Theorem 3. Assume X = Y = R, that Assumption 4 holds, and that there are only two desired outputs M = 2. Then for any i ∈ {1, 2} and any iteration n: |F

max ≡ max x1,x2 S F (x 1 , x 2 ) and similarly S F min ≡ min x1,x2 S F (x 1 , x 2 ).Assuming X = Y = R. Then the approximated linear function is G θ,b (y) = yθ + b where θ, b ∈ R are scalars. At iteration n + 1 and for i ∈ [1, M ]:

PA R T I C L E:SP L I N E S -EFFECT OF TRAJECTORY LENGTH We tested IT-IN on multiple horizons T in the Splines domain, and found it to work well across horizons of 32, 64 and 128. We present sample visualizations with different T values in Figure 5 (showing the final reconstructed trajectories) and in Figure 6 (showing trajectory progression during an epsiode). C.2 PA R T I C L E:SP L I N E S -EFFECT OF STEERING DATASET SIZE In Figure 7 we present trajectory visualizations showcasing the effect of the size of the steering buffer D steer (cf. Table 2). C.3 PA R T I C L E:DE C E L E R A T I O N -EFFECT OF STEERING DATASET SIZE Similarly to Section C.2, in Figure 8, we showcase the effect of the size of the steering buffer D steer (cf. Table 2) in the Particle:Deceleration domain.

Figure 5: Example results on the Splines dataset, for different sequence lengths. In all cases shown here |D steer | = 500. To the left of each row we state the average MSE on an evaluation set of 3 policies trained with different seeds. Note the increasing scale of the plots as the sequence length increases. Also note that all trajectories start at (0,0), marked by the blue circle in each plot.

Figure 6: Visualization of trajectories progression in the Splines domain for different horizons T . Note the increasing scale of the plots as the sequence length increases. |D steer | = 500.

Figure 7: Example results on the Particle:Splines dataset for policies trained with different sizes of D steer . Each row corresponds to a different size. Each column corresponds to a specific reference trajectory from the dataset, the intent of which was used as input to the policies. T = 64 was used in all experiments.

Figure 8: Example results on the Particle:Deceleration dataset for policies trained with different sizes of D steer . T = 64 was used in all experiments. Figure structure same as in Figure 7.

Figure 9: Examples of trajectory reconstructions in the Hopper-v2 domain, with T = 128 and |D steer | = 500.

Figure 10: Examples of trajectory reconstructions in the Reacher-v2 domain.In each plot, the red row is the reference trajectory and the blue row is the policy reconstruction. These are based on a GRU policy. For ease of viewing, we modified the dark colors of the original rendered images.

(a) Testing on trajectories from Particle:Splines (b) Testing on trajectories from Particle:Deceleration

Figure 11: Examples comparing how policies trained with steering intents from either Particle:Splines or Particle:Deceleration perform when tested on trajectories from either datasets.We can see that when a policy is trained with steering intents from one dataset, it performs well on that dataset and performs poorly on the other. In each column the reference trajectory is the same.

Steering Dataset Size and Generalization. Here T = 64, and we show MSE averaged over 3 random seeds. Note that |D steer | = 0 represents the case where no steering is used at all. In this case, we use trajectories sampled from a random policy to initialize |D prev | (see Algorithm 2). (*) For Hopper-v2, the maximal |D steer | is 1740 due to a limited amount of data in D4RL.

contains a list of common hyperparameter values that we have used for all the experiments. Table

Common hyperparameters for all experiments.

Mujoco Hopper-v2 hyperparameters.



Evaluation of policies trained with and without exploration. We show average MSE for 3 policies; due to different domains, MSEs are comparable only within each row.

is summarizing the hyperparameters used for training RL policies withPPO Schulman et al.

Evaluation of IT-IN with a GRU policy on variable Steering Dataset size. T = 16. Note that |D steer | = 0 represents the case where no steering is used at all. In this case, we use trajectories sampled from a random policy to initialize |D prev | (see Algorithm 2) |D steer | = 0 |D steer | = 10 |D steer | = 50 |D steer | = 100 |D steer | = 500 |D steer | = 5000

Steering cross-evaluation for a GRU-policy. T = 16.

