EQUIVARIANT DESCRIPTOR FIELDS: SE(3)-EQUIVARIANT ENERGY-BASED MODELS FOR END-TO-END VISUAL ROBOTIC MANIPULATION LEARNING

Abstract

End-to-end learning for visual robotic manipulation is known to suffer from sample inefficiency, requiring large numbers of demonstrations. The spatial rototranslation equivariance, or the SE(3)-equivariance can be exploited to improve the sample efficiency for learning robotic manipulation. In this paper, we present SE(3)-equivariant models for visual robotic manipulation from point clouds that can be trained fully end-to-end. By utilizing the representation theory of the Lie group, we construct novel SE(3)-equivariant energy-based models that allow highly sample efficient end-to-end learning. We show that our models can learn from scratch without prior knowledge and yet are highly sample efficient (5∼10 demonstrations are enough). Furthermore, we show that our models can generalize to tasks with (i) previously unseen target object poses, (ii) previously unseen target object instances of the category, and (iii) previously unseen visual distractors. We experiment with 6-DoF robotic manipulation tasks to validate our models' sample efficiency and generalizability. Codes are available at: https://github.com/tomato1mule/edf To overcome such limitations, we present Equivariant Descriptor Fields (EDFs), the first end-to-end trainable and SE(3)-equivariant visual robotic manipulation models. EDFs can be fully end-to-end trained to solve highly spatial tasks from only a few (5∼10) demonstrations without requiring any pre-training, object keypoint annotation, or segmentation. EDFs can generalize to previously unseen target object instances in unseen poses as NDFs. Furthermore, EDFs can generalize to unseen distracting objects and unseen placement poses (See Figure 1 ). Our contributions are as follows: 1. To enable end-to-end training, we reformulate the energy minimization problem of NDFs into a probabilistic learning framework with energy-based models on the SE(3)-manifold. 2. We generalize the invariant descriptors of NDFs into representation-theoretic equivariant descriptors. Using equivariant descriptors significantly improves generalizability owing to their orientational sensitivity. 3. We propose a novel energy function and end-to-end trainable query point models to achieve the SE(3)-equivariance regarding both the target object and placement target. 4. EDFs do not resort to non-local mechanisms to achieve the SE(3)-equivariance. This specific design enables our method to work well without object segmentation pipelines. Equivariant Robotic Manipulation Equivariant models have emerged as a promising approach for robotic manipulation learning, with growing evidence indicating they can significantly improve both sample efficiency and generalizability (Wang & Walters, 2022; Wang et al., 2022). Transporter Networks and their variants (Zeng et al., 2020; Seita et al., 2021) are end-to-end models for visual robotic manipulation tasks that exploit the planar roto-translation equivariance, or the SE(2)equivariance for the sample efficiency. Equivariant Transporter Networks (ETNs) (Huang et al., 2022) exploit the representation theory of discrete rotation groups to further improve the sample efficiency. However, the efficiency of SE(2)-equivariant models is limited to planar tasks and cannot be extended to highly spatial tasks. Neural Descriptor Fields (NDFs) (Simeonov et al., 2021) overcome this limitation by leveraging the spatial roto-translation equivariance, or the SE(3)-equivariance. Energy-Based Models Energy-based models (EBMs) are probabilistic models that are derived from energy functions. EBMs are widely used for image and video generation (

1. INTRODUCTION

Learning robotic manipulation from scratch often involves learning from mistakes, making realworld applications highly impractical (Kalashnikov et al., 2018; Levine et al., 2016; Lee & Choi, 2022) . Learning from demonstration (LfD) methods (Ravichandar et al., 2020; Argall et al., 2009) are advantageous because they do not involve trial and error, but expert demonstrations are often rare and expensive to collect. Therefore, auxiliary pipelines such as pose estimation (Zeng et al., 2017; Deng et al., 2020) , segmentation (Simeonov et al., 2021) , or pre-trained object representations (Florence et al., 2018; Kulkarni et al., 2019) are commonly used to improve data efficiency. However, collecting sufficient data for training such pipelines is often burdensome or unavailable in practice. Recently, roto-translation equivariance has been explored for sample-efficient robotic manipulation learning. Transporter Networks (Zeng et al., 2020) achieve high sample efficiency in end-to-end visual robotic manipulation learning by exploiting SE(2)-equivariance (planar roto-translation equivariance). However, the efficiency of Transporter Networks is limited to planar tasks due to the lack of the full SE(3)-equivariance (spatial roto-translation equivariance). In contrast, Neural Descriptor Fields (NDFs) (Simeonov et al., 2021) can achieve few-shot level sample efficiency in learning highly spatial tasks by exploiting the SE(3)-equivariance. Moreover, the trained NDFs can generalize to previously unseen object instances (in the same category) in unseen poses. However, unlike Transporter Networks, NDFs cannot be end-to-end trained from demonstrations. The neural networks of NDFs need to be pre-trained with auxiliary self-supervised learning tasks and cannot be fine-tuned for the demonstrated tasks. Furthermore, NDFs can only be used for well-segmented point cloud inputs and fixed placement targets. These limitations make it difficult to apply NDFs when 1) no public dataset is available for pre-training on the specific target object category, 2) when well-segmented object point clouds cannot be expected, or when 3) the placement target is not fixed. (2011); Davidchack et al. (2017) . The Langevin dynamics and their convergence on general Riemannian manifold have been studied by Girolami & Calderhead (2011) ; Gatmiry & Vempala (2022) . In this paper, we propose SE(3)-equivariant EBMs on the SE(3) manifold, which should be distinguished from SE(3)-equivariant EBMs on Euclidean spaces (Jaini et al., 2021; Wu et al., 2021) . Representation Theory of Lie Groups A representation D of a group G is a map from G to the space of linear operators acting on a vector space V that has the following property: D(g)D(h) = D(gh) ∀g, h ∈ G (1) Any representation of SO(3) group D(R) for R ∈ SO(3) can be block-diagonalized into the direct sum of (real) Wigner D-matrices D l (R) ∈ R (2l+1)×(2l+1) of degree l ∈ {0, 1, 2, • • • }, which are orthogonal matrices (Aubert, 2013) . The (2l+1) dimensional vectors that are transformed by D l (R) are called type-l (or spin-l) vectors. Type-0 vectors are invariant to rotations such that D(R) = I. Type-1 vectors are the familiar 3-dimensional space vectors with D(R) = R. Type-l vectors are identical to themselves when rotated by θ = 2π/l. A type-l vector field f : R 3 × X → R 2l+1 is SE(3)-equivariant if D l (R)f (x|X) = f (T x|T X) ∀T = (R, v) ∈ SE(3), X ∈ X , x ∈ R 3 (2) where X is some set equipped with a group action and T x = Rx + v. Tensor Field Networks (TFNs) (Thomas et al., 2018) and SE(3)-Transformers (Fuchs et al., 2020) are used to implement SE(3)-equivariant vector fields in this work. We provide details of these networks in Appendix G.

3. PROBLEM FORMULATION

Let a colored point cloud with M points given by X = {(x 1 , c 1 ), • • • , (x M , c M )} ∈ P where x i ∈ R 3 is the position, c i ∈ R 3 is the color vector of the i-th point, and P is the set of all possible colored point clouds. The action of SE(3) on point clouds : SE(3) × P → P is then defined as T X = {(T x 1 , c 1 ), (T x 2 , c 2 ), • • • , (T x M , c M )} ∀T = (R, v) ∈ SE(3) ) Consider a problem where a robot has to learn to grasp and place objects in specific locations from human demonstrations. To solve this problem, the placement pose T ∈ SE(3) must be inferred such that the relative pose between the grasped object and placement target remains consistent with the demonstrations, regardless of changes in their postures. This can be achieved by requiring the bi-equivariance to T such that 1) it follows changes in the posture of the placement target, and 2) it compensates for changes in the posture of the grasp (See Appendix C for more explanation). In contrast, uni-equivariant methods such as NDFs, which can only account for changes in one object, are not well-suited for problems where both postures may change. Now let X be the point cloud of the scene (where the placement target belongs), and Y be the point cloud of the end-effector (with the grasped object) observed in the end-effector frame T ∈ SE(3). The formal definition of bi-equivariance is as follows: Definition 1. A differential probability distribution dP (T |X, Y ) on SE(3) conditioned by two point clouds X, Y ∈ P is bi-equivariant if for all Borel subsets Ω ⊆ SE(3), T ∈Ω dP (T |X, Y ) = T ∈SΩ dP (T |S X, Y ) = T ∈ΩS dP (T |X, S -1 Y ) ∀S ∈ SE(3) (4) where SΩ = {ST |T ∈ Ω}, ΩS = {T S|T ∈ Ω}, and S -1 denotes the group inverse of S. Definition 2. A scalar function f : SE(3) × P × P → R is bi-equivariant if f (T |X, Y ) = f (ST |S X, Y ) = f (T S|X, S -1 Y ) ∀S ∈ SE(3) (5) Proposition 1. A probability distribution P (T |X, Y )dT is bi-equivariant if dT is the bi-invariant volume form (See Appendix A) on the SE(3) manifold and P (T |X, Y ) is a bi-equivariant probabil- ity density function (PDF). The proof of Proposition 1 can be found in Appendix F.1. The bi-equivariance condition of Definition 1 can be viewed as a generalization of Huang et al. (2022) to non-commutative groups like SE(3) and a probabilistic generalization of Ganea et al. (2021) . By Proposition 1, our goal boils down to constructing a bi-equivariant PDF P (T |X, Y ) such that P (T |X, Y ) = P (ST |S X, Y ) = P (T S|X, S -1 Y ) ∀S ∈ SE(3) (6) Figure 2 : A) The model is globally equivariant if the grasp pose is equivariant to the transformations of the whole scene (the target object and background). B) The model is locally equivariant to the target object if the grasp pose is equivariant to the localized transformations of the target object. However, we want our policy to be not only globally equivariant as Equation ( 6) but also to be locally equivariant. That is, we want our models to be equivariant only to the target object, not the backgrounds. We illustrate the local equivariance and the global equivariance in Figure 2 . To achieve the local equivariance, the model should only rely on local mechanisms to achieve the equivariance. For example, Transporter Networks achieve translational equivariance by using convolutional neural networks, which are well-known for their locality (Battaglia et al., 2018; Goodfellow et al., 2016) . On the other hand, NDFs (Simeonov et al., 2021) rely on the centroid subtraction method to obtain translational equivariance, which is highly nonlocal. As a result, unlike Transporter Networks, NDFs cannot be used without object segmentation pipelines.

4. BI-EQUIVARIANT ENERGY BASED MODELS ON SE(3)

In this section, we present EDFs and the corresponding bi-equivariant energy-based models on SE(3). We also provide practical implementations for the proposed models. We illustrate the overview of our method in Figure 3 .

4.1. EQUIVARIANT DESCRIPTOR FIELD

We define the EDF φ(x|X) as a direct sum of N vector fields φ(x|X) = N n=1 φ (n) (x|X) where φ (n) (x|X) : R 3 × P → R 2ln+1 is an SE(3)-equivariant type-l n vector field. Therefore, the EDF φ(x|X) transforms according to a rigid body transformation T ∈ SE(3) as φ(T x|T X) = D(R)φ(x|X) ∀ T = (R, v) ∈ SE(3) where D(R) = N n=1 D ln (R) is the direct sum of the real Wigner D-Matrices of degree l n in the real basis. Therefore, D(R) is an orthogonal representation of the SO(3) group. Note that NDFs (Simeonov et al., 2021) only use type-0 descriptors, which are invariant to rotations such that D(R) = I. In contrast, EDFs also use type-1 or higher descriptors, which are highly sensitive to rotations. As a result, NDFs require at least three non-collinear query points (and much more in practice) to represent the orientation of a rigid body, whereas EDFs require only one point.

4.2. EQUIVARIANT ENERGY-BASED MODEL ON SE(3)

Naively minimizing the energy function like Simeonov et al. (2021) cannot be used to simultaneously train the descriptors, as this would result in all the descriptors collapsing to zero (or some other constants). Therefore, we use the EBM approach for the end-to-end training of descriptors. An energy-based model on the SE(3) manifold conditioned by X, Y ∈ P can be defined as P (T |X, Y ) = exp [-E(T |X, Y )] SE(3) dT exp [-E(T |X, Y )] Figure 3 : A) Query points and query EDF are generated from the point cloud of the grasp. Query EDF values at the query points are used as the query descriptors. We visualized three type-0 descriptors in colors (RGB) and type-1 descriptors as arrows. We only visualized type-1 descriptors in important locations. We did not visualize higher-type descriptors. B) The key descriptors are generated from the point cloud of the scene. C) The query descriptors are transformed and matched to the key descriptors to produce the energy of the pose. For simplicity, we only visualized the query descriptor for a single query point. Note that the query and key descriptors are better aligned in the low energy case than in the high energy case for both the type-0 and type-1 descriptors (The orange query points are near the orange region, and the black arrow is well aligned to the gray arrows). Proposition 2. The EBM P (T |X, Y ) in Equation ( 9) is bi-equivariant if the energy function E(T |X, Y ) is bi-equivariant. We prove Proposition 2 in Appendix F.2. We now propose the following energy function: E(T |X, Y ) = R 3 d 3 xρ(x|Y )∥φ(T x|X) -D(R)ψ(x|Y )∥ 2 where φ(x|X) is the key EDF, ψ(x|Y ) is the query EDF, and ρ(x|Y ) is the query density. Note that T = (R, v). The query density is an SE(3)-equivariant non-negative scalar field such that ρ(x|Y ) = ρ(T x|T Y ) ∀T ∈ SE(3) Intuitively, the energy function in Equation ( 10) can be thought as a query-key matching between the key EDF and the query EDF which is analogous to (Zeng et al., 2020; Huang et al., 2022) . Proposition 3. The energy function E(T |X, Y ) in Equation ( 10) is bi-equivariant. We prove Proposition 3 in Appendix F.3. As a result, the EBM in Equation ( 9) with the energy function in Equation ( 10) is also bi-equivariant. Lastly, we provide an important consequence of Equation ( 11), whose proof can be found in Appendix F.4: Proposition 4. Non-constant query densities that satisfy Equation ( 11) must be grasp-dependent.

4.3. IMPLEMENTATION

Our method consists of two models, viz. the pick-model and the place-model. For the following sections, we denote all the learnable parameters as θ. Therefore, all the functions with θ as a subscript are to be understood as trainable models. We visualized the key EDF of a trained pick-model in Figure 4 . Query Density To make the integral in Equation ( 10) tractable, we model the query density as weighted query points by taking the weighted sum of Dirac delta functions δ (3) (x) = 3 i=1 δ(x i ) ρ θ (x|Y ) = Nq i=1 w θ (q i;θ (Y )|Y ) δ (3) (x -q i;θ (Y )) (12) 𝝋(𝐱|𝑋) B) A) 𝝋(𝐱|𝑋) Figure 4 : The key EDF of a trained pick-model is illustrated for the scenes with a mug in A) upright pose and B) lying pose. Note that the colors (type-0 descriptors) are invariant to the rotation of the mug. On the other hand, the arrows (type-1 descriptors) are equivariant to the rotation. We only visualized type-1 descriptors in important locations. Higher-type descriptors are not visualized. where q i;θ (Y ) : P → R 3 is the i-th query point function and w θ (x|Y ) : R 3 × P → R + is the query weight field. These maps are SE(3)-equivariant such that q i;θ (T Y ) = T q i;θ (Y ) w θ (T x|T Y ) = w θ (x|Y ) Proposition 5. The query density ρ θ (x|Y ) in Equation ( 12) is SE(3)-equivariant. We prove Proposition 5 in Appendix F.5. Note that as a consequence of Proposition 4, the grasp dependence is inevitable for non-constant query point models like Equation ( 12). In this case, the integral in Equation ( 10) can be written in the following tractable summation form: E θ (T |X, Y ) = Nq i=1 E θ (T |X, Y, w θ (q i;θ (Y )|Y ) , q i;θ (Y )) E θ (T |X, Y, w, q) = w∥φ θ (T q|X) -D(R)ψ θ (q|Y )∥ 2 In Appendix B, we provide practical implementations of q i;θ (Y ) and w θ (x|Y ) that are continuously parameterized and differentiable. EDFs As was argued in Section 3, only the local operations should be used in our models for the local equivariance. We use Tensor Field Networks (TFNs) (Thomas et al., 2018) for the last layer and SE(3)-Transformers (Fuchs et al., 2020) for the other layers. The convolution operations that are used in these networks are highly local when their radial functions (See Appendix G) have short cutoff distances. We used simple radius clustering, which is locally SE(3)-equivariant within the clustering radius, to make the point clouds into graphs. For computational efficiency, only the last layer is used to evaluate the field values at the query points. All the other layers' outputs only depend on the point cloud and not the query points. Therefore, during the MCMC steps, only the last layer (TFN) has to be recalculated, and the outputs of the other layers (SE(3)-Transformers) can be reused. We use the E3NN package (Geiger et al., 2022) to implement the equivariant layers.

5. SAMPLING AND TRAINING

For the sampling, we first run the Metropolis-Hastings algorithm (MH) with IG SO(3) distribution (Nikolayev & Savyolov, 1970; Savyolova, 1994; Leach et al., 2022) for the orientation proposal and typical Gaussian distribution for the translation proposal. Next, we run the Langevin dynamics on the SE(3) manifold using the samples from MH as initial seeds. The Lie derivatives (Brockett, 1997; Chirikjian, 2011) for the Langevin dynamics are calculated in quaternion-translation parameterization as Davidchack et al. (2017) to avoid singularity issues. We provide details in Appendix D. For the training, we estimate the gradient of the log-likelihood of Equation (9) at T target as ∇ θ log P θ (T target |X, Y ) ≈ -∇ θ E θ (T target |X, Y ) + 1 N N n=1 [∇ θ E θ (T n |X, Y )] where T n ∼ P (T |X, Y ) is the n-th negative sample (Carreira-Perpinan & Hinton, 2005) . However, naively maximizing the log-likelihood is highly unstable. If the query EDF and the key EDF are initially very different at some important sites, the learning algorithm tends to lower the query density of these sites rather than make the two EDFs closer. Therefore, all the query points in essential locations (such as contact points) are being pushed away. As a result, the training diverges. To avoid this instability, we propose using the following surrogate query model during the early stage of training. We first decompose the EBM P (T |X, Y ) in Equation ( 14) into P (T |X, Y ) = dw dQP (T |X, Y, w, Q)P (w, Q|Y ) P (T |X, Y, w, Q) = exp - Nq i=1 E(T |X, Y, w i , q i ) SE(3) dT exp - Nq i=1 E(T |X, Y, w i , q i ) P (w, Q|Y ) = Nq i=1 P i (w i , q i |Y ) = Nq i=1 δ(w i -w(q i |Y )) × δ (3) (q i -q i (Y )) where Q = (q 1 , • • • , q Nq ) and w = (w 1 , • • • , w Nq ). We temporarily hide θ for brevity. Proposition 6. The marginal EBM P (T |X, Y ) in Equation ( 17) is bi-equivariant if P (w, Q|Y ) = P (w, SQ|S Y ) ∀S ∈ SE(3) We prove Proposition 6 in Appendix F.6. We now relax this deterministic query model into a stochastic model by adding Gaussian noise to the logits of the query weights l i = log w i as follows. P (w, Q|Y ) = Nq i=1 Pi (w i , q i |Y ) = Nq i=1 dl i dw i N (l i ; log w(q i |Y ), σ H )δ (3) (q i -q i (Y )) Now we propose the following surrogate query model H(w, Q|X, Y, T ) = Nq i=1 H i (w i , q i |X, Y, T ) H i (w i , q i |X, Y, T ) = Pi (w i , q i |Y ) if d min (T q i , X) < r (dl i /dw i )N (l i ; α, σ H )δ (3) (q i -q i (Y )) else (21) where σ H ∈ R + , r ∈ R + , and α ∈ R are hyperparameters and d min (x, X) : R 3 × P → R + is the shortest Euclidean distance between x and the points in X. We set α to be sufficiently small so that query points without neighboring points in X can be suppressed. To train our models using the surrogate query model in Equation ( 21), we maximize the following variational lower bound (Kingma & Welling, 2013) instead of the marginal log-likelihood. 22) is bi-equivariant. L θ (T |X, Y ) = E w,Q∼H θ [log P θ (T |X, Y, w, Q)] -D KL H θ (w, Q|X, Y, T ) Pθ (w, Q|Y ) (22) Proposition 7. The variational lower bound L θ (T |X, Y ) in Equation ( We provide the proof of Proposition 7 in Appendix F.7. The Kullback-Leibler divergence term in Equation ( 22) is provided in Appendix B.2. Once the query model has been sufficiently trained, we remove the surrogate query model and return to the maximum likelihood training in Equation ( 16).

6. EXPERIMENTAL RESULTS

We design the experiments to assess the generalization performance of our approach (EDFs) when the number of demonstrations is very limited. Similar to Simeonov et al. (2021) , we evaluate the generalization performance of models with a mug-hanging task and bowl/bottle pick-and-place tasks. In the mug-hanging task, a mug has to be picked by its rim and then hung on a hanger by its handle. In the bowl/bottle pick-and-place task, a bowl/bottle should be picked and placed on a tray. As opposed to Simeonov et al. (2021) , we randomize the pose of the hanger and tray to evaluate Table 1 : Pick-and-place success rates in various out-of-distribution settings.

Mug

Bowl Bottle Pick Place Total Pick Place Total Pick Place Total Unseen Instances SE(3)-TNs (Zeng et al., 2020) 1.00 0.36 0.36 0.76 1.00 0.76 0.20 1.00 0.20 EDFs (Ours) 1.00 0.97 0.97 0.98 1.00 0.98 1.00 1.00 1.00 Unseen Poses SE(3)-TNs (Zeng et al., 2020) 0.00 N/A 0.00 0.00 N/A 0.00 0.00 N/A 0.00 EDFs (Ours) 1.00 1.00 1.00 1.00 1.00 1.00 0.95 1.00 0.95 Unseen Distracting Objects SE(3)-TNs (Zeng et al., 2020) 1.00 0.63 0.63 1.00 1.00 1.00 0.96 0.92 0.88 EDFs (Ours) 1.00 0.98 0.98 1.00 1.00 1.00 0.99 1.00 0.99 Unseen Instances, Arbitrary Poses & Distracting Objects SE(3)-TNs (Zeng et al., 2020) For baselines, we use SE(3)-Transporter Networks (SE(3)-TNs) (Zeng et al., 2020) as the state-of-the-art method for end-to-end visual manipulation. However, for the mug hanging task, we find that SE(3)-TNs entirely fail due to the multimodality of the demonstrations. Therefore, we instead train SE(3)-TNs using unimodal, low-variance demonstrations only for the mug-hanging task. For fair comparison, we provide the result for EDFs trained with the same demonstrations in Table 3 of Appendix I. The experimental results for SE(3)-TNs and EDFs are summarized in Table 1 . For EDFs, we provide the learning curves for the mug task in Figure 6 . Note that we do not directly compare EDFs with NDFs because (1) NDFs require the target placement poses to be fixed, and (2) NDFs require object segmentation pipelines (Performance of NDFs without object segmentation is provided in Appendix J). Instead, we perform an ablation study by using only type-0 descriptors as NDFs. One may consider this as the end-to-end trainable and bi-equivariant modification of NDFs (See Section 4.1 and Appendix E). We control the ablated model's query point number such that the wall-clock inference time is similar to or slightly longer than EDFs. Note that the query point number serves a similar role to the batch size of inputs. Details on the ablated model and EDFs are provided in Appendix H. All the times were Analysis As can be seen in Table 1 , EDFs outperform SE(3)-TNs (Zeng et al., 2020) for all three tasks. It is obvious that EDFs generalize much better than SE(3)-TNs for unseen poses, as EDFs are SE(3)-equivariant, whereas SE(3)-TNs are only SE(2)-equivariant. We provide a qualitative example in Figure 7 . EDFs also generalize much better than SE(3)-TNs for unseen instances and/or unseen distracting objects. For example, SE(3)-TNs often fail to correctly regress the z-axis of the grasp pose of an unseen instance when the object height differs a lot from the trained objects. In contrast, EDFs rarely fail under such height differences due to the SE(3)-equivariance. For the distracting objects, we presume that the reason why EDFs perform better is that the point cloud inputs provide better geometric information than the orthographic RGB-D inputs that SE(3)-TNs take. The ablation study shows that using higher-type descriptors significantly increases generalization performance when the number of query points is highly limited due to computational constraints. As can be seen in Table 2 , EDFs outperform the ablated model that only uses type-0 descriptors as NDFs. As was discussed in Section 4.1, we presume that this is due to the orientational insensitivity of type-0 descriptors. As type-0 descriptors cannot represent orientations alone, query points are crucial in representing orientations for the ablated model. In contrast, the higher-type descriptors of EDFs can represent the orientation without the help of query points. As a result, EDFs can maintain orientational accuracy in unseen situations in which low-quality query points are expected. We illustrate the failure case of the ablated model in Figure 7 .

7. DISCUSSION AND CONCLUSION

There are several limitations to EDFs that should be resolved in future works. First, faster sampling methods are required for real-time manipulations. Cooperative learning (Xie et al., 2018a; 2021b; 2022) and amortized sampling (Wang & Liu, 2016; Xie et al., 2021c ) can be applied to accelerate the MCMC sampling. In addition, EDFs are not intended for tasks with significant occlusions to the target object. Future work may also encompass 3D reconstruction methods. Lastly, EDFs cannot solve problems at the trajectory level, which is a shared problem with NDFs. Future work should define the adequate equivariance condition for full trajectory-level manipulation tasks. To summarize, we introduce EDFs and the corresponding energy-based models, which are SE(3)equivariant end-to-end models for robotic manipulations. We propose novel bi-equivariant energybased models, which provably allow highly sample efficient and generalizable learning. We show by experiment that our method is highly sample efficient and generalizable to unseen object instances, unseen object poses, and unseen distracting objects. Lastly, we show by the ablation study that higher-degree equivariance (type1 or higher) is important for generalizability.

A BI-INVARIANT VOLUME FORM

A bi-invariant volume form dg of a n-dimensional Lie group G is a differential n-form that satisfies dg = d(hg) = d(gh) ∀h ∈ G (23) such that for all Borel subsets Ω ⊆ G and for all well-behaved function f (g) : G → R g∈Ω dgf (g) = hg∈hΩ d(hg)f h -1 (hg) ∀h ∈ G (Left invariance) g∈Ω dgf (g) = gh∈Ωh d(gh)f (gh)h -1 ∀h ∈ G (Right invariance) where hΩ = {hg|g ∈ Ω} and Ωh = {gh|g ∈ Ω}. Let z = ϕ(g) be some coordinatization of G that for some function J(z) : R n → R, dg can be explicitly written as dg = J(z)d n z Let the left/right group translations in coordinates be z (l) (z) = ϕ(hg(z)) and z (r) (z) = ϕ(g(z)h). The bi-equivariance condition in Equation ( 23) can then be expressed in the coordinate form as d(hg) = J(z (l) )d n z (l) = J(z (l) ) det ∂z (l) ∂z d n z = J(z)d n z = dg d(gh) = J(z (r) )d n z (r) = J(z (r) ) det ∂z (r) ∂z d n z = J(z)d n z = dg (25) thus leading to the following equations: J(z (l) ) det ∂z (l) ∂z = J(z) = J(z (r) ) det ∂z (r) ∂z Detailed introduction to invariant volume forms on Lie groups can be found in (Chirikjian, 2011; Zee, 2016) . Readers interested in differential forms may find differential geometry textbooks (Spivak, 2018; Nakahara, 2018) useful. For example, the translation group (R, +) admits a bi-invariant volume form dx because x∈Ω dxf (x) = x+ϵ∈Ω+ϵ d(x + ϵ) & & & & dx d(x + ϵ) f ((x + ϵ) -ϵ) = x+ϵ∈Ω+ϵ d(x + ϵ)f ((x + ϵ) -ϵ) ) Note that the right invariance in Equation ( 27) is sufficient to prove the bi-invariance of dx because (R, +) is commutative, that is x + ϵ = ϵ + x ∀x, ϵ ∈ R. On the other hand, for the multiplicative group (R ̸ =0 , ×), dx is not a bi-invariant volume form: x∈Ω dxf (x) = ϵx∈ϵΩ d(ϵx) dx d(ϵx) f (ϵ -1 (ϵx)) = 1 ϵ ϵx∈ϵΩ d(ϵx)f (ϵ -1 (ϵx)) ̸ = ϵx∈ϵΩ d(ϵx)f (ϵ -1 (ϵx)) ∀ϵ ∈ R ̸ =0 However, d(log |x|) = dx/x is a bi-invariant volume form: x∈Ω dx x f (x) = ϵx∈ϵΩ d(ϵx) x dx d(ϵx) f (ϵ -1 (ϵx)) ϵx∈ϵΩ d(ϵx) ϵx f (ϵ -1 (ϵx)) Again, we only show the left invariance because the multiplicative group is commutative. Note that not every Lie group does admit bi-invariant volume form. Nevertheless, the Lie groups that we are concerned in this paper, the SO(3) group and the SE(3) group, have bi-invariant volume forms. We reproduce here the bi-invariant volume forms of SO(3) and SE(3) in coordinate forms provided in Chirikjian ( 2011). The bi-invariant volume form on SO(3) can be written in Euler angles α, β,and γ as dR = 1 8π 2 sin βdαdβdγ The bi-invariant volume form on SE(3) can be written in rotation-translation coordinate as dT = dRd 3 v where v ∈ R 3 denotes the translation vector with respect to the space frame. Lastly, the bi-invariant volume form or bi-invariant integral measure of SE(3) should not be confused with the bi-invariant metric, which does not exist for SE(3) (Chirikjian, 2015).

B QUERY MODELS B.1 EQUIVARIANT QUERY POINTS

We use Stein variational gradient descent (SVGD) (Liu & Wang, 2016; Jaini et al., 2021) method to equivariantly draw query points [q i;θ (Y )] i=Nq i=1 in Equation ( 13) from the query weight field w θ (x|Y ). In this case, w θ (x|Y ) can be interpreted as an unnormalized probability distribution on R 3 . The SVGD equation (Liu & Wang, 2016 ) is given by q t+1 i = q t i + ϵ 1 N q Nq j=1 k(q t j , q t i ) ∇ x log w θ (x|Y )| x=q t j + ∇ x k(x, q t i ) x=q t j (30) k(x, x ′ ) = exp - 1 h ∥x -x ′ ∥ 2 (31) Note that SVGD is fully deterministic given initial points q t=0 i . We use h t = med 2 t / log N q as (Liu & Wang, 2016) where med t denotes the median of the distances between all the points in q t 1 , q t 2 , • • • , q t Nq . We take the final output of SVGD as the query points, that is q i (Y ) = q t=t f in i ( ) for some t f in ≥ 1. We take ϵ and t f in as hyperparameters. In our work, we uses ϵ = 0.005 and t f in = 100. We now show that the query point q i (Y ) in Equation ( 32) is SE(3)-equivariant. Proposition 8. The query points {q i (Y )} Nq i=1 are SE(3)-equivariant if the initial query points q t=0 i (Y ) Nq i=1 are SE(3)-equivariant, that is T q t=0 i (Y ) = q t=0 i (T Y ) ∀i ⇒ T q i (Y ) = q i (T Y ) ∀i To prove Proposition 8, we first explicitly denote the query points' dependence on Y as q t i = q t i (Y ). We then propose the following lemma. 3), the following equations hold: Lemma 1. If q t i (T Y ) = T q t i (Y ) ∀T ∈ SE( k(q t j (Y ), q t i (Y ))R ∇ x log w θ (x|Y )| x=q t j (Y ) = k(q t j (T Y ), q t i (T Y )) ∇ x log w θ (x|T Y )| x=q t j (T Y ) (34) R ∇ x k(x, q t i (Y )) x=q t j (Y ) = ∇ x k(x, q t i (T Y )) x=q t j (T Y ) (35) Since ∥x -x ′ ∥ 2 = ∥T x -T x ′ ∥ 2 ∀T ∈ SE(3), it is straightforward to prove that k(x, x ′ ) = k(T x, T x ′ ) ∀T ∈ SE(3) To prove the equivariance of the gradient terms in Equation ( 34) and Equation ( 35), we prove the following lemma. Lemma 2. If f (x|Y ) = f (T x|T Y ) for some function f : R 3 × P → R, the following holds. ∇ x f (x|T Y )| x=T x0 = R ∇ x f (x|Y )| x=x0 Proof. ∇ x f (x|T Y )| x=T x0 = ∇ x f (T -1 x|Y ) x=T x0 (∵ f (x|Y ) = f (T x|T Y )) = ∇ x f (x ′ |Y )| x=T x0 (Change of variables x ′ = T -1 x) = R ∇ x ′ f (x ′ |Y )| x ′ =x0 = R ∇ x f (x|Y )| x=x0 (x ′ → x) where in the last line we used ∇ x = ∂x ′ ∂x T ∇ x ′ = ∂ R -1 x -$ $ $ R -1 v ∂x T ∇ x ′ = (R -1 ) T ∇ x ′ = R∇ x ′ One can prove Lemma 1 using Lemma 2 and Equation (36). We now prove Proposition 8 using Lemma 1. Let the sum on the right-hand side of Equation ( 30) be s(Y ) such that s(Y ) = Nq j=1 k(q t j (Y ), q t i (Y )) ∇ x log w θ (x|Y )| x=q t j (Y ) + ∇ x k(x, q t i (Y )) x=q t j (Y ) We first consider the case with t = 0. One can prove that s(T Y ) = Rs(Y ) using Lemma 1 and the equivariance of the initial points T q t=0 i (Y ) = q t=0 i (T Y ), which was assumed in Proposition 8. It is then straightforward to prove that q t=1 i (Y ) is also equivariant. Proof. q t=1 i (T Y ) = q t=0 i (T Y ) + ϵs(T Y ) = T q t=0 i (Y ) + ϵRs(Y ) = Rq t=0 i (Y ) + v + ϵRs(Y ) = R(q t=0 i (Y ) + ϵs(Y )) + v = T q t=1 i (Y ) (39) We now recursively apply this relation to t = 1, 2, 3, • • • , t f in to conclude that the final query point is also equivariant, that is q t=t f in i (Y ) = q t=t f in i (T Y ). Therefore, the only requirement for our query points to be equivariant is the equivariance of the initial points: T q t=0 i (Y ) = q t=0 i (T Y ). We provide a simple (and clearly not the best) deterministic method that we used to sample the initial point q t=0 i (Y ) in Algorithm 1.

Algorithm 1 Simple algorithm for deterministic and equivariant initial point sampling

Input: Y = {(y 1 , c 1 ), • • • , (y M , c M )}, w(x|Y ), N max , r cluster Output: q 1 , q 2 , • • • i ← 1 Q ← {y 1 , • • • , y M } ▷ Initialize set Q with the set of all the points in Y while i ≤ N max do if Q is not empty then q i ← arg max y∈Q w(y|Y ) ▷ Take the point with largest weight in Q Q ← Q -z ∈ Q ∥z -q i ∥ 2 ≤ r cluster ▷ Remove the neighbors from Q end if i ← i + 1 end while

B.2 SURROGATE QUERY MODEL

Let A(r) be the set of all the indices of the query points whose shortest distance to the point cloud X is farther than some radius r such that A(r) = {i ∈ {1, 2, • • • , N q }|d min (T q i , X) ≥ r}. The Kullback-Leibler divergence term in Equation ( 22) can be calculated as follows: D KL (H(w, Q|X, Y, T )∥ P (w, Q|Y )) =   N Q i=1 R + dw i R 3 dq i H i (w i , q i |X, Y, T )   N Q j=1 log H j (w j , q j |X, Y, T ) Pj (w j , q j |Y ) = i∈A(r) R + dw i R 3 dq i H i (w i , q i |X, Y, T ) log H i (w i , q i |X, Y, T ) Pi (w i , q i |Y ) = i∈A(r) R + ¨dw i dl i ¨dw i $ $ $ $ $ $ $ $ $ $ $ R 3 dq i δ (3) (q i -q i (Y ))N (l i ; α, σ H ) × log N (l i ; α, σ H ) @ @ @ @ @ @ @ @ δ (3) (q i -q i (Y )) N (l i ; log w(q i |Y ), σ H ) @ @ @ @ @ @ @ @ δ (3) (q i -q i (Y )) = i∈A(r) R dl i N (l i ; α, σ H ) log N (l i ; α, σ H ) N (l i ; log w(q i (Y )|Y ), σ H ) = i∈A(r) R dl i N (l i ; α, σ H ) - 1 2σ 2 H (l i -α) 2 -(l i -log w(q i (Y )|Y )) 2 = i∈A(r) E ϵ∼N0,1 - 1 2σ 2 H (ϵσ H ) 2 -(ϵσ H + α -log w(q i (Y )|Y )) 2 = i∈A(r) 1 2 log w(q i (Y )|Y ) -α σ H 2 (40) where in the third line we used log H j (w j , q j |X, Y, T ) Pj (w j , q j |Y ) = log $ $ $ $ $ $ Pj (w j , q j |Y ) $ $ $ $ $ $ Pj (w j , q j |Y ) = 0 ∀j / ∈ A(r) and in the last line, we used the reparameterization trick for Gaussian distributions (Kingma & Welling, 2013).

B.3 QUERY ATTENTION

Due to the computational limitations, it is desirable to have as few query points as possible during the inference time. Therefore, instead of directly taking w i = w θ (q i;θ (Y )), we normalize the query weights by taking w i = w θ (q i (Y )) Nq j=1 w θ (q j (Y )) such that the query points compete with each other during the training. As a result of this competition, only a few query points have non-negligible weights. Therefore, during the inference time, we can calculate for only a few query points with non-negligible weights instead of calculating the whole query points to save the computation. Note that the normalized query weight is still equivariant because only scalar addition and division were used in Equation (41).

C INTUITION BEHIND THE BI-EQUIVARIANCE CONDITION

To illustrate the bi-equivariance condition in Equation ( 6), consider an object placing task where T go ∈ SE(3) is the object pose (o) in the gripper frame (g) and T sd ∈ SE(3) is the desired object pose (d) that is to be placed in the scene frame (s). Consequently, the gripper pose in the scene frame T sg should satisfy the following equation T sd = T sg T go (42) Now, let the desired pose of the object to be placed has been transformed as T sd → T ′ sd = ST sd for some transformation S ∈ SE(3). In order to keep the relation in Equation ( 42) invariant such that for the new gripper pose T ′ sg the equation T ′ sd = T ′ sg T go holds, it should be that T ′ sg = ST sg . Similarily, if the pose of the grasped object is transformed as T go → T ′′ go = ST go , the gripper pose should also be transformed as T sg → T ′′ sg = T sg S -1 to keep the relation in Equation ( 42) invariant. Since T sd and T go are implicitly encoded in X and Y individually, one can naively substitute T sd and T go into X and Y to get the equation P (T sg |X, Y ) = P (T ′ sg = ST sg |S X, Y ) = P (T ′′ sg = T sg S|X, S -1 Y ) This is the intuition behind Equation (6). We illustrate the bi-equivariance condition in Figure 8 D SAMPLING DETAILS For the sampling, we use the Metropolis-Hastings (MH) algorithm (Hastings, 1970; Metropolis et al., 1953) and Langevin algorithm on the SE(3) manifold (Brockett, 1997; Chirikjian, 2011; Davidchack et al., 2017) . Unlike the MH, the Langevin algorithm does not suffer from high rejection ratios and converges with much fewer iterations. However, the Langevin algorithm requires the gradient of the energy function and thus is computationally inefficient. In addition, the time step for the Langevin algorithm cannot be arbitrarily high to maintain the precision of the dynamics. Therefore, we first run MH for rapid exploration and then run the Langevin algorithm using the MH samples as initial seeds. Note that the differential geometric aspects of the SE(3) manifold must be considered in implementing these methods. For the following sections, we provide details of the sampling methods that we used. We first explain the proposal distributions that we used to run the MH algorithm on the SE(3) manifold. We then introduce the Langevin algorithm on SE(3). We calculate the Langevin dynamics in quaterniontranslation parameterization as Davidchack et al. (2017) to avoid singularity while benefiting from commonly used autograd packages.

D.1 PROPOSAL DISTRIBUTION FOR MH

The Metropolis-Hastings (MH) algorithm (Hastings, 1970) is a propose-and-reject algorithm used for sampling from some probability distribution dP (T ). First, a proposal point T p is sampled from the proposal distribution dQ(T p |T t ). The proposed point T p is stochastically accepted or rejected by the acceptance ratio A = min 1, dP (Tp)dQ(Tt|Tp) dP (Tt)dQ(Tp|Tt) . If the proposal is accepted, the next point is the proposed point, that is T t+1 = T p . If rejected, the point remains the same, that is T t+1 = T t . It is known that the steady-state distribution dP ∞ (T ∞ ) converges to dP (T ).

We decompose the proposal distribution dQ(T |T

t ) = Q(T |T t )dT into 1) the orientation proposal distribution Q R (R|R t )dR and 2) the position proposal distribution Q v (v|v t )d 3 v such that Q(T |T t )dT = Q R (R|R t )dR × Q v (v|v t )d 3 v where d 3 v is the Euclidean volume element and dR is the bi-invariant volume form of SO(3) (See Appendix A). We use Gaussian distribution for the position proposal, that is Q v (v p |v t ) = N (v p ; v t , σI). For the orientation proposal Q R (R p |R t ), we used IG SO(3) which is the normal distribution on SO(3) (Nikolayev & Savyolov, 1970; Savyolova, 1994; Leach et al., 2022) . Concrete calculation and sampling methods for IG SO(3) are provided in Appendix D.2.

D.2 NORMAL DISTRIBUTION ON SO(3)

We follow the method in Leach et al. (2022) to calculate and sample from IG SO(3) , the normal distribution on SO(3) (Nikolayev & Savyolov, 1970; Savyolova, 1994) . In the axis-angle parameterization, our orientation proposal distribution Q R (R p |R t )dR, which is IG SO(3) , can be written as Q R (R p |R t )dR = (1 -cos ω)/π × f ϵ (ω)dωdΩ where ω ∈ [0, π) is the rotation angle, dΩ = sin θ/π × dθdϕ is the uniform volume element over the sphere S 2 where the rotation axis lies, and f ϵ (ω) is as follows: f ϵ (ω) = ∞ l=0 (2l + 1)e -ϵl(l+1) sin ((2l + 1)ω/2) sin ω/2 (43) We approximate the infinite sum in Equation ( 43) by summing up to sufficiently high l. Note that the summand in Equation ( 43) decays exponentially fast to the square of l. Therefore, the approximation is justified. The rotation axis vector can be easily sampled by first sampling from three-dimensional Gaussian and then normalizing it. For the sampling of ω, one may use numerical inverse transform sampling. As noted by Leach et al. (2022) , the volume element (1 -cos ω)/π should be multiplied to f ϵ (ω) for the inverse transform sampling.

D.3 LANGEVIN MCMC ON SE(3)

Let V i be the i-th basis of the Lie algebra of an unimodular Lie group G. Consider the following stochastic process g(t) ∈ G generated by a Lie algebra δX(t ) = i δX i (t)V i ∈ T e G such that g(t) = g(0) exp [δX(0)] exp [δX(dt)] • • • exp [δX(t -dt)]. The Langevin dynamics for G is then δX i (t) = -L Vi [E(g)] dt + √ 2dw i ( ) where dw i ∼ N 0; √ dt is the standard Wiener process and L V f = d ds f (g exp[sV]) s=0 is the (left) Lie derivative of a function f on G along V. It is known that this process converges to dP ∞ (g) ∝ exp [-E(g)] dg when t → ∞ where dg is the (left) invariant volume form of G (Brockett, 1997; Chirikjian, 2011) . Davidchack et al. (2017) provide concrete ways to calculate the Lie derivative and the Langevin dynamics on SE(3) in quaternion-translation parameterization. Quaternion-translation parameterization is convenient because it has no singularities. Therefore, the gradients from commonly used autograd packages can be easily used to calculate the dynamics. For SE(3), V i is the Lie algebra basis of SO(3) for i = 1, 2, 3 and the translation generator for i = 4, 5, 6. Let the quaterniontranslation parameterization be z = (q, v) ∈ S 3 × R 3 ⊂ R 7 . Let L ∈ R 7×6 be the Lie derivative matrix whose (µ, i)'th element is [L] µ i = L Vi z µ . The matrix can be calculated as L = L SO(3) 0 4×3 0 3×3 I 3×3 L SO(3) = 1 2    -q 2 -q 3 -q 4 q 1 -q 4 q 3 q 4 q 1 -q 2 -q 3 q 2 q 1    (45) where q = q 1 + q 2 i + q 3 j + q 4 k. Derivations of Equation ( 45) can be found in (Davidchack et al., 2017) . Since the chain rule holds for the Lie derivatives (Chirikjian, 2011), the Equation ( 44) can be written in the parameterized form as dz = dq dv = -G -1 ∇ z E(z)dt + √ 2Ldw where G -1 = LL T . We calculate the gradient of the energy ∇ z E(z) using typical autograd packages. Note that dq in Equation ( 46) satisfies the unit-quaternion constraint q • dq = 0 such that q + dq ∈ S 3 (Davidchack et al., 2017) . In practice, however, we reproject q + dq onto S 3 by a normalization because of the inaccuracy in numerical integration.

E PICK-MODEL AND THE RELATIONSHIP TO NDFS

Pick-model For the place model, the point cloud of the end-effector Y is always different because of the grasped object. On the other hand, Y is always the same for the pick-model because no object has been grasped yet. Therefore, we remove the Y -dependence of the query EDF and the query density by taking ψ θ (x|Y ) to ψ θ (x) and ρ θ (x|Y ) to ρ θ (x). In this case, the energy function E θ (T |X, Y ) in Equation ( 14) becomes E θ (T |X, Y ) = E θ (T |X) = Nq i=1 w i ∥φ θ (T q i |X) -D(R)ψ i ∥ 2 where w i , q i and ψ i are not the outputs of some functions anymore but just parameters that are either predefined or learned.

Relationship to NDFs

We now illustrate the relation of Equation ( 47) to the energy function of NDFs (Simeonov et al., 2021) . Let the energy function in Simeonov et al. (2021) be E N DF (T |X) = Nq i=1 ∥φ(T q i |X) -ψ i ∥ 1 ψ i = 1 N demo N demo n=1 φ( Tn q i | Xn ) where Tn and Xn are the grasp pose and the point cloud input of the n'th demonstration. The Equation ( 48) can be understood as a special case of Equation ( 47) with (i) the L 1 error instead of the square error, (ii) all the query weights being constant w i = 1, and (iii) all the components of the feature EDF φ(x|X) being the invariant scalars (type-0 vectors) such that D(R) = I. Relationship to other variants of NDFs Other recent works derived from NDFs (Simeonov et al., 2022; Chun et al., 2023) also are closely related to EDFs. Relational Neural Descriptor Fields (R-NDFs) (Simeonov et al., 2022) extend NDFs' fixed placement target tasks to object rearrangement tasks in which a placement target is also an object with varying poses. Instead of using fixed target descriptors as NDFs, R-NDFs utilize another descriptor field to represent the target placement object. The resulting R-NDFs' energy function is very similar to EDFs' bi-equivariant energy function. This is a natural consequence due to the bi-equivariant nature of object rearrangement tasks. On the other hand, Local Neural Descriptor Fields (L-NDFs) (Chun et al., 2023) focus on imposing locality on the descriptor fields. While EDFs and L-NDFs both try to exploit locality, the motivation is largely different. Chun et al. (2023) focus on locality to improve generalization and transferability to novel objects. This is distinguished from the major motivation for imposing locality on EDFs: removing the necessity of object segmentation. One thing to note is that these studies are not mutually exclusive but complement each other. EDFs can be pre-trained with a similar method to NDFs and R-NDFs. The strict SE(3)-equivariance of EDFs can be relaxed by imposing the SE(3)-equivariance on the loss function as L-NDFs instead of imposing it on the model itself. Conversely, these methods may benefit from the end-to-end trainability and generative nature of EDFs' energy-based model. The orientational sensitivity of higher-type (equivariant) descriptors should also be beneficial to these methods. Irrepwise L 1 Norm For closer analogy with the energy function of NDFs, we propose using irrepwise L 1 norm. Let an equivariant vector f be given by f = N n=1 f (n) where f (n) is a type-l n vector. We then define the irrepwise L 1 norm as ∥f ∥ I 1 = N n=1 ∥f (n) ∥ 2 If we use irrepwise L 1 norm in Equation ( 47) and confine all the vectors in EDFs to be of type-0, Equation (47) and Equation ( 48) are exactly identical. Although we did not use irrepwise L 1 norm in our work, we expect that this modification would be more robust to outliers than using the square error term.

F PROOFS F.1 PROOF OF PROPOSITION 1

We first prove the left equivariance. Proof. T ∈SΩ dP (T |S X, Y ) = T ∈SΩ dT P (T |S X, Y ) = T ∈SΩ dT P (S -1 T |X, Y ) (∵ P (ST |S X, Y ) = P (T |X, Y )) = S -1 T ∈Ω d(S -1 T )P (S -1 T |X, Y ) (∵ bi-invariance of dT ) = T ∈Ω dT P (T |X, Y ) (S -1 T → T ) = T ∈Ω dP (T |X, Y ) The right equivariance can be similarly proved. Proof. T ∈ΩS dP (T |X, S -1 Y ) = T ∈ΩS dT P (T |X, S -1 Y ) = T ∈ΩS dT P (T S -1 |X, Y ) (∵ P (T S|X, S -1 Y ) = P (T |X, Y )) = T S -1 ∈Ω d(T S -1 )P (T S -1 |X, Y ) (∵ bi-invariance of dT ) = T ∈Ω dT P (T |X, Y ) (T S -1 → T ) = T ∈Ω dP (T |X, Y ) F.2 PROOF OF PROPOSITION 2 Let the partition function (the denominator) of Equation ( 9) be Z(X, Y ). Lemma 3. For a bi-equivariant energy function E(T |X, Y ), the following equation holds. Z(X, Y ) = Z(S X, Y ) = Z(X, S -1 Y ) (50) Proof. Z(S X, Y ) = SE(3) dT exp [-E(T |S X, Y )] = SE(3) dT exp -E(S -1 T |X, Y ) (∵ E(ST |S X, Y ) = E(T |X, Y )) = SE(3) d(S -1 T ) exp -E(S -1 T |X, Y ) (∵ bi-invariance of dT ) = SE(3) dT exp [-E(T |X, Y )] = Z(X, Y ) (S -1 T → T ) Z(X, S -1 Y ) = SE(3) dT exp -E(T |X, S -1 Y ) = SE(3) dT exp -E(T S -1 |X, Y ) (∵ E(T S|X, S -1 Y ) = E(T |X, Y )) = SE(3) d(T S -1 ) exp -E(T S -1 |X, Y ) (∵ bi-invariance of dT ) = SE(3) dT exp [-E(T |X, Y )] = Z(X, Y ) (T S -1 → T ) Now we prove the bi-equivariance of Equation ( 9) using Lemma 3. Proof. P (ST |S X, Y ) = exp [-E(ST |S X, Y )]/Z(S X, Y ) = exp [-E(T |X, Y )]/Z(X, Y ) = P (T |X, Y ) = exp -E(T S|X, S -1 Y ) /Z(X, S -1 Y ) = P (T S|X, S -1 Y ) F.3 PROOF OF PROPOSITION 3 We first show that the energy function in Equation ( 10) satisfies E(ST |S X, Y ) = E(T |X, Y ) where T = (R, v) ∈ SE(3) and S = (R S , v S ) ∈ SE(3). Proof. E(ST |S X, Y ) = R 3 d 3 xρ(x|Y )∥φ(ST x|S X) -D(R S R)ψ(x|Y )∥ 2 = R 3 d 3 xρ(x|Y )∥D(R S )φ(T x|X) -D(R S R)ψ(x|Y )∥ 2 (∵ Equation (8)) = R 3 d 3 xρ(x|Y )∥D(R S )φ(T x|X) -D(R S )D(R)ψ(x|Y )∥ 2 (∵ Equation (1)) = R 3 d 3 xρ(x|Y )∥φ(T x|X) -D(R)ψ(x|Y )∥ 2 = E(T |X, Y ) where the orthogonality of the representation D(R) is used in the last line. Note that the inner product of two vectors is invariant to orthogonal transformations. We now prove that E(T S|X, S -1 Y ) = E(T |X, Y ). Proof.

E(T S|X, S

-1 Y ) = R 3 d 3 xρ(x|S -1 Y )∥φ(T Sx|X) -D(RR S )ψ(x|S -1 Y )∥ 2 = R 3 d 3 xρ(x|S -1 Y )∥φ(T Sx|X) -D(R)D(R S )ψ(x|S -1 Y )∥ 2 (∵ Equation (1)) = R 3 d 3 xρ(Sx|Y )∥φ(T Sx|X) -D(R)ψ(Sx|Y )∥ 2 = R 3 d 3 (Sx)ρ(Sx|Y )∥φ(T Sx|X) -D(R)ψ(Sx|Y )∥ 2 (∵ d 3 (T x) = d 3 x ∀T ∈ SE(3)) = R 3 d 3 xρ(x|Y )∥φ(T x|X) -D(R)ψ(x|Y )∥ 2 (Sx → x) In the fourth line, we used ρ(T x|T Y ) = ρ(x|Y ) and ψ(T x|T Y ) = D(R)ψ(x|Y ) by the definition of the query density and the query EDF. Note that in the fifth line we used the SE(3)invariance of the Euclidean volume element d 3 x, that is d 3 (T x) = det [∂(Rx + v)/∂x]d 3 x = det [∂(Rx)/∂x]d 3 x = $ $ $ det R$ $ $ det I d 3 x = d 3 x ∀T = (R, v) ∈ SE(3) Therefore, the energy function E(T |X, Y ) in Equation ( 10) is indeed bi-equivariant.

F.4 PROOF OF PROPOSITION 4

Proof. Let a query density satisfies Equation ( 11) such that ρ(x|Y ) = ρ(T x|T Y ) ∀T ∈ SE(3). If this query density is grasp-independent such that ρ(x|Y ) = ρ(x), then ρ(x) = ρ(T x) ∀T ∈ SE(3) by Equation ( 11). Since there always exists some T ∈ SE(3) such that T x = x ′ for any x ′ ∈ R 3 , ρ(x) must be a constant function. In other words, there exists no grasp-independent and non-constant query density that satisfies Equation (11).

F.5 PROOF OF PROPOSITION 5

Proof. ρ θ (T x|T Y ) = N Q i=1 w θ (q i;θ (T Y )|T Y ) δ (3) (T x -q i;θ (T Y )) = N Q i=1 w θ (T q i;θ (Y )|T Y ) δ (3) (T x -T q i;θ (Y )) = N Q i=1 w θ (q i;θ (Y )|Y ) δ (3) (T x -T q i;θ (Y )) = N Q i=1 w θ (q i;θ (Y )|Y ) δ (3) (x -q i;θ (Y )) = ρ θ (x|Y ) where Equation ( 13) was used in the second and the third lines.

F.6 PROOF OF PROPOSITION 6

Let the query model P (w, Q|Y ) be SE(3)-equivariant such that P (w, Q|Y ) = P (w, SQ|S Y ) ∀S ∈ SE(3) We first show that P (T |X, Y, w, Q) satisfies P (T |X, Y, w, Q) = P (ST |S X, Y, w, Q) = P (T S|X, S -1 Y, w, S -1 Q) ∀S = (R S , v S ) ∈ SE(3) To prove Equation ( 54), we first show that E(T |X, Y, w, q) in Equation ( 15) satisfies the following: E(ST |S X, Y, w, q) = E(T |X, Y, w, q) = E(T S|X, S -1 Y, w, S -1 q) Proof. We first prove the left equivariance. E θ (ST |S X, Y, w, q) = w∥φ θ (ST q|S X) -D(R S )D(R)ψ θ (q|Y )∥ 2 = w∥D(R S )φ θ (T q|X) -D(R S )D(R)ψ θ (q|Y )∥ 2 (∵ Equation (8)) = w∥φ θ (T q i |X) -D(R)ψ θ (q i |Y )∥ 2 = E θ (T |X, Y, w, q) (∵ D(R) T = D(R) -1 ) We now prove the right equivariance.  E θ (T S|X, S -1 Y, w, S -1 q) = w∥φ θ (T $ $ $ SS -1 q i |X) -D(R)D(R S )ψ θ (S -1 q i |S -1 Y )∥ 2 = w∥φ θ (T q i |X) -D(R) @ @ @ @ @ @ @ D(R S )D(R -1 S )ψ θ (q i |Y )∥ 2 (∵ Equation (8)) = w∥φ θ (T q i |X) -D(R)ψ θ (q i |Y )∥ 2 = E θ (T |X, Y, w, q) (∵ D(R -1 ) = D(R) - = dwdQP (T S|X, S -1 Y, w, S -1 Q)P (w, S -1 Q|S -1 Y ) = dwd(S -1 Q)P (T S|X, S -1 Y, w, S -1 Q)P (w, S -1 Q|S -1 Y ) (∵ Equation (51)) = dwdQP (T S|X, S -1 Y, w, Q)P (w, Q|S -1 Y ) = P (T S|X, S -1 Y ) (S -1 Q → Q) In the fourth line, we used the SE(3)-invariance of the Eulcidean volume element in Equation ( 51): dQ = Nq i=1 R 3 d 3 q i = Nq i=1 R 3 d 3 (T q i ) = d(T Q) ∀T ∈ SE(3) Proof. f (ST |SX, Y ) = dw dQ h 1 (ST, SX, Y, w, Q)h 2 (ST, SX, Y, w, Q) = dw dQ h 1 (T, X, Y, w, Q)h 2 (T, X, Y, w, Q) = f (T |X, Y ) (∵ Equation (60)) = dw dQ h 1 (T S, X, S -1 Y, w, S -1 Q)h 2 (T S, X, S -1 Y, w, S -1 Q) (∵ Equation (60)) = dw d(S -1 Q) h 1 (T S, X, S -1 Y, w, S -1 Q)h 2 (T S, X, S -1 Y, w, S -1 Q) (∵ Equation (55)) = dw dQ h 1 (T S, X, S -1 Y, w, Q)h 2 (T S, X, S -1 Y, w, Q) (S -1 Q → Q) = f (T S|X, S -1 Y ) We now prove the bi-equivariance of L θ (T |X, Y ) in Equation ( 22). Proof. We first define h(T, X, Y, w, Q) as follows: h(T, X, Y, w, Q) = log P θ (T |X, Y, w, Q) + Pθ (w, Q|Y ) -H θ (w, Q|X, Y, T ) Using Equation ( 54), Equation ( 57) and Equation ( 59), one can prove that h(T, X, Y, w, Q) satisfies Equation ( 60). In addition, H θ (w, Q|X, Y, T ) satisfies Equation ( 60 

G EQUIVARIANT GRAPH NEURAL NETWORKS

Graph neural networks are often used to model point cloud data (Wang et al., 2019; Te et al., 2018; Shi & Rajkumar, 2020) . SE(3)-equivariant graph neural networks (Thomas et al., 2018; Fuchs et al., 2020; Liao & Smidt, 2022) exploit the roto-translation symmetry of graphs with spatial structures. In this work, we use Tensor Field Networks (TFNs) (Thomas et al., 2018) and the SE(3)-transformers (Fuchs et al., 2020) as the backbone networks for our models. Tensor Product and Spherical Harmonics Given two vectors u and v of type-l 1 and -l 2 , the tensor product u ⊗ v transforms according to a rotation R ∈ SO(3) as u ⊗ v → (D l1 (R)u) ⊗ (D l2 (R)v) Tensor products are important because they can be used to construct new vectors of different types. By a change of basis the tensor product u⊗v can be decomposed into the direct sum of type-l vectors using the Clebsch-Gordan coefficients (Thomas et al., 2018; Zee, 2016; Griffiths & Schroeter, 2018) . Let this type-l vector be (u ⊗ v) (l) . The m'th components of this vector is calcuated as: (u ⊗ v) (l) m = l1 m1=-l1 l2 m2=-l2 C (l,m) (l1,m1)(l2,m2) u m1 v m2 where C (l,m) (l1,m1)(l2,m2) is the Clebsch-Gordan coefficients in real basis, which can be nonzero only for |l 1 -l 2 | ≤ l ≤ l 1 + l 2 . The (real) spherical harmonics Y (l) m (x/∥x∥) are orthonormal functions that form the complete basis of the Hilbert space on the sphere S 2 . l ∈ {0, 1, 2, • • • } is called the degree and m ∈ {-l, • • • , l} is called the order of the spherical harmonic function. Consider the following (2l + 1)-dimensional vector field Y (l) = (Y l m=-l , • • • , Y l m=l ) . By a 3dimensional rotation R ∈ SO(3), Y (l) transforms like a type-l vector field such that Y (l) (R(x/∥x∥)) = D l (R)Y (l) (x/∥x∥) Tensor Field Networks Tensor field networks (TFNs) (Thomas et al., 2018) are SE(3)equivariant models for generating representation-theoretic vector fields from a point cloud input. TFNs construct equivariant output feature vectors from equivariant input feature vectors and spherical harmonics. Spatial convolutions and tensor products are used for the equivariance. Consider a featured point cloud input with M points given by X = {(x 1 , f 1 ), • • • , (x M , f M )} where x i ∈ R 3 is the position and f i is the equivariant feature vector of the i-th point. Let f i be decomposed into N vectors such that f i = N n=1 f (n) i , where f (n) i is a type-l n vector, which is (2l n + 1) dimensional. Therefore, we define the action of T = (R, v) ∈ SE(3) on X as T X = {(T x 1 , D(R)f 1 ), • • • , (T x M , D(R)f M )} where R ∈ SO(3), v ∈ R 3 and D(R) = N n=1 D ln (R). Consider the following input feature field f (in) (x|X) generated by the point cloud input X as f (in) (x|X) = M j=1 f j δ (3) (x -x j ) where δ (3) (x -y) = 3 µ=1 δ(x µ -y µ ) is the three-dimensional Dirac delta function centered at x j . Note that this input feature field is an SE(3)-equivariant field, that is: f (in) (T x|T X) = D(R)f (in) (x|X) ∀ T = (R, v) ∈ SE(3) Now consider the following output feature field by a convolution 2ln+1) is defined as follows: f (out) (x|X) = N ′ n ′ =1 f (n ′ ) (out) (x|X) = d 3 yW(x -y)f (in) (y|X) = j W(x -x j )f j (67) with the convolution kernel W(x -y) ∈ R dim(f (out) )×dim(f (in) ) whose (n ′ , n)-th block W n ′ n (x - y) ∈ R (2l n ′ +1)×( W n ′ n (x) m ′ m = l n ′ +ln J=|l n ′ -ln| ϕ n ′ n J (∥x∥) J k=-J C (l n ′ ,m ′ ) (J,k)(ln,m) Y (J) k (x/∥x∥) Here, ϕ n ′ n J (∥x∥) : R → R is some learnable radial function. The output feature field f (out) (x|X) in Equation ( 67) is proven to be SE(3)-equivariant (Thomas et al., 2018; Fuchs et al., 2020) .

SE(3)-Transformers

The SE(3)-Transformers (Fuchs et al., 2020) are variants of TFNs with self-attention. Consider the case in which the output field is also a featured sum of Dirac deltas f (out) (x|X) = M j=1 f (out),j δ (3) (x -x j ) where x i is the same point as that of the point cloud input X. The SE(3)-Transformers apply type-0 (scalar) self-attention α ij to Equation (67): f (out),i = j̸ =i α ij W(x -x j )f j + N ′ n ′ N n=1 W n ′ n (S) f (n) j where the W n ′ n (S) term is called the self-interaction (Thomas et al., 2018) . W n ′ n (S) is nonzero only when l ′ n = l n . The self-interaction occurs where i = j such that W(x i -x j ) = W (0). The selfinteraction term is needed because W is a linear combination of the spherical harmonics, which are not well defined in x = 0. Details about the calculation of the self-attention α ij can be found in Fuchs et al. (2020) .

H EXPERIMENTAL DETAILS

For the test environment, we use PyBullet (Coumans & Bai, 2016 -2021) simulator for the experiments. We use the Franka Panda manipulator with a custom end-effector. We use IKFast (Diankov, 2010) with Pybullet-Planning (Garrett, 2018) for the inverse kinematics. Three simulated depth cameras are used to observe the point cloud of the scene. Six simulated depth cameras are used to observe the point cloud of the grasp. We downsample the point clouds using a voxel filter. We illustrate the downsampled point clouds in Figure 9 . Since motion planning is not in the scope of our work, we assume no collision between the environment and the robot links except for the hand link. We also allow the robot to teleport to reach pre-grasp and pre-place poses to eliminate the unnecessary influence of the motion planners. However, we fully simulate the trajectories of all the task-relevant primitives (e.g., grasping, releasing, lifting). We experiment with three tasks, the mug-hanging task and the bowl/bottle pick-and-place task. For the mug hanging task and the bowl pick-and-place task, we demonstrate with a single target object instance in upright poses only. For the bottle pick-and-place task, we use five object instances in upright poses only. For evaluations, we tested in four different unseen setups: (A) Unseen instances, (B) Unseen poses, (C) Unseen distractors, and (D) Unseen poses, instances, and distractors. Note that in the unseen poses setup (B), we only use lying poses, which were not presented during the training. On the other hand, in the unseen poses, instances, and distractors setup (D), we both use lying and upright poses but in unseen elevations. The reason why we also use upright poses in setup (D) is that the baseline model already completely fails in setup (B). Therefore, the result would be trivial if we only use lying poses in (D). Therefore, we mix both upright and lying poses in setup (D). Note that the upright poses are also unseen poses because of the unseen elevations. For the inference, we run the MH for 1000 steps and then run the Langevin algorithm for 300 steps. Lastly, we optimize the samples for 100 steps using Equation (46) but without noise. We empirically find that only at most three query points have significant weights after training. Therefore, we only use three query points with the highest weights to save computations. Instead of directly taking the lowest-energy sample pose, we check the feasibility of the pose before going into action. For example, if a collision is found or no inverse kinematics solution can be found for the sample pose, we deny that pose and move to the next best sample. We provide the details in Algorithm 2. Ten demonstrations are generated by a probabilistic oracle for each task, with the default instances in upright poses, with only the x, y, and yaw being randomized. In the unseen poses setup, the default (trained) instances are provided in unseen (lying) poses. The poses are completely randomized, including the elevation z. In the unseen instances setup, ten unseen instances of target objects are provided in the trained poses (upright), again with only the x, y, and yaw being randomized. In the unseen distractors setup, four unseen visual distractors are located near the target objects. We randomize the poses and the colors of these distractors. To separate the effect of motion planning, we disable the collision between the distractors and the robot. Lastly, in the unseen instances, poses, and distractors setup, we combine all the prior setups. We experiment with ten unseen instances in unseen poses (50% upright and 50% lying, arbitrary elevation) with four randomized visual distractors. In this setup, we use supports to give arbitrary elevation to the target objects. Note that we both used upright and lying poses, unlike the unseen poses setup. This is to test the case with unseen distractors and unseen instances for not only the lying poses but the upright poses as well. Note that upright poses are also unseen poses because we give arbitrary elevations. For the mug and bowl tasks, only a single instance was used as the default instance for training. For the bottle task, five instances were used due to the high variance of the shape. All the models are sufficiently trained such that the total success rate in the trained setup (no unseen situations) exceed at least 90% for all three tasks. The experimental settings for the three tasks are illustrated in Figure 10 . We also illustrate the ten unseen instances of mug, bowl, and bottle in Figures 11, 12, and 13 , respectively.

J EXPERIMENTAL RESULTS ON NEURAL DESCRIPTOR FIELDS WITH UNSEGMENTED POINT CLOUD INPUTS

In this section, we show by experiment that object segmentation is critical to the performance of Neural Descriptor Fields (NDFs) (Simeonov et al., 2021) . We compare the success rate of NDFs with unsegmented point cloud input to the same NDFs with segmented point cloud input. All the experiments are done using the official implementations of Simeonov et al. (2021) . The results are summarized in Table 6 . As can be seen in Table 6 , the performances significantly drop for both tasks when NDFs are provided with unsegmented We provide qualitative examples in Figure 14 . Therefore, it can be concluded that object segmentation is essential for NDFs. 



Figure 1: Given few (5∼10) demonstrations of a mug pick-and-place task, EDFs can be trained fully end-to-end without requiring any pre-training, object segmentation, or pose estimation pipelines. In addition, we show that EDFs can generalize to A) unseen poses, B) unseen instances of the target object category, and C) the presence of unseen visual distractors.

Figure 5: A) Only ten demonstrations with objects in upright poses are provided during the training. B) The models are evaluated with unseen object instances in unseen poses with unseen distractors.

0.25 0.04 0.01 0.09 1.00 0.09 0.26 0.88 0.23 EDFs (Ours)1.00 0.95 0.95 0.95 1.00 0.95 0.95 1.00 0.95 with changing target placement poses. We use ten demonstrations of upright poses generated by probabilistic oracles for the training. We then evaluate the success rate for (1) unseen instances, (2) unseen poses (lying poses), (3) unseen distracting objects, and (4) unseen instances in arbitrary poses (50% lying, 50% upright but in arbitrary elevation) with unseen distracting objects. We illustrate the experimental setups in Figure5. Details are provided in Appendix H.

Figure 6: The success rate of EDFs for the mug-hanging task with respect to the number of demonstrations.

Figure 7: A) SE(3)-TNs fail to pick the object in an unseen pose due to the lack of SE(3)equivariance. B) Type-0 only descriptors fail to place the object in a proper orientation due to the lack of orientational sensitivity.

Figure 8: A) The end-effector should follow the transformation of the placement target to keep the relative pose between the object and the placement target invariant. B) The end-effector should transform contravariantly to compensate for the transformation of the grasped object such that the relative pose between the object and the placement target is invariant.

1 ) One may simply replace the energy function E(T |X, Y ) in Appendix F.2 with the new energy function E(T |X, Y, v, Q) = Nq i=1 E(T |X, Y, w i , q i ) to find that Equation (54) indeed holds. Now we show that the marginal PDF P (T |X, Y ) is bi-equivariant, Proof. P (ST |S X, Y ) = dwdQP (ST |S X, Y, w, Q)P (w, Q|Y ) = dwdQP (T |X, Y, w, Q)P (w, Q|Y ) = P (T |X, Y ) (∵ Equation (53) and Equation (54))

) as was shown in. Because L θ (T |X, Y ) can be written asL θ (T |X, Y ) =E w,Q∼H θ [log P θ (T |X, Y, w, Q)] -D KL H θ (w, Q|X, Y, T ) Pθ (w, Q|Y ) = dw dQ H θ (w, Q|X, Y, T )h(T, X, Y, w, Q) (62)we prove the bi-equivariance of L θ (T |X, Y ) in Equation (22) using Lemma 4.

Figure 14: A) NDFs successfully infer the pick position of a well-segmented mug point cloud. B) NDFs fail to successfully infer the pick position of an unsegmented mug point cloud. C) NDFs successfully infer the pick position of a well-segmented bottle point cloud. D) NDFs fail to successfully infer the pick position of an unsegmented bottle point cloud. The black dots are the query points attached to the gripper.

Success rate and inference time of the ablated model and EDFs. All the evaluations are done in the unseen instances, poses & distracting objects setting. Intel i9-12900k CPU (P-core only) with an Nvidia RTX3090 GPU. Experimental results for the ablated model with more query points can be found in Table4of Appendix I.

Success rate of NDFs with and without object segmentation

acknowledgement

Acknowledgement This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2021R1A2B5B01002620). This work was partially supported by the Korea Institute of Science and Technology (KIST) intramural grants (2E31570).

F.7 PROOF OF PROPOSITION 7

We first show that Pi (w i , q i |Y ) in Equation ( 20) is SE(3)-equivariant:Proof.Pi (w i , T q i |T Y ) = dl i dw i N (l i ; log w(T q i |T Y ), σ H )δ (3) (T q i -q i (T Y ))= dl i dw i N (l i ; log w(q i |Y ), σ H )δ (3) ( T q i -T q i (Y )) (∵ Equation ( 13))= Pi (w i , q i |Y )As a result, P (w, Q|Y ) = Nq i=1 Pi (w i , q i |Y ) in Equation ( 20) also satisfies P (w, T Q|T Y ) = P (w, Q|Y ) ∀T ∈ SE(3) (57)We now show that H i (w i , q i |X, Y, T ) in Equation ( 21) satisfies the following equation:Proof.. This is because the Euclidean distance is preserved under SE(3) transformations. We usedand Equation ( 56) in (B). Lastly, we used Equation ( 13) in (C).Therefore,We now propose the following lemma. Lemma 4. Let a scalar function f (T |X, Y ) be defined as follows:Algorithm 2 Pick-and-place algorithmFor EDFs, we use a single query point for the pick inference and three query points for the place inference. For the ablated model, we use ten query points for the pick inference and five query points for the place inference. The reason we use more query points for the ablated model is that the type-0 descriptors cannot encode orientations alone. Unlike type-0 descriptor fields (the ablated model) require at least three non-collinear query points and much in practice to determine orientations. This is in direct comparison with EDFs, which can determine the orientations even with a single point. The computational benefit of using only type-0 descriptors is compensated by the increased number of query points. We set the number of query points to make the inference time of the ablated model to be similar to or slightly longer than EDFs. We run all the experiments on an Nvidia RTX3090 GPU and an Intel i9-12900k CPU with 16Gb RAM. We turned off all the E-cores of the CPU and only used P-cores with a fixed clock of 5100Mhz. We found that turning off the E-core is crucial for the inference speed.The models were trained for 600 steps (60 epochs) using Adam optimizer (Kingma & Ba, 2014) where the learning rates range from 0.005 to 0.001. We randomly perturb the target pose and apply jitters on input point clouds to augment the training data. It takes around 5.5 hours to train the pick-model and 8.5 hours to train the place-model, where most of the time is spent on MCMC sampling. We run 10000 iterations of the MH and 3000∼6000 iterations (linearly increasing as training proceeds) of the Langevin algorithm to draw negative samples for the training. In Table 3 , we list the success rates of EDFs for (1) low variance and unimodal demonstrations and (2) highly multimodal demonstrations for the mug-hanging task. In the highly multimodal demonstrations, the mug is picked both using the rim grasp and the handle grasp. The experimental results indicate that EDFs are both robust to low variance or high variance demonstrations, whereas SE(3)-TNs can only be trained from low variance ones. 1.00 0.99 0.99 1.00 1.00 1.00 0.98 1.00 0.98 SE(3)-TNs (Zeng et al., 2020) 1.00 0.91 0.91 1.00 1.00 1.00 0.99 0.93 0.92 Type-0 Only (Fast)

I ADDITIONAL EXPERIMENTAL RESULTS

1.00 0.98 0.98 1.00 1.00 1.00 1.00 0.97 0.97 Type-0 Only (Slow)1.00 0.98 0.98 1.00 1.00 1.00 1.00 1.00 1.00Lastly, in Table 5 , we list the success rate of all the methods in the trained setup. Note that all the methods used in the experiments were sufficiently trained to achieve at least 91% total success rate.

