LEARNING OBJECT AFFORDANCE WITH CONTACT AND GRASP GENERATION

Abstract

Understanding object affordance can help in designing better and more robust robotic grasping. Existing work in the computer vision community formulates the object affordance understanding as a grasping pose generation problem, which treats the problem as a black box by learning a mapping between objects and the distributions of possible grasping poses for the objects. On the other hand, in the robotics community, estimating object affordance represented by contact maps is of the most importance as localizing the positions of the possible affordance can help the planning of grasping actions. In this paper, we propose to formulate the object affordance understanding as both contacts and grasp poses generation. we factorize the learning task into two sequential stages, rather than the black-box strategy: (1) we first reason the contact maps by allowing multi-modal contact generation; (2) assuming that grasping poses are fully constrained given contact maps, we learn a one-to-one mapping from the contact maps to the grasping poses. Further, we propose a penetration-aware partial optimization from the intermediate contacts. It combines local and global optimization for the refinement of the partial poses of the generated grasps exhibiting penetration. Extensive validations on two public datasets show our method outperforms state-of-the-art methods regarding grasp generation on various metrics.

1. INTRODUCTION

Affordance is an area studying how an object can be used by an agent. Understanding affordance can help to design better and more robust robotic systems operating in complex and dynamic environments (Hassanin et al., 2021) . For example, a cup can be grasped and passed over by a hand and a bed can be sat or slept onto by a human. Learning affordance (or affordance understanding) has wide applications like grasping (Bohg et al., 2013) , action recognition and prediction (Jain et al., 2016; Koppula et al., 2013; Koppula & Saxena, 2015) , functionality understanding (Grabner et al., 2011) , social scene understanding (Chuang et al., 2018) etc. In this paper, we focus on object affordance for hands, i.e. hand-object interactions. Though of great importance to many applications, only several works about 3D grasp synthesis using deep learning (Corona et al., 2020; Taheri et al., 2020; Jiang et al., 2021; Karunratanakul et al., 2020; Zhang et al., 2021; Taheri et al., 2021) have been proposed in the computer vision community. In (Taheri et al., 2020) , a dataset for human grasping objects with annotations of full body meshes and objects meshes have been collected, and a coarse-to-fine hand pose generation network based on a conditional autoencoder (CVAE) is proposed. In (Karunratanakul et al., 2020) , a new implicit representation is proposed for hand and object interactions. The previous work (Taheri et al., 2021) takes a step further to learn dynamic grasping sequences including the motion of the whole body given an object, instead of static grasping poses. Both these work defines affordance as possible grasping poses allowed by the objects. However, instantiations of affordance understanding can include affordance categorization, reasoning, semantic labeling, activity recognition, etc. (Deng et al., 2021) Among all these, semantic labeling of contact areas between agents and objects is found to be of the most importance (Deng et al., 2021; Roy & Todorovic, 2016; Zhu et al., 2015) because localizing the position of possible affordance can greatly help the planning of actions for robotic hands (Mo et al., 2021; Wu et al., 2021; Mandikal & Grauman, 2021; 2022) . In the robotics community, Mo et al. ( 2021) and Wu et al. (2021) first estimate the contact points for parallel-jaw grippers and plan paths to grasp the target objects. For dexterous robotic hand grasping, recent works (Mandikal & Grauman, 2021; 2022) find that leveraging contact areas from human grasp can improve the grasping success rate significantly in a reinforcement learning framework. However, they assume an object only has one grasp contact area and learn a one-to-one mapping from an object to the contact. To overcome the limitation of work in both computer vision and robotics community, we propose to formulate the object affordance understanding as both contacts and grasp poses generation. Specifically, we factorize the learning task into two sequential stages, rather than taking a black-box hand pose generative network that directly learns an object to the possible grasping poses in previous work. 1) In the first stage, we generate multiple hypotheses of the grasping contact areas, represented by binary 3D segmentation maps. 2) In the second stage, we learn a one-to-one mapping from the contact to the grasping pose by assuming the grasping pose is fully constrained given a contact map. Different from a coarse-to-fine strategy, our decomposition not only provides intermediate semantic contact maps, but also benefits from the intermediate task learning in the quality of the generated poses. This intermediate task learning has been proven effective in many computer vision tasks (Tang et al., 2019; Wan et al., 2018; Tome et al., 2017; Wu et al., 2017) . In the robotic grasping, it is shown that optimizing grasping poses directly from contacts is superior to re-targeting observed grasps to the target hands (Brahmbhatt et al., 2019b) , which also motivates our choice. Therefore, the other benefit of the intermediate contact representation is enabling the optimization from the contacts. Different from the optimization for the full grasps from scratch in (Brahmbhatt et al., 2019b) , we propose a penetration-aware partial optimization from the intermediate contacts. It combines of a local and global optimization for the refinement of the partial poses of the generated grasps exhibiting penetration. The local-global optimization constrains gradients to affect only on the partial poses requiring adjustment, which results in faster convergence and better grasp quality than a global optimization. In summary, our key contributions are (1) we formulate object affordance understanding as contact and grasp pose synthesis; (2) we develop a novel two-stage affordance learning framework that first generates contact maps and then predicts the grasp pose constrained by the maps; (3) we propose a penetration-aware partial optimization from the intermediate contacts for the grasp refinement; (4) benefiting from the first two decomposed learning stages and partial optimization, our method outperforms existing methods both quantitatively and qualitatively.

2. RELATED WORKS

Grasp Generation Human grasp generation is a challenging task due to higher degrees of freedom of human hands and the requirement of the generated hands to interact with objects in a physically reasonable manner. Most methods use models such as MANO (Romero et al., 2017) to parameterize hand poses, aiming to directly learn a latent conditional distribution of the hand parameters given objects via large datasets. The distribution is usually learned by generation network models such as Conditional Variational Auto-Encoder (Sohn et al., 2015) , or Adversarial Generative Networks (Arjovsky et al., 2017) . To get finer poses, many existing works adopt a coarse-to-fine strategy by learning the residuals of the grasping poses in the refinement stage. Corona et al. ( 2020) uses a generative adversarial network to obtain an initial grasp, and then an extra network to refine it while Taheri et al. ( 2020) passes hand parameters to a CVAE model and output an initial grasp, followed by a further refinement. In recent work, however, Jiang et al. (2021) proposes to exploit contact maps to refine human grasps by leveraging the consistency of the contact map. Though estimating the hand-object contact maps, it only reasons about the contact consistency to refine the generated pose while our work exploits the contact maps as an intermediate representation for the final grasp generation. On the other hand, in the area of robotic grasping, Brahmbhatt et al. (2019b) introduces a loss for optimization using contact maps captured from thermal cameras (Brahmbhatt et al., 2019a; 2020) to filter and rank random grasps sampled by Graspit! (Miller & Allen, 2004) . It concludes that synthesized grasping poses optimized directly from the contact demonstrate superior quality to other approaches, kinematically re-targeting observed human grasps to the target hand model. The contact maps are also used in the hand and object reconstruction. Grady et al. (2021) proposes a differentiable contact optimization to refine the hand pose reconstructed from an image.

