

ABSTRACT

Understanding what objects could furnish for humansÐlearning object affordanceÐ is the crux of bridging perception and action. In the vision community, prior work has primarily focused on learning object affordance with dense (e.g., at a per-pixel level) supervision. In stark contrast, we humans learn the object affordance without dense labels. As such, the fundamental question to devise a computational model is: What is the natural way to learn the object affordance from geometry with humanlike weak supervision? In this work, we present the new task of part-level affordance discovery (PartAfford): Given only the affordance labels for each object, the machine is tasked to (i) decompose 3D shapes into parts and (ii) discover how each part of the object corresponds to a certain affordance category. We propose a novel learning framework that discovers part-level representations by leveraging only the affordance set supervision and geometric primitive regularization without dense supervision. To learn and evaluate PartAfford, we construct a part-level, crosscategory 3D object affordance dataset, annotated with 24 affordance categories shared among > 25, 000 objects. We demonstrate through extensive experiments that our method enables both the abstraction of 3D objects and part-level affordance discovery, with generalizability to difficult and cross-category examples. Further ablations reveal the contribution of each component.

1. INTRODUCTION

The human vision system could swiftly locate the functional part upon using an object for specific tasks (Land et al., 1999) . Such a critical capability in object interaction requires fine-grained object affordance understanding. Affordance, coined and originally theorized by Gibson (Gibson & Carmichael, 1966; Gibson, 1979) , characterizes how humans interact with human-made objects and environments. As such, affordance understanding of objects and scenes has a significant influence on bridging visual perception and holistic scene understanding (Huang et al., 2018b; a; Chen et al., 2019) with actionable information (Soatto, 2013; Han et al., 2022) . Object affordances have two main characteristics. First, object affordances are not defined in terms of conventional categorical labels in computer vision; instead, they are defined by the associated actions for various tasks and are naturally cross-category. For example, both chair and sofa can be sat on, which indicates they share the sittable affordance. Similarly, desktop and bookshelf share the support affordance. Second, object affordances are intrinsically part-based. We could easily associate sittable affordance with the seats of chairs and sofas, and support with the boards of desktop and bookshelf. As such, the ability to learn part-based, cross-category affordance is essential to demonstrate the general object affordance understanding. In passive affordance learning, prior literature follows the supervised learning paradigm, in which dense affordance annotation on the objects is fed as supervised signals (Deng et al., 2021) . However, this line of thought depends heavily on the quality of dense annotation, which significantly deviates from how we humans learn to understand affordance. Humanlike supervision would be: ªyou can sit on this chair and rest your arm,º ªyou can open the lid and hold water with the cup.º In this paper, we try to answer: How to distinguish each object part while recognizing corresponding affordances with such weak and natural supervisions? To tackle this problem, we present PartAfford, a new task of part-level affordance discovery, which learns the object affordance with the natural supervision of the affordance set. As shown in Fig. 1 , by providing only the set of affordance labels for each object, the algorithm is tasked to decompose the 3D shapes into parts and discover how each part corresponds to a certain affordance category, which is challenging and under-explored in the area of generalizable part-level object understanding and affordance learning. To address this, we propose a novel method that discovers part-level representations with selfsupervised 3D reconstruction, affordance set supervision, and primitive regularization. The proposed approach consists of two main components. The first component is an encoder with slot attention for unsupervised clustering and abstraction. Specifically, we encode the 3D object into visual features and abstract the low-level features into a set of slot variables (Locatello et al., 2020) . The second component is a decoder built upon the learned slot features. It has three output branches that jointly reconstruct the 3D parts and object, predict the affordance labels, and regularize the learned part-level shapes with cuboidal primitives. Our method does not rely on dense supervision but instead learns from the weak set supervision. It discovers the part-level affordance by learning the correspondence between affordance labels and abstracted 3D object parts. Learning and evaluating PartAfford demands collections of 3D objects and their affordance labels for object parts. Prior work on visual affordance learning (Hassanin et al., 2021) 



Figure 1: The proposed PartAfford: discover 3D object part affordances by learning contrast in affordance compositions. During training (left), given weak annotations (per-shape affordance label set), a learning framework is devised to ground affordance (e.g., backrest) to 3D part (e.g., sofa back) through learning crosscategory, affordance-related shapes (e.g., chair, sofa) with various affordance compositions. At test time (right), the learned model decomposes the 3D object into parts and infers the part-level affordances.

either focuses on 2D objects and scenes or lacks part-based annotation(Deng et al., 2021). Hence, we construct a part-level, cross-category 3D object affordance dataset annotated with 24 affordance categories shared among over 25, 000 3D objects. The 3D objects are collected from PartNet dataset(Mo et al., 2019b)  and the PartNet-Mobility dataset(Xiang et al., 2020). The 24 part affordance categories are defined in terms of adjectives (e.g., ªsittableº) or nouns (e.g., ªarmrestº); they describe how object parts could afford human daily actions and activities. We annotate the part-level object affordances by manually mapping the fine-grained object part defined in PartNet to the part affordances defined in this work.By experimenting on this newly constructed PartAfford dataset, we empirically demonstrate that our method jointly enables the abstraction of 3D objects and part-level affordance discovery. Our model also shows strong generalizability on hard and cross-category objects. Further experiments and ablations analyze each component's contribution and point out future directions. PartAfford task for part-level affordance discovery. Compared to the prior denselysupervised learning paradigm, PartAfford learns the visual object affordance more naturally. • We propose a novel learning framework for tackling PartAfford, which jointly abstracts 3D objects into part-level representations and discovers affordances by learning the affordance correspondence. • We build the benchmark for learning and evaluating PartAfford by curating a dataset consisting of 3D objects and annotating part-level affordances. • We empirically demonstrate the efficacy and generalization capability of the proposed method and analyze each component's significance via a suite of ablation studies. Code and data will be released for research purposes.

