

ABSTRACT

Understanding what objects could furnish for humansÐlearning object affordanceÐ is the crux of bridging perception and action. In the vision community, prior work has primarily focused on learning object affordance with dense (e.g., at a per-pixel level) supervision. In stark contrast, we humans learn the object affordance without dense labels. As such, the fundamental question to devise a computational model is: What is the natural way to learn the object affordance from geometry with humanlike weak supervision? In this work, we present the new task of part-level affordance discovery (PartAfford): Given only the affordance labels for each object, the machine is tasked to (i) decompose 3D shapes into parts and (ii) discover how each part of the object corresponds to a certain affordance category. We propose a novel learning framework that discovers part-level representations by leveraging only the affordance set supervision and geometric primitive regularization without dense supervision. To learn and evaluate PartAfford, we construct a part-level, crosscategory 3D object affordance dataset, annotated with 24 affordance categories shared among > 25, 000 objects. We demonstrate through extensive experiments that our method enables both the abstraction of 3D objects and part-level affordance discovery, with generalizability to difficult and cross-category examples. Further ablations reveal the contribution of each component.

1. INTRODUCTION

The human vision system could swiftly locate the functional part upon using an object for specific tasks (Land et al., 1999) . Such a critical capability in object interaction requires fine-grained object affordance understanding. Affordance, coined and originally theorized by Gibson (Gibson & Carmichael, 1966; Gibson, 1979) , characterizes how humans interact with human-made objects and environments. As such, affordance understanding of objects and scenes has a significant influence on bridging visual perception and holistic scene understanding (Huang et al., 2018b; a; Chen et al., 2019) with actionable information (Soatto, 2013; Han et al., 2022) . Object affordances have two main characteristics. First, object affordances are not defined in terms of conventional categorical labels in computer vision; instead, they are defined by the associated actions 1



Figure 1: The proposed PartAfford: discover 3D object part affordances by learning contrast in affordance compositions. During training (left), given weak annotations (per-shape affordance label set), a learning framework is devised to ground affordance (e.g., backrest) to 3D part (e.g., sofa back) through learning crosscategory, affordance-related shapes (e.g., chair, sofa) with various affordance compositions. At test time (right), the learned model decomposes the 3D object into parts and infers the part-level affordances.

