IMAGINE THAT! LEVERAGING EMERGENT AFFORDANCES FOR 3D TOOL SYNTHESIS Anonymous authors Paper under double-blind review

Abstract

In this paper we explore the richness of information captured by the latent space of a vision-based generative model. The model combines unsupervised generative learning with a task-based performance predictor to learn and to exploit taskrelevant object affordances given visual observations from a reaching task, involving a scenario and a stick-like tool. While the learned embedding of the generative model captures factors of variation in 3D tool geometry (e.g. length, width, and shape), the performance predictor identifies sub-manifolds of the embedding that correlate with task success. Within a variety of scenarios, we demonstrate that traversing the latent space via backpropagation from the performance predictor allows us to imagine tools appropriate for the task at hand. Our results indicate that affordances -like the utility for reaching -are encoded along smooth trajectories in latent space. Accessing these emergent affordances by considering only high-level performance criteria (such as task success) enables an agent to manipulate tool geometries in a targeted and deliberate way.

1. INTRODUCTION

The advent of deep generative models (e.g. Burgess et al., 2019; Greff et al., 2019; Engelcke et al., 2019) with their aptitude for unsupervised representation learning casts a new light on learning affordances (Gibson, 1977) . This kind of representation learning raises a tantalising question: Given that generative models naturally capture factors of variation, could they also be used to expose these factors such that they can be modified in a task-driven way? We posit that a task-driven traversal of a structured latent space leads to affordances emerging naturally along trajectories in this space. This is in stark contrast to more common approaches to affordance learning where it is achieved via direct supervision or implicitly via imitation (e.g. Tikhanoff et al., 2013; Myers et al., 2015; Liu et al., 2018; Grabner et al., 2011; Do et al., 2018) . The setting we choose for our investigation is that of tool synthesis for reaching tasks as commonly investigated in the cognitive sciences (Ambrose, 2001; Emery & Clayton, 2009) . In order to demonstrate that a task-aware latent space encodes useful affordance information we require a mechanism to train such a model as well as to purposefully explore the space. To this end we propose an architecture in which a task-based performance predictor (a classifier) operates on the latent space of a generative model (see fig. 1 ). During training the classifier is used to provide an auxiliary objective, aiding in shaping the latent space. Importantly, however, during test time the performance predictor is used to guide exploration of the latent space via activation maximisation (Erhan et al., 2009; Zeiler & Fergus, 2014; Simonyan et al., 2014) , thus explicitly exploiting the structure of the space. While our desire to affect factors of influence is similar in spirit to the notion of disentanglement, it contrasts significantly with models such as β-VAE (Higgins et al., 2017) , where the factors of influence are effectively encouraged to be axis-aligned. Our approach instead relies on a high-level auxiliary loss to discover the direction in latent space to explore. Our experiments demonstrate that artificial agents are able to imagine an appropriate tool for a variety of reaching tasks by manipulating the tool's task-relevant affordances. To the best of our knowledge, this makes us the first to demonstrate an artificial agent's ability to imagine, or synthesise, 3D meshes of tools appropriate for a given task via optimisation in a structured latent embedding. Figure 1 : Tool synthesis for a reaching task. Our model is trained on data-triplets {task observation, tool observation, success indicator}. Within a scenario, the goal is to determine if a given tool can reach the goal (green) while avoiding barriers (blue) and remaining behind the boundary (red). If a tool cannot satisfy these constraints, our approach (via the performance predictor) imagines how one may augment it in order to solve the task. Our interest is in what these augmentations, imagined during tool synthesis, imply about the learned object representations. Similarly, while activation maximisation has been used to visualise modified input images before (e.g. Mordvintsev et al., 2015) , we believe this work to be the first to effect deliberate manipulation of factors of influence by chaining the outcome of a task predictor to the latent space, and then decoding the latent representation back into a 3D mesh. Beyond the application of tool synthesis, we believe our work to provide novel perspectives on affordance learning and disentanglement in demonstrating that object affordances can be viewed as trajectories in a structured latent space as well as by providing a novel architecture adept at deliberately manipulating interpretable factors of influence.

2. RELATED WORK

The concept of an affordance, which describes a potential action to be performed on an object (e.g. a doorknob affords being turned), goes back to Gibson (1977) . Because of their importance in cognitive vision, affordances are extensively studied in computer vision and robotics. Commonly, affordances are learned in a supervised fashion where models discriminate between discrete affordance classes or predict masks for image regions which afford certain types of human interaction (e.g. Stoytchev, 2005; Kjellström et al., 2010; Tikhanoff et al., 2013; Mar et al., 2015; Myers et al., 2015; Do et al., 2018) . Interestingly, most works in this domain learn from object shapes which have been given an affordance label a priori. However, the affordance of a shape is only properly defined in the context of a task. Hence, we employ a task-driven traversal of a latent space to optimise the shape of a tool by exploiting factors of variation which are conducive to task success. Recent advances in 3D shape generation employ variational models (Girdhar et al., 2016; Wu et al., 2016) to capture complex manifolds of 3D objects. Besides their expressive capabilities, the latent spaces of such models also enable smooth interpolation between shapes. Remarkable results have been demonstrated including 'shape algebra' (Wu et al., 2016) and the preservation of object part semantics (Kohli et al., 2020) and fine-grained shape styles (Yifan et al., 2019) during interpolation. This shows the potential of disentangling meaningful factors of variation in the latent representation of 3D shapes. Inspired by this, we investigate whether these factors can be exposed in a taskdriven way. In particular, we propose an architecture in which a generative model for 3D object reconstruction (Liu et al., 2019) is paired with activation maximisation (e.g. Erhan et al., 2009; Zeiler & Fergus, 2014; Simonyan et al., 2014) of a task-driven performance predictor. Guided by its loss signal, activation maximisation traverses the generative model's latent representations and drives an imagination process yielding a shape suitable for the task at hand.

