UNDERSTANDING MENTAL REPRESENTATIONS OF OBJECTS THROUGH VERBS APPLIED TO THEM Anonymous

Abstract

In order to interact with objects in our environment, we rely on an understanding of the actions that can be performed on them, and the extent to which they rely or have an effect on the properties of the object. This knowledge is called the object "affordance". We propose an approach for creating an embedding of objects in an affordance space, in which each dimension corresponds to an aspect of meaning shared by many actions, using text corpora. This embedding makes it possible to predict which verbs will be applicable to a given object, as captured in human judgments of affordance, better than a variety of alternative approaches. Furthermore, we show that the dimensions learned are interpretable, and that they correspond to typical patterns of interaction with objects. Finally, we show that the dimensions can be used to predict a state-of-the-art mental representation of objects, derived purely from human judgements of object similarity.

1. INTRODUCTION

In order to interact with objects in our environment, we rely on an understanding of the actions that can be performed on them, and their dependence (or effect) on properties of the object. Gibson (1979) coined the term "affordance" to describe what the environment "provides or furnishes the animal". Norman (2013) developed the term to focus on the properties of objects that determine the action possibilities. The notion of "affordance" emerges from the relationship between the properties of objects and human actions. If we consider "object" as meaning anything concrete that one might interact with in the environment, there will be thousands of possibilities, both animate and inanimate (see WordNet (Miller, 1998) ). The same is true if we consider "action" as meaning any verb that might be applied to the noun naming an object (see VerbNet (Schuler, 2005) ). Intuitively, only a relatively small fraction of all possible combinations of object and action will be plausible. Of those, many will also be trivial, e.g. "see" or "have" may apply to almost every object. Finally, different actions might reflect a similar mode of interaction, depending on the type of object they are applied to (e.g. "chop" and "slice" are distinct actions, but they are both used in food preparation). Mental representations of objects encompass many aspects beyond function. Several studies (McRae et al., 2005; Devereux et al., 2014; Hovhannisyan et al., 2020) have asked human subjects to list binary properties for hundreds of objects, yielding thousands of answers. Properties could be taxonomic (category), functional (purpose), encyclopedic (attributes), or visual-perceptual (appearance), among other groups. While some properties were affordances in themselves (e.g. "edible"), others reflected many affordances at once (e.g. "is a vegetable" means that it could be planted, cooked, sliced, etc). More recently, Zheng et al. (2019); Hebart et al. (2020) introduced SPoSE, a model of the mental representations of objects. The model was derived from a dataset of 1.5M Amazon Mechanical Turk (AMT) judgments of object similarity, where subjects were asked which of a random triplet of objects was the odd one out. The model was an embedding for objects where each dimension was constrained to be sparse and positive, and where triplet judgments were predicted as a function of the similarity between embedding vectors of the three objects considered. The authors showed that these dimensions were predictable as a combination of elementary properties in the Devereux et al. ( 2014) norm that often co-occur across many objects. Hebart et al. (2020) further showed that 1) human subjects could coherently label what the dimensions were "about", ranging from categorical (e.g. is animate, food, drink, building) to functional (e.g. container, tool) or structural (e.g. made of metal or wood, has inner structure). Subjects could also predict what dimension values new objects would

