UNDERSTANDING MENTAL REPRESENTATIONS OF OBJECTS THROUGH VERBS APPLIED TO THEM Anonymous

Abstract

In order to interact with objects in our environment, we rely on an understanding of the actions that can be performed on them, and the extent to which they rely or have an effect on the properties of the object. This knowledge is called the object "affordance". We propose an approach for creating an embedding of objects in an affordance space, in which each dimension corresponds to an aspect of meaning shared by many actions, using text corpora. This embedding makes it possible to predict which verbs will be applicable to a given object, as captured in human judgments of affordance, better than a variety of alternative approaches. Furthermore, we show that the dimensions learned are interpretable, and that they correspond to typical patterns of interaction with objects. Finally, we show that the dimensions can be used to predict a state-of-the-art mental representation of objects, derived purely from human judgements of object similarity.

1. INTRODUCTION

In order to interact with objects in our environment, we rely on an understanding of the actions that can be performed on them, and their dependence (or effect) on properties of the object. Gibson (1979) coined the term "affordance" to describe what the environment "provides or furnishes the animal". Norman (2013) developed the term to focus on the properties of objects that determine the action possibilities. The notion of "affordance" emerges from the relationship between the properties of objects and human actions. If we consider "object" as meaning anything concrete that one might interact with in the environment, there will be thousands of possibilities, both animate and inanimate (see WordNet (Miller, 1998) ). The same is true if we consider "action" as meaning any verb that might be applied to the noun naming an object (see VerbNet (Schuler, 2005) ). Intuitively, only a relatively small fraction of all possible combinations of object and action will be plausible. Of those, many will also be trivial, e.g. "see" or "have" may apply to almost every object. Finally, different actions might reflect a similar mode of interaction, depending on the type of object they are applied to (e.g. "chop" and "slice" are distinct actions, but they are both used in food preparation). Mental representations of objects encompass many aspects beyond function. Several studies (McRae et al., 2005; Devereux et al., 2014; Hovhannisyan et al., 2020) have asked human subjects to list binary properties for hundreds of objects, yielding thousands of answers. Properties could be taxonomic (category), functional (purpose), encyclopedic (attributes), or visual-perceptual (appearance), among other groups. While some properties were affordances in themselves (e.g. "edible"), others reflected many affordances at once (e.g. "is a vegetable" means that it could be planted, cooked, sliced, etc). More recently, Zheng et al. (2019); Hebart et al. (2020) introduced SPoSE, a model of the mental representations of objects. The model was derived from a dataset of 1.5M Amazon Mechanical Turk (AMT) judgments of object similarity, where subjects were asked which of a random triplet of objects was the odd one out. The model was an embedding for objects where each dimension was constrained to be sparse and positive, and where triplet judgments were predicted as a function of the similarity between embedding vectors of the three objects considered. The authors showed that these dimensions were predictable as a combination of elementary properties in the Devereux et al. ( 2014) norm that often co-occur across many objects. Hebart et al. (2020) further showed that 1) human subjects could coherently label what the dimensions were "about", ranging from categorical (e.g. is animate, food, drink, building) to functional (e.g. container, tool) or structural (e.g. made of metal or wood, has inner structure). Subjects could also predict what dimension values new objects would have, based on knowing the dimension value for a few other objects. SPoSE is unusual in its wide coverage -1,854 objects -and in having been validated in independent behavioral data. Our first goal is to produce an analogous affordance embedding space for objects, where each dimension groups together actions corresponding to a particular "mode of interaction"affordance mining. Our second goal is to understand the degree to which affordance knowledge underlies the mental representation of objects, as instantiated in SPoSE. In this paper, we will introduce and evaluate an approach for achieving both of these goals. Our approach is based on the hypothesis that, if a set of verbs apply to the same objects, they apply for similar reasons. We start by identifying applications of action verbs to nouns naming objects, in large text corpora. We then use the resulting dataset to produce an embedding that represents each object as a vector in a low-dimensional space, where each dimension groups verbs that tend to be applied to similar objects. We do this for larger lists of objects and action verbs than previous studies (thousands in each case). Combining the weights on each verb assigned by various dimensions yields a ranking over verbs for each concept. We show that this allows us to predict which verbs will be applicable to a given object, as captured in human judgments of affordance. Further, we show that the dimensions learned are interpretable, and they group together verbs that would all typically occur during certain complex interactions with objects. Finally, we show that they can be used to predict most dimensions of the SPoSE representation, in particular those that are categorical or functional. This suggests that affordance knowledge underlies much of the mental representation of objects, in particular semantic categorization.

2. RELATED WORK

The problem of determining, given an action and an object, whether the action can apply to the object was defined as "affordance mining" in Chao et al. (2015) . The authors proposed complementary methods for solving the affordance mining problem by predicting a plausibility score for each combination of object and action. The best method used word co-occurrences in two ways: ngram counts of verb-noun pairs, or similarity between verb and noun vectors in Latent Semantic Analysis (Deerwester et al., 1990) or Word2Vec (Mikolov et al., 2013) word embeddings. For evaluation, they collected AMT judgements of plausibility ("is it possible to < verb > a < object >") for every combination of 91 objects and 957 action verbs. The authors found they could retrieve a small number of affordances for each item, but precision dropped quickly with a higher recall. Subsequent work (Rubinstein et al., 2015; Lucy & Gauthier, 2017; Utsumi, 2020) predicted properties of objects in the norms above from word embeddings (Mikolov et al., 2013; Pennington et al., 2014) , albeit without a focus on affordances. Forbes et al. ( 2019) extracted 50 properties (some were affordances) from Devereux et al. (2014) , for a set of 514 objects, to generate positive and negative examples for 25,700 combinations. They used this data to train a small neural network to predict these properties. The input to the network was either the product of the vectors for object and property, if using word embeddings (Mikolov et al., 2013; Pennington et al., 2014; Levy & Goldberg, 2014) , or the representation of a synthesized sentence combining them, if using contextualized embeddings (Peters et al., 2018; Devlin et al., 2018) . They found that the latter outperformed the former for property prediction, but none allowed reliable affordance prediction. In addition to object/action plausibility prediction, Ji et al. ( 2020) addressed the problem of determining whether a object1/action/object2 (target of the action with object1) was feasible. They selected a set of 20 actions from Chao et al. (2015) and combined them with the 70 most frequent objects in ConceptNet (Speer & Havasi, 2012) into 1400 object/action pairs, which were then labelled as plausible or not; given rater disagreements, this yielded 330 positive pairs and 1070 negative ones. They then combined the positive pairs with other objects as "tails" (recipients of the action), yielding 3900 triplets. They reached F1 scores of 0.81 and 0.52 on the two problems, respectively. Other papers focus on understanding the relevant visual features in objects that predict affordances, e.g. (Myers et al., 2015; Sawatzky et al., 2017; Wang & Tarr, 2020) . The latter collected affordance judgments on AMT ("what can you do with < object >") for 500 objects and harmonized them with WordNet synsets for 334 action verbs. For validation of the rankings of verb applicability predicted by our model, we will use the datasets from Chao et al. (2015) and Wang & Tarr (2020) , as they are the largest available human rated datasets. In robotics research, affordance refer to relation between agent, action and the environment, under the constraints of motor and sensing capabilities of the agent (Lopes et al., 2007) . Affordance modeling for robotics have been studied extensively, see recent surveys (Jamone et al., 2016; Zech et al., 2017; Hassanin et al., 2018) . Due to the restriction in action possibilities and the complexity of

