AN EXTENSIBLE MULTIMODAL MULTI-TASK OBJECT DATASET WITH MATERIALS

Abstract

We present EMMa, an Extensible, Multimodal dataset of Amazon product listings that contains rich Material annotations. It contains more than 2.8 million objects, each with image(s), listing text, mass, price, product ratings, and position in Amazon's product-category taxonomy. We also design a comprehensive taxonomy of 182 physical materials (e.g., Plastic → Thermoplastic → Acrylic). Objects are annotated with one or more materials from this taxonomy. With the numerous attributes available for each object, we develop a Smart Labeling framework to quickly add new binary labels to all objects with very little manual labeling effort, making the dataset extensible. Each object attribute in our dataset can be included in either the model inputs or outputs, leading to combinatorial possibilities in task configurations. For example, we can train a model to predict the object category from the listing text, or the mass and price from the product listing image. EMMa offers a new benchmark for multi-task learning in computer vision and NLP, and allows practitioners to efficiently add new tasks and object attributes at scale.

1. INTRODUCTION

Perhaps the biggest problem faced by machine learning practitioners today is that of producing labeled datasets for their specific needs. Manually labeling large amounts of data is time-consuming and costly (Deng et al., 2009; Lin et al., 2014; Kuznetsova et al., 2020) . Furthermore, it is often not possible to communicate how numerous ambiguous corner cases should be handled (e.g., is a hole puncher "sharp"?) to the human annotators we typically rely on to produce these labels. Could we solve this problem with the aid of machine learning? We hypothesized that we could accurately add new properties to every instance in a semi-automated fashion if given a rich dataset with substantial information about every instance. Consequently, we developed EMMa, a large, object-centric, multimodal, and multi-task dataset. We show that EMMa can be easily extended to contain any number of new object labels using a Smart Labeling technique we developed for large multi-task and multimodal datasets. Multi-task datasets contain labels for more than one attribute for each instance, whereas multimodal datasets contain data from more than one modality, such as images, text, audio, and tabular data. Derived from Amazon product listings, EMMa contains images, text, and a number of useful attributes, such as materials, mass, price, product category, and product ratings. Each attribute can be used as either a model input or a model output. Models trained on these attributes could be useful to roboticists, recycling facilities, consumers, marketers, retailers, and product developers. Furthermore, we believe that EMMa will make a great multi-task benchmark for both computer vision and NLP. EMMa has many diverse CV tasks, such as material prediction, mass prediction, and taxonomic classification. Currently, most multi-task learning for NLP is done on corpora in which each input sentence is labeled only for a single task. In contrast, EMMa offers unique tasks, such as predicting both product ratings and product price from the same product listing text. One important contribution of this work is that EMMa lists each object's material composition. No existing materials dataset has more than a few dozen material types; furthermore, there are no other large materials datasets that are object-centric. This prevents models from being able to learn We introduce EMMa, an object-centric, multimodal, multi-task dataset of Amazon product listings that contains over 2.8 million objects. Each object in EMMa is accompanied by images, listing text, mass, price, product ratings, position in Amazon's product-category taxonomy, etc. We also introduce a Smart Labeling technique that allows practitioners to easily extend the entire dataset with new binary properties of interest (e.g., sharpness, transparency, deformability, etc.) with only hours of labeling effort. important distinctions about the materials from which an object is made. In contrast, each object in EMMa is annotated with one or more materials from a hand-curated taxonomy of 182 material types (see Figure 2 ). Armed with a dataset containing such rich annotations, we developed a technique for adding highquality binary object properties to EMMa with minimal manual labeling effort, by leveraging all available data for each object. Our technique employs active learning and a powerful object embedding. We envision practitioners adding the labels themselves for their own use cases, obviating the substantial work typically required to obtain and curate high-quality data labels from crowdsourcing services such as Amazon Mechanical Turk. Our main contributions are threefold. First, we present EMMa, a large-scale, multimodal, multi-task object dataset that contains more than 2.8 million objects. Second, our dataset is labeled in accordance with a hand-curated taxonomy of 182 material types, which is much larger compared with existing material datasets. Third, we propose a Smart Labeling pipeline that allows practitioners to easily add new binary labels of interest to the whole dataset with only hours of labeling effort.

2. RELATED WORK

Multi-Task Learning Datasets Many large computer vision multi-task datasets, such as Taskonomy (Zamir et al., 2018 ), 3D Scene Graph (Armeni et al., 2019 ), and Omnidata (Eftekhar et al., 2021) , offer a large number of tasks for each image. However, unlike EMMa, these datasets are not object-centric and focus on pixel-level prediction tasks. Other multi-task datasets, such as COCO (Lin et al., 2014 ), NYUv2 (Nathan Silberman & Fergus, 2012 ), and Cityscapes (Cordts et al., 2016) , are relatively small and have only a few tasks, while others are either artificial, such as MultiMNIST (Sabour et al., 2017) , or focused on a restricted domain like human faces (Liu et al., 2015) . Learning from Amazon Data Our work is not the first to use Amazon data in a machine learning context. For example, the ABO dataset (Collins et al., 2022) provides listings for 150k objects, about 8k of which have 3D models. Unfortunately, nearly half of the 150k objects are cell phone cases, and the data provided for most objects is in the raw form provided by Amazon. Likewise, image2mass (Standley et al., 2017) curates a dataset of Amazon listings for the purpose of predicting an object's weight given its image, but the processed dataset is only for a single task. The UCSD Amazon review data (Ni et al., 2019) is quite large, containing raw data for 15.5 million products, with a focus on product reviews. We incorporate raw data from the image2mass and UCSD datasets into our dataset, which also contains new data we collected from Amazon. As far as we know, we are the first to take advantage of the bulk of the information in Amazon product listings for machine learning.



Figure1: We introduce EMMa, an object-centric, multimodal, multi-task dataset of Amazon product listings that contains over 2.8 million objects. Each object in EMMa is accompanied by images, listing text, mass, price, product ratings, position in Amazon's product-category taxonomy, etc. We also introduce a Smart Labeling technique that allows practitioners to easily extend the entire dataset with new binary properties of interest (e.g., sharpness, transparency, deformability, etc.) with only hours of labeling effort.

