UNSUPERVISED HIERARCHICAL CONCEPT LEARNING

Abstract

Concepts or temporal abstractions are an essential aspect of learning among humans. They allow for succinct representations of the experiences we have through a variety of sensory inputs. Also, these concepts are arranged hierarchically, allowing for an efficient representation of complex long-horizon experiences. Analogously, here we propose a model that learns temporal representations from long-horizon visual demonstration data and associated textual descriptions without explicit temporal supervision. Additionally, our method produces a hierarchy of concepts that align more closely with ground-truth human-annotated events than several state-of-theart supervised and unsupervised baselines in complex visual domains such as chess and cooking demonstrations. We illustrate the utility of the abstracted concepts in downstream tasks, such as captioning and reasoning. Finally, we perform several ablation studies illustrating the robustness of our approach to data-scarcity.

1. INTRODUCTION

Consider a video (Figure 1 ) that demonstrates how to cook an egg. Humans subconsciously learn concepts (such as boiling water) that describe different concepts (or skills) in such demonstrations Pammi et al. (2004) . These learned skills can be composed and reused in different ways to learn new concepts. Discovering such concepts automatically from demonstration data is a non-trivial problem. Shankar et al. (2019) introduces a sequence-to-sequence architecture that clusters long-horizon action trajectories into shorter temporal skills. However, their approach treats skills as independent concepts. In contrast, humans organize these concepts in hierarchies where lower-level concepts can be grouped to define higher-level concepts Naim et al. (2019) . We extend the architecture in Shankar et al. (2019) to simultaneously discover concepts along with their hierarchical organization without any supervision. We propose an end-to-end trainable architecture UNHCLE for hierarchical representation learning from demonstrations. UNHCLE takes as input a long horizon trajectory of high-dimensional images demonstrating a complex task (in our case, chess and cooking) and the associated textual commentary and isolates semantically meaningful subsequences in input trajectories. We emphasize that it does not require temporal annotations which link subsequences in the trajectories of images to the freeflowing commentary, but instead, autonomously discovers this mapping. Therefore, this work takes a step towards unsupervised video understanding of high-dimensional data. Our contributions can be summarized as follows: • We introduce a transformer-based architecture to learn a multi-modal hierarchical latent embedding space to encode the various concepts in long-horizon demonstration trajectories. UNHCLE abstracts these concepts (shown through visual qualitative analysis) without requiring any temporal supervision, i.e., it divides long-horizon trajectories into semantically meaningful subsequences, without access to any temporal annotations that split these trajectories optimally. • We show the quantitative effectiveness of learning high-level concepts in a hierarchical manner compared to learning them in isolation while outperforming several baselines on YouCook2 (Zhou et al. ( 2017)) and Chess Opening datasetfoot_0 . • We further introduce a mechanism to incorporate commentary accompanying demonstrations in UNHCLE and show improvements in hierarchical concepts discovered. • We introduce TimeWarped IoU (TW-IoU), an evaluation metric that we use to compare the alignment of our discovered concepts and ground-truth events. Existing approaches to representation learning for demonstrations or videos typically require significant supervision. Typically, sequence-to-sequence architectures are trained on datasets segmented by humans. During inference, these architectures generate proposals for timestamps that segment the input trajectory into semantically meaningful sequences. These complex sequence-to-sequence models require significant amounts of annotated data, making them costly to train them. More generally, video and scene understanding is an important research area with wide-ranging applications. Most recently, Chen et al. ( 2019) utilize semantic awareness to perform complex depth estimation tasks to acquire the geometric properties of 3-dimensional space from 2-dimensional images. Tosi et al. ( 2020) utilize similar semantic information for depth estimation, optical flow and motion segmentation. Boggust et al. (2019) attempt to ground words in the video, but apply significant supervision to synchronize them, requiring human intervention. We attempt to learn similar embeddings but do so in a completely unsupervised manner, not utilizing any of the temporal labels available. The field of learning from demonstrations (Nicolescu & Mataric (2003) ) seeks to learn to perform tasks from a set of demonstrated behaviors. Behavioral Cloning is one popular scheme (Esmaili et al. (1995) ). Atkeson & Schaal (1997) and Schaal (1997) 2019) address generalizing to new instances of manipulation tasks in the few-shot regime by abstracting away low-level controls. However, all of these approaches require an environment i.e., a transition and reward function to learn from. On the contrary, humans show an ability to learn by watching demonstrations, which we attempt to replicate. Temporal abstractions of action sequences or skill/primitive learning is also a related field. Eysenbach et al. ( 2018), learn a large number of low-level sequences of actions by forcing the agent to produce skills that are different from those previously acquired. However, due to the diversity bias, the agent results in learning many useless skills that cannot be used for any semantically meaningful task. Similarly, Sharma et al. (2019) attempt to learn skills such that their transitions are almost deterministic in a given environment. These approaches also require access to an environment, whereas we try to learn without an environment.



https://www.kaggle.com/residentmario/recommending-chess-openings



Figure1: Overview of our approach. UNHCLE learns a hierarchical latent space of concepts describing long horizon tasks like cooking and chess gameplay.

show how agents can learn simple tasks like cartpole simply from demonstrations. Pastor et al. (2009) also study how robots can learn from human demonstrations of tasks. Peters et al. (2013) and Kober & Peters (2009) fit a parametric model to the demonstrations. Niekum et al. (2012), Murali et al. (2016), and Meier et al. (2011) first segment trajectories into subsequence and then apply a parametric models to each subsequence. More recently, Schmeckpeper et al. (2019) shows that agents can learn to maximize external reward using a large corpus of observation data, i.e., trajectories of states on a relatively smaller corpus of interaction data, i.e., trajectories of state-action pairs. Hierarchical task representations have been studied as well. Instead of treating demonstrations in a flat manner, one may also infer the hierarchical structure. A few recent works attempt to do so(Xu et al. (2018); Sun et al. (2018)), or as task graphs(Huang et  al., 2019). Both Xu et al. (2018) and Huang et al. (

