UNSUPERVISED HIERARCHICAL CONCEPT LEARNING

Abstract

Concepts or temporal abstractions are an essential aspect of learning among humans. They allow for succinct representations of the experiences we have through a variety of sensory inputs. Also, these concepts are arranged hierarchically, allowing for an efficient representation of complex long-horizon experiences. Analogously, here we propose a model that learns temporal representations from long-horizon visual demonstration data and associated textual descriptions without explicit temporal supervision. Additionally, our method produces a hierarchy of concepts that align more closely with ground-truth human-annotated events than several state-of-theart supervised and unsupervised baselines in complex visual domains such as chess and cooking demonstrations. We illustrate the utility of the abstracted concepts in downstream tasks, such as captioning and reasoning. Finally, we perform several ablation studies illustrating the robustness of our approach to data-scarcity.

1. INTRODUCTION

Consider a video (Figure 1 ) that demonstrates how to cook an egg. Humans subconsciously learn concepts (such as boiling water) that describe different concepts (or skills) in such demonstrations Pammi et al. (2004) . These learned skills can be composed and reused in different ways to learn new concepts. Discovering such concepts automatically from demonstration data is a non-trivial problem. Shankar et al. (2019) introduces a sequence-to-sequence architecture that clusters long-horizon action trajectories into shorter temporal skills. However, their approach treats skills as independent concepts. In contrast, humans organize these concepts in hierarchies where lower-level concepts can be grouped to define higher-level concepts Naim et al. (2019) . We extend the architecture in Shankar et al. (2019) to simultaneously discover concepts along with their hierarchical organization without any supervision. We propose an end-to-end trainable architecture UNHCLE for hierarchical representation learning from demonstrations. UNHCLE takes as input a long horizon trajectory of high-dimensional images demonstrating a complex task (in our case, chess and cooking) and the associated textual commentary and isolates semantically meaningful subsequences in input trajectories. We emphasize that it does not require temporal annotations which link subsequences in the trajectories of images to the freeflowing commentary, but instead, autonomously discovers this mapping. Therefore, this work takes a step towards unsupervised video understanding of high-dimensional data. Our contributions can be summarized as follows: • We introduce a transformer-based architecture to learn a multi-modal hierarchical latent embedding space to encode the various concepts in long-horizon demonstration trajectories. UNHCLE abstracts these concepts (shown through visual qualitative analysis) without requiring any temporal supervision, i.e., it divides long-horizon trajectories into semantically meaningful subsequences, without access to any temporal annotations that split these trajectories optimally. • We show the quantitative effectiveness of learning high-level concepts in a hierarchical manner compared to learning them in isolation while outperforming several baselines on YouCook2 (Zhou et al. ( 2017)) and Chess Opening datasetfoot_0 . • We further introduce a mechanism to incorporate commentary accompanying demonstrations in UNHCLE and show improvements in hierarchical concepts discovered.



https://www.kaggle.com/residentmario/recommending-chess-openings 1

