DISENTANGLING ACTION SEQUENCES: FINDING COR-RELATED IMAGES

Abstract

Disentanglement is a highly desirable property of representation due to its similarity with human's understanding and reasoning. This improves interpretability, enables the performance of down-stream tasks, and enables controllable generative models. However, this domain is challenged by the abstract notion and incomplete theories to support unsupervised disentanglement learning. We demonstrate the data itself, such as the orientation of images, plays a crucial role in disentanglement instead of the ground-truth factors, and the disentangled representations align the latent variables with the action sequences. We further introduce the concept of disentangling action sequences which facilitates the description of the behaviours of the existing disentangling approaches. An analogy for this process is to discover the commonality between the things and categorizing them. Furthermore, we analyze the inductive biases on the data and find that the latent information thresholds are correlated with the significance of the actions. For the supervised and unsupervised settings, we respectively introduce two methods to measure the thresholds. We further propose a novel framework, fractional variational autoencoder (FVAE), to disentangle the action sequences with different significance step-by-step. Experimental results on dSprites and 3D Chairs show that FVAE improves the stability of disentanglement.

1. INTRODUCTION

The basis of artificial intelligence is to understand and reason about the world based on a limited set of observations. Unsupervised disentanglement learning is highly desirable due to its similarity with the way we as human think. For instance, we can infer the movement of a running ball based on a single glance. This is because the human brain is capable of disentangling the positions from a set of images. It has been suggested that a disentangled representation is helpful for a large variety of downstream tasks (Schölkopf et al., 2012; Peters et al., 2017) . According to Kim & Mnih (2018) , a disentangled representation promotes interpretable semantic information. That brings substantial advancement, including but not limited to reducing the performance gap between humans and AI approaches (Lake et al., 2017; Higgins et al., 2018) . Other instances of disentangled representation include semantic image understanding and generation (Lample et al., 2017; Zhu et al., 2018; Elgammal et al., 2017) , zero-shot learning (Zhu et al., 2019) , and reinforcement learning (Higgins et al., 2017b) . Despite the advantageous of the disentangling representation approaches, there are still two issues to be addressed including the abstract notion and the weak explanations. Notion The conception of disentangling factors of variation is first proposed in 2013. It is claimed in Bengio et al. (2013) that for observations the considered factors should be explanatory and independent of each other. The explanatory factors are however hard to formalize and measure. An alternative way is to disentangle the ground-truth factors (Ridgeway, 2016; Do & Tran, 2020) . However, if we consider the uniqueness of the ground-truth factors, a question which arises here is how to discover it from multiple equivalent representations? As a proverb "one cannot make bricks without straw", Locatello et al. (2019) prove the impossibility of disentangling factors without the help of inductive biases in the unsupervised setting. Explanation There are mainly two types of explanations for unsupervised disentanglement: information bottleneck, and independence assumption. The ground-truth factors affect the data

