DISENTANGLING ACTION SEQUENCES: FINDING COR-RELATED IMAGES

Abstract

Disentanglement is a highly desirable property of representation due to its similarity with human's understanding and reasoning. This improves interpretability, enables the performance of down-stream tasks, and enables controllable generative models. However, this domain is challenged by the abstract notion and incomplete theories to support unsupervised disentanglement learning. We demonstrate the data itself, such as the orientation of images, plays a crucial role in disentanglement instead of the ground-truth factors, and the disentangled representations align the latent variables with the action sequences. We further introduce the concept of disentangling action sequences which facilitates the description of the behaviours of the existing disentangling approaches. An analogy for this process is to discover the commonality between the things and categorizing them. Furthermore, we analyze the inductive biases on the data and find that the latent information thresholds are correlated with the significance of the actions. For the supervised and unsupervised settings, we respectively introduce two methods to measure the thresholds. We further propose a novel framework, fractional variational autoencoder (FVAE), to disentangle the action sequences with different significance step-by-step. Experimental results on dSprites and 3D Chairs show that FVAE improves the stability of disentanglement.

1. INTRODUCTION

The basis of artificial intelligence is to understand and reason about the world based on a limited set of observations. Unsupervised disentanglement learning is highly desirable due to its similarity with the way we as human think. For instance, we can infer the movement of a running ball based on a single glance. This is because the human brain is capable of disentangling the positions from a set of images. It has been suggested that a disentangled representation is helpful for a large variety of downstream tasks (Schölkopf et al., 2012; Peters et al., 2017) . According to Kim & Mnih (2018) , a disentangled representation promotes interpretable semantic information. That brings substantial advancement, including but not limited to reducing the performance gap between humans and AI approaches (Lake et al., 2017; Higgins et al., 2018) . Other instances of disentangled representation include semantic image understanding and generation (Lample et al., 2017; Zhu et al., 2018; Elgammal et al., 2017 ), zero-shot learning (Zhu et al., 2019) , and reinforcement learning (Higgins et al., 2017b) . Despite the advantageous of the disentangling representation approaches, there are still two issues to be addressed including the abstract notion and the weak explanations. Notion The conception of disentangling factors of variation is first proposed in 2013. It is claimed in Bengio et al. (2013) that for observations the considered factors should be explanatory and independent of each other. The explanatory factors are however hard to formalize and measure. An alternative way is to disentangle the ground-truth factors (Ridgeway, 2016; Do & Tran, 2020) . However, if we consider the uniqueness of the ground-truth factors, a question which arises here is how to discover it from multiple equivalent representations? As a proverb "one cannot make bricks without straw", Locatello et al. (2019) prove the impossibility of disentangling factors without the help of inductive biases in the unsupervised setting. Explanation There are mainly two types of explanations for unsupervised disentanglement: information bottleneck, and independence assumption. The ground-truth factors affect the data independently, therefore, the disentangled representations must follow the same structure. The approaches, holding the independence assumption, encourage independence between the latent variables (Schmidhuber, 1992; Chen et al., 2018; Kim & Mnih, 2018; Kumar et al., 2018; Lopez et al., 2018) . However, the real-world problems have no strict constraint on the independence assumption, and the factors may be correlative. The other explanation incorporates information theory into disentanglement. Burgess et al.; Higgins et al.; Insu Jeon et al.; Saxe et al. suggest that a limit on the capacity of the latent information channel promotes disentanglement by enforcing the model to acquire the most significant latent representation. They further hypothesize that the information bottleneck enforces the model to find the significant improvement. In this paper, we first demonstrate that instead of the ground-truth factors the disentangling approaches learn actions of translating based on the orientation of the images. We then propose the concept of disentangling actions which discover the commonalities between the images and categorizes them into sequences. We treat disentangling action sequences as a necessary step toward disentangling factors, which can capture the internal relationships between the data, and make it possible to analyze the inductive biases from the data perspective. Furthermore, the results on a toy example show that the significance of actions is positively correlated with the threshold of latent information. Then, we promote that conclusion to complex problems. Our contributions are summarized in the following: • We show that the significance of action is related to the capacity of learned latent information, resulting in the different thresholds of factors. • We propose a novel framework, fractional variational autoencoder (FVAE) to extracts explanatory action sequences step-by-step, and at each step, it learns specific actions by blocking others' information. We organize the rest of this paper as follows. Sec.2 describes the development of unsupervised disentanglement learning and the proposed methods based on VAEs. In Sec.3, through an example, we show that the disentangled representations are relative to the data itself and further introduce a novel concept, i.e., disentangling action sequences. Then, we investigate the inductive biases on the data and find that the significant action has a high threshold of latent information. In Sec.4, we propose a step-bystep disentangling framework, namely fractional VAE (FVAE), to disentangle action sequences. For the labelled and unlabelled tasks, we respectively introduce two methods to measure their thresholds. We then evaluate FVAE on a labelled dataset (dSprites, Matthey et al. ( 2017)) and an unlabelled dataset (3D Chairs, Aubry et al. ( 2014)). Finally, we conclude the paper and discuss the future work in Sec.5

2. UNSUPERVISED DISENTANGLEMENT LEARNING

We first introduce the abstract concepts and the basic definitions, followed by the explanations based on information theory and other related works. This article focuses on the explanation of information theory and the proposed models based on VAEs.

2.1. THE CONCEPT

Disentanglement learning is fascinating and challenging because of its intrinsic similarity to human intelligence. As depicted in the seminal paper by Bengio et al., humans can understand and reason from a complex observation to the explanatory factors. A common modeling assumption of disentanglement learning is that the observed data is generated by a set of ground-truth factors. Usually, the data has a high number of dimensions; hence it is hard to understand, whereas the factors have a low number of dimensions, thus simpler and easier to be understood. The task of disentanglement learning is to uncover the ground-truth factors. Such factors are invisible to the training process in an unsupervised setting. The invisibility of factors makes it hard to define and measure disentanglement (Do & Tran, 2020) . Furthermore, it is shown in Locatello et al. (2019) that it is impossible to unsupervised disentangle the underlying factors for the arbitrary generative models without inductive biases. In particular, they suggest that the inductive biases on the models and the data should be exploited. However, they do not provide a formal definition of the inductive bias and such a definition is still unavailable.

