LEARNING BASIC INTERPRETABLE FACTORS FROM TEMPORAL SIGNALS VIA PHYSICAL SYMMETRY

Abstract

We have recently seen great progress in learning interpretable music representations, ranging from basic factors, such as pitch and timbre, to high-level concepts, such as chord andtexture. However, most methods rely heavily on music domain knowledge and it remains an open question how to learn interpretable and disentangled representations using inductive biases that are more general. In this study, we use physical symmetry as a self-consistency constraint on the latent space. Specifically, it requires the prior model that characterises the dynamics of the latent states to be equivariant with respect to a certain group transformation. We show that our model can learn linear pitch factor (that agrees with human music perception) as well as pitch-timbre disentanglement from unlabelled monophonic music audio. In addition, the same methodology can be applied to computer vision, learning the 3D Cartesian space as well as space-colour disentanglement from a simple moving object shot by a single fixed camera. Furthermore, applying physical symmetry to the prior model naturally leads to representation augmentation, a new learning technique which helps improve sample efficiency.

1. INTRODUCTION

Interpretable representation-learning models have achieved great progress for various types of timeseries data. Taking the music domain as an example, tailored deep generative models (Ji et al., 2020) have been developed to learn pitch, timbre, melody contour, chord progression, accompaniment texture, etc. However, most models still rely heavily on domain-specific knowledge. For example, to use pitch scales or instrument labels for learning pitch and timbre representations (Luo et al., 2020; 2019; Engel et al., 2020; Lin et al., 2021; Esling et al., 2018) and to use chords and rhythm labels for learning higher-level representations (Akama, 2019; Yang et al., 2019; Wang et al., 2020; Wei & Xia, 2021) . Such an approach is very different from human learning; even without formal music training, one can at least perceive basic factors such as pitch and timbre from the experience of listening to music. In other words, it remains an open question how to learn interpretable music representations using inductive biases that are more general. We see a similar issue in other domains. For instance, various computer-vision models (McCarthy & Ahmed, 2020; Trevithick & Yang, 2021; Mescheder et al., 2019; Riegler et al., 2017) can learn 3D representations of human faces or a particular scene by incorporating domain knowledge (e.g., labelling of meshes and voxels, 3D-specific setups such as multi-cameras, 3D convolution, etc.) but it remains a non-trivial task to trace the 3D location of a simple moving object from monocular videos in a self-supervised fashion. In this study, we explore to use physical symmetry (i.e., symmetry of physical laws) as a weak self-consistency constraint for the learned latent z space. As indicated in Figure 1 , this general inductive bias requires that after a certain transformation S (e.g., translation or rotation) in the latent space, the learned prior model R, which is the induced physical law describing the temporal flow of the latent states, should output equivariant predictions. Formally, z t+1 = R(z t ) if and only if z S t+1 = R(z S t ) , where z S = S(z). In other words, R and S are commutable operations for z, i.e., R(S(z)) = S(R(z)). Note that this approach is fundamentally different from most existing symmetry-informed models (Bronstein et al., 2021) , in which the symmetry property is used to constrain the encoder or the decoder. Specifically, we design self-supervised learning with physics symmetry (SPS), a method that adopts an encoder-decoder framework and applies physical symmetry to the prior model. We show that with the right symmetry assumptions, our model learns linear pitch factors that agree with human music perception from monophonic music audio, without any domain-specific knowledge about pitch scales or signal-level regularities. If we further assume an extra global invariant latent code, the model can learn pitch-timbre disentanglement without instrument labelling. Moreover, we show that the same methodology can be applied to the computer vision domain, learning 3D Cartesian space as well as space-colour disentanglement from monocular videos of a bouncing ball shot from a fixed perspective.

2. INTUITION

The idea of using physical symmetry for representation learning comes from modern physics. In classical physics, scientists usually first induce physical laws from observations and then discover symmetry properties of the law. (E.g., Newton's law of gravitation, which was induced from planetary orbits, is symmetric with respect to Galilean transformation.) In contrast, in modern physics, scientists often start from a symmetry assumption, based on which they derive the corresponding law and predict the properties (representations) of fundamental particles. (E.g., general relativity was developed based on a firm assumption of symmetry with respect to Lorentz transformation). Analogously, we use physical symmetry as an inductive bias of our machine learning model, which helps us learn a regularised prior and an interpretable latent space. In other words, if it is a belief of many physicists that symmetry in physical law is a main design principle of the nature, we regard symmetry in physical law as a major useful inductive bias of the representation learner. The introduction of physical symmetry to the learned prior model naturally leads to representation augmentation, a novel learning technique which helps improve sample efficiency. As indicated in Figure 1 , representation augmentation means to "imagine" z S t as the training sample of the prior model R. Representation augmentation can be regarded as a regularisation of the prior model, since it requires the prediction of the z sequence to be equivariant with respect to a certain group transformation of S. It also constrains the encoder and decoder indirectly through the prior model since the network is trained in an end-to-end fashion.

3. METHODOLOGY

Our goal is to learn a disentangled and interpretable representation z i of each high-dimensional sample x i from time-series x 1:T . The disentanglement of z i is at two levels. First, z i is divided into two factors: z i,s and z i,c , where z i,s is the global invariant style and z i,c is the content representation that changes over time. More importantly, we aim to further disentangle the spatio-temporal content factor z i,c using physical symmetry such that it is equivariant with respect to the prior model and each dimension of it is interpretable and consistent with human perception. We focus on two specific problems in this paper. The primary problem is to learn pitch and timbre factors of music notes from music audio, where each x i is a spectrogram of a note. Ideally, z i,c is a 1D content factor representing the pitch and z i,s is a style factor representing the timbre. Another problem is to learn 3D Cartesian location and colour factors of a simple moving object (a bouncing ball) from its trajectory shot by a fixed, single camera. In this case, each x i is an image. Ideally, z i,c is learned to be a 3D content factor representing the location and z i,s represents the global colour.



Figure 1: Physical symmetry, the fundamental inductive bias of this study.

