LEARNING BASIC INTERPRETABLE FACTORS FROM TEMPORAL SIGNALS VIA PHYSICAL SYMMETRY

Abstract

We have recently seen great progress in learning interpretable music representations, ranging from basic factors, such as pitch and timbre, to high-level concepts, such as chord andtexture. However, most methods rely heavily on music domain knowledge and it remains an open question how to learn interpretable and disentangled representations using inductive biases that are more general. In this study, we use physical symmetry as a self-consistency constraint on the latent space. Specifically, it requires the prior model that characterises the dynamics of the latent states to be equivariant with respect to a certain group transformation. We show that our model can learn linear pitch factor (that agrees with human music perception) as well as pitch-timbre disentanglement from unlabelled monophonic music audio. In addition, the same methodology can be applied to computer vision, learning the 3D Cartesian space as well as space-colour disentanglement from a simple moving object shot by a single fixed camera. Furthermore, applying physical symmetry to the prior model naturally leads to representation augmentation, a new learning technique which helps improve sample efficiency.

1. INTRODUCTION

Interpretable representation-learning models have achieved great progress for various types of timeseries data. Taking the music domain as an example, tailored deep generative models (Ji et al., 2020) have been developed to learn pitch, timbre, melody contour, chord progression, accompaniment texture, etc. However, most models still rely heavily on domain-specific knowledge. For example, to use pitch scales or instrument labels for learning pitch and timbre representations (Luo et al., 2020; 2019; Engel et al., 2020; Lin et al., 2021; Esling et al., 2018) and to use chords and rhythm labels for learning higher-level representations (Akama, 2019; Yang et al., 2019; Wang et al., 2020; Wei & Xia, 2021) . Such an approach is very different from human learning; even without formal music training, one can at least perceive basic factors such as pitch and timbre from the experience of listening to music. In other words, it remains an open question how to learn interpretable music representations using inductive biases that are more general. We see a similar issue in other domains. For instance, various computer-vision models (McCarthy & Ahmed, 2020; Trevithick & Yang, 2021; Mescheder et al., 2019; Riegler et al., 2017) can learn 3D representations of human faces or a particular scene by incorporating domain knowledge (e.g., labelling of meshes and voxels, 3D-specific setups such as multi-cameras, 3D convolution, etc.) but it remains a non-trivial task to trace the 3D location of a simple moving object from monocular videos in a self-supervised fashion. In this study, we explore to use physical symmetry (i.e., symmetry of physical laws) as a weak self-consistency constraint for the learned latent z space. As indicated in Figure 1 , this general inductive bias requires that after a certain transformation S (e.g., translation or rotation) in the latent space, the learned prior model R, which is the induced physical law describing the temporal flow of the latent states, should output equivariant predictions. Formally, z t+1 = R(z t ) if and only if z S t+1 = R(z S t ), where z S = S(z). In other words, R and S are commutable operations for z, i.e., R(S(z)) = S(R(z)). Note that this approach is fundamentally different from most existing symmetry-informed models (Bronstein et al., 2021) , in which the symmetry property is used to constrain the encoder or the decoder. Specifically, we design self-supervised learning with physics symmetry (SPS), a method that adopts an encoder-decoder framework and applies physical symmetry to the prior model. We show that

