COMPRESSED PREDICTIVE INFORMATION CODING

Abstract

Unsupervised learning plays an important role in many fields, such as machine learning, data compression, and neuroscience. Compared to static data, methods for extracting low-dimensional structure for dynamic data are lagging. We developed a novel information-theoretic framework, Compressed Predictive Information Coding (CPIC), to extract predictive latent representations from dynamic data. Predictive information quantifies the ability to predict the future of a time series from its past. CPIC selectively projects the past (input) into a low dimensional space that is predictive about the compressed data projected from the future (output). The key insight of our framework is to learn representations by balancing the minimization of compression complexity with maximization of the predictive information in the latent space. We derive tractable variational bounds of the CPIC loss by leveraging bounds on mutual information. The CPIC loss induces the latent space to capture information that is maximally predictive of the future of the data from the past. We demonstrate that introducing stochasticity in the encoder and maximizing the predictive information in latent space contributes to learning more robust latent representations. Furthermore, our variational approaches perform better in mutual information estimation compared with estimates under the Gaussian assumption commonly used. We show numerically in synthetic data that CPIC can recover dynamical systems embedded in noisy observation data with low signal-to-noise ratio. Finally, we demonstrate that CPIC extracts features more predictive of forecasting exogenous variables as well as auto-forecasting in various real datasets compared with other state-of-the-art representation learning models. Together, these results indicate that CPIC will be broadly useful for extracting low-dimensional dynamic structure from high-dimensional, noisy timeseries data.

1. INTRODUCTION

Unsupervised methods play an important role in learning representations that provide insight into data and exploit unlabeled data to improve performance in downstream tasks in diverse application areas Bengio et al. ( 2013 2020). Prior work on unsupervised representation learning can be broadly categorized into generative models such as variational autoencoders(VAEs) (Kingma & Welling, 2013) and generative adversarial networks (GAN) (Goodfellow et al., 2014) , discriminative models such as dynamical components analysis (DCA) (Clark et al., 2019) , contrastive predictive coding (CPC) (Oord et al., 2018) , and deep autoencoding predictive components (DAPC) (Bai et al., 2020) . Generative models focus on capturing the joint distribution between representations and inputs, but are usually computationally expensive. On the other hand, discriminative models emphasize capturing the dependence of data structure in the low-dimensional latent space, and are therefore easier to scale to large datasets. In the case of time series, some representation learning models take advantage of an estimate of mutual information between encoded past (input) and the future (output) (Creutzig & Sprekeler, 2008; Creutzig et al., 2009; Oord et al., 2018) . Although previous models utilizing mutual information extract low-dimensional representations, they tend to be sensitive to noise in the observational space. DCA directly makes use of the mutual information between the past and the future (i.e., the predictive information (Bialek et al., 2001) ) in a latent representational space that is a linear embedding of the observation data. However, DCA operates under Gaussian assumptions for mutual information estimation. We propose a novel representation learning framework which is not only robust to noise in the observation space but also alleviates the Gaussian assumption and is thus more flexible. We formalize our problem in terms of data generated from a stationary dynamical system and propose an information-theoretic objective function for Compressed Predictive Information Coding (CPIC). Instead of leveraging the information bottleneck (IB) objective directly as in Creutzig & Sprekeler (2008) and Creutzig et al. (2009) , where the past latent representation is directly used to predict future observations, we predict the compressed future observations filtered by the encoder. It is because that in the time series setting, future observations are noisy, and treating them as labels is not insightful. Specifically, our target is to extract latent representation which can better predict future underlying dynamics. Since the compressed future observations are assumed to only retain the underlying dynamics, better compression thus contributes to extracting better dynamical representation. In addition, inspired by Clark et al. ( 2019) and Bai et al. ( 2020), we extend the prediction from single input to a window of inputs to handle high order predictive information. Moreover, instead of directly estimating the objective information with Gaussian assumption (Creutzig & Sprekeler, 2008; Creutzig et al., 2009; Clark et al., 2019; Bai et al., 2020) , we developed variational bounds and a tractable end-to-end training framework based on the neural estimator of mutual information studied in Poole et al. (2019) . Note that our inference first leverages the variational boundary technique for self-supervised learning on the time series data. Since it alleviates the Gaussian assumption, it is applicable to a much larger class of dynamical systems. In CPIC, we also demonstrate that introducing stochasticity into either a linear or nonlinear encoder robustly contributes to numerically better representations in different tasks. In particular, we illustrate that CPIC can recover trajectories of a chaotic dynamical system embedded in highdimensional noisy observations with low signal-to-noise ratios in synthetic data. Furthermore, we conduct numerical experiments on four real-world datasets with different goals. In two neuroscience datasets, monkey motor cortex (M1) and rat dorsal hippocampus (HC), compared with the state-ofthe-art methods, we show that the latent representations extracted from CPIC have better forecasting accuracy for the exogenous variables of the monkey's future hand position for M1, and for the rat's future position for HC. In two other real datasets, historical hourly weather temperature data (TEMP) and motion sensor data (MS), we show that latent representations extracted by CPIC have better forecasting accuracy of the future of those time series than other methods. In summary, the primary contributions of our paper are as follows: • We developed a novel information-theoretic self-supervised learning framework, Compressed Predictive Information Coding (CPIC), which extracts low-dimensional latent representation from time series. CPIC maximizes the predictive information in the latent space while minimizing the compression complexity. • We introduced the stochastic encoder structure where we encode inputs into stochastic representations to handle uncertainty and contribute to better representations. • Based on prior works, we derived the variational bounds of the CPIC's objective function and a tractable, end-to-end training procedure. Since our inference alleviates the Gaussian assumption common to other methods, it is applicable to a much larger class of dynamical systems. Moreover, to the best of our knowledge, our inference is the first to leverage the variational boundary technique for self-supervised learning on time series data. • We demonstrated that, compared with the other unsupervised based methods, CPIC more robustly recovers latent dynamics in dynamical system with low signal-to-noise ratio in synthetic experiments, and extracts more predictive features for downstream tasks in various real datasets.

2. RELATED WORK

Mutual information (MI) plays an important role in estimating the relationship between pairs of variables. It is a reparameterization-invariant measure of dependency: I(X, Y ) = E p(x,y) log p(x|y) p(x) (1)



); Chen et al. (2020); Grill et al. (2020); Devlin et al. (2018); Brown et al. (2020); Baevski et al. (2020); Wang et al. (

