LEARNING TO DECOUPLE COMPLEX SYSTEM FOR SEQUENTIAL DATA

Abstract

A complex system with cluttered observations may be a coupled mixture of multiple simple sub-systems corresponding to latent entities. Such sub-systems may hold distinct dynamics in the continuous-time domain, therein complicated interactions between sub-systems also evolve over time. This setting is fairly common in the real world, but has been less considered. In this paper, we propose a sequential learning approach under this setting by decoupling a complex system for handling irregularly sampled and cluttered sequential observations. Such decoupling brings about not only subsystems describing the dynamics of each latent entity, but also a meta-system capturing the interaction between entities over time. Specifically, we argue that the meta-system of interactions is governed by a smoothed version of projected differential equations. Experimental results on synthetic and real-world datasets show the advantages of our approach when facing complex and cluttered sequential data compared to the state-of-the-art.

1. INTRODUCTION

Discovering hidden rules from sequential observations has been an essential topic in machine learning, with a large variety of applications such as physics simulation (Sanchez-Gonzalez et al., 2020) , autonomous driving (Diehl et al., 2019) , ECG analysis (Golany et al., 2021) and event analysis (Chen et al., 2021) , to name a few. A standard scheme is to consider sequential data at each timestamp being holistic and homogeneous under some ideal assumptions (i.e., only the temporal behavior of one entity is involved in a sequence), under which data/observation is treated as collection of slices at different time from a unified system. A series of sequential learning models fall into this category, including variants of recurrent neural networks (RNNs) (Cho et al., 2014; Hochreiter & Schmidhuber, 1997) , neural differential equations (DEs) (Chen et al., 2018; Kidger et al., 2020; Rusch & Mishra, 2021; Zhu et al., 2021) and spatial/temporal attention-based approaches (Vaswani et al., 2017; Fan et al., 2019; Song et al., 2017) . These variants fit well into the scenarios agreeing with the aforementioned assumptions, and are proved effective in learning or modeling for relatively simple applications with clean data source. In the real world, a system may not only describe a single and holistic entity, but also consist of several distinguishable interacting but simple subsystems, where each subsystem corresponds to a physical entity. For example, we can think of the movement of a solar system being the mixture of distinguishable subsystems of sun and surrounding planets, while interactions between these celestial bodies along time are governed by the laws of gravity. Back to centuries ago, physicists and astronomers made enormous effort to discover the rule of celestial movements from the records of each single bodies, and eventually delivered the neat yet elegant differential equations (DEs) depicting principles of moving bodies and interactions therein. Likewise in nowadays, researchers also developed a series of machine learning models for sequential data with distinguishable partitions (Qin et al., 2017) . Two widely adopted strategies for learning the interactions between subsystems are graph neural networks (Iakovlev et al., 2021; Ha & Jeong, 2021; Kipf et al., 2018; Yıldız et al., 2022; Xhonneux et al., 2020) and attention mechanism (Vaswani et al., 2017; Lu et al., 2020; Goyal et al., 2021) , while the interactions are typically encoded with "messages" between nodes and pair-wise "attention scores", respectively. It is worth noting a even more difficult scenario, in which the data/observation is so cluttered that cannot be readily distinguished into separate parts. This can be either due to the way of data collection (e.g., video consisting of multiple objects), or because there is no explicit physical entities originally (e.g., weather time series). To tackle this, a fair assumption can be introduced that complex observations can be decoupled into several relatively independent modules in the feature space, where each module corresponds to a latent entity. Latent entities may not have exact physical meanings, but learning procedures can greatly benefit from such decoupling, as this assumption can be viewed as strong regularization to the system. This assumption has been successfully incorporated in several models for learning from regularly sampled sequential data, by emphasizing "independence" to some extent between channels or groups in the feature space (Li et al., 2018; Yu et al., 2020; Goyal et al., 2021; Madan et al., 2021) . Another successful counterpart in parallel benefiting from this assumption is transformer (Vaswani et al., 2017) which stacks multiple layers of self-attention and point-wise feedforward networks. In transformers, each attention head can be viewed as a relatively independent module and interaction happens throughout head re-weighting procedure following the attention scores. Lu et al. ( 2020) presented an interpretation from a dynamic point of view, by regarding a basic layer in the transformer as one step of integration, governed by differential equations derived from interacting particles. Vuckovic et al. ( 2020) extended this interpretation with more solid mathematical support by viewing forward pass of transformer as applying successive Markov kernels in a particle-based dynamic system. We note, however, despite the ubiquity of this setting, there is barely any previous investigation focusing on learning for irregularly sampled and cluttered sequential data. Aforementioned works either fail to handle the irregularity (Goyal et al., 2021; Li et al., 2018) , or neglect the independence/modularity assumption in the latent space (Chen et al., 2018; Kidger et al., 2020) . In this paper, inspired by recent advances of neural controlled dynamics (Kidger et al., 2020) and novel interpretation of attention mechanism (Vuckovic et al., 2020) , we make a step to propose an effective approach addressing this problem under dynamic setting. To this end, our approach explicitly learned to decouple a complex system into several latent sub-systems and utilizes an additional meta-system capturing the evolution of interactions over time. Specifically, taking into account the constrained interactions analogous to the attention mechanism, we further characterized such interactions using projected differential equations (ProjDEs). We argued our contributions as follows: • We provide a novel modeling strategy for sequential data from a system decoupling perspective; • We propose a novel and natural interpretation of evolving interactions as a ProjDE-based meta-system under constraints; • Our approach is parameter-insensitive and more compatible to other modules, thus being flexible to be integrated into various tasks. Extensive experiments were conducted on either regularly or irregularly sampled sequential data, including both synthetic and real-world settings. It was observed that our approach achieved prominent performance compared to state-of-the-arts on a wide spectrum of tasks.

2. RELATED WORK

Sequential learning. Traditionally, learning with sequential data can be performed using variants of recurrent neural networks (RNNs) (Hochreiter & Schmidhuber, 1997; Cho et al., 2014; Li et al., 2018) under a Markov setting. While such RNNs are generally designed for regularly sampling frequency, a more natural line of counterparts lie in the continuous time domain allowing irregularly sampled time series as input. As such, a variety of RNN-based methods are developed, by introducing exponential decay on observations (Che et al., 2018; Mei & Eisner, 2017) , incorporating an underlying Gaussian process (Li & Marlin, 2016; Futoma et al., 2017) , or integrating some latent evolution under ODEs (Rubanova et al., 2019; De Brouwer et al., 2019) . A seminal work interpreting forward passing in neural networks as integration of ODEs was proposed in Chen et al. ( 2018), following by a series of relevant works (Liu et al., 2019; Li et al., 2020a; Dupont et al., 2019) . As integration over ODEs allows for arbitrary step length, it's a natural modeling of irregularly time series, and proved powerful in many machine learning tasks (e.g., bioinformatics (Golany et al., 2021 ), physics (Nardini et al., 2021 ) and computer vision (Park et al., 2021) ). Kidger et al. ( 2020) studied a more effective way of injecting observations to the system via a mathematical tool called Controlled differential Equation, achieving state-of-the-art performance on several benchmarks. Some variants of neural ODEs have

