DIFFERENTIABLE SEGMENTATION OF SEQUENCES

Abstract

Segmented models are widely used to describe non-stationary sequential data with discrete change points. Their estimation usually requires solving a mixed discretecontinuous optimization problem, where the segmentation is the discrete part and all other model parameters are continuous. A number of estimation algorithms have been developed that are highly specialized for their specific model assumptions. The dependence on non-standard algorithms makes it hard to integrate segmented models in state-of-the-art deep learning architectures that critically depend on gradient-based optimization techniques. In this work, we formulate a relaxed variant of segmented models that enables joint estimation of all model parameters, including the segmentation, with gradient descent. We build on recent advances in learning continuous warping functions and propose a novel family of warping functions based on the two-sided power (TSP) distribution. TSP-based warping functions are differentiable, have simple closed-form expressions, and can represent segmentation functions exactly. Our formulation includes the important class of segmented generalized linear models as a special case, which makes it highly versatile. We use our approach to model the spread of COVID-19 with Poisson regression, apply it on a change point detection task, and learn classification models with concept drift. The experiments show that our approach effectively solves all these tasks with a standard algorithm for gradient descent.

1. INTRODUCTION

Non-stationarity is a classical challenge in the analysis of sequential data. A common source of non-stationarity is the presence of change points, where the data-generating process switches its dynamics from one regime to another regime. In some applications, the detection of change points is of primary interest, since they may indicate important events in the data (Page, 1954; Box & Tiao, 1965; Basseville & Nikiforov, 1986; Matteson & James, 2014; Li et al., 2015; Arlot et al., 2019; Scharwächter & Müller, 2020) . Other applications require models for the dynamics within each segment, which may yield more insights into the phenomenon under study and enable predictions. A plethora of segmented models for regression analysis (McGee & Carleton, 1970; Hawkins, 1976; Lerman, 1980; Bai & Perron, 2003; Muggeo, 2003; Acharya et al., 2016) and time series analysis (Hamilton, 1990; Davis et al., 2006; Aue & Horváth, 2013; Ding et al., 2016) have been proposed in the literature, where the segmentation materializes either in the data dimensions or the index set. We adhere to the latter approach and consider models of the following form. Let x = (x 1 , ..., x T ) be a sequence of T observations, and let z = (z 1 , ..., z T ) be an additional sequence of covariates used to predict these observations. Observations and covariates may be scalars or vector-valued. We refer to the index t = 1, ..., T as the time of observation. The data-generating process (DGP) of x given z is time-varying and follows a segmented model with K T segments on the time axis. Let τ k denote the beginning of segment k. We assume that x t | z t iid ∼ f DGP (z t , θ k ) , if τ k ≤ t < τ k+1 ,

annex

where the DGP in segment k is parametrized by θ k . This scenario is typically studied for change point detection (Truong et al., 2020; van den Burg & Williams, 2020) and modeling of non-stationary time series (Guralnik & Srivastava, 1999; Cai, 1994; Kohlmorgen & Lemm, 2001; Davis et al., 2006; Robinson & Hartemink, 2008; Saeedi et al., 2016) , but also captures classification models with concept drift (Gama et al., 2013) . and segmented generalized linear models (Muggeo, 2003) We express the segmentation of the time axis by a segmentation function ζ : {1, ..., T } -→ {1, ..., K} that maps each time point t to a segment identifier k. The segmentation function is monotonically increasing with boundary constraints ζ(1) = 1 and ζ(T ) = K. We denote all segment-wise parameters by θ = (θ 1 , ..., θ K ). The ultimate goal is to find a segmentation ζ as well as segment-wise parameters θ that minimize a loss function L(ζ, θ), for example, the negative log-likelihood of the observations x. Existing approaches exploit the fact that model estimation within a segment is often straightforward when the segmentation is known. These approaches decouple the search for an optimal segmentation ζ algorithmically from the estimation of the segment-wise parameters θ:Various algorithmic search strategies have been explored for the outer minimization of ζ, including grid search (Lerman, 1980) , dynamic programming (Hawkins, 1976; Bai & Perron, 2003) , hierarchical clustering (McGee & Carleton, 1970) and other greedy algorithms (Acharya et al., 2016) , some of which come with provable optimality guarantees. These algorithms are often tailored to a specific class of models like piecewise linear regression, and do not generalize beyond. Moreover, the use of non-standard optimization techniques in the outer minimization hinders the integration of such models with deep learning architectures, which usually perform joint optimization of all model parameters with gradient descent.In this work, we provide a continuous and differentiable relaxation of the segmented model from Equation 1 that allows joint optimization of all model parameters, including the segmentation function, using state-of-the-art gradient descent algorithms. Our formulation is inspired by the learnable warping functions proposed recently for sequence alignment (Lohit et al., 2019; Weber et al., 2019) .In a nutshell, we replace the hard segmentation function ζ with a soft warping function γ. An optimal segmentation can be found by optimizing the parameters of the warping function. We propose a novel class of piecewise-constant warping functions based on the two-sided power (TSP) distribution (Van Dorp & Kotz, 2002; Kotz & van Dorp, 2004 ) that can represent segmentation functions exactly. TSP-based warping functions are differentiable, have simple closed-form expressions that allow fast evaluation, and their parameters have a one-to-one correspondence with segment boundaries.Source codes for the model and all experiments can be found in the online supplementary material at https://github.com/diozaka/diffseg.

2. RELAXED SEGMENTED MODELS

We relax the model definition from Equation 1 in the following way. We assume thatwhere we substitute the actual parameter θ k of the DGP at time step t in segment k by the predictor θt . The predictor θt is a weighted sum over the individual segment parameters θt := The smaller the difference, the closer the alignment weight ŵkt will be to 1. If k ≤ ζt ≤ k + 1, this choice of weights leads to a linear interpolation of the parameters θ k and θ k+1 in Equation 4. Higher-order interpolations can be achieved by adapting the weight function accordingly.

