DEEP DECLARATIVE DYNAMIC TIME WARPING FOR END-TO-END LEARNING OF ALIGNMENT PATHS

Abstract

This paper addresses learning end-to-end models for time series data that include a temporal alignment step via dynamic time warping (DTW). Existing approaches to differentiable DTW either differentiate through a fixed warping path or apply a differentiable relaxation to the min operator found in the recursive steps used to solve the DTW problem. We instead propose a DTW layer based around bilevel optimisation and deep declarative networks, which we name DecDTW. By formulating DTW as a continuous, inequality constrained optimisation problem, we can compute gradients for the solution of the optimal alignment (with respect to the underlying time series) using implicit differentiation. An interesting byproduct of this formulation is that DecDTW outputs the optimal warping path between two time series as opposed to a soft approximation, recoverable from Soft-DTW. We show that this property is particularly useful for applications where downstream loss functions are defined on the optimal alignment path itself. This naturally occurs, for instance, when learning to improve the accuracy of predicted alignments against ground truth alignments. We evaluate DecDTW on two such applications, namely the audio-to-score alignment task in music information retrieval and the visual place recognition task in robotics, demonstrating state-of-the-art results in both.

1. INTRODUCTION

The dynamic time warping (DTW) algorithm computes a discrepancy measure between two temporal sequences, which is invariant to shifting and non-linear scaling in time. Because of this desirable invariance, DTW is ubiquitous in fields that analyze temporal sequences such as speech recognition, motion capture, time series classification and bioinformatics (Kovar & Gleicher, 2003; Zhu & Shasha, 2003; Sakoe & Chiba, 1978; Bagnall et al., 2017; Petitjean et al., 2014; Needleman & Wunsch, 1970) . The original formulation of DTW computes the minimum cost matching between elements of the two sequences, called an alignment (or warping) path, subject to temporal constraints imposed on the matches. For two sequences of length m and n, this can be computed by first constructing an m-by-n pairwise cost matrix between sequence elements and subsequently solving a dynamic program (DP) using Bellman's recursion in O(mn) time. Figure 1 illustrates the mechanics of the DTW algorithm. There has been interest in recent years around embedding DTW within deep learning models (Cuturi & Blondel, 2017; Cai et al., 2019; Lohit et al., 2019; Chang et al., 2019; 2021) , with applications spread across a variety of learning tasks utilising audio and video data (Garreau et al., 2014; Dvornik et al., 2021; Haresh et al., 2021; Chang et al., 2019; 2021) Gould et al., 2021) to define the forward and backward passes in our DTW layer. The forward pass involves solving for the optimal (continuous time) warping path; we achieve this using a custom dynamic programming approach similar to the original DTW algorithm. The backward pass uses the identities in Gould et al. ( 2021) to derive gradients using implicit differentiation through the solution computed in the forward pass. We will show that DecDTW has benefits compared to existing approaches based on Soft-DTW (Cuturi & Blondel, 2017; Le Guen & Thome, 2019; Blondel et al., 2021) ; the most important of which is that DecDTW is more effective and efficient at utilising alignment path information in an end-to-end learning setting. This is particularly useful for applications where the objective of learning is to improve the accuracy of the alignment itself, and furthermore, ground truth alignments between time series pairs are provided. We show that using DecDTW yields both considerable performance gains and efficiency gains (during training) over Soft-DTW (Le Guen & Thome, 2019) in challenging real-world alignment tasks. An overview of our proposed DecDTW layer is illustrated in Figure 2 . Our Contributions First, we propose a novel, inequality constrained NLP formulation of the DTW problem, building on the approach in Deriso & Boyd (2019). Second, we use this NLP formulation to specify our novel DecDTW layer, where gradients in the backward pass are computed implicitly as in the DDN framework (Gould et al., 2021) . Third, we show how the alignment path produced by DecDTW can be used to minimise discrepancies to ground truth alignments. Last, we use our method to attain state-of-the-art performance on challenging real-world alignment tasks.

2. RELATED WORKS

Differentiable DTW Layers Earlier approaches to learning with DTW involve for each iteration, alternating between first, computing the optimal alignment using DTW and then given the fixed alignment, optimising the underlying features input into DTW.  * γ = E γ [A] = ∇ ∆ dtw γ (X, Y) (Cuturi & Blondel, 2017). While A * γ is recovered during the Soft-DTW backward pass, differentiating through A * γ , involves computing Hessian ∇ 2 ∆ dtw γ (X, Y). Le Guen & Thome (2019) proposed an efficient custom backward pass to achieve this. A loss can be specified over paths through a penalty matrix Ω, which for instance, can encode the error between the expected predicted path and ground truth. Note, at inference time, the original DTW problem (i.e., γ = 0) is solved to generate predicted alignments, leaving a disconnect between the training loss and inference task.



Figure 1: Classic DTW (a) is a discrete optimisation problem which finds the minimum cost warping path through a pairwise cost matrix (b). DecDTW uses a continuous time variant of classic DTW (GDTW) (c) and finds an optimal time warp function between two continuous time signals (d).

Zhou & De la Torre (2009); Su & Wu (2019); Zhou & De la Torre (2012) analytically solve for a linear transformation of raw observations. More recent work such as DTW-Net (Cai et al., 2019) and DP-DTW (Chang et al., 2021) instead take a single gradient step at each iteration to optimise a non-linear feature extractor. All aforementioned methods are not able to directly use path information within a downstream loss function. Differentiable Temporal Alignment Paths Soft-DTW (Cuturi & Blondel, 2017) is a differentiable relaxation of the classic DTW problem, achieved by replacing the min step in the DP recursion with a differentiable soft-min. The Soft-DTW discrepancy is the expected alignment cost under a Gibbs distribution over alignments, induced by the pairwise cost matrix ∆ and smoothing parameter γ > 0. Path information is encoded through the expected alignment matrix, A

, especially where an explicit alignment step is desired. There are several distinct approaches in the literature for differentiable DTW layers: Cuturi & Blondel (2017) (Soft-DTW) and Chang et al. (2019) use a differentiable relaxation of min in the DP recursion, Cai et al. (2019) and Chang et al. (2021) observe that differentiation is possible after fixing the warping path, and others(Lohit et al., 2019; Shapira Weber et al., 2019; Grabocka &  Schmidt-Thieme, 2018; Kazlauskaite et al., 2019; Abid & Zou, 2018)  regress warping paths directly from sequence elements without explicitly aligning. Note, methods based on DTW differ from methods such as CTC for speech recognition(Graves & Jaitly, 2014), where word-level transcription is more important than frame-level alignment, and an explicit alignment step is not required.We propose a novel approach to differentiable DTW, which we name DecDTW, based on deep implicit layers. By adapting a continuous time formulation of the DTW problem proposed by Deriso & Boyd (2019), called GDTW, as an inequality constrained non-linear program (NLP), we can use the deep declarative networks (DDN) framework (

