FORWARD AND BACKWARD LIFELONG LEARNING WITH TIME-DEPENDENT TASKS

Abstract

For a sequence of classification tasks that arrive over time, lifelong learning methods can boost the effective sample size of each task by leveraging information from preceding and succeeding tasks (forward and backward learning). However, backward learning is often prone to a so-called catastrophic forgetting in which a task's performance gets worse while trying to repeatedly incorporate information from succeeding tasks. In addition, current lifelong learning techniques are designed for i.i.d. tasks and cannot capture the usual higher similarities between consecutive tasks. This paper presents lifelong learning methods based on minimax risk classifiers (LMRCs) that effectively exploit forward and backward learning and account for time-dependent tasks. In addition, we analytically characterize the increase in effective sample size provided by forward and backward learning in terms of the tasks' expected quadratic change. The experimental evaluation shows that LMRCs can result in a significant performance improvement, especially for reduced sample sizes.

1. INTRODUCTION

In practical scenarios, classification problems (tasks) often have limited sample sizes and arrive sequentially over time. Lifelong learning (also known as continual learning) can boost the effective sample size (ESS) of each task by leveraging information from preceding and succeeding tasks (forward and backward learning) (Ruvolo & Eaton, 2013; Lopez-Paz & Ranzato, 2017; Chen & Liu, 2018) . The general goal of such approaches is to replicate the humans' ability to continually improve the performance of each task exploiting information acquired from other tasks. The development of lifelong learning techniques is hindered by the continuous arrival of samples from tasks characterized by different underlying distributions. In particular, backward learning (also known as reverse transfer) is often prone to a so-called catastrophic forgetting in which a task's performance gets worse while trying to repeatedly incorporate information from the succeeding tasks (Kirkpatrick et al., 2017; Hurtado et al., 2021; Henning et al., 2021) . More generally, lifelong learning methods face a so-called stability-plasticity dilemma: the excessive usage of information from different tasks can result in a performance decrease while a moderate usage does not fully exploit the potential of lifelong learning (Rolnick et al., 2019; Ke et al., 2021) . Most of lifelong learning techniques are designed for tasks sampled i.i.d. from a task environment (Baxter, 2000; Maurer et al., 2016; Denevi et al., 2019) , and current methods cannot capture the usual higher similarities between consecutive tasks. For a sequence of tasks that arrive over time, it is common that the tasks are time-dependent and consecutive tasks are significantly more similar. For instance, if each task corresponds to the classification of portraits from a specific time period (Ginosar et al., 2015) , the similarity between tasks is markedly higher for consecutive tasks (see Figure 1 ). In the current literature of lifelong learning, only Pentina & Lampert (2015) considers scenarios with time-dependent tasks and analyzes the feasibility of transferring information from the preceding tasks. On the other hand, methods designed for concept drift adaptation (Zhao et al., 2020; Tahmasbi et al., 2021; Álvarez et al., 2022) account for time-dependent underlying distributions but only aim to learn the last task in the sequence. This paper presents lifelong learning methods based on minimax risk classifiers (LMRCs). The proposed techniques effectively exploit forward and backward learning and account for time-dependent tasks. Specifically, the main contributions presented in the paper are as follows. • The presented LMRCs minimize the worst-case error probabilities over uncertainty sets obtained using information from all the tasks. • We propose learning techniques that can effectively incorporate information from the everincreasing sequence of tasks and provide performance guarantees for forward and backward learning. • We analytically characterize the increase in ESS provided by forward and backward learning in terms of the expected quadratic change between consecutive tasks. • We numerically quantify the performance improvement provided by the presented learning techniques in comparison with existing methods using multiple datasets, different sample sizes, and number of tasks. Notations Calligraphic letters represent sets; • 1 and • ∞ denote the 1-norm and the infinity norm of its argument, respectively; and denote vector inequalities; I{•} denotes the indicator function; and E p { • } and Var p {•} denote the expectation and the variance of its argument with respect to distribution p. For a vector v, v (i) and v T denote the i-th component and the transpose of v. Non-linear operators acting on vectors denote component-wise operations. For instance, |v| and v 2 denote the vector formed by the absolute value and the square of each component, respectively.

2. PRELIMINARIES

In the following, we denote by X the set of instances or attributes, Y the set of labels or classes, ∆(X × Y) the set of probability distributions over X × Y, and T(X , Y) the set of classification rules. A classification task is characterized by an underlying distribution p * ∈ ∆(X × Y) and supervised classification methods use a sample set D = {(x i , y i )} n i=1 formed by n i.i.d samples from distribution p * to find a classification rule h ∈ T(X , Y) with small expected loss ℓ(h, p * ). In lifelong learning, sample sets D 1 , D 2 , . . . arrive over time steps 1, 2, . . . corresponding with different classification tasks characterized by underlying distributions p 1 , p 2 , . . .. At each time step k, lifelong learning methods aim to obtain classification rules h 1 , h 2 , . . . , h k with small expected losses ℓ(h 1 , p 1 ), ℓ(h 2 , p 2 ), . . . , ℓ(h k , p k ) for the current sequence of k tasks. For instance, overall performance is usually assessed by the averaged error 1 k k i=1 ℓ(h i , p i ). As depicted in Fig. 1 , for each j-th task with j ∈ {1, 2, . . . , k}, lifelong learning methods obtain the classification rule h j leveraging information obtained from sample sets D 1 , D 2 , . . . , D j (forward learning) and from sample sets D j+1 , D j+2 , . . . , D k (backward learning). Most existing lifelong learning techniques are designed for tasks characterized by distributions p 1 , p 2 , . . . such that the tasks' distributions p i are independent and identically distributed (i.i.d.) random probability measures for i = 1, 2, . . .. In the following, we propose lifelong learning techniques designed for time-dependent tasks that are characterized by distributions p 1 , p 2 , . . . such that the changes between consecutive distributions p i+1 -p i are independent and zero-mean random signed measures for i = 1, 2, . . .. Such assumption can account for usual higher similarities between consecutive tasks; for instance, it implies that p i+t -p i is a zero-mean random variable with Var{p i+t -p i } = t j=1 Var{p i+j -p i+j-1 }, while



Figure1: For tasks that arrive over time, consecutive tasks are often more similar. Forward and backward learning can exploit such similarities and extract information from preceding and succeding tasks.

