FORWARD AND BACKWARD LIFELONG LEARNING WITH TIME-DEPENDENT TASKS

Abstract

For a sequence of classification tasks that arrive over time, lifelong learning methods can boost the effective sample size of each task by leveraging information from preceding and succeeding tasks (forward and backward learning). However, backward learning is often prone to a so-called catastrophic forgetting in which a task's performance gets worse while trying to repeatedly incorporate information from succeeding tasks. In addition, current lifelong learning techniques are designed for i.i.d. tasks and cannot capture the usual higher similarities between consecutive tasks. This paper presents lifelong learning methods based on minimax risk classifiers (LMRCs) that effectively exploit forward and backward learning and account for time-dependent tasks. In addition, we analytically characterize the increase in effective sample size provided by forward and backward learning in terms of the tasks' expected quadratic change. The experimental evaluation shows that LMRCs can result in a significant performance improvement, especially for reduced sample sizes. . . . . . .

1. INTRODUCTION

In practical scenarios, classification problems (tasks) often have limited sample sizes and arrive sequentially over time. Lifelong learning (also known as continual learning) can boost the effective sample size (ESS) of each task by leveraging information from preceding and succeeding tasks (forward and backward learning) (Ruvolo & Eaton, 2013; Lopez-Paz & Ranzato, 2017; Chen & Liu, 2018) . The general goal of such approaches is to replicate the humans' ability to continually improve the performance of each task exploiting information acquired from other tasks. The development of lifelong learning techniques is hindered by the continuous arrival of samples from tasks characterized by different underlying distributions. In particular, backward learning (also known as reverse transfer) is often prone to a so-called catastrophic forgetting in which a task's performance gets worse while trying to repeatedly incorporate information from the succeeding tasks (Kirkpatrick et al., 2017; Hurtado et al., 2021; Henning et al., 2021) . More generally, lifelong learning methods face a so-called stability-plasticity dilemma: the excessive usage of information from different tasks can result in a performance decrease while a moderate usage does not fully exploit the potential of lifelong learning (Rolnick et al., 2019; Ke et al., 2021) . Most of lifelong learning techniques are designed for tasks sampled i.i.d. from a task environment (Baxter, 2000; Maurer et al., 2016; Denevi et al., 2019) , and current methods cannot capture the usual higher similarities between consecutive tasks. For a sequence of tasks that arrive over time, it is common that the tasks are time-dependent and consecutive tasks are significantly more similar. For instance, if each task corresponds to the classification of portraits from a specific time period (Ginosar et al., 2015) , the similarity between tasks is markedly higher for consecutive tasks (see Figure 1 ). In the current literature of lifelong learning, only Pentina & Lampert (2015) considers scenarios with time-dependent tasks and analyzes the feasibility of transferring information from the preceding tasks. On the other hand, methods designed for concept drift adaptation (Zhao et al., 2020; Tahmasbi et al., 2021; Álvarez et al., 2022) account for time-dependent underlying distributions but only aim to learn the last task in the sequence. This paper presents lifelong learning methods based on minimax risk classifiers (LMRCs). The proposed techniques effectively exploit forward and backward learning and account for time-dependent tasks. Specifically, the main contributions presented in the paper are as follows. Figure 1 : For tasks that arrive over time, consecutive tasks are often more similar. Forward and backward learning can exploit such similarities and extract information from preceding and succeding tasks. • The presented LMRCs minimize the worst-case error probabilities over uncertainty sets obtained using information from all the tasks. • We propose learning techniques that can effectively incorporate information from the everincreasing sequence of tasks and provide performance guarantees for forward and backward learning. • We analytically characterize the increase in ESS provided by forward and backward learning in terms of the expected quadratic change between consecutive tasks. • We numerically quantify the performance improvement provided by the presented learning techniques in comparison with existing methods using multiple datasets, different sample sizes, and number of tasks. Notations Calligraphic letters represent sets; • 1 and • ∞ denote the 1-norm and the infinity norm of its argument, respectively; and denote vector inequalities; I{•} denotes the indicator function; and E p { • } and Var p {•} denote the expectation and the variance of its argument with respect to distribution p. For a vector v, v (i) and v T denote the i-th component and the transpose of v. Non-linear operators acting on vectors denote component-wise operations. For instance, |v| and v 2 denote the vector formed by the absolute value and the square of each component, respectively.

2. PRELIMINARIES

In the following, we denote by X the set of instances or attributes, Y the set of labels or classes, ∆(X × Y) the set of probability distributions over X × Y, and T(X , Y) the set of classification rules. A classification task is characterized by an underlying distribution p * ∈ ∆(X × Y) and supervised classification methods use a sample set D = {(x i , y i )} n i=1 formed by n i.i.d samples from distribution p * to find a classification rule h ∈ T(X , Y) with small expected loss ℓ(h, p * ). In lifelong learning, sample sets D 1 , D 2 , . . . arrive over time steps 1, 2, . . . corresponding with different classification tasks characterized by underlying distributions p 1 , p 2 , . . .. At each time step k, lifelong learning methods aim to obtain classification rules h 1 , h 2 , . . . , h k with small expected losses ℓ(h 1 , p 1 ), ℓ(h 2 , p 2 ), . . . , ℓ(h k , p k ) for the current sequence of k tasks. For instance, overall performance is usually assessed by the averaged error 1 k k i=1 ℓ(h i , p i ). As depicted in Fig. 1 , for each j-th task with j ∈ {1, 2, . . . , k}, lifelong learning methods obtain the classification rule h j leveraging information obtained from sample sets D 1 , D 2 , . . . , D j (forward learning) and from sample sets D j+1 , D j+2 , . . . , D k (backward learning). Most existing lifelong learning techniques are designed for tasks characterized by distributions p 1 , p 2 , . . . such that the tasks' distributions p i are independent and identically distributed (i.i.d.) random probability measures for i = 1, 2, . . .. In the following, we propose lifelong learning techniques designed for time-dependent tasks that are characterized by distributions p 1 , p 2 , . . . such that the changes between consecutive distributions p i+1 -p i are independent and zero-mean random signed measures for i = 1, 2, . . .. Such assumption can account for usual higher similarities between consecutive tasks; for instance, it implies that p i+t -p i is a zero-mean random variable with Var{p i+t -p i } = t j=1 Var{p i+j -p i+j-1 }, while the i.i.d. case would imply that p i+t -p i is a zero-mean random variable with Var{p i+t -p i } = Var{p i+1 -p i } = 2Var{p 1 } for any t and i. As described above, lifelong learning methods consider tasks characterized by different distributions. The methods presented below are based on the framework of minimax risk classifiers (MRCs) (Mazuelas et al., 2020; 2022a; b) since MRCs can utilize general expectation estimates obtained from samples with different distributions. MRCs learn classification rules by minimizing the worst-case expected loss against an uncertainty set that can include the underlying distribution with high probability. Such uncertainty sets are given by constraints on the expectation of a feature mapping Φ : X × Y → R m as U = {p ∈ ∆(X × Y) : |E p {Φ(x, y)} -τ | λ} (1) where τ denotes a mean vector of expectation estimates and λ denotes a confidence vector. Feature mappings are vector-valued functions over X × Y, e.g., one-hot encodings of values from the last layers in a neural network (Bengio et al., 2013; Kemker & Kanan, 2018; Mohri et al., 2018; Hurtado et al., 2021) . Given the uncertainty set U, MRC rules are solutions of the optimization problem R(U) = min h∈T(X ,Y) max p∈U ℓ(h, p) where R(U) denotes the minimax risk and ℓ(h, p) denotes the expected loss of classification rule h for distribution p. In the following, we utilize the 0-1-loss so that ℓ(h, p) = E p {I{h(x) = y}} and the expected loss with respect to the underlying distribution becomes the error probability of the classification rule. Deterministic MRCs assign each instance x ∈ X with the label h(x) ∈ arg max y∈Y Φ(x, y) T µ * where the parameter µ * is the solution of the convex optimization problem min µ 1 -τ T µ + max x∈X ,C⊆Y y∈C Φ(x, y) T µ -1 |C| + λ T |µ| given by the Lagrange dual of (2). MRCs provide bounds for the error probability with respect to the minimax risk R(U) and the smallest minimax risk as described in Mazuelas et al. (2020; 2022a; b) . The smallest minimax risk, denoted by R ∞ , is the minimax risk corresponding to the ideal case of knowing mean vectors exactly, that is, R ∞ is the minimax risk corresponding with the uncertainty set U ∞ = {p ∈ ∆(X × Y) : E p {Φ(x, y)} = E p * {Φ(x, y)}}. Such minimax risk coincides with the Bayes risk if the underlying distribution is the worst-case distribution in uncertainty set given by the true expectation of feature mapping (Mazuelas et al., 2022a) . If the mean vector of expectation estimates is obtained using a sample set D formed by instance-label pairs from the same underlying distribution, then the performance bounds of MRCs are of the usual order O (1/ √ n) where n is the sample size of D. In lifelong learning, the baseline approach of single-task learning obtains a classification rule h j for each j-th task leveraging information only from the sample set D j = {(x j,i , y j,i )} nj i=1 of size n j . In that case, LMRCs coincide with MRCs for standard supervised classification that obtain the mean and confidence vectors as τ j = 1 n j nj i=1 Φ (x j,i , y j,i ) , λ j = √ s j , s j = σ 2 j /n j (4) with σ 2 j an estimate of Var pj {Φ(x, y)}, e.g., the sample variance of the n j samples. The vector s j describes the mean squared errors (MSEs) of the mean vector components and directly gives the confidence vector λ j as shown in (4). In the following sections, we describe techniques that obtain the mean and MSE vectors using forward and backward learning. Once such vectors are obtained, LMRC methods take the confidence vector λ j as in (4) and obtain the classifier parameter µ j for each j-th task solving the convex optimization problem in (3) that can be efficiently addressed using conventional methods (Nesterov & Shikhman, 2015; Tao et al., 2019) .

3. FORWARD LEARNING WITH PERFORMANCE GUARANTEES

This section presents the recursions that allow to obtain mean and MSE vectors for each task retaining information from preceding tasks. In addition, it characterizes the increase in ESS provided by forward learning in terms of the tasks' expected quadratic change and the number of tasks.

3.1. FORWARD LEARNING

The proposed techniques for forward learning account for time-dependent tasks and obtain classification rules for each task leveraging information from preceding tasks. Let τ ⇀ j and s ⇀ j denote the mean and MSE vectors for forward learning corresponding to the j-th task for j ∈ {1, 2, . . . , k}. The following recursions allow to obtain τ ⇀ j and s ⇀ j for each j-th task using those vectors for the preceding task τ ⇀ j-1 , s ⇀ j-1 as τ ⇀ j = τ j + s j s ⇀ j-1 + s j + d 2 j τ ⇀ j-1 -τ j (5) s ⇀ j = 1 s j + 1 s ⇀ j-1 + d 2 j -1 with τ j and s j given by (4) and τ ⇀ 1 = τ 1 , s ⇀ 1 = s 1 . The vector d 2 j assesses the expected quadratic change between consecutive tasks. In the following, the change between consecutive tasks is described by w j = τ ∞ jτ ∞ j-1 for any j ∈ {2, 3, . . . , k} where τ ∞ j = E pj {Φ(x, y)} is the expectation of the feature mapping with respect to the underlying distribution. If p j -p j-1 are independent and zero-mean for j = 2, 3, . . ., then vectors w j are also independent and zero-mean for any feature mapping. Taking d 2 i = E{w 2 i } = E{(τ ∞ i -τ ∞ i-1 ) 2 } and σ 2 i = Var pi {Φ (x, y)} for any i, the recursion in (5) provides the unbiased linear estimator of the mean vector τ ∞ j based on D 1 , D 2 , . . . , D j that has the minimum MSE, while the recursion in (6) provides its MSE (see Appendix A for a detailed derivation). Vectors σ 2 i and d 2 i can be estimated online using the sample sets. In particular, σ 2 i can be estimated as the sample variance, while d 2 i can be estimated using sample averages as d 2 i = 1 W W l=1 (τ i l -τ i l-1 ) 2 where i 0 , i 1 , . . . , i W are the W + 1 closest indexes to i in {1, 2, . . . , k}. Recursions ( 5)-( 6) obtain mean and MSE vectors for the j-th task by acquiring information from the j-th sample set D j and retaining information from preceding tasks. Specifically, recursion (5) obtains the mean vector τ ⇀ j by adding a correction to the sample average τ j . This correction is proportional to the difference between τ j and τ ⇀ j-1 with a proportionality constant that depends on the MSE vectors s j , s ⇀ j-1 and the expected quadratic change d 2 j . In particular, if s j ≪ s ⇀ j-1 + d 2 j , the mean vector is given by the sample average as in single-task learning, and if s j ≫ s ⇀ j-1 + d 2 j , the mean vector is given by that of the preceding task. Note that for forward learning, at each step k, only the vectors for the last task τ ⇀ k and s ⇀ k need to be obtained from those of the (k -1)-th task. The vectors for the remaining j-th tasks with j ∈ {1, 2, . . . , k -1} stay the same as at step k -1 (see also Fig. 2 and Alg. 1 below).

3.2. PERFORMANCE GUARANTEES AND EFFECTIVE SAMPLE SIZES

The following result provides bounds for the minimax risk for each task with respect to the smallest minimax risk. For each j-th task, we denote by R ∞ j the smallest minimax risk and by µ ∞ j the classifier parameter that determines the optimal minimax rule, as described in Section 2. In addition, we denote by R(U ⇀ j ) the minimax risk over uncertainty set U ⇀ j determined as in (1) using the mean and confidence vectors τ ⇀ j and λ ⇀ j = s ⇀ j provided by ( 5) and ( 6). Theorem 1. Let M and κ be such that M ≥ Φ(x, y) ∞ ∀(x, y) ∈ X × Y and κ ≥ σ Φ (i) j σ (i) j , κ ≥ σ w (i) j d (i) j for j = 1, 2, . . . , k and i = 1, 2, . . . , m where Φ (i) j denotes the r.v. given by the i-th component of the feature mapping of samples from the j-th task, σ(z) denotes the sub-Gaussian parameter of a r.v. z, i.e., E{e t(z-E{z}) } ≤ e σ(z) 2 t 2 /2 ∀t. For any j ∈ {1, 2, . . . , k}, we have that R(U ⇀ j ) ≤ R ∞ j + M (κ + 1) 2 log (2m/δ) n ⇀ j µ ∞ j 1 with prob. at least 1 -δ (8) with n ⇀ 1 = n 1 and n ⇀ j ≥ n j + n ⇀ j-1 σ 2 j ∞ σ 2 j ∞ +n ⇀ j-1 d 2 j ∞ for j ≥ 2. Proof. See Appendix B. The excess risk in inequality (8) decreases as O(1/ n ⇀ j ) using the forward learning methods proposed, while such difference would decrease as O(1/ √ n j ) using only the information of the j-th task. Therefore, n ⇀ j in ( 8) is the ESS of the proposed LMRC method with forward learning. The ESS of each task is obtained by adding a fraction of the ESS for the preceding task to the sample size. In particular, if d 2 j is large, the ESS is given by the sample size, while if d 2 j is small, the ESS is given by the sum of the sample size and the ESS of the preceding task. Other existing methods provide comparable performance bounds (Mohri & Medina, 2012; Pentina & Lampert, 2015) . Such bounds decrease with the number of tasks and increase with the change between consecutive distributions. Specifically, bounds in Proposition 1 of Mohri & Medina (2012) and in Theorem 7 of Pentina & Lampert (2015) are proportional to the discrepancy and to the Kullback-Leibler divergence between consecutive distributions. The bound in Theorem 1 above decreases with the number of tasks and increases with the expected quadratic change d 2 j between consecutive distributions. Note that the coefficient κ in (8) can be taken to be small as long as the values used for σ j and d j are not much lower than the sub-Gaussian parameters of Φ j and w j , respectively. In particular, κ is smaller than the maximum of M/ min j,i {σ (i) j } and 2M/ min j,i {d (i) j } due to the bound for the sub-Gaussian parameter of bounded random variables (see e.g., Section 2.1.2 in Wainwright ( 2019)). Theorem 1 shows the increase in ESS in terms of the ESS of the preceding task. The following result allows to directly quantify the ESS in terms of the sample size and the expected quadratic change. Theorem 2. Let d, σ j and n be such that d 2 ≥ d 2 j ∞ , σ 2 j ∞ ≤ 1, and n ≤ n j for j = 1, 2, . . . , k. For any j ∈ {1, 2, . . . , k}, we have that the ESS in (8) can be taken so that it satisfies n ⇀ j ≥ n 1 + (1 + α) 2j-1 -1 -α α(1 + α) 2j-1 + α with α = 2 1 + 4 nd 2 -1 . In particular, for j ≥ 2, we have that n ⇀ j ≥ n 1 + j -1 3 if nd 2 < 1 j 2 n ⇀ j ≥ n 1 + 1 5 √ nd 2 if 1 j 2 ≤ nd 2 < 1 n ⇀ j ≥ n 1 + 1 3nd 2 if nd 2 ≥ 1. Proof. See Appendix C. The above theorem characterizes the increase in ESS provided by forward learning in terms of the tasks' expected quadratic change. Such increase grows monotonically with the number of preceding tasks j as shown in ( 9) and becomes proportional to j when the expected quadratic change is smaller than 1/(j 2 n). Figure 3 below further illustrates the increase in ESS with respect to the sample size (n ⇀ j /n) due forward learning in comparison with forward and backward learning.

4. FORWARD AND BACKWARD LEARNING WITH PERFORMANCE GUARANTEES

This section presents the recursions that allow to obtain mean and MSE vectors for each task retaining information from preceding tasks and acquiring information from succeeding tasks. In addition, it characterizes the increase in ESS provided by forward and backward learning in terms of the tasks' expected quadratic change and the number of tasks. Backward learning is more challenging than forward learning since, for each task, the sequence of succeeding tasks is ever-increasing due to the continuous arrival of tasks, while the sequence of preceding tasks is always the same. The repeated usage of information from the succeeding tasks can result in a so-called catastrophic forgetting in which the tasks' performance gets worse over time. The techniques proposed below for backward learning effectively increase the ESS over time by carefully accounting for the new information at each step.

4.1. FORWARD AND BACKWARD LEARNING

The proposed techniques for forward and backward learning account for time-dependent tasks and obtain classification rules for each task leveraging information from preceding and succeeding tasks. From preceding tasks, we obtain the forward mean and MSE vectors τ ⇀ j , s ⇀ j using recursions ( 5)-( 6), while from succeeding tasks, we obtain the backward mean and MSE vectors τ ↽k j , s ↽k j using recursions ( 5)-( 6) in retrodiction. Specifically, vectors τ ↽k j and s ↽k j are obtained using the same recursion as for τ ⇀ j and s ⇀ j in ( 5)-( 6) with s ↽k j+1 , d 2 j+1 , and τ ↽k j+1 instead of s ⇀ j-1 , d 2 j , and τ ⇀ j-1 . Let τ ⇋k j and s ⇋k j denote the mean and MSE vectors for forward and backward learning corresponding to the j-th task for j ∈ {1, 2, . . . , k}. The following recursions allow to obtain, at each step k, the mean and MSE vectors τ ⇋k j and s ⇋k j for each j-th task using those vectors for forward learning τ ⇀ j , s ⇀ j and backward learning τ ↽k j+1 , s ↽k j+1 as τ ⇋k k = τ ⇀ k , s ⇋k k = s ⇀ k and τ ⇋k j = τ ⇀ j + s ⇀ j s ⇀ j + s ↽k j+1 + d 2 j+1 τ ↽k j+1 -τ ⇀ j ( ) s ⇋k j = 1 s ⇀ j + 1 s ↽k j+1 + d 2 j+1 -1 (11) with τ ↽k k = τ k and s ↽k k = s k . Analogously to the case of forward learning in Section 3.1, taking d 2 i = E{w 2 i } and σ 2 i = Var pi {Φ (x, y)} for any i, the recursion in (10) provides the unbiased linear estimator of the mean vector τ ∞ j based on D 1 , D 2 , . . . , D j and D j+1 , D j+2 , . . . , D k that has the minimum MSE, while the recursion in (11) provides its MSE (see Appendix A for a detailed derivation). Recursions ( 10)-( 11) obtain at step k the mean and MSE vectors for the j-th task by retaining information from preceding tasks and acquiring information from succeeding tasks. Specifically, recursion (10) obtains the mean vector τ ⇋k j by adding a correction to the mean vector of the corresponding task τ ⇀ j obtained for forward learning. This correction is proportional to the difference between τ ⇀ j and τ ↽k j+1 with a proportionality constant that depends on the MSE vectors s ⇀ j , s ↽k j+1 and the expected quadratic change d 2 j+1 . In particular, if s ⇀ j ≪ s ↽k j+1 + d 2 j+1 , the mean vector is given by that of the corresponding task for forward learning, and if s ⇀ j ≫ s ↽k j+1 + d 2 j+1 , the mean vector is given by that of the succeeding task for backward learning.

4.2. IMPLEMENTATION

This section describes the implementation of the proposed LMRCs with forward and backward learning and its computational and memory complexities. given by ( 10) are obtained from the forward mean vectors τ ⇀ j and the backward mean vectors τ ↽k j+1 . In particular, τ ⇀ j provides the information from the preceding tasks 1, 2, . . . , j, while τ ↽k j+1 provides the information from the succeeding tasks j + 1, j + 2, . . . , k. j-1 τ j τ j+1 τ ⇋k j τ ⇀ j-1 τ ⇀ j τ ⇀ j+1 τ ↽k j+1 τ ↽k j-1 τ ↽k j τ ⇋j j-1 Algorithm 1 details the implementation of the proposed LMRCs at each step. For k steps, LMRCs have computational complexity O((b + 1)Kmk) and memory complexity O((b + k)m) where K is the number of iterations used for the convex optimization problem (3), m is the length of the feature vector, and b is the number of backward steps. In particular, if b = 0, LMRC carries out only forward learning. The complexity of forward and backward learning increases proportionally to the number of backward steps that can be taken to be rather small, as shown in the following. Even more efficient implementations can be obtained using Rauch-Tung-Striebel recursions (see e.g., Section 7.2 in Anderson & Moore (1979) ) that can obtain τ ⇋k j from τ ⇋k j+1 as shown in Appendix D. Algorithm 1 LMRC at step k Input: D k from new task and τ j , sj , τ ⇀ j , s ⇀ j for k -b ≤ j < k from previous b -1 steps Output: µ j for k -b ≤ j ≤ k, τ k , s k , τ ⇀ k , s ⇀ k Obtain sample average and MSE vectors τ ↽k k = τ k , s ↽k k = s k using the sample set D k ⊲ Single-task Estimate the tasks' expected quadratic change d 2 k using (7) Obtain the forward mean and MSE vectors τ ⇀ k , s ⇀ k using ( 5)-( 6) ⊲ Forward Take λ ⇀ k = √ s ⇀ k and obtain classifier parameter µ k solving the optimization problem (3) for j = k -1, k -2, . . . , k -b do Estimate the tasks' expected quadratic change d 2 j using (7) Obtain backward mean and MSE vectors τ ↽k j+1 , s ↽k j+1 using ( 5)-( 6) in retrodiction ⊲ Backward Obtain mean and MSE vectors τ ⇋k j , s ⇋k j using ( 10)-( 11) ⊲ Forward and backward Take λ ⇋k j = s ⇋k j and obtain classifier parameters µ j solving the optimization problem (3)

4.3. PERFORMANCE GUARANTEES AND EFFECTIVE SAMPLE SIZES

The following result provides bounds for the minimax risk for each task with respect to the smallest minimax risk. For each j-th task and step k, we denote by R(U ⇋k j ) the minimax risk over uncertainty set U ⇋k j determined as in (1) using the mean and confidence vector τ ⇋k j and λ ⇋k j = s ⇋k j provided by ( 10) and (11). Theorem 3. Let M , κ, and n ⇀ j be as in Theorem 1. For any j ∈ {1, 2, . . . , k}, we have that R(U ⇋k j ) ≤ R ∞ j + M (κ + 1) 2 log (2m/δ) n ⇋k j µ ∞ j 1 with prob. at least 1 -δ (12) with n ⇋k k = n ⇀ k and n ⇋k j ≥ n ⇀ j + n ↽k j+1 σ 2 j ∞ σ 2 j ∞ + n ↽k j+1 d 2 j+1 ∞ for j ≤ k -1, where the backward ESSs satisfy n ↽k k = n k and n ↽k j ≥ n j + n ↽k j+1 σ 2 j ∞ σ 2 j ∞ +n ↽k j+1 d 2 j+1 ∞ . Proof. See Appendix E. To the best of our knowledge, Theorem 3 provides the first performance guarantees for lifelong learning that show positive backward transfer. In particular, the bounds for forward and backward learning provided by inequality (12) are significantly lower than those for forward learning in Theorem 1. The ESS of each task is obtained by adding a fraction of the ESS for the succeeding task to the ESS of the corresponding task using forward learning. In particular, if d 2 j is large, the ESS is given by that with forward learning, while if d 2 j is small, the ESS is given by the sum of the ESS using forward learning and the ESS of the succeeding task. Theorem 3 shows the increase in ESS in terms of the ESS with forward learning and the ESS of the succeeding task. The following result allows to directly quantify the ESS in terms of the sample size and the expected quadratic change. Theorem 4. Let d, σ j and n be such that d 2 ≥ d 2 j ∞ , σ 2 j ∞ ≤ 1, and n ≤ n j for j = 1, 2, . . . , k. For any j ∈ {1, 2, . . . , k}, we have that the ESS in ( 12) can be taken so that it satisfies n ⇋k j ≥ n 1 + (1 + α) 2j-1 -1 -α α(1 + α) 2j-1 + α + (1 + α) 2(k-j)+1 -1 -α α(1 + α) 2(k-j)+1 + α with α = 2 1 + 4 nd 2 -1 . (13) In particular, for j ≥ 2, we have that n ⇋k j ≥ n ⇀ j + n j(k -j) j + 2(k -j) ≥ n 1 + j -1 3 + j(k -j) j + 2(k -j) if nd 2 < 1 j 2 n ⇋k j ≥ n ⇀ j + 1 5 n d 2 ≥ n 1 + 2 5 √ nd 2 if 1 j 2 ≤ nd 2 < 1 n ⇋k j ≥ n ⇀ j + 1 3d 2 ≥ n 1 + 2 3nd 2 if nd 2 ≥ 1. Proof. See Appendix F. The above theorem characterizes the increase in ESS provided by forward and backward learning in terms of the tasks' expected quadratic change. Such increase grows monotonically with the number of preceding tasks j and with the number of succeeding tasks k -j as shown in ( 13). In addition, it becomes proportional to the total number of tasks k when the expected quadratic change is smaller than 1/(j 2 n) and j ≥ k/2. Figure 3 further illustrates the increase in ESS with respect to the sample size (n ⇋k j /n) due to forward and backward learning in comparison with forward learning. Such figure displays the three intervals that are discussed in Theorems 2 and 4. In particular, the ESS significantly increases when nd 2 decreases between 1 and 1/j 2 . Note also that in most of the situations, the benefits of backward learning are achieved using only b = k -j = 3 backward steps.

5. NUMERICAL RESULTS

This section first compares the classification performance of LMRCs with state-of-the-art techniques using multiple datasets, then we show the performance improvement of the presented LMRCs due forward and backward learning. In the supplementary materials, we provide the code for LMRCs together with additional implementation details and numerical results in Appendix G. The proposed method is evaluated using 6 public datasets: "Yearbook" (Ginosar et al., 2015) , "Im-ageNet noise" (Mai et al., 2022) , "UTKFaces" (Zhang et al., 2017) , "Rotated MNIST" (Jin et al., 2021) , "DomainNet" (Peng et al., 2019) , and "CLEAR" (Lin et al., 2021) . These datasets are composed by time-dependent tasks (images with characteristics/quality/realism that change over time). The last two datasets are multi-class problems and the rest are binary (see further details in Appendix G). For all methods, instances are represented by pixel values in "Rotated MNIST" dataset, and by the last layer of the ResNet18 pre-trained network (He et al., 2016) in the remaining datasets. The proposed LMRC method is compared with 4 lifelong learning techniques: gradient episodic memory (GEM) (Lopez-Paz & Ranzato, 2017) , meta-experience replay (MER) (Riemer et al., 2018) , efficient lifelong learning algorithm (ELLA) (Ruvolo & Eaton, 2013) , and elastic weight consolidation (EWC) (Kirkpatrick et al., 2017) . The hyper-parameters in these methods are set to the default values provided by the authors. LMRCs are implemented using b = 3 backward steps and the expected quadratic change d 2 j is estimated using W = 2 in (7). We use the same hyper-parameters for all the results in this section for fair comparison with the state-of-the-art and to show that the techniques presented do not require a careful fine-tuning. In Appendix G, among other additional results, we study the change in classification error and processing time achieved by varying the number b of backward steps. In the first set of numerical results, we compare the performance of the proposed LMRCs with the state-of-the-art techniques for n = 10 and n = 100 samples per task. These numerical results are obtained computing the average classification error over all the tasks in 50 random instantiations of data samples. As can be observed in Table 1 , LMRCs can significantly improve performance in time-dependent tasks with respect to the state-of-the-art. In the second set of numerical results, we analyze the contribution of forward and backward learning to the final performance of LMRCs. In particular, we show the relationship among classification error, number of tasks, and sample size for single-task, forward, and forward and backward learning. These numerical results are obtained averaging, for each number of tasks and sample size, the classification errors achieved with 10 random instantiations of data samples in "Yearbook" dataset (see Appendix G for further details). Figure 4a shows the classification error of LMRC method divided by the classification error of single-task learning for different number of tasks with n = 10 and n = 100 sample sizes. Such figure shows that forward and backward learning can significantly improve performance as tasks arrive. In addition, Figure 4b shows the classification error of LMRC method for different sample sizes with k = 10 and k = 100 tasks. Such figure shows that forward and backward learning for k = 100 tasks using n = 10 samples achieves significantly better results than single-task learning using n = 100 samples. In particular, the methods proposed can effectively exploit backward learning that results in enhanced classification error in all the experimental results.

6. CONCLUSION

The paper proposes LMRCs that effectively perform forward and backward learning and account for time-dependent tasks. LMRCs carefully avoid the repeated usage of the same information from the ever-increasing sequence of succeeding tasks. In addition, the paper analytically characterizes the increase in ESS achieved by the proposed forward and backward learning techniques in terms of the tasks' expected quadratic change and number of tasks. The numerical results assess the performance improvement of LMRC methodology with respect to the state-of-the-art using multiple datasets, sample sizes, and number of tasks. The proposed methodology for lifelong learning with time-dependent tasks can lead to techniques that further approach the humans' ability to learn from few examples and to continuously improve on tasks that arrive over time.

A DERIVATION OF RECURSIONS (5) AND (6) FOR FORWARD LEARNING AND RECURSIONS (10) AND (11) FOR FORWARD AND BACKWARD LEARNING

This section shows how recursions in ( 5), ( 6) and recursions in (10), (11) are obtained using those for filtering and smoothing in linear dynamical systems. The mean vectors evolve over time steps through the linear dynamical system τ ∞ j = τ ∞ j-1 + w j (14) where, as described in Section 3.1, vectors w j for j ∈ {2, 3, . . . , k} are independent and zero-mean because p j -p j-1 are independent and zero-mean. In addition, each state variable τ ∞ j is observed at each step j through τ j that is the sample average of i.i.d. samples from p j , so that we have τ j = τ ∞ j + v j ) where v j for j ∈ {1, 2, . . . , k} are independent and zero mean, and independent of w j for j ∈ {1, 2, . . . , k}. Therefore, equations ( 14) and ( 15) above describe a linear dynamical system (state-space model with white noise processes) (Bishop, 2006; Anderson & Moore, 1979) . For such systems, the Kalman filter recursions provide the unbiased linear estimator with minimum MSE based on samples corresponding to preceding steps D 1 , D 2 , . . . , D j , and fixed-lag smoother recursions provide the unbiased linear estimator with minimum MSE based on samples corresponding to preceding and succeeding steps D 1 , D 2 , . . . , D k (Bishop, 2006; Anderson & Moore, 1979) . Then, equations ( 5), ( 6) and equations ( 10), ( 11) are obtained after some algebra from the Kalman filter recursions and fixed-lag smoother recursions, respectively.

B PROOF OF THEOREM 1

Proof. To obtain bound in (8) we first prove that the mean vector estimate and the MSE vector given by ( 5) and ( 6), respectively, satisfy P |τ ∞ j (i) -τ ⇀ j (i) | ≤ κ 2s ⇀ j (i) log 2m δ ≥ (1 -δ) for any component i = 1, 2, . . . , m. Then, we prove that s ⇀ j ∞ ≤ M/ n ⇀ j for j ∈ {1, 2, . . . , k}, where the ESSs satisfy n ⇀ 1 = n 1 and n ⇀ j ≥ n j + n ⇀ j-1 σ 2 j ∞ σ 2 j ∞ +n ⇀ j-1 d 2 j ∞ for j ≥ 2. To obtain inequality ( 16), we prove by induction that each component i = 1, 2, . . . , m of the error in the mean vector estimate z ⇀ j (i) = τ ∞ j (i) -τ ⇀ j (i) is sub-Gaussian with parameter η ⇀ j (i) ≤ κ s ⇀ j (i) . Firstly, for j = 1, we have that z ⇀ 1 (i) = τ ∞ 1 (i) -τ ⇀ 1 (i) = τ ∞ 1 (i) -τ (i) 1 . Since the bounded random variable Φ (i) 1 is sub-Gaussian with parameter σ(Φ (i) 1 ), then the error in the mean vector estimate z ⇀ 1 (i) is sub-Gaussian with parameter that satisfies i) for any i = 1, 2, . . . , m, then using the recursions ( 5) and ( 6) we have that η ⇀ 1 (i) 2 = σ Φ (i) 1 2 n 1 ≤ κ 2 σ 2 1 (i) n 1 = κ 2 s (i) 1 . If z ⇀ j-1 (i) = τ ∞ j-1 (i) -τ ⇀ j-1 (i) is sub-Gaussian with parameter η ⇀ j-1 (i) ≤ κ s ⇀ j-1 z ⇀ j (i) = τ ∞ j (i) -τ ⇀ j (i) = τ ∞ j-1 (i) + w (i) j -τ (i) j - s (i) j s ⇀ j-1 (i) + s (i) j + d 2 j (i) τ ⇀ j-1 (i) -τ (i) j = τ ∞ j-1 (i) + w (i) j -τ ⇀ j-1 (i) + 1 - s (i) j s ⇀ j-1 (i) + s (i) j + d 2 j (i) τ ⇀ j-1 (i) -τ (i) j = τ ∞ j-1 (i) + w (i) j -τ ⇀ j-1 (i) - s ⇀ j (i) s (i) j τ (i) j -τ ⇀ j-1 (i) since w j = τ ∞ j -τ ∞ j-1 . If v j = τ j -τ ∞ j , the error in the mean vector estimate is given by z ⇀ j (i) = τ ∞ j-1 (i) + w (i) j -τ ⇀ j-1 (i) - s ⇀ j (i) s (i) j τ ∞ j (i) + v (i) j -τ ⇀ j-1 (i) = τ ∞ j-1 (i) + w (i) j -τ ⇀ j-1 (i) - s ⇀ j (i) s (i) j τ ∞ j-1 (i) + w (i) j + v (i) j -τ ⇀ j-1 (i) = 1 - s ⇀ j (i) s (i) j z ⇀ j-1 (i) + 1 - s ⇀ j (i) s (i) j w (i) j - s ⇀ j (i) s (i) j v (i) j where w (i) j and v (i) j are sub-Gaussian with parameter σ(w (i) j ) and σ Φ (i) j / √ n j , respectively. Therefore, we have that z ⇀ j (i) is sub-Gaussian with parameter η ⇀ j (i) that satisfies η ⇀ j (i) 2 = 1 - s ⇀ j (i) s (i) j 2 η ⇀ j-1 (i) 2 + 1 - s ⇀ j (i) s (i) j 2 σ w (i) j 2 + s ⇀ j (i) s (i) j 2 σ Φ (i) j 2 n j since z ⇀ j-1 , w j , and v j are independent. Using that η ⇀ j-1 i) and the definition of κ, we have that (i) ≤ κ s ⇀ j-1 η ⇀ j (i) 2 ≤ 1 - s ⇀ j (i) s (i) j 2 κ 2 s ⇀ j-1 (i) + 1 - s ⇀ j (i) s (i) j 2 κ 2 d 2 j (i) + s ⇀ j (i) s (i) j 2 κ 2 σ 2 j (i) n j ≤ 1 - s ⇀ j (i) s (i) j 2 κ 2   1 s ⇀ j (i) - 1 s j (i) -1 + d 2 j (i)   + s ⇀ j (i) 2 s (i) j κ 2 (17) = 1 - s ⇀ j (i) s (i) j κ 2 s ⇀ j (i) + κ 2 s ⇀ j (i) 2 s (i) j where ( 17) is obtained using (6). The inequality in ( 16) is obtained using the union bound together with the Chernoff bound (concentration inequality) (Wainwright, 2019) for the random variables z ⇀ j (i) that are sub-Gaussian with parameter η ⇀ j (i) . Now, we prove by induction that, for any j, s ⇀ j ∞ ≤ M/ n ⇀ j where the ESSs satisfy n ⇀ 1 = n 1 and n ⇀ j ≥ n j + n ⇀ j-1 σ 2 j ∞ σ 2 j ∞ +n ⇀ j-1 d 2 j ∞ for j ≥ 2. For j = 1, using the definition of s ⇀ j in equation ( 6), we have that for any component i s ⇀ 1 (i) -1 = s (i) 1 -1 = n 1 σ 2 1 (i) ≥ n 1 M 2 . Then, vector s ⇀ 1 satisfies s ⇀ 1 ∞ ≤ M √ n 1 = M √ n ⇀ 1 . If s ⇀ j-1 ∞ ≤ M/ n ⇀ j-1 , then we have that for any component i s ⇀ j (i) -1 = 1 s (i) j + 1 s ⇀ j-1 (i) + d 2 j (i) ≥ 1 s (i) j + 1 M 2 n ⇀ j-1 + d 2 j (i) ≥ 1 M 2   nj + 1 1 n ⇀ j-1 + d 2 j (i) M 2    ≥ 1 M 2   nj + 1 1 n ⇀ j-1 + d 2 j ∞ σ 2 j ∞    by using the recursion (6) and the induction hypothesis. Then, vector s ⇀ j satisfies s ⇀ j ∞ ≤ M n j + n ⇀ j-1 σ 2 j ∞ σ 2 j ∞ +n ⇀ j-1 d 2 j ∞ . ( ) The inequality in ( 8) is obtained because the minimax risk is bounded by the smallest minimax risk as shown in (Mazuelas et al., 2020; 2022a; b) so that R(U ⇀ j ) ≤ R ∞ j + τ ∞ j -τ ⇀ j ∞ + λ ⇀ j ∞ µ ∞ j 1 that leads to (8) using ( 16), ( 18), and the fact that 1 ≤ 2 log 2m δ .

C PROOF OF THEOREM 2

Proof. To obtain bound in ( 9), we proceed by induction. For j = 1, using the expression for the ESS in ( 8), we have that n ⇀ 1 = n 1 ≥ n. If ( 9) holds for the (j -1)-task, then for the j-th task, we have that n ⇀ j ≥ n j + n ⇀ j-1 σ 2 j ∞ σ 2 j ∞ + n ⇀ j-1 d 2 j ∞ ≥ n + n ⇀ j-1 1 1 + n ⇀ j-1 d 2 = n 1 + 1 n n ⇀ j-1 + nd 2 where the second inequality is obtained because n j ≥ n, σ 2 j ∞ ≤ 1, and d 2 j ∞ ≤ d 2 . Using that n ⇀ j-1 ≥ n 1 + (1+α) 2j-3 -1-α α(1+α) 2j-3 +α , the ESS of the j-th task satisfies n ⇀ j ≥ n   1 + 1 n n 1+ (1+α) 2j-3 -1-α α(1+α) 2j-3 +α + nd 2    = n   1 + 1 α(1+α) 2j-3 +α (1+α) 2j-2 -1 + nd 2   = n   1 + 1 α(1+α) 2j-3 +α (1+α) 2j-2 -1 + α 2 α+1   (19) = n 1 + (1 + α) 2j-1 -1 -α α(1 + α) 2j-2 + α(α + 1) + α 2 (1 + α) 2j-2 -α 2 where (19) is obtained because nd 2 = α 2 α+1 since α = nd 2 2 1 + 4 nd 2 + 1 . Now, we obtain bounds for the ESS depending on the value of nd 2 . In the following, the constant φ represents the golden ratio φ = 1.618 . . .. 1. If nd 2 < 1 j 2 ⇒ √ nd 2 ≤ α ≤ √ nd 2 φ ≤ φ j ≤ 1 similarly as in the previous case, then we have that n ⇀ j satisfies n ⇀ j ≥ n 1 + 1 α α(2j -2) 2 + α(2j -1) = n 1 + 2j -2 2 + α(2j -1) where the first inequality follows because (1 + α) 2j-2 ≥ 1 + α(2j -2). Using α ≤ φ j , we have that n ⇀ j ≥ n 1 + 2j -2 2 + φ j (2j -1) ≥ n 1 + 2j -2 2 + 2φ -φ j ≥ n 1 + j -1 1 + φ . 2. If 1 j 2 ≤ nd 2 < 1 ⇒ 1 j < √ nd 2 < α < √ nd 2 φ because α = nd 2 2 1 + 4 nd 2 + 1 = √ nd 2 √ nd 2 +4+ √ nd 2 2 , then we have that n ⇀ j satisfies n ⇀ j ≥ n 1 + 1 √ nd 2 1 φ (1 + α) 2j-2 -1 (1 + α) 2j-2 + 1 ≥ n 1 + 1 √ nd 2 1 φ (1 + 1 j ) 2j-2 -1 (1 + 1 j ) 2j-2 + 1 where the first inequality follows because α < √ nd 2 φ and the second inequality follows because the expression is monotonically increasing for α and 1 j < α. Since (1 + 1 j ) 2j-2 ≥ 1 + 2j-2 j , we have that n ⇀ j ≥ n 1 + 1 √ nd 2 1 φ 2j-2 j 2 + 2j-2 j ≥ n 1 + 1 √ nd 2 1 φ 1 3 because j ≥ 2. 3. If nd 2 ≥ 1 ⇒ 1 ≤ nd 2 ≤ α ≤ nd 2 φ because α = nd 2 2 1 + 4 nd 2 + 1 , then we have that n ⇀ j satisfies n ⇀ j ≥ n 1 + 1 nd 2 φ (1 + α) 2j-1 -1 -α (1 + α) 2j-1 + 1 ≥ n 1 + 1 nd 2 φ (1 + α) 2j-2 -1 (1 + α) 2j-2 + 1 where the first inequality follows because α ≤ nd 2 φ and the second inequality follows multiplying and dividing by 1 + α and because 1/(1 + α) < 1. Since the above expression is monotonically increasing for α and α ≥ 1, we have that n ⇀ j ≥ n 1 + 1 nd 2 φ 2 2j-2 -1 2 2j-2 + 1 ≥ n 1 + 1 nd 2 φ 3 5 because j ≥ 2.

D MORE EFFICIENT RECURSIONS FOR FORWARD AND BACKWARD LEARNING

The Rauch-Tung-Striebel smoother recursions (Bishop, 2006; Anderson & Moore, 1979) allow to obtain forward and backward mean and MSE vectors directly from those vectors for the succeeding task. Specifically, for each j-th task, the mean vector τ ⇋k j together with the MSE vector s ⇋k j can be obtained using those vectors for the succeeding task τ ⇋k j+1 , s ⇋k j+1 as τ ⇋k j = τ ⇀ j + s ⇀ j s ⇀ j + d 2 j+1 τ ⇋k j+1 -τ ⇀ j s ⇋k j =    1 s ⇀ j +   d 2 j+1 + 1 s ⇋k j+1 - 1 s ⇀ j + d 2 j+1 -1   -1    -1 . The above recursions provide the same mean vector estimate as the recursions ( 10) and (11) in the paper since they are obtained using the Rauch-Tung-Striebel smoother recursions instead of fixedlag smoother recursions (Bishop, 2006; Anderson & Moore, 1979 ).

E PROOF OF THEOREM 3

Proof. To obtain bound in (12) we first prove that the mean vector estimate and the MSE vector given by ( 10) and ( 11), respectively, satisfy P |τ ∞ k (i) -τ ⇋k j (i) | ≤ κ 2s ⇋k j (i) log 2m δ ≥ (1 -δ) for any component i = 1, 2, . . . , m. Then, we prove that s ⇋k j ∞ ≤ M/ n ⇋k j for j ∈ {1, 2, . . . , k}, where the ESSs satisfy n ⇋k k = n ⇀ k and n ⇋k j ≥ n ⇀ j + n ↽k j+1 σ 2 j ∞ σ 2 j ∞ + n ↽k j+1 d 2 j+1 ∞ for j ≥ 2. To obtain inequality (20), we prove that each component i = 1, 2, . . . , m of the error in the mean vector estimate z ⇋k j (i) = τ ∞ j (i) -τ ⇋k j (i) is sub-Gaussian with parameter η ⇋k j (i) ≤ κ s ⇋k j (i) . Analogously to the proof of Theorem 1, it is proven that each component in the error of the backward mean vector τ ↽k j+1 is sub-Gaussian with parameters satisfying η ↽k j+1 κ s ↽k j+1 . The error in the forward and backward mean vector estimate is given by z ⇋k j (i) = τ ∞ j (i) -τ ⇋k j (i) = τ ∞ j (i) -τ ⇀ j (i) - s ⇀ j (i) s ⇀ j (i) + s ↽k j+1 (i) + d 2 j+1 (i) τ ↽k j+1 (i) -τ ⇀ j (i) where the second equality is obtained using the recursion for τ ⇋k j (i) in ( 10). Adding and subtracting i) , we have that i) . Then, we have that s ⇀ j (i) s ⇀ j (i) +s ↽k j+1 (i) +d 2 j+1 (i) τ ∞ j+1 z ⇋k j (i) = z ⇀ j (i) - s ⇀ j (i) s ⇀ j (i) + s ↽k j+1 (i) + d 2 j+1 (i) τ ∞ j+1 (i) -τ ∞ j+1 (i) + τ ↽k j+1 (i) -τ ⇀ j (i) = z ⇀ j (i) - s ⇀ j (i) s ⇀ j (i) + s ↽k j+1 (i) + d 2 j+1 (i) τ ∞ j (i) + w (i) j+1 -z ↽k j+1 (i) -τ ⇀ j (i) since w j = τ ∞ j -τ ∞ j-1 and z ⇀ j (i) = τ ∞ j (i) -τ ⇀ j z ⇋k j (i) =z ⇀ j (i) - s ⇀ j (i) s ⇀ j (i) + s ↽k j+1 (i) + d 2 j+1 (i) z ⇀ j (i) + w (i) j+1 -z ↽k j+1 (i) (21) = 1 - s ⇀ j (i) s ⇀ j (i) + s ↽k j+1 (i) + d 2 j+1 (i) z ⇀ j (i) - s ⇀ j (i) s ⇀ j (i) + s ↽k j+1 (i) + d 2 j+1 (i) w (i) j+1 -z ↽k j+1 (i) where z ⇀ j (i) , z ↽k j+1 (i) , and w (i) j+1 are sub-Gaussian with parameters η ⇀ j i) , and σ(w (i) ≤ κ s ⇀ j (i) , η ↽k j+1 (i) ≤ κ s ↽k j+1 (i) j ) , respectively. Since z ⇀ j , z ↽k j+1 , and w j+1 are independent, we have that z ⇋k j (i) given by ( 21) is sub-Gaussian with parameter that satisfies η ⇋k j (i) 2 = 1 - s ⇀ j (i) s ⇀ j (i) + s ↽k j+1 (i) + d 2 j+1 (i) 2 η ⇀ j (i) 2 + s ⇀ j (i) s ⇀ j (i) + s ↽k j+1 (i) + d 2 j+1 (i) 2 σ w (i) j 2 + η ↽k j+1 (i) 2 ≤ 1 - s ⇀ j (i) s ⇀ j (i) + s ↽k j+1 (i) + d 2 j+1 (i) 2 κ 2 s ⇀ j (i) + s ⇀ j (i) s ⇀ j (i) + s ↽k j+1 (i) + d 2 j+1 (i) 2 κ 2 d j+1 (i) + s ↽k j+1 (i) Using ( 11) we have that the sub-Gaussian parameter satisfies η ⇋k j (i) 2 ≤   1 - s ⇋k j (i) s ↽k j+1 (i) + d 2 j+1 (i)   2 κ 2   1 s ⇋k j (i) - 1 s ↽k 2 j+1 (i) + d 2 j+1 (i)   -1 + s ⇋k j (i) 2 s ↽k j+1 (i) + d 2 j+1 (i) κ 2 =   s ↽k j+1 (i) + d 2 j+1 (i) -s ⇋k j (i) s ↽k j+1 (i) + d 2 j+1 (i)   κ 2 s ⇋k j (i) + s ⇋k j (i) 2 s ↽k j+1 (i) + d 2 j+1 (i) κ 2 = κ 2 s ⇋k j (i) . The inequality in ( 20) is obtained using the union bound together with the Chernoff bound (concentration inequality) (Wainwright, 2019) for the random variables z ⇋k j (i) that are sub-Gaussian with parameter η ⇋k j (i) . Now, we prove that, for any j, s ⇋k j ≤ M/ n ⇋k j where the ESSs satisfy n ⇋k k = n ⇀ k and n ⇋k j ≥ n ⇀ j + n ↽k j+1 σ 2 j ∞ σ 2 j ∞ + n ↽k j+1 d 2 j+1 ∞ for j ≥ 2. Analogously to the proof of Theorem 1, we prove that the backward MSE vector s ↽k j+1 satisfies s ↽k j+1 ∞ ≤ M/ n ↽k j+1 . Then, using that s ↽k j+1 ∞ ≤ M/ n ↽k j+1 , we have that for every component i s ⇋k j (i) -1 = 1 s ⇀ j (i) + 1 s ↽k j+1 (i) + d 2 j+1 (i) ≥ n ⇀ j σ 2 j (i) + 1 M 2 n ↽k j+1 + d 2 j+1 (i) ≥ 1 M 2   n ⇀ j + 1 1 n ↽k j+1 + d 2 j+1 M 2    ≥ 1 M 2   n ⇀ j + 1 1 n ↽k j+1 + d 2 j+1 ∞ σ 2 j ∞    . Then, we obtain s ⇋k j ∞ ≤ M n ⇀ j + 1 1 n ↽k j+1 + d 2 j+1 ∞ σ 2 j ∞ . ( ) The inequality in ( 12) is obtained because the minimax risk is bounded by the smallest minimax risk as shown in (Mazuelas et al., 2020; 2022a; b) so that R(U ⇋k j ) ≤ R ∞ j + τ ∞ j -τ ⇋k j ∞ + λ ⇋k j ∞ µ ∞ j 1 that leads to (12) using ( 20), ( 22), and the fact that 1 ≤ 2 log 2m δ .

F PROOF OF THEOREM 4

Proof. To obtain bound in (13), we use the ESS obtained with forward learning in Theorem 2 and obtained with backward learning. Analogously to the proof of Theorem 2, we prove that the ESS obtained at backward learning satisfies n ↽k j+1 ≥ n j+1 + n ↽k j+2 σ 2 j+1 ∞ σ 2 j+1 ∞ + n ↽k j+2 d 2 j+2 ∞ ≥ n 1 + (1 + α) 2(k-j)-1 -1 -α α(1 + α) 2(k-j)-1 + α . Therefore, the ESS obtained with forward an backward learning satisfies n ⇋k j ≥ n ⇀ j + n 1 + (1 + α) 2(k-j)-1 -1 -α α(1 + α) 2(k-j)-1 + α   1 + n 1 + (1+α) 2(k-j)-1 -1-α α(1+α) 2(k-j)-1 +α nd 2   -1 = n ⇀ j + n (1 + α) 2(k-j) -1 α(1 + α) 2(k-j)-1 + α 1 + α 2 α + 1 1 + (1 + α) 2(k-j)-1 -1 -α α(1 + α) 2(k-j)-1 + α -1 where the second equality follows because nd 2 = α 2 α+1 since α = nd 2 2 1 + 4 nd 2 + 1 . Then, we have that n ⇋k j ≥n ⇀ j + n (1 + α) 2(k-j) -1 α(1 + α) 2(k-j)-1 + α • ((1 + α) 2(k-j)-1 + 1)(α + 1 + α 2 ) + α((1 + α) 2(k-j)-1 -1 -α) (α + 1)((1 + α) 2(k-j)-1 + 1) -1 ≥n ⇀ j + n (1 + α) 2(k-j) -1 α(1 + α) 2(k-j)-1 + α (α + 1)((1 + α) 2(k-j)-1 + 1) (1 + α) 2(k-j)+1 + 1 . Now, we obtain bounds for the ESS depending on the value value of nd 2 . Such bounds are obtained similarly as in Theorem 2 and we also denote by φ the golden ratio φ = 1.618 . . .. 1. If nd 2 < 1 j 2 ⇒ √ nd 2 ≤ α ≤ √ nd 2 φ ≤ φ j ≤ 1 similarly as in the previous case, then we have that n ⇋k j satisfies n ⇋k j ≥ n ⇀ j + n 1 α α(2(k -j)) 2 + α2(k -j) = n ⇀ j + n k -j 1 + α(k -j) ≥ n ⇀ j + n k -j 1 + φ j (k -j) where the first inequality follows because (1 + α) 2(k-j)-1 ≥ 1 + α(2(k -j) -1) and the second inequality is obtained using α ≤ φ j .

2.. If

1 j 2 ≤ nd 2 < 1 ⇒ 1 j ≤ √ nd 2 ≤ α ≤ √ nd 2 φ because α = nd 2 1+ 4 nd 2 +1 2 = √ nd 2 √ nd 2 +4+ √ nd 2 2 , then we have that n ⇋k j satisfies n ⇋k j ≥ n ⇀ j n α (1 + α) 2(k-j) -1 (1 + α) 2(k-j) + 1 ≥ n ⇀ j n α (1 + √ nd 2 ) 2(k-j) -1 (1 + √ nd 2 ) 2(k-j) + 1 where the second inequality follows because the ESS is monotonically increasing for α and α ≥ nd 2 . Since (1 + √ nd 2 ) 2(k-j) ≥ 1 + 2 √ nd 2 (k -j) and k -j ≥ 1, we have that n ⇋k j ≥ n ⇀ j + n α √ nd 2 1 + √ nd 2 ≥ n ⇀ j + n 1 φ 1 1 + √ nd 2 because α ≤ √ nd 2 φ. 3. If nd 2 ≥ 1 ⇒ 1 ≤ nd 2 ≤ α ≤ nd 2 φ because α = nd 2 1+ 4 nd 2 +1 2 , then we have that n ⇋k j satisfies n ⇋k j ≥ n ⇀ j + n 1 α 2 2(k-j) -1 2 2(k-j) + 1 ≥ n ⇀ j + n 1 nd 2 1 φ 3 5 where the first inequality follows because the ESS is monotonically increasing for α and α ≥ 1 and the second inequality is obtained using k -j ≥ 1 and α ≤ nd 2 φ. 2 that shows the number of classes, the number of samples, and the number of tasks. In the following, we further describe the tasks and the time-dependency of each dataset used. • The "Yearbook" dataset contains portraits' photographs over time and the goal is to predict males and females. Each task corresponds to portraits from one year from 1905 to 2013. • The "ImageNet noise" dataset contains images with increasing noise over tasks and the goal is to predict if an image is a bird or a snake. The sequence of tasks corresponds to the noise factors [0.0, 0.4, 0.8, 1.2, 1.6, 2.0, 2.4, 2.8, 3.2, 3.6] (Mai et al., 2022 ). • The "DomainNet" dataset contains six different domains with decreasing realism and the goal is to predict if an image is an airplane, bus, ambulance, or police car. The sequence of tasks corresponds to the six domains: real, painting, infograph, clipart, sketch, and quickdraw. • The "UTKFaces" dataset contains face images in the wild with increasing age and the goal is to predict males and females. The sequence of tasks corresponds to face images with different ages from 0 to 116 years. • The "Rotated MNIST" dataset contains rotated images with increasing angles over tasks and the goal is to predict if the number in an image is greater than 5 or not. Each j-th task corresponds to a rotation angle randomly selected from 180(j-1) k , 180j k degrees where j ∈ {1, 2, . . . , k} and k is the number of tasks. • The "CLEAR" dataset contains images with a natural temporal evolution of visual concepts in the real world and the goal is to predict if an image is soccer, hockey, or racing. Each task corresponds to one year from 2004 to 2014. The samples in each task are randomly splitted in 100 samples for test and the rest of the samples for training. The samples used for training in the numerical results are randomly sampled from each group of training samples in each repetition. The classifier parameters in the numerical results are obtained using an accelerated subgradient method based on Nesterov approach (Nesterov & Shikhman, 2015; Tao et al., 2019) . Such subgradient method applied to optimization (3) obtains at each step classifier parameters µ from the mean and confidence vectors τ , λ using the iterations for l = 1, 2, . . . , K μ(l + 1) = µ(l) + a l τ -∂ϕ(µ(l))λsign(µ(l)) (23) µ(l + 1) = μ(l + 1) + θ l+1 (θ -1 l -1) (µ(l) -μ(l)) where sign(•) denotes the sign function, µ(l) is the l-th iterate for µ, θ l = 2/(l + 1) and a l = 1/(l + 1) 3/2 are the step sizes and ∂ϕ(µ(l)) denotes a subgradient of ϕ(•) at µ(l) with ϕ(µ) = max x∈X ,C⊆Y y∈C Φ(x, y) T µ -1 |C| . In addition, the above subgradient method is implemented using K = 2000 iterations and a warmstart that initializes the classifier parameters in (23) with the solution obtained for the closest task. In the first set of additional results, we further compare the classification error of LMRCs with the state-of-the-art techniques. The results in Table 1 in the paper as well as Table 3 are obtained computing the classification error 50 times for each sample size. Table 1 in the paper shows classification errors for n = 10 and n = 100 samples, while Table 3 shows the classification error for n = 50 and n = 150 samples. As can be observed in Table 3 , the performance improvement of LMRCs in comparison with the state-of-the-art techniques for n = 50 and n = 150 is similar to that shown in the paper for n = 10 and n = 100. In the second set of additional results, we further illustrate the relationship among classification error, number of tasks, and sample size. Figure 4 in the paper as well as Figure 5 are obtained computing the classification error over all the sequences of consecutive tasks of length k in the dataset. Then, we repeat such experiment 10 times with randomly chosen training sets of size n. Figure 5 extends the results for LMRCs using "DomainNet" dataset completing those in the main paper that show the results using "Yearbook" dataset. Figure 5a shows the classification error of LMRC method divided by the classification error of single-task learning for different number of tasks with n = 10 and n = 100 sample sizes. In addition, Figure 5b shows the classification error of LMRC method for different sample sizes with k = 10 tasks. Figures 5a and 5b show similar behavior to Figures 4a and 4b in the paper, respectively. In addition, Figure 6 shows the classification error of LMRCs 



Figure 2 depicts the flow diagram for the proposed LMRC methodology. The proposed techniques carefully avoid the repeated usage of the same information from the sequence of succeeding tasks.

τ

Figure 2: Diagram for LMRC methodology.

Classification error per sample size in "Yearbook" dataset.

Forward and backward learning can sharply boost performance and ESS as tasks arrive.

Classification error per sample size using "Do-mainNet" dataset.

Figure 5: Forward and backward learning can sharply boost performance and ESS as tasks arrive.

Figure 7: Averaged partial autocorrelation of mean vectors components +/-their standard deviations.

Classification error and standard deviation of the proposed LMRC method in comparison with the state-of-the-art techniques. .03 .17 ± .03 .39 ± .08 .13 ± .07 .69 ± .05 .53 ± .10 .12 ± .00 .12 ± .00 .36 ± .06 .28 ± .02 .57 ± .10 .09 ± .02 MER .16 ± .03 .10 ± .01 .17 ± .03 .10 ± .01 .38 ± .04 .26 ± .04 .17 ± .09 .11 ± .01 .37 ± .09 .45 ± .10 .10 ± .03 .05 ± .02 ELLA .45 ± .09 .43 ± .10 .48 ± .05 .47 ± .04 .67 ± .05 .67 ± .05 .19 ± .12 .17 ± .11 .48 ± .05 .47 ± .05 .61 ± .06 .60 ± .05 EWC .47 ± .05 .27 ± .06 .47 ± .04 .46 ± .06 .75 ± .04 .74 ± .05 .12 ± .00 .12 ± .00 .48 ± .01 .40 ± .01 .65 ± .03 .62 ± .04 LMRC .13 ± .04 .08 ± .02 .15 ± .03 .09 ± .01 .34 ± .06 .28 ± .01 .10 ± .01 .10 ± .00 .36 ± .01 .21 ± .00 .09 ± .03 .05 ± .02

Datasets characteristics.In this section we describe the datasets used for the numerical results in Section 5, we provide further details for the numerical experimentations carried out, and include several additional results. Specifically, in the first set of additional results, we evaluate the classification performance of the proposed method in comparison with state-of-the-art techniques for different sample sizes; in the second set of additional results, we further show the performance improvement leveraging information from preceding and succeeding tasks with additional datasets; in the third set of additional results, we show the classification error and the running time of LMRCs for different hyper-parameter values; and in the fourth set of additional results, we evaluate the assumption of change between tasks being independent and zero-mean. In addition, in the folder Implementation_LMRC in the supplementary materials we provide the code of the proposed LMRCs with the setting used in the numerical results.

Classification error and standard deviation of the proposed LMRC method in comparison with the state-of-the-art techniques. .02 .16 ± .03 .16 ± .06 .12 ± .03 .65 ± .05 .49 ± .10 .12 ± .00 .12 ± .00 .29 ± .04 .27 ± .01 .08 ± .01 .08 ± .01 MER .11 ± .01 .10 ± .01 .10 ± .01 .07 ± .01 .29 ± .02 .28 ± .02 .11 ± .09 .11 ± .01 .41 ± .13 .47 ± .05 .09 ± .04 .05 ± .02 ELLA .43 ± .10 .43 ± .08 .47 ± .05 .47 ± .04 .67 ± .05 .67 ± .05 .18 ± .11 .17 ± .11 .48 ± .05 .47 ± .05 .60 ± .05 .60 ± .04 EWC .38 ± .02 .22 ± .02 .47 ± .05 .45 ± .07 .75 ± .05 .74 ± .05 .12 ± .00 .12 ± .00 .44 ± .01 .38 ± .01 .64 ± .03 .60 ± .04 LMRC .10 ± .02 .08 ± .01 .10 ± .02 .08 ± .01 .28 ± .03 .27 ± .02 .10 ± .00 .10 ± .00 .25 ± .02 .20 ± .01 .05 ± .02 .04 ± .01

Running time in seconds of LMRC method in comparison with the state-of-the-art techniques.

Task

Step per step and task with single-task learning, forward learning, and forward and backward learning using the "Yearbook" dataset. Such figure shows that forward and backward learning can improve performance of preceding tasks, while forward learning and single task learning maintain the same performance over time.In the third set of additional results, we further assess the change in classification error and the running time of LMRCs varying the hyper-parameters. Table 4 shows the classification error of LMRCs varying the values of hyper-parameter for the window size W and the number of backward steps b, completing those in the paper that show the results for W = 2 and b = 3. As shown in the table, the proposed LMRCs do not require a careful fine-tuning of hyper-parameters and similar performances are obtained by using different values. In addition, Table 5 shows the mean running time per task in seconds of LMRCs for b = 1, 2, . . . , 5 backward steps in comparison with the stateof-the-art techniques. Such table shows that the methods proposed for backward learning do not require a significant increase in complexity. In addition, Table 5 shows that the running time of the proposed method is similar to that of other state-of-the-art methods.In the fourth set of additional results, we evaluate the assumption of change between tasks being independent and zero-mean by assessing the partial autocorrelation of mean vectors. In particular, the partial autocorrelation at any lag would be zero if tasks are i.i.d.; while the partial autocorrelation at lag 1 is larger than zero if tasks satisfy the assumption of Section 2. Figure 7 shows the averaged partial autocorrelation of the mean vectors components +/-their standard deviations for different lags using "Portraits" and "UTKFaces" datasets. Such figure shows a partial autocorrelation clearly non-zero at lag 1 that reflects dependence between consecutive mean vectors, as described by the assumption of Section 2."

