TOWARDS UNDERSTANDING GD WITH HARD AND CON-JUGATE PSEUDO-LABELS FOR TEST-TIME ADAPTATION

Abstract

We consider a setting that a model needs to adapt to a new domain under distribution shifts, given that only unlabeled test samples from the new domain are accessible at test time. A common idea in most of the related works is constructing pseudolabels for the unlabeled test samples and applying gradient descent (GD) to a loss function with the pseudo-labels. Recently, Goyal et al. (2022) propose conjugate labels, which is a new kind of pseudo-labels for self-training at test time. They empirically show that the conjugate label outperforms other ways of pseudolabeling on many domain adaptation benchmarks. However, provably showing that GD with conjugate labels learns a good classifier for test-time adaptation remains open. In this work, we aim at theoretically understanding GD with hard and conjugate labels for a binary classification problem. We show that for square loss, GD with conjugate labels converges to an -optimal predictor under a Gaussian model for any arbitrarily small , while GD with hard pseudo-labels fails in this task. We also analyze them under different loss functions for the update. Our results shed lights on understanding when and why GD with hard labels or conjugate labels works in test-time adaptation.

1. INTRODUCTION

Fully test-time adaptation is the task of adapting a model from a source domain so that it fits to a new domain at test time, without accessing the true labels of samples from the new domain nor the data from the source domain (Goyal et al., 2022; Wang et al., 2021a; Li et al., 2020; Rusak et al., 2021; Zhang et al., 2021a; S & Fleuret, 2021; Mummadi et al., 2021; Iwasawa & Matsuo, 2021; Liang et al., 2020; Niu et al., 2022; Thopalli et al., 2022; Wang et al., 2022b; Kurmi et al., 2021) . Its setting is different from many works in domain adaptation or test-time training, where the source data or statistics of the source data are available, e.g., Xie et al. (2021) ; Liu et al. (2021a) ; Prabhu et al. (2021) ; Sun et al. (2020) ; Chen et al. (2022) ; Hoffman et al. (2018) ; Eastwood et al. (2022) ; Kundu et al. (2020) ; Liu et al. (2021b) ; Schneider et al. (2020) ; Gandelsman et al. (2022) ; Zhang et al. (2021b) ; Morerio et al. (2020) ; Su et al. (2022) . Test-time adaptation has drawn growing interest recently, thanks to its potential in real-world applications where annotating test data from a new domain is costly and distribution shifts arise at test time due to some natural factors, e.g., sensor degradation (Wang et al., 2021a) , evolving road conditions (Gong et al., 2022; Kumar et al., 2020) , weather conditions (Bobu et al., 2018) , or change in demographics, users, and time periods (Koh et al., 2021) . The central idea in many related works is the construction of the pseudo-labels or the proposal of the self-training loss functions for the unlabeled samples, see e.g., Wang et al. (2021a) ; Goyal et al. (2022) . More precisely, at each test time t, one receives some unlabeled samples from a new domain, and then one constructs some pseudo-labels and applies a GD step to the corresponding self-training loss function, as summarized in Algorithm 1. Recently, Goyal et al. (2022) propose a new type of pseudo-labels called conjugate labels, which is based on an observation that certain loss functions can be naturally connected to conjugate functions, and the pseudo-labels are obtained by exploiting a property of conjugate functions (to be elaborated soon). They provide a modular approach of constructing conjugate labels for some loss functions, e.g., square loss, cross-entropy loss, exponential loss. An interesting finding of Goyal et al. (2022) is that a recently proposed self-training loss for test-time adaptation of Wang et al. (2021a) Apply gradient descent (GD): wt+1 = wt -η∇w self (wt; xt). 7: end for framework. They also show that GD with conjugate labels empirically outperforms that of other pseudo-labels like hard labels and robust pseudo-labels (Rusak et al., 2021) across many benchmarks, e.g., ImageNet-C (Hendrycks & Dietterich, 2019) , ImageNet-R (Hendrycks et al., 2021) , VISDA-C (Peng et al., 2017) , MNISTM (Ganin & Lempitsky, 2015) . However, certain questions are left open in their work. For example, why does GD with conjugate labels work? Why can it dominate GD with other pseudo-labels? To our knowledge, while pseudo-labels are quite indispensable for self-training in the literature (Li et al., 2019; Zou et al., 2019) , works that theoretically understand the dynamic of GD with pseudo-labels are very sparse, and the only work that we are aware is of Chen et al. (2020) . Chen et al. (2020) show that when data have spurious features, if projected GD is initialized with sufficiently high accuracy in a new domain, then by minimizing the exponential loss with hard labels, projected GD converges to an approximately Bayes-optimal solution under certain conditions. In this work, we study vanilla GD (without projection) for minimizing the self-training loss derived from square loss, logistic loss, and exponential loss under hard labels and conjugate labels. We prove a performance gap between GD with conjugate labels and GD with hard labels under a simple Gaussian model (Schmidt et al., 2018; Carmon et al., 2019) . Specifically, we show that GD with hard labels for minimizing square loss can not converge to an -optimal predictor (see (8) for the definition) for any arbitrarily small , while GD with conjugate labels converge to an -optimal predictor exponentially fast. Our theoretical result champions the work of conjugate labels of Goyal et al. (2022) . We then analyze GD with hard and conjugate labels under logistic loss and exponential loss, and we show that under these scenarios, they converge to an optimal solution at a log(t) rate, where t is the number of test-time iterations. Our results suggest that the performance of GD in test-time adaptation depends crucially on the choice of pseudo-labels and loss functions. Interestingly, the problems of minimizing the associated self-training losses of conjugate labels in this work are non-convex optimization problems. Hence, our theoretical results find an application in non-convex optimization where GD can enjoy some provable guarantees.

2. PRELIMINARIES

We now give an overview of hard labels and conjugate labels. But we note that there are other proposals of pseudo-labels in the literature. We refer the reader to Li et al. (2019) ; Zou et al. (2019) ; Rusak et al. (2021) and the references therein for details. Hard labels: Suppose that a model w outputs h w (x) ∈ R K and that each element of h w (x) could be viewed as the predicted score of each class for a multi-class classification problem with K classes. A hard pseudo-label y hard w (x) is a one-hot vector which is 1 on dimension k (and 0 elsewhere) if k = arg max k h w (x)[k], i.e., class k has the largest predicted score by the model w for a sample x (Goyal et al., 2022) . On the other hand, for a binary classification problem by a linear predictor, i.e., h w (x) = w x, a hard pseudo-label is simply defined as: y hard w (x) := sign(w x), (1) see, e.g., Kumar et al. (2020) , Chen et al. (2020) . GD with hard labels is the case when Algorithm 1 uses a hard label to construct a gradient ∇ w self (w t ; x t ) and update the model w. Conjugate labels (Goyal et al., 2022) : The approach of using conjugate labels as pseudo-labels crucially relies on the assumption that the original loss function is of the following form: (w; (y, x)) := f (h w (x)) -y h w (x), where f (•) : R K → R is a scalar-value function, and y ∈ R K is the label of x, which could be a one-hot encoding vector in multi-class classification. Since the true label y of a sample x is not Table 1: Summary of {Hard, Conjugate} pseudo-labels and the resulting self-training loss functions using square loss, logistic loss, and exponential loss.  (x) in (2), conj (w; x) := f (h w (x)) -y pseudo w (x) h w (x). (3) One can then compute the gradient ∇ conj (w; (y, x)) and use GD to adapt the model w at test time. Define h * ∈ R K as h * ← arg min h∈R K f (h) -y h, where -f * (y) = min h∈R K f (h) -y h is the conjugate function, see e.g, Chapter 3.3 in Boyd et al. (2004) . It turns out that h * satisfies y = ∇f (h * ). From the similarity, Goyal et al. (2022) propose conjugate labels: y conj w (x) := ∇f (h w (x)), where y conj w (x) is possibly a real-value vector instead of a one-hot encoding vector. Let y pseudo w (x) ← y conj w (x) in (3). Then, we get the self-training loss function using the conjugate label: conj (w; x) := f (h w (x)) -∇f (h w (x)) h w (x). We note that GD with conjugate labels is an instance of Algorithm 1 when we let ∇ w self (w t ; x t ) ← ∇ w conj (w t ; x t ) at each test time t. Table 1 summarizes conjugate labels and hard labels as well as their self-training loss functions using square loss, logistic loss, and exponential loss. We provide the derivation of the case using square loss below, while the rest of them are available in Appendix A. (Square loss) Example of a conjugate label y conj w (x) and its self-training function conj (w; x): Observe that square loss (w; (x, y)) := 1 2 (y -w x) 2 is in the form of (2) up to a constant, where 4) and ( 5), we get f (•) = 1 2 (•) 2 : R → R + . Substituting f (•) = 1 2 (•) 2 and h(w) = w x in ( y conj w (x) = w x, and conj (w; x) = - 1 2 (w x) 2 . ( ) 3 THEORETICAL FRAMEWORK: GAUSSIAN MODEL Our theoretical analysis considers a binary classification setting in which samples from the new domain are generated as x ∼ N (yµ, σ 2 I d ) ∈ R d , where µ ∈ R d is the mean and σ 2 > 0 is the magnitude of the covariance. The label y is assumed to be uniform on {-1, 1}. Therefore, we have P (X|Y = y) = N (yµ, σI d ) and P (y = -1) = P (y = 1) = 1 2 under Gaussian model (Schmidt et al., 2018; Carmon et al., 2019; Kumar et al., 2020) . Given a test sample x, a linear predictor w ∈ R d makes a prediction of the label ŷw (x) as ŷw (x) = sign(w x). While a model could be self-trained under various loss functions, the natural metric to evaluate a model for classification is the expected 0-1 loss. Under Gaussian model, the expected 0-1 loss enjoys a simple closed-form expression: 0-1 (w) := E (x,y) [1{y ŷw (x) = 0}] = P [yw x < 0] = P N µ w σ w , 1 < 0 = Φ µ w σ w , where Φ(u) := 1 √ 2π ∞ u exp(-z 2 /2 )dz is the Gaussian error function. From (7), one can see that the predictors that minimize the 0-1 loss are those that align with µ in direction and the minimum error is Φ µ σ . In other words, an optimal linear predictors w * ∈ R d has to satisfy cos w * w * , µ µ = 1. In our theoretical analysis, we let µ = [ µ , 0, . . . , 0] ∈ R d ; namely, the first element is the only non-zero entry. Our treatment is without loss of generality, since we can rotate and change a coordinate system if necessary. For any vector w ∈ R d , its orthogonal component to µ is I d -µ |µ| µ |µ| w. Thanks to the assumption of µ, the orthogonal space (to µ) is the subspace of dimension 2 to d. Indeed, for any vector w, its orthogonal component (to µ) I d -µ |µ| µ |µ| w is always 0 in its first entry. Therefore, we can represent an orthogonal component of w as [w[2], . . . , w[d]] ∈ R d-1 . We call a model w ∈ R d an -optimal predictor under Gaussian model if it satisfies two conditions: Condition 1: w, µ µ = w[1] > 0 and Condition 2: cos 2 w w , µ µ ≥ 1 -. Using ( 7), the expected 0-1 loss of an -optimal predictor is 0-1 (w) = Φ µ σ √ 1 -. To get an -optimal predictor, we need to satisfy w, µ > 0 and also need that the ratio of the projection onto µ to the size of the orthogonal component to µ is as large as possible, i.e., w [1] i] is large, which can be seen from the following equalities: 2 d i =1 w 2 [ cos 2 w w , µ µ = w,µ 2 w 2 µ 2 = w[1] 2 d i=1 w[i] 2 = 1 1+ d i =1 w[i] 2 w[1] 2 . The projection of w onto µ has to be positive and large when the size of the orthogonal component is non-zero to get an -optimal predictor, i.e., w[1] 0. Finally, in our analysis we will assume that the initial point satisfies Condition 1 on (8), which means that the initial point forms an acute angle with µ. This is a mild assumption, as it means that the source model is better than the random guessing in the new domain. Related works of Gaussian model: In recent years, there are some works that adopt the framework of Gaussian model to show some provable guarantees under various topics. For example, Schmidt et al. (2018) and Carmon et al. (2019) studying it for adversarial robustness. For another example, Kumar et al. (2020) recently show that self-training with hard labels can learn a good classifier when infinite unlabeled data are available and that the distributions shifts are mild. Their theoretical result perhaps is the most relevant one to ours in the literature, in addition to Chen et al. ( 2020) that we have discussed in the introduction. Kumar et al. (2020) consider the setting of gradual distribution shifts so that the data distribution in each iteration t is different and that the update in each t is a minimizer of a constrained optimization: w t ← arg min w∈Θ E x∼Dt L y hard w (x)w x , where Θ := w : w ≤ 1, w -w t-1 ≤ 1 2 . (9) On (9), L(•) : R → R + is a continuous decreasing function, D t represents the data distribution at t, and y hard w (x) := sign(w x) is the hard label for an unlabeled sample x. The main message of their result is that even though the data distribution of the target domain could be very different from that of the source domain, by using data from the intermediate distributions that change gradually, a good classifier for the target domain can be obtained in the end. On the other hand, we consider analyzing GD with pseudo-labels at test-time iterations, and we do not assume that there are intermediate distributions. Our goal is to provably show that GD with pseudo-labels can learn an optimal classifier in a new domain when only unlabeled samples are available at test time, which is different from the setup of Kumar et al. (2020) that simply assumes the access to a minimizer of a certain objective. as compared to that of a model trained under certain classification losses like cross-entropy loss (Demirkaya et al., 2020; Han et al., 2022; Hui & Belkin, 2020) . In this section, we analyze test-time adaptation by GD with hard pseudo-labels under square loss. Recall the definition of square loss: (w; (x, y)) = 1 2 (y -w x) 2 . By using hard labels as (1), the self-training loss function becomes hard (w; x) := 1 2 y hard w (x) -w x 2 = 1 2 sign(w x) -w x 2 . ( ) It is noted that the derivative of sign(•) is 0 everywhere except at the origin. Furthermore, sign(•) is not differentiable at the origin. Define sign(0) = 0. Then, sign(w x) -w x = 0 when w x = 0, which allows us to avoid the issue of the non-differentiability. Specifically, we can write the gradient as ∇ hard (w; x) = -sign(w x) -w x x. Using the gradient expression, we obtain the dynamic of GD with hard labels under square loss, w t+1 = w t -η∇ hard (w t ; x t ) = w t + η sign(w t x t ) -w t x t x t . (11) What we show in the following proposition is that the update w t of (11) does not converge to the class mean µ in direction. However, it should be noted that a perfect classifier (i.e., one that has the zero 0-1 loss) does not necessarily need to align with the class mean µ depending on the setup. Proposition 1. GD with hard labels using square loss fails to converge to an -optimal predictor for any arbitrarily small > 0 even under the noiseless setting of Gaussian model (σ = 0). More precisely, we have cos wt wt , µ µ ≤ 1 -¯ , for some ¯ > 0 as t → ∞ if w ∞ exists. Proof. In this proof, we denote āt := w t [1] = w t , µ µ . From (11), we have āt+1 = āt + η sign(w t x t ) -w t x t x t , µ µ . Let us consider the simple noiseless setting of Gaussian model, i.e., σ = 0, as we aim at giving a non-convergence example. Then, we have x t = y t µ and the dynamic (12) becomes āt+1 = (1 -η µ 2 )ā t + η sign(ā t µ ) µ , where we used y 2 t = 1 and y t sign(y t •) = sign(•) because y t = {-1, +1}. Case: η ≤ 1 µ 2 : Given the initial condition ā1 > 0, we have āt > 0, ∀t from (13), and sign(ā t µ ) = 1, ∀t. Then, we can recursively expand (13) from time t + 1 back to time 1 and obtain āt+1 = (1 -η µ 2 ) t ā1 + η µ t s=0 (1 -η µ 2 ) s . From ( 14), we know that āt → 1 µ , as t → ∞, where we used that ∞ s=0 (1 -η µ 2 ) s = 1 η µ 2 . On the other hand, the dynamic of the orthogonal component i = 1 ∈ [d] is w t+1 [i] = w t [i] + η sign(w t x t ) -w x t x[i] = w t [i], where in the last equality we used that x t = y t µ and µ = [ µ , 0, . . . , 0] ∈ R d so that x[i] = 0, ∀i = 1. By ( 14) and (15), we get d i =1 w∞[i] 2 w∞[1] 2 = d i =1 w1[i] 2 1/ µ 2 . That is, the ratio converges to a non-zero value, which implies that GD with hard labels fails to converge to an -optimal predictor for any arbitrarily small , i.e., cos w∞ w∞ , µ µ ≤ 1 -¯ for some ¯ > 0. Case: η > 1 µ 2 : Suppose āt > 0. Then, the condition that āt+1 ≥ āt is 1 µ ≥ āt from (13), which means that the projection to µ is bounded and hence the model w t cannot be an -optimal classifier for any arbitrarily small . On the other hand, if āt > 1 µ , then āt+1 < āt , and āt+1 could even be negative when āt > 1 µ -1/(η µ ) . Moreover, if η > 2 µ 2 and |ā t | > η µ η µ 2 -2 = 1 µ -2/(η µ ) , then the magnitude |ā t | is increasing and the sign of āt is oscillating; more precisely, we will have |ā t+1 | ≥ |ā t | and sign(ā t+1 ) = -sign(ā t ). Consequently, the model w t is not better than the random guessing at every other iteration (recall ( 7)), which is not desirable for test-time adaptation. In the next section, we will provably show that GD with conjugate labels under square loss can learn an -optimal predictor for any arbitrary , which is the first theoretical result in the literature that shows the advantage of conjugate labels over hard labels, to the best of our knowledge.  self (w; x) = ψ(w x) = ψ yw (µ + σξ) = ψ w (µ + σξ) , where the second equality uses x = y(µ + σξ) under Gaussian model, and the last equality uses the assumption that ψ(•) is an even function. We emphasize that the underlying algorithm itself does not have the knowledge of µ, σ, or ξ, and the last expression simply arises from our analysis. From ( 16), we know that the gradient is ∇ self (w; x) = ∇ψ(w x) = ψ w (µ + σξ) (µ + σξ). Hence, the dynamic of GD with pseudo-labels is w t+1 = w t -η∇ self (w t ; x t ) = w t -ηψ w t (µ + σξ) (µ + σξ). ( ) Now let us analyze the population dynamics, which means that we observe infinitely many unlabeled samples, so we can take expectation on the r.h.s. of (18). We get w t+1 = w t -ηE ξ ψ w t (µ + σξ) µ -ηE ξ ψ w t (µ + σξ) σξ (19) = w t -ηE ξ ψ w t (µ + σξ) µ -ησ 2 E ξ ψ w t (µ + σξ) w t = 1 -ησ 2 E ξ ψ w t (µ + σξ) w t -ηE ξ ψ w t (µ + σξ) µ, where the second to last equality uses Stein's identity (Stein, 1981) : for any function ψ : R d → R and ξ ∼ N (0, I d ), it holds that E ξ [ξψ(ξ)] = E ξ [∇ ξ ψ(ξ)]. Denote a t := w t , µ the dynamic of the component of w t along µ. Given the dynamic (20), it is clear that the component along µ evolves as: a t+1 = 1 -ησ 2 E ξ ψ w t (µ + σξ) a t -ηE ξ ψ w t (µ + σξ) µ 2 . ( ) On the other hand, denote b t := [w t [2], . . . , w t [d] ] the size of the component orthogonal to µ. Then, its population dynamic evolves as: b t+1 = 1 -ησ 2 E ξ ψ w t (µ + σξ) b t . We further define the ratio r t := at bt . By ( 21) and ( 22), we have r t+1 = sign 1 -ησ 2 E ξ ψ w t (µ + σξ) r t + ηE ξ -ψ w t (µ + σξ) µ 2 1 -ησ 2 E ξ ψ w t (µ + σξ) b t . It turns out that cos wt wt , µ µ is an increasing function of r t , Indeed, where we used w t = (w t µ/ µ ) 2 + b 2 t . A successful recovery (cos → 1) means that we would like r t → ∞. cos w t w t , µ µ = w t , µ w t µ = w t , µ b t µ 2 + w t , µ 2 /b 2 t = sign(r t ) 1 1 + µ 2 /r 2 t , In the rest of this paper, we will use the notations ♦+♥ or GD + ♦+♥, where ♦ = {conj, hard} and ♥ = {square, logistic, exp} for brevity. For example, hard + exp represents the self-training loss based on hard labels under exponential loss, i.e., hard (w; x) = exp(-|w x|),while GD + conj + square stands for GD with conjugate labels under square loss in test-time adaptation.

5.1. (EXPONENTIAL-RATE CONVERGENCE) GD + conj + square

Proposition 2. (GD + conj + square) The ratio of the projection onto µ to the size of the orthogonal component grows as r t+1 = r 1 1 + η µ 2 1 + ησ 2 t . Furthermore, GD learns an -optimal predictor after t ≥ 1 2 log( µ 2 /( r 2 1 )) log(1+η µ 2 /(1+ησ 2 )) iterations. Proof. For GD + conj + square, the self-training loss is conj (w; x) = -1 2 (w x) 2 from (6). Hence, ψ(•) = -1 2 (•) 2 in (16); moreover, ψ (•) = -(•) and ψ (•) = -1 in (23) . Therefore, we have E ξ -ψ w t (µ + σξ) = E ξ w t (µ + σξ) = w t µ since E ξ [w t ξ] = 0, and we also have E ξ ψ w t (µ + σξ) = E ξ [-1] = -1 in (23). Consequently, the dynamic of the ratio is r t+1 = r t + ηw t µ µ 2 (1 + ησ 2 )b t = r t 1 + η µ 2 1 + ησ 2 = r 1 1 + η µ 2 1 + ησ 2 t . From ( 24) and ( 25), the cosine between w t and µ is positive and increasing, given the initial condition a 1 > 0 (or equivalently, r 1 > 0). Hence, Condition 1 on (8) holds for all t. By using (24), we see that to get an -optimal predictor at test time t, we need to satisfy µ 2 / r 2 1 1 + η µ 2 1+ησ 2 2t ≤ . Simple calculation shows that t ≥ 1 2 log( µ 2 /( r 2 1 )) log(1+η µ 2 /(1+ησ 2 )) . Proposition 1 and 2 together provably show a performance gap between GD + conj + square and GD + hard + square. Using conjugate labels, GD converges to the class mean µ in direction exponentially fast, while GD with hard labels fails in this task.

5.2. log(t)-RATE CONVERGENCE OF GD

In this subsection, we consider self-training loss functions, self (w; x) = ψ(w x), that satisfy the following set of properties ♣ with parameter (L, a min ): (i) Even: ψ(-a) = ψ(a) for all a ∈ R. (ii) There exists 0 < L < ∞ such that -ψ (a) ≥ e -La for all a ≥ a min . Lemma 1. The following self-training loss functions self (w; x) = ψ(w x) satisfy ♣. More precisely, we have: 1. hard + exp: ψ(u) = exp(-|u|) satisfies ♣ with (L = 1, a min = 0). 2. hard + logistic: ψ(u) = log (cosh (u)) -|u| satisfies ♣ with (L = 2, a min = 0). 3. conj + exp: ψ(u) = sech(u) satisfies ♣ with (L = 1, a min = 0.75). 4. conj + logistic: ψ(u) = log (cosh (u)) -tanh (u) u satisfies ♣ with (L = 2, a min = 0.5). The proof of Lemma 1 is available in Appendix C. Figure 2 plots the self-training losses listed in Lemma 1. From the figure, one might find that Property ♣ is evident for these self-training losses. We will also need the following supporting lemma to get a convergence rate. Lemma 2. Consider the dynamic: r t+1 ≥ r t + ce -Lrt , for some L > 0 and c ≥ 0. Suppose that initially r 1 > 0. Then, r t-τ * ≥ 1 2L log c(t -1), for all t > τ * , where τ * = 0 if ν ≤ e Lν , ∀ν ≥ 0; otherwise, τ * = ν 2 * (L)/c, where ν * (L) is the unique fixed point of ν * = e Lν * if it exits. Proof. From the dynamic, it is clear that r t+1 ≥ r t since c ≥ 0. Then, e Lrt+1 r t+1 ≥ e Lrt+1 r t + ce L(rt+1-rt) ≥ e Lrt r t + c ≥ e Lr0 r 0 + ct ≥ ct, where the last step follows from unrolling the recursion t times. We first analyze the case that r t ≤ e Lrt . Since r t ≤ e Lrt , we have e 2Lrt ≥ c(t -1) from ( 26). Hence, r t ≥ 1 2L log c(t -1). Now let us switch to the case that r t ≥ e Lrt . Let ν * (L) the unique point of ν * such that ν * = e Lν * . If r t ≤ ν * (L), then r t ≥ e Lrt . Hence, we have r 2 t ≥ r t e Lrt ≥ c(t -1). So r t ≥ c(t -1). Note this possibility cannot happen more than τ * := ν 2 * (L)/c times, since we need r t ≤ r * to stay in this regime. So eventually we get out of this regime after a constant number τ * iterations. Now we are ready to state another main result in this paper. Proposition 3 below shows a log(t)convergence rate of GD with pseudo-labels in the noiseless setting σ 2 = 0 if the underlying selftraining loss function satisfies ♣. The gap between the exponential rate of GD with conjugate labels using square loss shown in Proposition 2 and the logarithmic rate in Proposition 3 suggests that the performance of GD in test-time adaptation also crucially depends on the choice of loss functions, in addition to the choice of pseudo-labels. Proposition 3. (Noiseless setting) Apply GD to minimizing self (w; x) = ψ(w x), where ψ(•) satisfies ♣. If the initial point satisfies a 1 > a min , then the ratio of w t s component along µ to the size of its orthogonal component to µ at test time t, i.e., r t in (23), satisfies r t-τ * = Ω 1 Lb 1 log η µ 2 b 1 t , for all t > τ * , where τ * is a constant defined in Lemma 2. Proof. From ( 19) or ( 22), we know that the size of the orthogonal component does not change throughout the iterations when σ 2 = 0, i.e., b t+1 = b t , ∀t. On the other hand, the component along µ in the noiseless setting has the dynamic, a t+1 (21) = a t + η (-ψ (a t )) µ 2 ≥ a t + ηe -Lat µ 2 , ∀a t ≥ a min , where we recall a t := w t , µ and the inequality uses the property regarding -ψ (•) as stated in ♣. It is noted that (27) implies that a t is non-decreasing, and hence the condition about the initial point, i.e., a 1 ≥ a min , guarantees a t ≥ a min for all test time t. By using the above results, we deduce that the dynamic of the ratio r t := at bt satisfies r t+1 ≥ r t + ηe -La t µ 2 b1 = r t + ηe -Lr t b 1 µ 2 b1 , where we used that b t+1 = b t = b 1 , ∀t. Invoking Lemma 2 leads to the result. of each on Figure 3 , which shows that there exists a threshold z min such that for all z ≥ z min , the number L(z) that corresponds to the loss function with the conjugate label is smaller than that of the hard label. This implies that the self-training loss derived from conjugate labels can have a smaller constant L (for a finite z) compared to that of hard labels, which in turn might hint at a faster convergence of GD + conj compared to GD + hard for exponential loss and logistic loss. Figure 4 shows the experimental results under Gaussian model, where GD uses a received mini-batch of samples to conduct the update at each test time. The detailed setup is available in Appendix B. We find that GD with conjugate labels dominates GD with hard labels empirically, which is aligned with our theoretical result. It is noted that for the case of exponential loss, Goyal et al. (2022) report a similar experimental result under Gaussian model -GD + conj + exp outperforms GD + hard + exp.

6. LIMITATIONS AND OUTLOOKS

In this paper, we analyze GD with hard and conjugate pseudo-labels for test-time adaptation under different loss functions. We study the performance of each of them under a binary classification framework, identify a scenario when GD with hard labels cannot converge to an -optimal predictor for any small while GD with conjugate labels does, and obtain some convergence results of GD with pseudo-labels. However, there are still many directions worth exploring. First of all, while our current analysis in the binary classification setting might be viewed as a first step towards systematically studying GD with pseudo-labels, analyzing GD with pseudo-labels in multi-class classification is left open in this work and could be a potential direction. Second, while analyzing the population dynamics has already given us some insights about GD with pseudo labels, it might be useful to study their finite-sample dynamics. Third, theoretically understanding GD with other pseudo-labels or combined with other domain adaptation techniques like ensembling (e.g., Wortsman et al. (2022)) or others (e.g., Li et al. (2019) ; Schneider et al. (2020) ; Eastwood et al. (2022) ) might be promising. Finally, analyzing momentum methods (e.g., Nesterov (2013) ; Wibisono et al. (2016) ; Wang & Abernethy (2018) ; Wang et al. (2022a; 2021b; c) ) with pseudo-labels is another interesting direction, and one of the open questions is whether they enjoy provable guarantees of faster test-time adaptation compared to GD. Overall, we believe that the connection between optimization, domain adaptation, and machine learning under distribution shifts can be strengthened.



(A NEGATIVE EXAMPLE) GD WITH HARD LABELS UNDER SQUARE LOSS One of the common loss function is square loss. Recent works have shown that even for the task of classification, a model trained under square loss can achieve competitive performance for classification



Square loss: exp (w; (x, y)) := 1 2 (y -w x) 2 . Hard y hard w (x) = sign(w x) hard (w; x) = 1 2 (sign(w x) -w x) 2 Conjugate y conj w (x) = w x conj (w; x) = -1 2 (w x) 2 Logistic loss: logit (w; (x, y)) := log cosh w x -y(w x), where y = {+1, -1}. Hard y hard w (x) = sign(w x) hard (w; x) = log cosh w x -|w x| Conjugate y conj w (x) = tanh w x conj (w; x) = log cosh w x -tanh w x w x Exponential loss: exp (w; (x, y)) := exp(-yw x), where y = {+1, -1}. Hard y hard w (x) = sign(w x) hard (w; x) = exp(-|w x|) Conjugate y conj w (x) = tanh w x conj (w; x) = sech w x available in test-time adaptation, it is natural to construct a pseudo-label y pseudo w (x) and consequently a self-training loss function by replacing y with y pseudo w

Figure1: Expected 0-1 loss vs. test-time iteration of GD. GD with hard labels under square loss (blue solid line) can not converge to the class mean µ in direction, while GD with conjugate labels under square loss (green dash dot line) learns an -optimal predictor. Here, "no-adaptation" means simply predicting according to the initial model without any updates. The detailed setup is described in Appendix B.5 CONVERGENCE RESULTS OF GD WITH PSEUDO-LABELSRecall that we have self (w; x) = ψ(w x) for some scalar function ψ(•) : R → R under the scenario of linear predictors. If ψ(•) is an even function, i.e., ψ(u) = ψ(-u) for all u ∈ R, then

Figure 2: Plots of some self-training loss functions that satisfy the set of properties ♣.

Figure 3: We plot L(z) := log(-ψ (z)) z vs. z, where ψ (•) is the first derivative of the underlying self-training loss. Left: L(z) vs. z of hard + exp and conj + exp. Right: L(z) vs. z of hard + logistic and conj + logistic.

Figure 4: Expected 0-1 loss Φ µ w t σ w t vs. test-time t. Left: GD + hard + exp and GD + conj + exp. Right:GD + hard + logistic and GD + conj + logistic. Here "best minimal error" is Φ µ σ (recall the discussion in Section 3). Both figures show that GD with conjugate labels outperforms GD with hard labels.

can be recovered from their conjugate-label Algorithm 1: Test-time adaptation via pseudo-labeling 1: Init: w1 = wS , where wS is the model learned from a source domain. 2: Given: Access to samples from the data distribution Dtest of a new domain. 3: for t = 1, 2, . . . , T do 4:Get a sample xt ∼ Dtest from the new domain.

ACKNOWLEDGMENTS

The authors appreciate Shikhar Jaiswal spotting a minor error in our previous version of the proof of Proposition 1, which has been corrected in this version. The authors thank the constructive feedback from the reviewers and comments from Sachin Goyal, which helps improve the quality of this paper. The authors also thank Chi-Heng Lin for valuable discussions.

A DERIVATIONS OF CONJUGATE LABELS AND THE ASSOCIATED SELF-TRAINING LOSSES ON TABLE 1

1. (Square loss): Square loss (w; (x, y)) := 1 2 (y -w x) 2 is in the form of (2), where f (•) = 1 2 (•) 2 : R → R + . Substituting f (•) = 1 2 (•) 2 and h(w) = w x into (4) and (5), we get y conj w (x) = w x, and conj (w;On the other hand, let y ← sign(w x). we have y hard w (x) = sign(w x), and hard (w; x) = 1 2 sign w x -w x 2 .(29)2. (Logistic loss): Recall that logistic regression predicts P (ŷ = 1) = exp(w x) 1+exp(w x) and P (ŷ = 0) = 1 -P (ŷ = 1), and the loss function is:where ŷ = {0, 1}. Let y = 2ŷ -1 ∈ {-1, 1}. Then, substituting ŷ = 1 2 + y 2 back into (30) and using the equation cosh(z) = exp(z)+exp (-z) 2 for any z ∈ R, we obtain an equivalent objective:Now by renaming w 2 ← w, we get logit (w; (x, y)) = log cosh wwhere the last term is a constant and can be dropped without affecting the training.Observe that (32) is in the form of (2), where f (•) = log (cosh (•)) and h w (x) = w x. Using (4) and ( 5), we get y conj w (x) = tanh w x , and conj (w; x) = log cosh w x -tanh w x w x. (33)On the other hand, let y ← sign(w x) in (32). we have y hard w (x) = sign(w x), and hard (w; x) = log cosh w x -|w x|.(34)

3.. (Exponential loss):

Recall that exponential loss is exp (w; (x, y)) := exp(-yh w (x)) = exp(-yw x), where y = {+1, -1}, which can be rewritten asThe above function is in an expanded conjugate form (Goyal et al., 2022) :where Goyal et al. (2022) define the conjugate label y conj w (x) via the equality ∇f (h w (x)) = ∇g(h w (x))y conj w (x) for this case. Therefore, we have y conj w (x) = tanh(w x). By substituting y ← y conj w (x) in (35), we get the self-training loss function using the conjugate label: conj (w) = sech(w x). To conclude, we have: 

B SETUP OF THE SIMULATION IN FIGURE 1 AND FIGURE 4

Below we describe how to reproduce Figure 1 and Figure 4 . We first specify the mean and covariance µ S , µ T , Σ S = σ S I d , Σ T = σ T I d as follows, where the subscript S stands for the source domain, and the subscript T is the target domain.We set µ S = e 1 and then set set µ T [1] = 0.6567, and the remaining elements of µ T is set randomly from a normal distribution and were normalized to ensure that µ T is a unit norm vector. Then, we set σ T = 0.6567/0.8416. This way we have0.2, i.e., the initial model w 1 = w S has 20% expected 0-1 loss in the new domain T . Also, the best minimal error in the new domainIn the simulation result depicted in Figure 1 , a sample of (x = µ) arrives when the test time t is an odd number and a sample of (x = -µ) arrives when the test time t is an even number. Note that the algorithms do not know the labels.In the simulation result depicted in Figure 4 , we consider the setting of noisy data, i.e., x t ∈ R d is sampled as x t ∼ N (µ T , σ 2 T I d ) instead of x t = yµ T . We search the step size η over the grid {10 -3 , 5 × 10 -3 , 10 -2 , 5 × 10 -2 , 10 -1 , 5 × 10 -1 , 10 0 , 5 × 10 0 , 10 1 , 5 × 10 1 , 10 2 } for each GD + hard + exp, GD + conj + exp, GD + hard + logistic, or GD + conj + logistic, and report the best result of each one.

C PROOF OF LEMMA 1

Lemma 1: The following self-training loss functions self (w; x) = ψ(w x) satisfy the set of properties ♣. More precisely, we have 1. hard + exp: ψ(u) = exp(-|u|) satisfies ♣ with (L = 1, a min = 0).2. hard + logistic: ψ(u) = log (cosh (u)) -|u| satisfies ♣ with (L = 2, a min = 0).3. conj + exp: ψ(u) = sech(u) satisfies ♣ with (L = 1, a min = 0.75).4. conj + logistic: ψ(u) = log (cosh (u)) -tanh (u) u satisfies ♣ with (L = 2, a min = 0.5).

Proof.

• For hard + exp, we have ψ(u) = exp(-|u|), ψ (u) = -sign(u) exp(-|u|), and ψ (u) = exp(-|u|) + δ 0 (u).It is evident that ψ(u) = exp(-|u|) is an even function and that it is differentiable everywhere except at the origin. We also have | -ψ (u)| ≤ 1 and -ψ (u) ≥ exp(-u) for all u ≥ 0. We conclude that ψ(u) = exp(-|u|) satisfies ♣ with parameter (L = 1, a min = 0).• For hard + logistic, we haveIt is evident that ψ(u) = log (cosh (u)) -|u| is an even function and that it is differentiable everywhere except at the origin. We also have (-u) .Hence, for u > 0, -φand the later is evident for u ≥ 0.We conclude that ψ(u) = log (cosh (u))-|u| satisfies ♣ with parameter (L = 2, a min = 0).• For conj + exp, we have ψ(u) = sech(u), ψ (u) = -tanh(u)sech(u), and ψ (u) = -sech(u) 3 + tanh 2 (u)sech(u).It is evident that ψ(u) = sech(u) is an even function and that it is differentiable everywhere. We also havewhich holds when u ≥ 0.75. That is, -ψ (u) ≥ exp(-u) for all u ≥ 0.75.We conclude that ψ(u) = sech(u) satisfies ♣ with parameter (L = 1, a min = 0.75).• For conj + logistic, we have ψ(u) = log (cosh (u)) -tanh (u) u, ψ (u) = -sech 2 (u)u, and ψ (u) = -sech(u) 2 + 2u tanh(u)sech 2 (u).It is evident that ψ(u) = log (cosh (u)) -tanh (u) u is an even function and that it is differentiable everywhere. We also have |-ψ (u)| = 4u (exp(u)+exp(-u)) 2 ≤ 1.Note that -ψ (u) = sech 2 (u)u = 4u (exp(u)+exp(-u)) 2 . Moreover, 4u (exp(u) + exp(-u)) 2 ≥ exp(-2u) ⇐⇒ 4u ≥ 1 + 2 exp(-2u) + exp(-4u), (39) which holds when u ≥ 0.5. That is, -ψ (u) ≥ exp(-2u) for all u ≥ 0.5.We conclude that ψ(u) = log (cosh (u)) -tanh (u) u satisfies ♣ with parameter (L = 2, a min = 0.5).

