SOME PRACTICAL CONCERNS AND SOLUTIONS FOR USING PRETRAINED REPRESENTATION IN INDUS-TRIAL SYSTEMS

Abstract

Deep learning has dramatically changed the way data scientists and engineers craft features -the once tedious process of measuring and constructing can now be achieved by training learnable representations. Recent work shows pretraining can endow representations with relevant signals, and in practice they are often used as feature vectors in downstream models. In real-world production, however, we have encountered key problems that cannot be justified by existing knowledge. They raise concerns that the naive use of pretrained representation as feature vector could lead to unwarranted and suboptimal solution. Our investigation reveals critical insights into the gap of uniform convergence for analyzing pretrained representations, their stochastic nature under gradient descent optimization, what does model convergence means to them, and how they might interact with downstream tasks. Inspired by our analysis, we explore a simple yet powerful approach that can refine pretrained representation in multiple ways, which we call Featurizing Pretrained Representations. Our work balances practicality and rigor, and contributes to both applied and theoretical research of representation learning.

1. INTRODUCTION

The ability of neural networks to learn predictive feature representation from data has always fascinated practitioners and researchers (Bengio et al., 2013) . The learnt representations, if proved reliable, can potentially renovate the entire life cycle and workflow of industrial machine learning. Behind reliability are the three core principles for extracting information from data, namely stability, predictability, and computability (Yu, 2020) . These three principles can not only justify the practical value of learnt representation, but also lead to the efficiency, interpretability, and reproducibility that are cherished in real-world production. Since pretrained representations are optimized to align with the given task, intuitively, they should satisfy all three principles in a reasonable setting. However, when productionizing an automated pipeline for pretrained representations in an industrial system, we encountered key problems that cannot be justified by existing knowledge. In particular, while the daily refresh follows the same modelling and training configurations and uses essentially the same data 1 , downstream model owners reported unexpectedly high fluctuations in performance when retraining their models. For illustration purpose, here we reproduce the issue using benchmark data, and take one further step where the pretraining is repeated on exactly the same data, under the same model configuration, training setup, and stopping criteria. We implement ten independent runs to essentially generate the i.i.d versions of the pretrained representation. We first visualize the dimension-wise empirical variances of the pretrained representations, provided in Figure 1a . It is surprising to find out that while the pretraining losses almost converge to the same value in each run (Figure 1b ), there is such a high degree of uncertainty about the exact values of each dimension. Further, in Figure 1c , we observe that the uncertainty (empirical variance) of pretrained representation will increase as the pretraining progresses. In the downstream task where pretrained representations are used as feature vectors (see the right figure), we observe that the performance does fluctuate wildly from run to run. Since we use logistic regression as the downstream model, the fluctuation can only be caused by the instability of pretrained representations because we can effectively optimize the downstream model to global optimum. To demonstrate that the above phenomenon is not caused by using a specific model or data, we also experiment with a completely different pretraining model and benchmark data from from another domain. We perform the same analysis, and unfortunately the same issues persist (Figure A.1 in the Appendix). Existing deep learning theory, both the convergence and generalization results (we will discuss them more in Section 2), can fail to explain why shall we expect pretrained representation to work well in a downstream task when their exact values are so unstable. This is especially concerning for industrial systems as the issue can lead to unwarranted and suboptimal downstream solutions. We experienced this issue firsthand in production, so we are motivated to crack the mysteries behind pretrained representations, and understand if and how their stability can be improved without sacrificing predictability and computability. We summarize our contributions as below. • We provide a novel uniform convergence result for pretrained representations, which point out gaps that relate to the stability and predictability issues. • We break down and clarify the stability issue by revealing the stochastic nature of pretrained representation, the convergence of model output, and the stable and unstable components involved. • We investigate the interaction between pretrained representation and downstream tasks in both parametric and non-parametric settings, each revealing how predictability can benefit or suffer from stability (or instability) for particular usages of pretrained representations. • We discuss the idea of featurizing pretrained representation, and propose a highly practical solution that has nice guarantees and balances stability, predictability, and computability. We also examine its effectiveness in real-world experiments and online testings.

2. RELATED WORK

It is not until recent years that deep learning theory sees major progress. Zhang et al. (2016) observed that parameters of neural networks will stay close to initialization during training. At initialization, wide neural networks with random weights and biases are Gaussian processes, a phenomena first discussed by Neal (1995) and recently refined by Lee et al. (2017) ; Yang (2019) . However, they do not consider effect of optimization. The Neural Tangent Kernel provides a powerful tool to study the limiting convergence and generalization behavior of gradient descent optimization (Jacot et al., 2018; Allen-Zhu et al., 2019) , but it sometimes fails to capture meaningful characteristics of practical neural networks (Woodworth et al., 2020; Fort et al., 2020) . However, those works require parameters being close to initialization, in which useful representation learning would not take place. Indeed, it has also caught to people's attention that representation learning can go beyond the neural tangent kernel regime (Yehudai & Shamir, 2019; Wei et al., 2019; Allen-Zhu & Li, 2019; Malach et al., 2021) , among which a line of work connects the continuous-time training dynamics with mean field approximation (Mei et al., 2018; Sirignano & Spiliopoulos, 2020) , and another direction is to study the lazy training regime (Chizat et al., 2019; Ghorbani et al., 2019) where only the last layer of a neural network is trained. Unfortunately, their assumed training schemas all deviate from practical representation learning. Still, part of our analysis in Section 4.2 can be viewed as a practical discretetime extension of the mean-field method. Perhaps the most practical setting for studying pretrained representation is Arora et al. (2019) , which analyzes the contrastive representation learning under a particular data generating mechanism. However, their results do not generalize to broader setting, and they cannot justify the stability issue of pretrained representation.

3. PRELIMINARIES

Notations. We use x ∈ X ⊆ R d0 and y ∈ R to denote the raw feature and outcome, uppercase letters to denote random variables and measures, and bold-font letters to denote matrices. Let h : X → R d be the representation hypothesis, and f : R d → R be the prediction hypothesis. The hypothesis classes are given by H and F respectively. Denote by • the operator for function composition, and ℓ : R×R → [0, 1] the loss function. We assume ℓ is 1-Lipschitz without loss of generality. Then the risk for a pair of (h ∈ H, f ∈ F) is given by: R(h, f ) := E (X,Y )∼P ℓ f • h(X), Y , where P is a measure on (X , R). We also use P n to denote the corresponding product measure for (X 1 , Y 1 ), . . . , (X n , Y n ). The one-layer multi-layer perceptron (MLP) is perhaps the most fundamental representation learning model, given by: f • h(x) = Θσ(Wx). Here, σ is the activation function, and W ∈ R d0×d , Θ ∈ R d×k . We mention that adding the bias terms will not affect our analysis, so we drop them here for brevity. In practice, Θ and W are often initialized as scaled i.i.d Gaussian random variables that follow N (0, 1/d). We will use such as [W ] i to denote the i th row of a matrix. The popular contrastive representation learning can also be considered as a special case of this configuration 2 . Define the shorthand g(x) := Θσ(Wx). A typical pretraining process involves optimizing the risk function defined for pretraining and extracting the hidden representation. The optimization is done via stochastic gradient descent (SGD), e.g. W (t+1) = W (t) -α∇ W ℓ(g(x (t) ), y (t) ), where α is the learning rate. For convenience, we consider each mini-batch having one random sample, denoted by (x (t) , y (t) ) that corresponds to the t th step. Given a representation hypothesis h, we define: f h,n := arg min f ∈F 1/n n i=1 ℓ f (h(x i ), y i . In the sequel, how well f h,n • h can generalize to a new i.i.d sample of the downstream task is: R(h) := E (X,Y )∼P E P n ℓ f h,n • h(X), Y , where the second expectation E P n is taken with respect to the downstream data {X i , Y i } n i=1 under- lying f h,n . Its empirical version is given by R n (h) := 1/n i ℓ f h,n • h(X i ), Y i .

4.1. THE GAP OF UNIFORM CONVERGENCE FOR PRETRAINED REPRESENTATION

Suppose h and f are optimized jointly (end-to-end) via empirical risk minimization (ERM), which amounts to solving: arg min h∈H,f ∈F 1/n i ℓ(f • h(x i ), y i ). In this setting, the generalization behavior of the solution well-studied. In particular, using the notion of Gaussian (or Rademacher) complexity 3 , the generalization error can be bounded by O G n (F • H)/n + (log 1/δ)/n with probability at least 1 -δ ( Bartlett & Mendelson, 2002) . This result, known as uniform convergence, is especially appealing because it both includes problem-specific aspects and applies to all functions in the composite hypothesis class F • H := {f • h : f ∈ F, h ∈ H}. Is it possible to achieve a comparable result for pretrained representation? Perhaps the most ideal setting for uniform convergence to hold under pretrained representation is: C1: the pretraining and downstream training will use the same data {(X i , Y i )} n i=1 , i.e. ĥ, f := arg min h∈H,f ∈F 3 We will use Gaussian complexity G(•) here for some of its technical convenience. Then we let Gn be the empirical Gaussian complexity. See Appendix A for detail. 1 n n i=1 ℓ f • h(X i ), Y i , f ĥ,n = arg min f ∈F 1 n n i=1 ℓ f ( ĥ(X i ), Y i . C2: they rely on the same prediction function class F. These two conditions essentially eliminate the confounding effects of model and data mismatch. Thus, if uniform convergence cannot hold in this setting, it is unlikely to serve more general use cases. We first summarize the common intuition behind why pretrained representation might work: • the pretraining objective, when well-designed, reasonably predicts the empirical downstream risk for f h,n (intuition 1); • f h,n 's empirical downstream risk can be generalized to the true downstream risk (intuition 2). These two intuitions have also been exemplified for contrastive representation learning in Arora et al. (2019) and its following work. Our main contribution here is to make the above intuitions rigorous, and reveal whether they are indeed sufficient for uniform convergence in general settings. Recall that, given the complete information on a downstream task, the best we can do is: min h∈H,f ∈F R(h, f ). We denote the representation hypothesis that achieves this minimum by h * . Let ĥ be given in C1. Then the generalization error is simply given by: R( ĥ)-min h∈H,f ∈F R(h, f ). Following the standard derivation which decomposes the generalization error and takes the supremum to upper bound each term, we run into terms that exactly characterize the above two intuitions. As we show in Appendix B, it holds that: R( ĥ) -min h∈H,f ∈F R(h, f ) ≤ sup h E P n R n (h) -R n (h) + sup h E P n E (X,Y )∼P ℓ f h,n • h(X), Y -R n (h) + remainder, where the first term sup h E P n R n (h) -R n (h) exactly seeks to match intuition 1, and the second term can be further upper bounded using: E (X,Y )∼P ℓ f h,n • h(X), Y -R n (h) ≤ sup f E (X,Y )∼P ℓ f • h(X), Y -R n (h) , which underlies intuition 2. The remainder terms can be bounded using standard concentration results. However, we also spot a critical issue with the first term, and we first expand it for clarity: sup h E P n 1 n i ℓ f h,n • h(X i ), Y i - 1 n i ℓ f h,n • h(X i ), Y i . Notice that this is not the typical empirical process encountered in a standard generalization setting, and we show that its upper bound is actually given by O G n (H)/ √ n + log 1/δ following the same procedure as Bartlett & Mendelson (2002) . Compared with the standard generalization bound, here the slack term log 1/δ does not vanish as we increase n. Therefore, there exist gaps between common intuitions and achieving uniform convergence. Before we discuss the cause of the gaps and its implications, we first present the complete result as below. Proposition 1. Let G ′ n (•) be a slightly modified Gaussian complexity term. Under the conditions and definitions in C1 and C2, it holds with probability at least 1 -δ that: R( ĥ) -min h∈H,f ∈F R(h, f ) ≲ G n (H) √ n + G ′ n (F) sup h E∥h(X)∥ 2 2 √ n + log 1/δ. The proof is deferred to Appendix B. Proposition 1 can be viewed as a "no free lunch" result for using pretrained representation: even in the most ideal setting we study here, uniform convergence cannot be expected for all representation hypothesis. The gap is that not for every h ∈ H can the pretraining objective be predictive of f h,n 's empirical downstream risk. Imagine a entirely random h. Then both its pretraining objective and the empirical downstream risk of f h,n may have high variances that do not scale with n. Thus the prediction will not concentrate whatsoever. Takeaway. The implications of this gap are two folds. Firstly, it does not suffice to only study H and the data distribution -the statistical and algorithmic convergence properties of ĥ(X) could be more relevant as they suggest its stability. Secondly, we cannot take the performance of f ĥ,n for granted, at least not without understanding how ĥ(X) interacts with the downstream learning and generates f ĥ,n -which ultimately relates to its predictability. Unfortunately, we find a lack of discussion on these two issues in existing literature, so we will investigate them in the next two sections.

4.2. STOCHASTIC NATURE OF PRETRAINED REPRESENTATION, AND THE CONVERGENCE OF PRETRAINING MODEL

In this section, we reveal two important statistical and algorithmic properties of pretrained representation. We show that while they persist as random vectors during SGD optimization (as shown in Figure 1 ), the output of the pretraining model can be deterministic and converge to some optimal solution. Two contributing factors are scaled i.i.d initialization and the inductive bias of gradient descent. Our findings provide critical insight to the stability of pretrained representations. We motivate our statistical analysis by deriving the optimization path of the one-layer MLP introduced in Section 3. For notation convenience, we introduce Θ and W as the rescaled version of Θ and W such that Θ(0) , W(0) i.i.d ∼ N (0, 1). We let ℓ ′ (g(x), y) be the derivative of the loss function and similarly for other functions. In contrast to the existing theoretical work that studies optimization path under gradient flow or infinitesimal learning rate, we fix the learning rate as α = 1 to reflect real-world practice. The output dimension is also set to k = 1 without loss of generality. In the first forward pass, since σ(W (0) x (0) ) has i.i.d coordinates, as d → ∞ it holds that: g (0) (x (0) ) := 1 d d i=1 Θ(0) i σ W(0) x (0) i a.s. -→ EΘ (0) σ W (0) x (0) (denote by g (0) * (x (0) )), where we use Θ (t) , W (t) to denote an i.i.d element (or row) of Θ(t) and W(t) . As a result, ℓ ′ g (0) (x (0) ), y (0) also converges to the deterministic value L (0) := ℓ ′ g (0) * (x (0) ), y (0) . Then in the first backward pass, the updated parameters will follow: Θ(1) = Θ(0) -L (0) σ W(0) x (0) , W(1) = W(0) -L (0) x (0) Θ(0) σ ′ W(0) x (0) . An important observation is that the updated parameters remain to be element-wise i.i.d. Consequently, the model output of the second forward pass will also converge to a deterministic value: g (1) (x (1) ) a.s. -→ E Θ (0) -L (0) σ W (0) x (1) W (0) x (1) -L (0) x (0) Θ (0) σ ′ W (0) x (0) x (1) . As we show in the following Proposition, the (statistical) convergence result will hold for any t, and there exists a general iterative update rule for g (t) (x). For some intuition, suppose σ(•) is the identity function, then Θ (t) , W (t) will simply be linear combinations of Θ (0) , W (0) . Proposition 2. For the one-layer MLP we consider, with the learning rate α = 1, for any step t > 1, as d → ∞, the model output g (t) (x) will converge almost surely to g (t) * (x) defined as follows: g (t) * (x) = C (t) 1 C (t) 2 + C (t) 3 C (t) 4 x, with C (t+1) 1 , C , C , C (t+1) 4 = C (t) 1 , C (t) 2 , C (t) 3 , C (t) 4 + L (t) x (t) C (t) 3 , C (t) 4 , C (t) 1 , C . As a corollary, while the hidden representations will remain random vectors throughout the SGD process (which can be seem from the update rule): h (t) (x) := σ(W (t) x) = σ W(t-1) x -L (t-1) x (t-1) Θ(t-1) σ ′ W(t-1) x (t-1) x , ⟨h (t) (x), h (t) (x ′ )⟩ will nevertheless also converge to some deterministic value as d → ∞. The proof and detail are deferred to Appendix C. In Figure 1d , we see that the statistical convergence of model output is indeed evident even with moderately small d, and its variance is by magnitudes smaller than the variance of the hidden representation σ(W (t) x) (see the x-axis of Figure 1c and 1d ). On the other hand, the algorithmic convergence of model prediction has received considerable attention. It has been shown that over-parameterized models will converge to minimum-norm interpolants due to the inductive bias of gradient descent (Bartlett et al., 2021; Soudry et al., 2018) . For the sake of space, here we focus on their implications and leave the details to Appendix C. Roughly speaking, among the many locally optimum solutions that interpolate the training data, gradient descent will converge to the one with the smallest norm, which usually has nice properties such as smoothness. We let g 0 be that particular solution such that lim t→∞ g (t) (x) = g 0 (x). Since ⟨h (t) , h (t) ⟩ converge statistically to a deterministic value at every optimization step, we can immediately conclude that: • if g (t) takes the form of ⟨h (t) , h (t) ⟩ such as in contrastive representation learning, the inner product between hidden representations also converge algorithmically to g 0 's prediction; • if g (t) = θh (t) , i.e. the last hidden layer is used as the representation, note that a necessary but not sufficient condition for ∥g (t) (x) -g (t) (x ′ )∥ to be small is that ∥h (t) (x) -h (t) (x ′ )∥ is small as well. Suppose h (t) are normalized, then upon the algorithmic convergence, ⟨h (t) (x), h (t) (x ′ )⟩ are likely to be larger if x, x ′ are close to each other under g 0 's prediction. Takeaway. The stochastic nature of ĥ := lim t→∞ h (t) and the (approximate) convergence of ⟨ ĥ(x), ĥ(x ′ )⟩ under gradient descent reveal two important properties of pretrained representations: 1. Instability of ĥ(x): the exact position of ĥ(x) in R d is stochastic, depending on the initialization and the order of the pretraining data that is fed to SGD; 2. Stability of ⟨ ĥ(x), ĥ(x ′ )⟩: the pairwise inner product of ⟨ ĥ(x), ĥ(x ′ )⟩ converges (approximately) to a value that is consistent with the minimum-norm interpolant of the pretraining task. These results will also play a crucial role in understanding how ĥ can interact with the downstream learning, which we will study in the next section.

4.3. INTERACTION WITH DOWNSTREAM TASK

To be comprehensive, we consider both the parametric and non-parametric set up for downstream task. Interestingly, they will reveal different aspects on the predictability of ĥ. Parametric setup. To eliminate the interference of label noise, we consider the noiseless setting where the output of downstream task is generated by: y i = f * E[h(x i )] , i = 1, . . . , n. Because h(x) might be high-dimensional, we assume there is some sparsity in f * . The conditions below provide perhaps the easiest parametric setup for pretrained representations to perform well. C3: Let f * (h) := ⟨θ * , h⟩, ∥θ * ∥ 0 ≤ q, and let the inputs h i := Eh(x i ) be sampled from: N (0, σ 2 h I) where σ h is the strength of the signal. We show previously that ĥ is stochastic, so we simply set ĥi := h i + ϵ i , where ϵ i ∼ N (0, σ 2 ϵ I) captures the variance of the pretrained representation. Intuitively, since ϵ i are i.i.d, it holds that E ϵ ⟨ ĥ(x i ), ĥ(x j )⟩ = ⟨h(x i ), h(x j )⟩ so recovering θ * should be less challenging. However, we show that the variance will again prohibit efficient learning, and the best f ĥ,n can do is controlled by σ ϵ /σ h -a notion of signal-to-noise ratio for pretrained representation. The result below takes the form of a minimax lower bound: an information-theoretical quantity that characterize the inherent difficulty of a problem. Our proof (in Appendix D) is based on Le Cam's method that was previously used to prove a lower bound result under label noise (Raskutti et al., 2011) , which is very different from our setting. Proposition 3. Under C3, it holds with probability at least 1/2 that: inf θ sup ∥θ * ∥0≤q ∥ θ -θ * ∥ 2 ≳ σ 2 ϵ /σ 2 h • qn -1 log(d/q), where inf θ is taken with respect to any learning procedure that is based on { ĥ(x i ), y i } n i=1 . Takeaway. The result in Proposition 3 is alarming because during pretraining, the variance of h(x) might increase as more and more stochastic terms are being added (suggested by both the derivations in Section 4.2 and the empirical result in Figure 1c ). The above lower bound shows the predictability of ĥ(x) can be compromised by its variance inherited from pretraining. This also explains the instability in downstream machine learning that we experienced during real-world production. Non-parametric setup. Among the non-parametric regression estimators, the Nadaraya-Watson (NW) estimator has received considerable attention due to its simplicity and effectiveness (Nadaraya, 1964) . It can be thought of as a smoothing nearest-neighbor estimator under a weighting schema: f h,n • h(x) := n i=1 y j w h (x, x i ), w h (x, x i ) := K (h(x) -h(x i ) /z, where K : R d → R + is a kernel, and z is a normalizing constant. Here, we omit the bandwidth parameter for convenience. The Gaussian kernel K(u) ∝ exp(-∥u∥ 2 2 ) is a common choice, so when pretrained representations are normalized, it only depends on h via ⟨h(x), h(x ′ )⟩ -a more stable quantity according to the previous section. We effectively denote this kernel by K ⟨h(x), h(x ′ )⟩ . It is well-understood that the generalization of a kernel support vector machine is controlled by the kernel-target alignment (Cristianini et al., 2001) , i.e. ⃗ y, K⃗ y , where ⃗ y = [y i , . . . , y m ] T and K i,j = K ⟨h(x i ), h(x j )⟩ . We prove that this is also the case for NW estimator, with a simple result that does not resort to the concentration arguments. The proof is in Appendix D. Lemma 1. Under 0-1 loss, with probability at least 1 -δ, the risk of NW estimator satisfies: R(f h,n • h) ≤ 1 - √ δ • E 1[Y = Y ′ ]K ⟨h(X), h(X ′ )⟩ , where the expectation is taken with respect to (X, Y ) ∼ P , (X ′ , Y ′ ) ∼ P . Takeaway. Lemma 1 shows the predictability of h(x), when expressed and measured through the more stable ⟨h(x), h(x ′ )⟩, is strictly guaranteed. Therefore, using h(x) in downstream task in the form of ⃗ h(x) := e ⟨h(x),h(x1)⟩ , . . . , e ⟨h(x),h(xn)⟩ can be beneficial, and it can be interpreted as a representation of weights in the NW estimator. Further, ⃗ h(x) contains all the pairwise relationship that can be more closely related to the pretraining objective. Note that h(x) can also be viewed as the compression of ⃗ h(x) because: [ ⃗ h(x i )] j = exp(⟨h(x i ), h(x j )⟩). Nevertheless, ⃗ h(x) and h(x) cannot be compared directly because they have different intrinsic dimensionality. In terms of computability, ⃗ h(x) ∈ R n is also no compare to h(x) ∈ R d -computing ⃗ h(x) itself can be non-trivial for largescale applications. We aim to resolve these issues in the next section.

5. FEATURIZING PRETRAINED REPRESENTATION

Our next goal is to build on top of h(x) features or representations that are comparable to ⃗ h(x) in terms of stability and predictability, and have similar computability to h(x). Suppose {h(x i )} n i=1 are normalized. Then ⃗ h(x i ) is simply the exponential of pairwise cosine distances between h(x i ) and all the pretrained representations. Notice that the angle between any pair of (h(x i ), h(x j )) can be decomposed into their respective angle with a baseline direction u ∈ R d , ∥u∥ 2 = 1. When the set of baseline directions is rich enough, we can recover all the pairwise cosine distances in ⃗ h(x i ) using their angles with the baseline directions. Given U := [u 1 , . . . , u m ] ∈ R d×m , the set of angles between h(x i ) and U forms a measurement for the relative location of h(x) ∈ R d . We refer to such a measurement process as featurizing pretrained representation, as it is similar to how features are constructed by measuring experimental subjects. While featurizing h(x) according to its geometrically property is an appealing solution, it is unknown how many baseline directions are needed to preserve the stability and predictability of ⃗ h, as well as the optimal way to choose those directions. Fortunately, the Bochner's Theorem (Loomis, 2013) from harmonic analysis lays a solid foundation for selecting the directions and providing approximation and learning guarantees. Also, the resulting measurements will coincide with the random Fourier feature (Rahimi & Recht, 2007; Liu et al., 2021) that plays a critical role in many machine learning communities. For the Gaussian kernel we studied, Bochner's Theorem states that there exists a measure Q on R d such that: K(h(x), h(x ′ )) = R d e iu(h(x)-h(x ′ )) q(u)du real part = E u∼Q cos u h(x 1 ) -h(x 2 ) . Since cos(a -a ′ ) = cos(a) cos(a ′ ) + sin(a) sin(a ′ ), we can approximate the kernel value using the Monte Carlo method as below: K(h(x), h(x ′ )) ≈ 1 m m i=1 cos u i h(x) cos u i h(x ′ ) + sin u i h(x) sin u i h(x ′ ) , u i i.i.d ∼ Q. Let ϕ m h(x), Q := 1/ √ m cos u 1 h(x) , sin u 1 h(x) , . . . , cos u m h(x) , sin u m h(x) be the featurization of h(x) according to Bochner's Theorem. Note that it amounts to measuring h(x)'s distances with respect to random directions drawn from Q(u), and then transforming them through trigonometric functions. Furthermore, ϕ m h(•), Q , ϕ m h(•), Q can approximate any entries in ⃗ h. To be more precise, Rahimi & Recht (2007) shows that it only requires m = Ω d/ϵ 2 log(σ Q /ϵ) to achieve K(h(x), h(x ′ )) -ϕ m h(x), Q , ϕ m h(x ′ ), Q ≤ ϵ, where σ 2 Q is the second moment Q. Therefore, when m is comparable to d, the featurized ϕ m h(x), Q achieves the stability and predictability of ⃗ h, as well as the computability of h. Converting h(x) to ϕ m h(x), Q is computationally efficient, since u 1 , . . . , u m only need to be drawn from Q once and apply to all h(x i ), i = 1, . . . , n. However, there is still the obstacle of finding the optimal Q * . Strictly speaking, Q * is obtained from the inverse Fourier transform, but in practice the standard Gaussian distribution is often used. Indeed, compute the inverse Fourier transform and sample from it poses another challenging task. To our knowledge, there is no existing study on whether we can safely sample u from a proxy Q. In the following proposition, we show that using Q instead of Q * will not cost stability as long as their discrepancy is bounded. In particular, we state our result in the context of Lemma 1, that is, the downstream risk is controlled by the alignment A := E 1[Y = Y ′ ]K ⟨h(X), h(X ′ )⟩ . We use D s (Q, Q * ) := s(dQ/dQ * )dQ * to denote the f-divergence induced by s(•). Proposition 4. Let Q(Q * ; δ) := {Q : D s (Q, Q * ) ≤ δ} be a D s -ball with radius δ cen- tered at Q * . Let {h(x i ), y i } n i=1 be the downstream data, and A n (Q) := 1 n(n-1) i̸ =j 1[y i = y j ]⟨ϕ m (h(x i ), Q), ϕ m (h(x i ), Q)⟩. It holds that: P r sup Q∈Q(Q * ;δ) A n (Q) -A n (Q * ) ≥ ϵ ≲ σ 2 Q ϵ 2 exp - mϵ 2 16(d + 2) + exp - nϵ 2 64(1 + δ) , where σ Q := max Q∈Q σ Q . The significance of Proposition 4 is that even if the optimal Q * is not used, in the worst case scenario, the instability caused by it (reflected via δ) vanishes quickly as the sample size gets larger. Similarly, increasing the dimension of featurized representation ϕ m also speeds up the convergence exponentially. They provide the guarantee for predictability even if Q * is not used. The proof is provided in Appendix E. Takeaway. Featurzing pretrained representation as ϕ m (h, Q) offers a simple and practical solution to balance stability, predictability, and computability. We just showed that Q can simply be standard Gaussian distribution, and the dimension of ϕ m (h) can be obtained by satisfying a specific approximation threshold ϵ. It can also be treated as a tuning parameter in downstream tasks.

6. BENCHMARK AND REAL-WORLD EXPERIMENTS

We conduct experiments on the benchmark dataset MovieLens-1m (ML-1m) for illustration and reproducibility purposes. The real-world production experiments took place at a major US ecommerce platform anonymized as Ecom. The detailed descriptions for ML-1m and the introduction of Ecom's production environment are provided in Appendix F. On ML-1m. The dataset supports two types of pretraining-downstream task combination: (a). leverage the sequences of user viewing data to pretrain movie embeddings, then use the embeddings to predict the genre of the movie (ML-1m task 1); (b). pretrain movie embeddings using the title and other descriptions, then use the embeddings for downtream sequential recommendation (ML-1m task 2). The detailed data processing, model and pretraining configurations, downstream training/testing setup, evaluation metrics, and sensitivity analysis are deferred to Appendix F. On ML-1m task 1, we use contrastive representation learning to pretrain the movie embeddings, and employ logistic regression to predict the genre using movie embeddings as features. On ML-1m task 2, we use a bidirectional-RNN-type structure on movies' NLP data, and extract the final hidden layer as pretrained representation. The downstream sequential recommendation task employs a two-tower structure, and a RNN is used to aggregate the history viewing sequence. In Table 1 , we first see that ϕ m (h) improves the stability of h by at least ×10 in both tasks. Even under the same dimension, ϕ m (h) outperforms h, and is highly comparable to avg(h) -the manually stabilized version of h by averaging it over ten independent runs. Note that avg(h) is almost never a good practical solution because it requires repeating the same pretraining process multiple times. Here, we use it as an analytical baseline, and show that ϕ m (h) is just as good. When the dimension increases for ϕ m (h), it delivers much more superior results. Although changing dimension can also change the downstream model complexity, but as we discuss below, it offers more flexibility for real-world problems. deviations computed from ten independent runs. The subscripts of h and ϕ refer to the dimension of representation, avg(h) refers to using the representation averaged from the ten independent pretrainings. Note that avg(h) cannot be used for large-scale production since it takes ten times the resources needed. For Ecom A/B testing, we present the relative lift over the control, and the p-value associated with the lift. On Ecom. The item representation learning pipeline is being used by several downstream productions: item-page recommendation (Task1), search ranking (Task2), email recommendation (Task3), and home-page marketing (Task4). They all have task-specific features and non-trivial model architectures different. The refreshing of pretrained item embedding is done on a daily basis, and downstream model owners may have separate schedules to update and refresh the relevant parts of their models. In Appendix F.4, we describe our engineering solutions of deploying the featurization process on the frontend and backend. During A/B testing, we observe performance lifts (in terms of click-through rate) that are statistically significant for all four downstream applications. The average revenue-per-visitor lift is also positive during the testing period. The detailed online results and analysis are provided in Appendix F. Lessons learnt. In addition to improved stability and performance, an important feedback we received from downstream model owners is that the flexibility in choosing ϕ m (h)'s dimension is very helpful for their tasks. Prior to our featurization technique, it is almost impossible to personalize the dimension of pretrained representation for different applications, let alone tuning it in downstream tasks. Now knowing that the predictability will not vary much, experimenting with different dimensions often allows them to find a better bias-variance tradeoff for downstream tasks.

7. DISCUSSION

The analytical results and the proposed featurization method in our work can apply to a broad range of applications and research problems. Nevertheless, our results may still be rudimentary and far from providing the complete picture or optimal practice for using pretrained representation. We hope the progress we made will lead to more advanced future research and applications. Scope and limitation. Most of our analysis are performed in basic settings: while they ensure the results will hold in generality, advanced methods for pretraining representation are not considered. Also, we do not include additional downstream features and their correlation with pretrained representations, or connections between the pretraining objective and downstream task. Those additional knowledge can be useful for deriving task-specific results (Arora et al., 2019) . For application, our featurization technique may be less helpful if the downstream task simply uses embedding distance like KNN search. Optimizing the space and time complexity by such as embedding quantization might be more useful for such tasks (Chen et al., 2020) , which is not discussed in our paper. A future direction. While our work studies h(x) as a whole, it can be inferred from Figure 1c that the element-wise variance of ĥ(x) is bimodal, which suggests heterogeneity within h(x). Possible explanations are that a (random) subset of h(x) is responsible for overfitting the pretraining task (Bartlett et al., 2020) , or that some dimensions are forced to become more independent of others so the representation matrix has nice spectral properties (Hastie et al., 2022) . It is thus an important future direction to identify the internal structure of h(x) to better featurize pretrained representations. 

A TECHNICAL BACKGROUND

In this part of the paper we provide the technical background for both the discussions in the paper and the following proofs. A central role for proving uniform convergence results is the Gaussian / Rademarcher complexity. For a set A ⊂ R n , it is defined as: G(A) := E ϵ sup a∈A n i=1 a i , where ϵ are i.i.d Gaussian / Rademarcher random variables. It essentially measures how good a function class can interpolate a random sign pattern assigned to a set of points. Given a function class F and n samples (x 1 , . . . , x n ), the empirical Gaussian / Rademarcher complexity is given by: G n (F) := E ϵ sup f ∈F n i=1 f (x i ) . Remark A.1. We mention that in some versions of the definition, there is a 1/n factor in the complexity term. Here, we explicit pull that factor out and place it in the resulting bound. As we mentioned earlier, an important reason for us using Gaussian complexity is because some of its technical properties, which is the Slepian's Lemma (Slepian, 1962) and its corollary, which we state as below: Lemma A.1 (From Slepian's Lemma). Suppose ϕ : A → R q has Lipschitz constant L. Then it holds that: G(ϕ(A)) ≤ LG(A). This result can be viewed as the contraction lemma for Gaussian complexity (Ledoux & Talagrand, 1991) . A (2018) and their follow-up works. The key factor that contributes to the implicit bias of gradient decent is the divergence of model parameters after separating the data under loss functions that has exponential-tail behavior. When the predictor f ∈ F parameterized by θ is over-parameterized, other than certain degenerated cases, the data can be separated at certain point if the predictor class satisfies some regularity assumptions (Lyu & Li, 2019 ), e.g. • f ∈ F is homogeneous such that: f (x; c • θ) = c β f (x; θ), ∀c > 0; • f ∈ F is smooth and has bounded Lipschitz constant. These conditions can be met for many neural network structures and activation functions. The exponential-tail of loss function can be satisfied by the common exponential loss and logistic loss (which we use through our discussions and experiments). To see why the the norm of model parameters will diverge, simply note that under such as exponential loss, both the risk and the gradient will take the form of: i c i exp(-y i f (x i ; θ)), where c i are lower order terms. Since gradient descent will converge to a stationary point due to the nice properties of F, we expect i c i exp(-y i f (x i ; θ)) = 0 to hold upon convergence. A necessary condition for that is: exp(-y i f (x i ; θ)) = 0, i = 1, . . . , n, and this condition is actually sufficient with high probability (Soudry et al., 2018) . Therefore, for all exp(-y i f (x i ; θ)) to reach 0, ∥θ∥ must diverge so |f (•; θ)| → ∞. With that said, since the loss function decays exponentially fast around 0, the data points with the largest margin will dominate both the gradient and the loss function. As a direct consequence, the decision boundary will share characteristics with the hard-margin SVM, given by: min ∥θ∥ 2 s.t. y i f (x i ; θ) ≥ 1, ∀i = 1, . . . , n. Indeed, recent work shows that the optimization path of over-parameterized models will indeed converge to some minimum-norm predictor: Corollary A.1 (Chizat et al. (2019) ; Woodworth et al. (2020) , and others). Under the conditions specified in the reference work, which are mostly exponential loss, scaled initialization, appropriate learning rate, and regularity conditions for the predictor class, it holds that: lim t→∞ lim d→∞ F θ (t) /∥θ (t) ∥ stationary points of → arg min ∥f ∥ K s.t. y i f (x i ) ≥ 1, ∀i ∈ [n] , where F is the decision boundary of f , d is the dimension of the hidden layer(s) of f , and ∥ • ∥ K is an appropriate RKHS norm. Note that in Section 4.2 we use g 0 to denote the converged result, and the above corollary guarantees its existence and uniqueness. However, one open question is which particular RKHS norm best describes the solution, because it will particularly affect the convergence of the parameters. Therefore, in our work, we leave the convergence of parameters out of our discussion. Remark A.2. It is also worth mentioning that the convergence of E[h (t) (x)] plays no part in our arguments and results. Indeed, it will not change the stochasticity of h (t) (x), and (in some cases) can be implied from the convergence of g (t) (x) (Lyu & Li, 2019) . Therefore, we do not discuss it in our work.

B PROOF OF THE RESULTS IN SECTION 4.1

We prove Proposition 1 in this part of the appendix. An important result we will be using is the Gaussian complexity bound for empirical risk minimization, and we will use the version of Bartlett & Mendelson (2002) . Lemma A.2. Let F be real-valued function class from X to [0, 1]. Let (X 1 , . . . , X n ) be i.i.d random variables, then for all f ∈ F, it holds with probability at least 1 -δ that: E f (X) ≤ 1 n i f (X i ) + √ 2πG n (F) n + 9 log 2/δ 2n . We now provide the proof, part of which will be using Corollary Lemma A.1, and Lemma A.2. We also assume F has a Lipschitz constant of at most L. Proof. Recall that h * , f * := arg min h∈H,f ∈F R(h, f ). We decompose the generalization error via: R( ĥ) -min h∈H,f ∈F R(h, f ) = R( ĥ) -min f ∈F 1 n i ℓ f • ĥ(X i ), Y i + + min f ∈F 1 n i ℓ f • ĥ(X i ), Y i -min f ∈F 1 n i ℓ f • h * (X i ), Y i + min f ∈F 1 n i ℓ f • h * (X i ), Y i -E P n min f ∈F 1 n i ℓ f • h * (X i ), Y i + E P n min f ∈F 1 n i ℓ f • h * (X i ), Y i -min f ∈F E (X,Y )∼P ℓ f • h * (X), Y . (A.1) We first discuss the first term, which incurs a major discussion in Section 4.1. By a standard practice, the first term can be bounded via: R( ĥ) -min f ∈F 1 n i ℓ f • ĥ(X i ), Y i ≤ sup h∈H R( ĥ) -min f ∈F 1 n i ℓ f • ĥ(X i ), Y i ≤ sup h∈H E P n E (X,Y )∼P ℓ f h,n • h(X), Y -R n (h) (a) sup h∈H E P n 1 n i ℓ f h,n • h(X i ), Y i - 1 n i ℓ f h,n • h(X i ), Y i (b) Using Lemma A.2, term (b) can be bounded as: sup h∈H E P n 1 n i ℓ f h,n •h(X i ), Y i - 1 n i ℓ f h,n •h(X i ), Y i ≤ √ 2πG n (A(H))+ 9 log 2/δ, where the set A(H) is given by: 1 n ℓ f h,n • h(X 1 ), Y 1 , . . . , 1 n ℓ f h,n • h(X n ), Y n : h ∈ H . It is easy to examine that A(H) invokes Slepian's lemma, so we can use the contraction result from Lemma A.1 to further bound it: G n (A(H)) ≤ L √ n G n (H). Combined together, the term (b) is upper bounded as: √ 2πLGn(H) √ n + 9 log 2/δ.

Now we bound term (a) as below.

Define the shorthand ℓ(F(h)) := ℓ f (h(X 1 ), Y 1 ), . . . , ℓ f (h(X n ), Y n ) : f ∈ F . It holds that: sup h∈H E P n E (X,Y )∼P ℓ f h,n • h(X), Y -R n (h) ≤ sup h∈H E P n sup f ∈F E (X,Y )∼P ℓ f • h(X), Y - 1 n i ℓ f • h(X i ), Y i ≤ √ 2π sup h∈H E P n G n (ℓ(F(h))) n (using Lemma A.2) and A.1 = √ 2πn -1 sup h∈H E P n G n (ℓ(F(h))) ∥h(X)∥ ∥h(X)∥ (where h(X) := [h(X 1 ), . . . , h(x n )]) ≤ √ 2πn -1 sup h∈H E∥h(X)∥ 2 • sup A∈R n×d 1 ∥A∥ E sup f ∈F i ϵ i f ([A] i ) ϵ i i.i.d ∼ N (0, 1). (A.2) We let G ′ n (F) := sup A∈R n×d 1 ∥A∥ E sup f ∈F i ϵ i f ([A] i ) be the modified Gaussian complexity, so the term (b) is finally bounded by: √ 2π n G ′ n (F) sup h∈H E∥h(X)∥ 2 . Next, notice in the last term that: E P n min f ∈F 1 n i ℓ f • h * (X i ), Y i ≤ E P n 1 n i ℓ f * • h * (X i ), Y i = E (X,Y )∼P ℓ f * • h * (X), Y . Therefore, the last term is always non-positive. Similar, by definition, the second term is non-positive as well. Finally, as for the second term, since there is already non-concentrating terms appearing in the bound of the first term, it does not hurt to simply bound it using Hoeffding's bound, i.e. the first term will not exceed O( log 1/δ) with probability at least 1 -δ. Putting things together, we conclude the final result. C TECHNICAL DETAILS FOR SECTION 4.2 We first restate the proposition: Proposition A.1. For the one-layer MLP we consider, with the learning rate α = 1, for any step t > 1, as d → ∞, the model output g (t) (x) will converge almost surely to g (t) * (x) defined as follows: g (t) * (x) = C (t) 1 C (t) 2 + C (t) 3 C (t) 4 x, with C (t+1) 1 , C (t+1) 2 , C (t+1) 3 , C (t+1) 4 = C (t) 1 , C (t) 2 , C (t) 3 , C (t) 4 + L (t) x (t) C (t) 3 , C (t) 4 , C (t) 1 , C (t) 2 . The above iterative update result can be shown by making explicit of the terms following the forward and backward pass in t th gradient step. In particular, it holds that: g (t) (x) a.s. → EΘ (t) σ W (t) x ( def = g (t) * (x)), ℓ ′ g (t) (x (t) ), y (t) a.s. → ℓ ′ g (t) * (x (t) ), y (t) L (t) ( def = L (t) ), Θ(t+1) = Θ(t) -L (t) σ W(t) x (t) , W(t+1) = W(t) -L (t) x (t) Θ(t) σ ′ W(t) x (t) . The only extra requirement for the above convergence to hold is that the activation function is wellbehaved (see Yang (2019) for a detailed description). To see how the above system of equations lead to the result in Proposition A.1, imagine the activation is the identity function. In this case, Θ(t) and W(t) are always deterministic linear combinations of Θ(0) and W(0) . Observe that the update becomes: Θ(t) = C 1 Θ(0) + C 2 W(0) , W(t) = C 3 Θ(0) + C 4 W(0) . We mention that as a corollary, W (t+1) (x) is also element-wise i.i.d, so the inner product of the hidden representations W (t+1) (x), W (t+1) (x ′ ) a.s. → E W (t) x • W (t) x ′ , where W (t) is an i.i.d row of W(t+1) , which is the rescaled version of W (t+1) .

D PROOFS OF THE RESULTS IN SECTION 4.3

Proof for Proposition 3 Proof. The proofs for the lower bound often starts by converting the problem to a hypothesis testing task. Denote our parameter space by B(k) = {θ ∈ R d : ∥θ∥ 0 ≤ k}. The intuition is that suppose the data is generated by: 1. drawing θ according to an uniform distribution on the parameter space; 2. conditioned on the particular θ, the observed data is drawn. Then the problem is converted to determining according to the data if we can recover the underlying θ as a canonical hypothesis testing problem. For any δ-packing {θ 1 , . . . , θ M } of B(k), suppose B is sampled uniformly from the δ-packing. Then following a standard argument of the Fano method Wainwright (2019), it holds that: P min θ sup ∥θ * ∥0≤k ∥ θ -θ * ∥ 2 ≥ δ/2 ≥ min θ P θ ̸ = B , (A.3) where θ is a testing function that decides according to the data if the some estimated θ equals to an element sampled from the δ-packing. The next step is to bound min θ P θ ̸ = B , whereas by the information-theoretical lower bound (Fano's Lemma), we have: min θ P θ ̸ = B ≥ 1 - I(y, B) + log 2 log M , (A.4) where I(•, •) denotes the mutual information. Then we only need to bound the mutual information term. Let P θ be the distribution of y (which the vector consisting of the n samples) given B = θ. Since y is distributed according to the mixture of: 1 M i P θi , it holds: I(y, B) = 1 M i D KL P θi ∥ 1 M j P θj ≤ 1 M 2 i,j D KL P θi ∥P θj , where D KL is the Kullback-Leibler divergence. The next step is to determine M : the size of the δ-packing, and the upper bound on D KL P θi ∥P θj where P θi , P θj are elements of the δ-packing. For the first part, it has been shown that there exists a 1/2-packing of (Raskutti et al., 2011) . As for the bound on the KL-divergence term, note that given θ, P θ is a product distribution of the condition Gaussian: B(k) in ℓ 2 -norm with log M ≥ k 2 log d-k k/2 y|ϵ ∼ N θ ⊺ ϵ σ 2 h σ 2 ϵ , θ ⊺ θ(σ 2 z -σ 4 z /σ 2 ϵ ) , where σ 2 ϵ := σ 2 h + σ 2 ϵ . Henceforth, for any θ 1 , θ 2 ∈ B(k), it is easy to compute that: D KL (P θ1 ∥P θ2 ) = E P θ 1 n 2 log θ ⊺ 1 θ 1 (σ 2 z -σ 4 z /σ 2 ϵ ) θ ⊺ 2 θ 2 (σ 2 z -σ 4 z /σ 2 ϵ ) + y -θ ⊺ 2 ϵ σ 2 h σ 2 ϵ 2 2 2θ ⊺ 2 θ 2 (σ 2 z -σ 4 z /σ 2 ϵ ) - y -θ ⊺ 1 ϵ σ 2 h σ 2 ϵ 2 2 2θ ⊺ 1 θ 1 (σ 2 z -σ 4 z /σ 2 ϵ ) = σ 2 z 2σ 2 ϵ ∥ϵ(θ 1 -θ 2 )∥ 2 2 , where y and ϵ are the vector and matrix consists of the n samples, i.e. y ∈ R n and ϵ ∈ R n×d . Since each row in the matrix ϵ is drawn from N (0, σ 2 ϵ I d×d ), standard concentration result shows that with high probability, ∥ϵ(θ 1 -θ 2 )∥ 2 2 can be bounded by C∥θ 1 -θ 2 ∥ 2 2 for some constant C. It gives us the final upper bound on the KL divergence term: D KL (P θ1 ∥P θ2 ) ≲ nσ 2 z δ 2 2σ 2 ϵ . Substituting this result into (A.4) and (A.3), by choosing δ 2 = Ckσ 2 ϵ σ 2 z n log d-k k/2 and rearranging terms, we obtain the desired result that with probability at least 1/2: inf θ sup θ * :∥θ * ∥0≤k ∥ θ -θ * ∥ 2 ≳ σ 2 ϵ σ 2 h kn -1 log(d/k).

Proof of Lemma 1

Proof. We first express the NW predictor in its expectation form: f ϕ (X) = E X ′ y ′ K(X, X ′ ) Z , where Z is the normalization constant. Recall that y ∈ {-1, +1}, R(•) is risk associated with the 0 -1 classification loss. We first define for x ∈ X : γ ϕ (X) := E X ′ K(X, X ′ ) Z , where the expectation is taken w.r.t. the underlying distribution. Using the Markov inequality, we immediately have: |γ(X)| ≤ 1 √ δ with probability at least 1 -δ. It then holds that: 1 -R(f ) = P Y f (X) ≥ 0 ≥ E Y f (X) γ(X) • 1[Y f (X) ≥ 0] ≥ E Y f (X) γ(X) ≥ E 1[Y = Y ′ ]K(X, X ′ ) Z √ δ ,with probability 1 -δ, which concludes the proof.

E PROOF OF THE RESULT IN SECTION 5

The proof of Proposition 4 relies on two important results, which we state below. Lemma A.3 (Ben-Tal et al. (2013) ). Let c be any closed convex function with domain [0, +∞), and this conjugate is given by c * (s) = sup t≥0 {ts-c(t)}. Then for any distribution Q * and any function g : R d → R, it holds: sup Q∈Q(Q * ;δ) g(u)dQ(u) = inf λ≥0,η λ c * g(u) -η λ dQ * (u) + δλ + η . (A.5) The next lemma is adapted from the concentration of random Fourier feature in Rahimi & Recht (2007) . Recall that ϕ m h(x), Q := 1/ √ m cos u 1 h(x) , sin u 1 h(x) , . . . , cos u m h(x) , sin u m h(x) comes from the Monte Carlo approximation of K(h(x), h(x ′ )). Lemma A.4. Let A ⊂ R d has diameter d A such that h(x) ∈ A for all x ∈ X . It holds that: P r sup h(x),h(x ′ ) K(h(x), h(x ′ )) -⟨ϕ m h(x), Q , ϕ m h(x ′ ), Q ⟩ ≥ ϵ ≤ 2 8 σ Q d A ϵ exp - mϵ 2 4(d + 2) , (A.6) where Q is given by the inverse Fourier transform of K, and σ Q is the second moment of Q. Recall that A n (Q) := 1 n(n-1) i̸ =j 1[y i = y j ]⟨ϕ m (h(x i ), Q), ϕ m (h(x i ), Q)⟩. For notation conve- nience, in what follows, we let h i := h(x i ), and further define φ(h, U ) := [cos(U T h), sin(U T h)] as the actual random Fourier feature underlying ϕ m (h, Q), where U ∼ Q. Also, we let K(Y, Y ′ ) := 1[Y = Y ′ ] to be the labelling kernel of the downstream task. Proof. Following Lemma A.3, we work with a scaled version of the f-divergence under c(t) = 1 k (t k -1) (because its dual function has a cleaner form). It is easy to check that c * (s) = 1 k ′ [s] k ′ + + 1 k with 1 k ′ + 1 k = 1. First note that the sampling error of the alignment E K(Y i , Y j )K Q (H i , H j ) , i.e. replacing the expectation by the sample average, can be given by: ∆ n (U ) := 1 n(n -1) i̸ =j K(y i , y j ) φ(h i , U ) T φ(h j , U ) -E K(Y i , Y j )K Q (H i , H j ) = 1 n(n -1) i̸ =j K(y i , y j ) φ(h i , U ) T φ(h j , U ) -E K(Y i , Y j ) φ(H i , U ) T φ(H i , U ) . We show that ∆ n (U ) is sub-Gaussian. Let {h ′ i , y ′ i } n i=1 be an i.i.d copy off the observation expect for one element such that (h j , y j ) ̸ = (h ′ j , y ′ j ). Without loss of generality, we assume the last element is different: (h n , y n ) ̸ = (h ′ n , y ′ n ). Let ∆ ′ n (U ) be computed by replace {h i , y i } n i=1 with {h ′ i , y ′ i } n i=1 , and their difference can be bounded via: |∆ n (U ) -∆ ′ n (U )| = 1 n(n -1) i̸ =j K(y i , y j ) φ(h i , U ) T φ(h j , U ) -K(y ′ i , y ′ j ) φ(h ′ i , U ) T φ(h ′ j , U ) ≤ 1 n(n -1) i<n K(y i , y n ) φ(h i , U ) T φ(h n , U ) -K(y i , y ′ n ) φ(h i , U ) T φ(h ′ n , U ) + j<n K(y n , y j ) φ(h n , U ) T φ(h j , U ) -K(y ′ n , y j ) φ(h ′ n , U ) T φ(h j , U ) ≤ 4 n where the last inequality comes from the fact that the random Fourier features φ and the labelling kernel K(y, y ′ ) are both bounded by 1. Therefore, the above bounded difference result tells that ∆ n (U ) is a 4 n -subGaussian random variable. To bound ∆ n (U ), we use: sup Q∈Q(Q * ;δ) ∆ n (U )dQ ≤ sup Q∈Q(Q * ;δ) |∆ n (U )|dQ ≤ inf λ≥0 λ 1-k ′ k ′ E Q * |∆ n (U )| k ′ + λ(δ + 1) k (using Lemma A.3) = (δ + 1) 1/k E Q * |∆ n (U )| k ′ 1/k ′ (solving for λ * from above) = √ δ + 1E Q * |∆ n (U )| 2 1/2 (let k = k ′ = 1/2). (A.7) It means that in order to bound sup Q∈Q(Q * ;δ) ∆ n (U )dQ , we only need to bound |∆ n (U )| 2 . Using classical results for sub-Gaussian random variables (Boucheron et al., 2013) , it holds for λ ≤ n/8 that: E exp λ∆ n (U ) 2 ≤ exp - 1 2 log(1 -8λ/n) . We can take its integral and further upper bound the result with: p ∆ n (U ) 2 dQ ≥ ϵ 2 δ + 1 ≤ E exp λ ∆ n (U ) 2 dQ exp - λϵ 2 δ + 1 (Chernoff bound) ≤ exp - 1 2 log 1 - 8λ n - λϵ 2 δ + 1 (apply Jensen's inequality). Hence, it holds that: P r sup Q∈Q(Q * ;δ) ∆ n (U ) ≥ ϵ ≤ exp - nϵ 2 16(1 + δ) . Combine this result with the approximation error of random Feature feature in Lemma A.4, we obtain the desired result. 

F SUPPLEMENTARY MATERIAL FOR THE EXPERIMENTS

We provide the descriptions, details, and additional results of our experiments in this part of the appendix.

F.1 REPLICATING THE INSTABILITY ISSUE WITH IMDB DATASET

The IMDB dataset is a binary sentiment analysis dataset consisting of 50,000 reviews from the Internet Movie Database (IMDb) labeled as positive or negativefoot_1 . We particularly consider using this dataset for an addition proof of concept because it appears on the official tutorial of Tensorflowfoot_2 . We directly adopt the implementation from the tutorial, including the text prepossessing pipeline and model architecture. In particular, the raw input texts are pass to a text vectorization layer, an embedding layer, a bidirectional RNN layer, and finally two dense layers to produce the final score for binary classification. We extract the hidden representation from the last hidden layer as the hidden representation. In our experiments, we set the number of hidden dimension as 32. The results are provided in Figure A .1, where we observe patterns highly similar to the ML-1m data. In particular, the pretrained embeddings both have high variances in their exact values even if their pretraining objectives converge to similar loss and accuracy, and the variances gets larger as the pretraining progress. Two minor differences from the ML-1m result are that the pretraining process is less stable for IMDB (Figure A.1b) , and the variance distribution here is unimodal instead of the bimodal distribution we observed in Figure 1c .

F.2 DETAILS OF THE BENCHMARK EXPERIMENTS

The main benchmark experiments in our paper are conducted on the Movielens-1mfoot_3 dataset, which is a well-established public dataset for movie & user contextual analysis and examining recommendation. The ML-1m dataset consists of 1 million movie ratings from 6,000 users on 4,000 movies, with a one-to-five rating scale. According to Harper & Konstan (2015) , the data is collected from the initial and follow-up stages, where the initial stage mainly involves popularity-based exposure (a very small proportion involves random exposure), while in the follow-up stage, rating feedback is collected under some deterministic recommender systems. By convention, we convert the dataset to implicit feedback, which amounts to treating all rating events as clicks. For contextual information, each movie is provided with its title and genre, in the form of English words or sentences. There are 18 genres in total.

Pretraining movie embeddings from user behavior data

We use Item2vec (Barkan & Koenigstein, 2016) to train movie embedding from users' consecutive viewing data. Item2vec uses the same objective function as Word2vec (Mikolov et al., 2013) j, where the words become movies and the corpus become each user's viewing sequence. Movies belong to a consecutive viewing window of #ws are treated as positive pairs, and for each positive pair, we randomly sample #ns negative movies. Together with the embedding dimension d and ℓ 2regularization parameter (weight decay) λ, the training schema is described by the quadruplet of (#ws, #ns, d, λ). Since our goal is not to find the best pretraining schema, we fix #ws=3 and #ns=3, and focus on studying how the our results may change under different d. (2015) to pretrain the movie embeddings. Since Doc2vec is built on top of Word2vec, the training schema can also be described by the quadruplet of (#ws, #ns, d, λ). Therefore, we also #ws=3 and #ns=3. Using pretrained movie embedding for downstream genre prediction Given pretrained movie embeddings ĥ(x), we employ logistic regression to predict the score for the movie to belong to a particular genre, i.e. p(Y i = k) ∝ exp(θ k ĥ(x)). Due to its simplicity, we use the logistic regression subroutine from the scikit-learn package.

Using pretrained movie embedding for downstream sequential recommendation

We employ a two-tower model structure (Figure A.2) for the downstream sequential recommendation, which is very common in the recommendation community. In particular, we use RNN to aggregate the past interaction sequence, so the whole model is very similar to GRU4Rec (Jannach & Ludewig, 2017) . We use the sigmoid function as the activation function for the dense layers. The model training can be done in a seq-to-seq fashion, where for each positive target, we randomly sample 3 negative targets. We fix the hidden dimension of both the RNN and the dense layers as 16.

Model Training

Besides Doc2vec and the logistic regression, all of our models are optimized using the Adam optimizer with early stopping, which stops the training if the improvement in the loss is less than 1e -4 for three consecutive epochs. For all the experiments, we set the initial learning rate to 0.001, and set the weight decay to 1e -4. Our main implementation is with Tensorflow, and all the computations are conducted on a 16-core Linux cluster with 128 Gb memory, and two Nvidia Tesla V100 GPU each with 16 Gb memory. We use the Doc2vec subroutine from the Gensim packagefoot_4 to pretrain the movie embeddings for ML-1m task2.

Train/test split and metrics

Since the goal of our experiments is not to find the best modelling and training configuration, we do not use a validation set to tune the hyperparameters. Instead, we provide sensitivity analysis on certain parameters of interest in Appendix F.3. For ML-1m task1, we randomly split the movies by 80%-20% to construct the training and testing set for genre classification. For evaluation, we use the accuracy and Macro F1 score as metrics. For ML-1m task2, we follow the convention of using the last user-movie interaction for testing, and use all the previous interactions for training. For evaluation, we use Recall@5, i.e. if the movie that the user truly viewed is among the top-5 recommendation, and NDCG@5 that further discounts the position of the viewed movie in the top-5 recommendation. In e-commerce application, the representation of items serves as a central component for almost all machine learning algorithms Wang et al. (2018) ; Xu et al. (2021) . In the past few years, we have built a dedicated item representation learning pipeline that uses multiple sources of data to optimize item embeddings. Since there are over billions of items on our platform Ecom, it took us considerable effort to optimize the data pipeline and training routines such that the model refresh can be done on a daily basis. We point out that the daily refresh is necessary for item representation learning because the catalog of items, which is a major source of pretraining data, also gets minor updates on a daily basis. For example, new items can be added, and the features of items (e.g. title, price, description) can be modified by the vendors. The other major source of pretraining data is the historical customer behavior data. They are critical for revealing the relationship (e.g. similarity, complementariness, compatibility, substitutability) among items. These relationships are relatively stable in the customer population, so the more data we use, the more likely for us to discovery useful signals. Our model for pretraining item embeddings has both feed-forward component, recurrent-unit component, as well as contrastive learning component. The reason for using these different components is to effective handle data that has different structures. It is expected that the pretrained item embeddings are stable. As we mentioned above, the relationship among items are quite stable, and the catalog data has very minor differences in a limited span of time. Therefore, downstream machine learning models may follow a weekly or bi-weekly refresh schedule and are expecting very stable performances. The four major applications that depend on our pretrained item embeddings, which we first introduced in Section 6, are item-page recommendation, search ranking, email recommendation, and home-page marketing. Each of the four tasks use both item embeddings and task-specific features to optimize their objectives. Most of them use model structures similar to the Wide and Deep network (Covington et al., 2016) to effectively combine information from different sources. Item-page recommendation aims to provide items that are related to the anchor item on that particular page that the customer is viewing. Item embeddings are in both the recall and reranking stage. Search ranking is a huge system that combines multiple components. In particular, the item embeddings are used in a particular recall stage. Email recommendation is a simpler task that aims to recommendation items related to what the customers recently interacted with, or are supposed to interact again. Item embeddings is used along with other features to build a model that optimizes CTR. Marketing is also a huge system in Ecom, and the particular component that uses item embedding is to build the predicted click-through rate model to support bidding and placement ranking. Brief summary of the production environment and implementation. Our production environment is relative standard in the e-commerce industry, with Hive/Spark supporting the offline data streaming and Tensorflow Server supporting the online inference of deep learning models. Featuring h(•) via ϕ(h(•), Q) can be easily implemented in production. Some practical advantages are: • the algorithm is very simple and requires no training; • it fits seamlessly into the current big-data infrastructure and frontend service; • it require no change to the downstream model; • the overhand for both the training and inference time are small; • the communication can be easily done by simply recording the dimension and random seed under which a particular U is generated. On the backend, featurizing pretrained representation is engineered into a subroutine (on top of the original automated representation learning pipeline) callable by downstream applications. For instance, it can be a simple PySpark function if the end point of the automated representation learning pipeline is a feature store in HDFS. The dimension m and the random seed for generated the random directions U = [u 1 , . . . , u d ] are the two main inputs. Configuring and logging the random seed used by each experiment is important because U might be reused for deploying the model on frontend. If the dimension and random seed are logged in the configuration, then there is no need to pass the particular U around across different infrastructures. We mentioned in Section 5 that the dimension can be chosen by either specifying a threshold on the approximation error ϵ (Proposition 4), for which we provide an implementation that takes ϵ as input. We find that in most cases, downstream model owners are more willing to treated as a tuning parameter. For frontend service, notice that featurizing h into ϕ m (h, Q) amounts to simply adding an initial fully-connected layer whose parameters are given by the U ∈ R d×m used in offline training. The activation function of that initial layer is given by the sin(•) and cos(•) functions. Therefore, it fits seamlessly with the serialization or signature building processes of the downstream model at no extra complication. In term of the inference time, it costs little overhead for large models. For smaller models and applications, on the other hand, we find out that it might be more efficient to directly cache the featurized ϕ m (h, Q) at the expense of space complexity.

A/B testing and result analysis

Deploying a unified A/B testing across the different downstream applications is extremely challenging. Therefore, the A/B testing is conducted with each downstream application, but we coordinated with the different teams such that they approximately start at the same time. As we shown in Table 1 , all four A/B testings achieves statistically significant result under the level of 0.1. In particular, the lift of item-page recommendation CTR has p-value less than 0.01. Furthermore, we observe a very positive change to the per-visitor gross-merchandise value (GMV) during the testing period. Although this metric can be confounded by many other factors, we nevertheless believe featurizing pretrained representations benefits both the individual machine learning task and the overall business metric.

Summary

We examine the performance of featurizing pretrain item embedding via the method proposed in this paper. We discuss how the approach fits seamlessly with our production scenarios, the overview of our implementation, as well as the final testing results. Based on our experience, featurizing representation could be an important step toward more stable and productive system for industrial machine learning.

G ADDITIONAL DISCUSSIONS

In this part of the appendix, we will provide additional discussions that provide readers with better understandings of our contributions.

Connections between our theoretical and empirical results

Proposition 1 and its subsequent discussion outline a potential problem with using pretrained representations in practice: the variance of h(x) may cause the empirical downstream solution not concentrating to the optimal downstream solution. In Proposition 2, we provide a comprehensive result that characterize the variance of h(x), which corroborates our empirical findings in Figure 1 . In particular, it shows that although the pretraining algorithm may converge, the entires of h(x) can remain stochastic, which negatively affects the downstream performance. Together, Proposition 1 and 2 suggests that directly using h(x) in downstream tasks may result in poorer performances, which is corroborated by the empirical results in Table 1 . Proposition 3 and Lemma 1 analyze the potential interactions scenarios between pretrained representation and downstream tasks. They highlight the role of the embedding kernel and motivate the featurization approach we later developed. Proposition 4 provides stability guarantee for the proposed featurization approach, which indeed delivers superior performances as shown in Table 1 .

Additional relevant literature

Besides the recent develop in machine learning theory summarized in Section 2, our work is related to a number of active domains in machine learning. In terms of applications, pretrained representations are being used by a broad range of industrial machine learning tasks including information retrieval, recommeder systems, advertisement, knowledge completion, social network, user understanding, and many others (Cheng et al., 2016; Fan et al., 2022; Xu et al., 2020a; Zhang et al., 2018) . Unfortunately, no existing work investigates their effectiveness from the industry's standpoint, that is, whether they strike a good balance between stability, predictability, and computability. Parallel to the above industrial tasks are the large natural language and computer vision models (Devlin et al., 2018; He et al., 2019) , whose complexity exceeds the typical settings studied in this paper. The proposed featurization approach relies on random Fourier feature -a well-established and important tool of modern machine learning (Rahimi & Recht, 2007) . Its properties are known and are being actively applied to obtain kernel representations for many applications, such as signal processing, time series, and graphs (Liu et al., 2021; Gogineni et al., 2020; Xu et al., 2020b) . Finally, compared with the recent line of research on interpretability of deep learning models (Chakraborty et al., 2017) , we focus particularly on the properties of learnt representations and how to improve them.

Experiments with NLP dataset

We further conduct experiments with the IMDB dataset described in Appendix F, where we use Word2vec for pretraining word embeddings, and apply them to a downstream bi-directional RNN model for sentiment analysis. We mention that large language models are not suitable for our purpose because the existing model checkpoints provided by the developers are from one-off implementation. They cannot be used to recover the variance information of the pretrained embeddings. Also, our analysis requires retraining those large models from scratch multiple times, which are very time-consuming and resource-intensive. The experiments are straightforward, they closely reveal the instability issues of pretraining embedding and the benefits of the proposed featurization approach. In particular, we consider the window size (#ws) and number of negative samples (#ns) as the configuration parameters of the Word2vec model. We vary their values and present the array of downstream performances. For each pretraining configuration, we conduct 10 independent runs to compute the empirical variance as a measure of h(x)'s instability in the downstream task. Our Word2vec pretraining follows the original implementation of (Mikolov et al., 2013) , which uses the standard stochastic gradient descent with a decaying learning-rate schedule for 30 epochs. All the remaining settings are kept as default, and we observe they are sufficient for achieving very small training loss (< 1e-6). For each set of pretrained word embeddings, we select λ from {1e-6, 1e-5, 1e-4,1e-3, 1e-2} according to the validation accuracy. The train-validation-test split follows the standard 80%-10%-10%. We use the binary cross entropy loss and use accuracy as the metric. We also use the Adam optimizer with initial learning rate as 1e-4 and early stopping (when the validation accuracy stops improving for five epochs). The remaining settings are kept as default. For reference, when the embedding layer is trained jointly, the downstream accuracy following the above implementation is: 86.25 (0.12). The downstream accuracy of using the different sets pretrained embeddings are provided in Table A .1. The featurized representations ϕ d (h) is obtained in the same way as described in Appendix F. In Table A .1, we observe almost identical patterns to our previous results in Table 1 . In particular, it shows that directly using h(x) in the downstream task results in a high degree of instability under all pretraining configurations. The proposed featurization approach not only achieves better testing accuracy, but it improves the downstream stability by almost eight times on average. Results on the NLP task provides additional evidence for both the instability issue of pretrained representations, and the benefits of the proposed featurization approach. Table A .1: Testing accuracy of using pretrained word embeddings for downstream sentiment classification. All results are multipled by 100, and in the paranthesis is the standard deviation computed from 10 independent pretraining runs.



Since the pretraining uses years of history data, the proportion of new daily data is quite small. https://www.imdb.com/interfaces/ https://www.tensorflow.org/text/tutorials/text_classification_rnn https://grouplens.org/datasets/movielens/1m/ https://radimrehurek.com/gensim/models/doc2vec.html https://www.tensorflow.org/text/tutorials/text_classification_rnn



Figure 1: Illustrating the stability issue of pretrained representation with MovieLens-1m. The details of the experiments are deferred to Appendix F. The empirical variances are computed from ten independent runs.

We can simply set xi ∈ R n as one-hot encodings, and W, Θ ∈ R d 0 ,d where they are allowed to coincide. Then we let h(xi) = [W]i or [Θ]i depending on the context. The activation becomes the identity function, and ℓ(f (xi), xj) = log(1 -σ(h(xi) T h(xj))) (or log σ(h(xi) T h(xj)), with σ(•) being the Sigmoid function.

.1 INDUCTIVE BIAS OF GRADIENT DESCENT Our introduction primarily follows Soudry et al. (2018); Ji & Telgarsky (2019); ?); Gunasekar et al.

Figure A.1: Replicating the results shown in Figure 1a, 1b, and 1c on IMDB dataset. (a). The dimension-wise variance of the pretrained embedding values of three randomly sampled reviews; (b). The convergence of the pretraining loss and accuracy. (c). The distribution of the empirical variances of the pretrained embedding values, as pretraining progresses.

Figure A.2: The two-tower architecture.

Figure A.5: Online A/B testing results, and the monitoring the per-vistor gross-merchandise value (GMV) during the testing period. App1, App2, App3, App4 correspond to item-page recommendation, search ranking, email recommendation, and home-page marketing. The control baseline is simply the existing implementation of each downstream applications. They use the raw item embeddings h to refresh their models as usual. The treatment is switching to use ϕ(h) in their implementations and obtain another version of their models. The outputs of the control model and treatment model are exposed to users pre-allocated into different buckets. The exact testing logic is quite involved and beyond the scope of our paper. The key takeaway is that the testing begins almost at the same time for all four applications, and the only treatment variable is which item embedding to use. The detailed testing results are provided in Figure A.5. For brevity, we only show the performance in terms of CTR since it is indicative for all four applications.

Figure A.6: The model architecture of downstream sentiment analysis 8 . In the original experiments, the embedding layer is trained jointly with the rest of the model. Here we replace it by pretrained word embeddings.

For ML-1m experiments, the results are multiplied by 100, and in the parenthesis are the standard

Chiyuan Zhang, SamyBengio, Moritz Hardt, Benjamin Recht, and  Oriol Vinyals. Understanding deep learning requires rethinking generalization. 2016. Daokun Zhang, Jie Yin, Xingquan Zhu, and Chengqi Zhang. Network representation learning: A survey. IEEE transactions on Big Data, 6(1):3-28, 2018.

F.3 SUPPLEMENTARY RESULTS

We provide the sensitivity analysis for the featurization method. We focus on two variables, the dimension d and the variance of Q (denoted by σ 2 Q ). Recall that we consider Q as Gaussian distributions. We vary d in {16, 32, 64}, and vary σ 2 Q in {0.2, 0.4, 0.6, 0.8}. We then examine the sensitivity of the downstream performance w.r.t. Q -the sampling distribution for constructing ϕ d (h). As stated before, we let Q be zero-mean Gaussian distribution, and vary its variance. From Figure A.4, we observe that for all the dimensions we consider, the downstream task under ϕ d (h) is very stable under different σ Q . This echos Corollary 4 that our approach enjoys robustness to the selection of Q. In real-world productions, we have been using standard Gaussian distribution and observed no issues.

F.4 ONLINE DEPLOYMENT

To avoid potential conflict of interest, we provide an overview of our production experiments. We aim to provide enough detail for interested practitioners to draw inspiration for both developing their own solutions and replicating ours.Some background introduction.

