SOME PRACTICAL CONCERNS AND SOLUTIONS FOR USING PRETRAINED REPRESENTATION IN INDUS-TRIAL SYSTEMS

Abstract

Deep learning has dramatically changed the way data scientists and engineers craft features -the once tedious process of measuring and constructing can now be achieved by training learnable representations. Recent work shows pretraining can endow representations with relevant signals, and in practice they are often used as feature vectors in downstream models. In real-world production, however, we have encountered key problems that cannot be justified by existing knowledge. They raise concerns that the naive use of pretrained representation as feature vector could lead to unwarranted and suboptimal solution. Our investigation reveals critical insights into the gap of uniform convergence for analyzing pretrained representations, their stochastic nature under gradient descent optimization, what does model convergence means to them, and how they might interact with downstream tasks. Inspired by our analysis, we explore a simple yet powerful approach that can refine pretrained representation in multiple ways, which we call Featurizing Pretrained Representations. Our work balances practicality and rigor, and contributes to both applied and theoretical research of representation learning.

1. INTRODUCTION

The ability of neural networks to learn predictive feature representation from data has always fascinated practitioners and researchers (Bengio et al., 2013) . The learnt representations, if proved reliable, can potentially renovate the entire life cycle and workflow of industrial machine learning. Behind reliability are the three core principles for extracting information from data, namely stability, predictability, and computability (Yu, 2020). These three principles can not only justify the practical value of learnt representation, but also lead to the efficiency, interpretability, and reproducibility that are cherished in real-world production. Since pretrained representations are optimized to align with the given task, intuitively, they should satisfy all three principles in a reasonable setting. However, when productionizing an automated pipeline for pretrained representations in an industrial system, we encountered key problems that cannot be justified by existing knowledge. In particular, while the daily refresh follows the same modelling and training configurations and uses essentially the same data 1 , downstream model owners reported unexpectedly high fluctuations in performance when retraining their models. For illustration purpose, here we reproduce the issue using benchmark data, and take one further step where the pretraining is repeated on exactly the same data, under the same model configuration, training setup, and stopping criteria. We implement ten independent runs to essentially generate the i.i.d versions of the pretrained representation. We first visualize the dimension-wise empirical variances of the pretrained representations, provided in Figure 1a . It is surprising to find out that while the pretraining losses almost converge to the same value in each run (Figure 1b ), there is such a high degree of uncertainty about the exact values of each dimension. Further, in Figure 1c , we observe that the uncertainty (empirical variance) of pretrained representation will increase as the pretraining progresses. In the downstream task where pretrained representations are used as feature vectors (see the right figure), we observe that the performance does fluctuate wildly from run to run. Since we use logistic regression as the downstream model, the fluctuation can only be caused by the instability of pretrained representations because we can effectively optimize the downstream model to global optimum. To demonstrate that the above phenomenon is not caused by using a specific model or data, we also experiment with a completely different pretraining model and benchmark data from from another domain. We perform the same analysis, and unfortunately the same issues persist (Figure A.1 in the Appendix). Existing deep learning theory, both the convergence and generalization results (we will discuss them more in Section 2), can fail to explain why shall we expect pretrained representation to work well in a downstream task when their exact values are so unstable. This is especially concerning for industrial systems as the issue can lead to unwarranted and suboptimal downstream solutions. We experienced this issue firsthand in production, so we are motivated to crack the mysteries behind pretrained representations, and understand if and how their stability can be improved without sacrificing predictability and computability. We summarize our contributions as below. • We provide a novel uniform convergence result for pretrained representations, which point out gaps that relate to the stability and predictability issues. • We break down and clarify the stability issue by revealing the stochastic nature of pretrained representation, the convergence of model output, and the stable and unstable components involved. • We investigate the interaction between pretrained representation and downstream tasks in both parametric and non-parametric settings, each revealing how predictability can benefit or suffer from stability (or instability) for particular usages of pretrained representations. • We discuss the idea of featurizing pretrained representation, and propose a highly practical solution that has nice guarantees and balances stability, predictability, and computability. We also examine its effectiveness in real-world experiments and online testings. et al., 2018; Allen-Zhu et al., 2019) , but it sometimes fails to capture meaningful characteristics of practical neural networks (Woodworth et al., 2020; Fort et al., 2020) . However, those works require parameters being close to initialization, in which useful representation learning would not take place.

2. RELATED WORK

Indeed, it has also caught to people's attention that representation learning can go beyond the neural tangent kernel regime (Yehudai & Shamir, 2019; Wei et al., 2019; Allen-Zhu & Li, 2019; Malach et al., 2021) , among which a line of work connects the continuous-time training dynamics with mean field approximation (Mei et al., 2018; Sirignano & Spiliopoulos, 2020) , and another direction is to study the lazy training regime (Chizat et al., 2019; Ghorbani et al., 2019) where only the last layer of a neural network is trained. Unfortunately, their assumed training schemas all deviate from practical



Since the pretraining uses years of history data, the proportion of new daily data is quite small.



Figure 1: Illustrating the stability issue of pretrained representation with MovieLens-1m. The details of the experiments are deferred to Appendix F. The empirical variances are computed from ten independent runs.

It is not until recent years that deep learning theory sees major progress.Zhang et al. (2016)  observed that parameters of neural networks will stay close to initialization during training. At initialization, wide neural networks with random weights and biases are Gaussian processes, a phenomena first discussed byNeal (1995)  and recently refined byLee et al. (2017); Yang (2019). However, they do not consider effect of optimization. The Neural Tangent Kernel provides a powerful tool to study the limiting convergence and generalization behavior of gradient descent optimization (Jacot

