SOME PRACTICAL CONCERNS AND SOLUTIONS FOR USING PRETRAINED REPRESENTATION IN INDUS-TRIAL SYSTEMS

Abstract

Deep learning has dramatically changed the way data scientists and engineers craft features -the once tedious process of measuring and constructing can now be achieved by training learnable representations. Recent work shows pretraining can endow representations with relevant signals, and in practice they are often used as feature vectors in downstream models. In real-world production, however, we have encountered key problems that cannot be justified by existing knowledge. They raise concerns that the naive use of pretrained representation as feature vector could lead to unwarranted and suboptimal solution. Our investigation reveals critical insights into the gap of uniform convergence for analyzing pretrained representations, their stochastic nature under gradient descent optimization, what does model convergence means to them, and how they might interact with downstream tasks. Inspired by our analysis, we explore a simple yet powerful approach that can refine pretrained representation in multiple ways, which we call Featurizing Pretrained Representations. Our work balances practicality and rigor, and contributes to both applied and theoretical research of representation learning.

1. INTRODUCTION

The ability of neural networks to learn predictive feature representation from data has always fascinated practitioners and researchers (Bengio et al., 2013) . The learnt representations, if proved reliable, can potentially renovate the entire life cycle and workflow of industrial machine learning. Behind reliability are the three core principles for extracting information from data, namely stability, predictability, and computability (Yu, 2020). These three principles can not only justify the practical value of learnt representation, but also lead to the efficiency, interpretability, and reproducibility that are cherished in real-world production. Since pretrained representations are optimized to align with the given task, intuitively, they should satisfy all three principles in a reasonable setting. However, when productionizing an automated pipeline for pretrained representations in an industrial system, we encountered key problems that cannot be justified by existing knowledge. In particular, while the daily refresh follows the same modelling and training configurations and uses essentially the same data 1 , downstream model owners reported unexpectedly high fluctuations in 1 Since the pretraining uses years of history data, the proportion of new daily data is quite small. 1



Figure 1: Illustrating the stability issue of pretrained representation with MovieLens-1m. The details of the experiments are deferred to Appendix F. The empirical variances are computed from ten independent runs.

