PROXIMAL VALIDATION PROTOCOL

Abstract

Modern machine learning algorithms are generally built upon a train/validation/test split protocol. In particular, with the absence of accessible testing sets in real-world ML development, how to split out a validation set becomes crucial for reliable model evaluation, selection and etc. Concretely, under a randomized splitting setup, the split ratio of the validation set generally acts as a vital meta-parameter; that is, with more data picked and used for validation, it would cost model performance due to the less training data, and vice versa. Unfortunately, this implies a vexing trade-off between performance enhancement against trustful model evaluation. However, to date, the research conducted on this line remains very few. We reason this could be due to a workflow gap between the academic and ML production which we may attribute to a form of technical debt of ML. In this article, we propose a novel scheme -dubbed Proximal Validation Protocol (PVP) -which is targeted to resolve this problem of validation set construction. Core to PVP is to assemble a proximal set as a substitution for the traditional validation set while avoiding the valuable data wasted by the training procedure. The construction of the proximal validation set is established with dense data augmentation followed by a novel distributional-consistent sampling algorithm. With extensive empirical findings, we prove that PVP works (much) better than all the other existing validation protocols on three data modalities (images, text and tabular data), demonstrating its feasibility towards ML production.

1. INTRODUCTION

Most, if not all, machine learning production and research are conducted based on a train/test/validation set split protocol. A machine learning engineer or scientist often first receives a labeled dataset and splits it into a training and validation set, respectively. The role of the validation set is critical when considering robust model evaluation, selection, hyper-parameter tuning, etc. Post to the validation protocol, the best model being picked would be fed to the testing protocol, where the testing set is generally not accessible during real-world ML development till this phase. Notably, prior to splitting the labeled dataset, one needs to determine the split ratio of the validation set against the training set. This ratio can be very tricky: if fewer samples are picked up for the training protocol, the model validation can be less reliable. Contrarily, the larger validation set effectively shortens the training resources, which may lead to performance degradation. In current days, this ratio is often set based on the experience level of a human expert. This problem anchored at the split ratio can also be exhibited in more complex validation schemes like cross-validation. Indeed, this problem of setting the (sub-)optimal validation set is often, or mostly, ignored by the academic cohorts in the community. To date, as we scrutinize the related literature, very few works have touched down on this line (Li et al., 2020; Moss et al., 2018; Joseph & Vakayil, 2021) . In hindsight, a large portion of the standardized academic benchmarks have had a prefixed validation set split, such as ImageNet (Krizhevsky et al., 2012) , COCO (Lin et al., 2014), and SST (Socher et al., 2013) . Also, the testing set is often visible for the evaluation of the academic research. On the one hand, this prefixed validation setup has some merit. For instance, this effectively dedicates the ML research to the model towards innovated model architecture, optimization methods, new learning paradigms, etc. On the other hand, we argue this could attribute to the technical debt (Sculley et al., 2015) of ML. When considering ML production for real-world applications, we make the following statements: (i)-not many application tags along with adequate or large-scale data because data curation and annotation both cost a fortune; (ii)-the testing set is often not accessible: considering the CTR prediction or manufactural defect detection where the testing data present and only present post to the deployment; (iii)-the validation set is almost always decided by the ML experts with their expertise. In these scenarios, how to split out a validation set may sit in the center. The benign condition -where a validation set is preset and fixed -almost always does not hold in ML production. To this regard, we propose the Proximal Validation Protocol (dubbed PVP). With this novel validation protocol, we attempt to (fully) resolve the split problem and its trade-off. The core idea of PVP is rather simple. It first synthetically generates a validation set based on the labeled dataset without any splitting. Then, a novel distributional-consistent sampling algorithm is applied in order to select the most suitable synthetic data point for validation. The resulting set is dubbed the proximal validation set. Thanks to the proposition of the proximal validation set, PVP (in theory) does not rely on any real labeled data point for validation, effectively leading to performance improvement. Notable, the comparison of PVP with the conventional validation protocol is graphically depicted in Figure 1 . Empirically, we extensively conduct experiments on three data modalities -including tabular data, image data, and text data. We actively compare the PVP with standardized methods like the holdout protocol, K-fold cross validation, as well as the very limited related work like Joseph & Vakayil (2021) . Besides the series of analytical justifications, we choose three major metrics to form a fair and comprehensive comparison: the performance, t-v gap and variance. Notably, performance means the test score (e.g., AUC and Accuracy) of a model, variance refers to the stability of the estimated performance (on validation set) under different random seeds, and t-v gap indicates the closeness of the estimated performance to the test one. We empirically show that PVP achieves better performance, lower bias and competitive variance than the standardized split-relied methods. With three major data modalities being experimented, we hope that the ideology and instantiation of the PVP can pave a way for a more effective validation protocol towards ML production on real-world applications. At last, we may summarize the contribution of this work as follows: • We propose a novel validation scheme-work -PVP -a stable and reliable validation protocol relying on only the synthetic data while capable of enhancing the model performance. • The decent empirical results of PVP on three major data modalities manifest its "plug-andplay" nature. Its design is very much input data-dependent but independent of the model, architectures, optimizers, and tasks. We hope PVP can shed some light on data-saving, performance-effective, lightweight and profound validation procedure. The code of PVP will be made public upon publication.

2. RELATED WORK

As we mentioned, the related literature remains very few. Looking back to the old days, the validation framework was raised to fix the issue of overfitting (Mosteller & Tukey, 1968; Stone, 1974; Geisser, 1975) , which was first noticed by Larson (1931) . Due to the universality of the data splitting heuristics, the split-relied method can be applied to almost any algorithm in almost any framework (Arlot & Celisse, 2010).



Figure 1: Traditional validation scheme (left) vs. Proximal Validation Protocol (right).

