PROXIMAL VALIDATION PROTOCOL

Abstract

Modern machine learning algorithms are generally built upon a train/validation/test split protocol. In particular, with the absence of accessible testing sets in real-world ML development, how to split out a validation set becomes crucial for reliable model evaluation, selection and etc. Concretely, under a randomized splitting setup, the split ratio of the validation set generally acts as a vital meta-parameter; that is, with more data picked and used for validation, it would cost model performance due to the less training data, and vice versa. Unfortunately, this implies a vexing trade-off between performance enhancement against trustful model evaluation. However, to date, the research conducted on this line remains very few. We reason this could be due to a workflow gap between the academic and ML production which we may attribute to a form of technical debt of ML. In this article, we propose a novel scheme -dubbed Proximal Validation Protocol (PVP) -which is targeted to resolve this problem of validation set construction. Core to PVP is to assemble a proximal set as a substitution for the traditional validation set while avoiding the valuable data wasted by the training procedure. The construction of the proximal validation set is established with dense data augmentation followed by a novel distributional-consistent sampling algorithm. With extensive empirical findings, we prove that PVP works (much) better than all the other existing validation protocols on three data modalities (images, text and tabular data), demonstrating its feasibility towards ML production.

1. INTRODUCTION

Most, if not all, machine learning production and research are conducted based on a train/test/validation set split protocol. A machine learning engineer or scientist often first receives a labeled dataset and splits it into a training and validation set, respectively. The role of the validation set is critical when considering robust model evaluation, selection, hyper-parameter tuning, etc. Post to the validation protocol, the best model being picked would be fed to the testing protocol, where the testing set is generally not accessible during real-world ML development till this phase. Notably, prior to splitting the labeled dataset, one needs to determine the split ratio of the validation set against the training set. This ratio can be very tricky: if fewer samples are picked up for the training protocol, the model validation can be less reliable. Contrarily, the larger validation set effectively shortens the training resources, which may lead to performance degradation. In current days, this ratio is often set based on the experience level of a human expert. This problem anchored at the split ratio can also be exhibited in more complex validation schemes like cross-validation. Indeed, this problem of setting the (sub-)optimal validation set is often, or mostly, ignored by the academic cohorts in the community. To date, as we scrutinize the related literature, very few works have touched down on this line (Li et al., 2020; Moss et al., 2018; Joseph & Vakayil, 2021) . In hindsight, a large portion of the standardized academic benchmarks have had a prefixed validation set split, such as ImageNet (Krizhevsky et al., 2012) , COCO (Lin et al., 2014), and SST (Socher et al., 2013) . Also, the testing set is often visible for the evaluation of the academic research. On the one hand, this prefixed validation setup has some merit. For instance, this effectively dedicates the ML research to the model towards innovated model architecture, optimization methods, new learning paradigms, etc. On the other hand, we argue this could attribute to the technical debt (Sculley et al., 2015) of ML. When considering ML production for real-world applications, we make the following statements: (i)-not many application tags along with adequate or large-scale data because

