PROXIMAL VALIDATION PROTOCOL

Abstract

Modern machine learning algorithms are generally built upon a train/validation/test split protocol. In particular, with the absence of accessible testing sets in real-world ML development, how to split out a validation set becomes crucial for reliable model evaluation, selection and etc. Concretely, under a randomized splitting setup, the split ratio of the validation set generally acts as a vital meta-parameter; that is, with more data picked and used for validation, it would cost model performance due to the less training data, and vice versa. Unfortunately, this implies a vexing trade-off between performance enhancement against trustful model evaluation. However, to date, the research conducted on this line remains very few. We reason this could be due to a workflow gap between the academic and ML production which we may attribute to a form of technical debt of ML. In this article, we propose a novel scheme -dubbed Proximal Validation Protocol (PVP) -which is targeted to resolve this problem of validation set construction. Core to PVP is to assemble a proximal set as a substitution for the traditional validation set while avoiding the valuable data wasted by the training procedure. The construction of the proximal validation set is established with dense data augmentation followed by a novel distributional-consistent sampling algorithm. With extensive empirical findings, we prove that PVP works (much) better than all the other existing validation protocols on three data modalities (images, text and tabular data), demonstrating its feasibility towards ML production.

1. INTRODUCTION

Most, if not all, machine learning production and research are conducted based on a train/test/validation set split protocol. A machine learning engineer or scientist often first receives a labeled dataset and splits it into a training and validation set, respectively. The role of the validation set is critical when considering robust model evaluation, selection, hyper-parameter tuning, etc. Post to the validation protocol, the best model being picked would be fed to the testing protocol, where the testing set is generally not accessible during real-world ML development till this phase. Notably, prior to splitting the labeled dataset, one needs to determine the split ratio of the validation set against the training set. This ratio can be very tricky: if fewer samples are picked up for the training protocol, the model validation can be less reliable. Contrarily, the larger validation set effectively shortens the training resources, which may lead to performance degradation. In current days, this ratio is often set based on the experience level of a human expert. This problem anchored at the split ratio can also be exhibited in more complex validation schemes like cross-validation. Indeed, this problem of setting the (sub-)optimal validation set is often, or mostly, ignored by the academic cohorts in the community. To date, as we scrutinize the related literature, very few works have touched down on this line (Li et al., 2020; Moss et al., 2018; Joseph & Vakayil, 2021) . In hindsight, a large portion of the standardized academic benchmarks have had a prefixed validation set split, such as ImageNet (Krizhevsky et al., 2012) , COCO (Lin et al., 2014) , and SST (Socher et al., 2013) . Also, the testing set is often visible for the evaluation of the academic research. On the one hand, this prefixed validation setup has some merit. For instance, this effectively dedicates the ML research to the model towards innovated model architecture, optimization methods, new learning paradigms, etc. On the other hand, we argue this could attribute to the technical debt (Sculley et al., 2015) of ML. When considering ML production for real-world applications, we make the following statements: (i)-not many application tags along with adequate or large-scale data because data curation and annotation both cost a fortune; (ii)-the testing set is often not accessible: considering the CTR prediction or manufactural defect detection where the testing data present and only present post to the deployment; (iii)-the validation set is almost always decided by the ML experts with their expertise. In these scenarios, how to split out a validation set may sit in the center. The benign condition -where a validation set is preset and fixed -almost always does not hold in ML production. To this regard, we propose the Proximal Validation Protocol (dubbed PVP). With this novel validation protocol, we attempt to (fully) resolve the split problem and its trade-off. The core idea of PVP is rather simple. It first synthetically generates a validation set based on the labeled dataset without any splitting. Then, a novel distributional-consistent sampling algorithm is applied in order to select the most suitable synthetic data point for validation. The resulting set is dubbed the proximal validation set. Thanks to the proposition of the proximal validation set, PVP (in theory) does not rely on any real labeled data point for validation, effectively leading to performance improvement. Notable, the comparison of PVP with the conventional validation protocol is graphically depicted in Figure 1 . Empirically, we extensively conduct experiments on three data modalities -including tabular data, image data, and text data. We actively compare the PVP with standardized methods like the holdout protocol, K-fold cross validation, as well as the very limited related work like Joseph & Vakayil (2021) . Besides the series of analytical justifications, we choose three major metrics to form a fair and comprehensive comparison: the performance, t-v gap and variance. Notably, performance means the test score (e.g., AUC and Accuracy) of a model, variance refers to the stability of the estimated performance (on validation set) under different random seeds, and t-v gap indicates the closeness of the estimated performance to the test one. We empirically show that PVP achieves better performance, lower bias and competitive variance than the standardized split-relied methods. With three major data modalities being experimented, we hope that the ideology and instantiation of the PVP can pave a way for a more effective validation protocol towards ML production on real-world applications. At last, we may summarize the contribution of this work as follows: • We propose a novel validation scheme-work -PVP -a stable and reliable validation protocol relying on only the synthetic data while capable of enhancing the model performance. • The decent empirical results of PVP on three major data modalities manifest its "plug-andplay" nature. Its design is very much input data-dependent but independent of the model, architectures, optimizers, and tasks. We hope PVP can shed some light on data-saving, performance-effective, lightweight and profound validation procedure. The code of PVP will be made public upon publication.

2. RELATED WORK

As we mentioned, the related literature remains very few. Looking back to the old days, the validation framework was raised to fix the issue of overfitting (Mosteller & Tukey, 1968; Stone, 1974; Geisser, 1975) , which was first noticed by Larson (1931) . Due to the universality of the data splitting heuristics, the split-relied method can be applied to almost any algorithm in almost any framework (Arlot & Celisse, 2010) . Recent works aim to reduce the validation estimate's instability, imprecision, and time cost. Specifically, the instability refers to variance between multiple training results with random seeds on the same model (Moss et al., 2018) . And the imprecision means the gap between the model's evaluation results on the validation dataset and the test dataset (Zeng & Martinez, 2000) . These works can be divided into two categories. Some works explore the variants of traditional data-split validation frameworks (Moss et al., 2018; Kohavi et al., 1995; Jiang & Wang, 2017; Székely & Rizzo, 2013; Jung, 2018; Li et al., 2020; Tiittanen et al., 2021; Zeng & Martinez, 2000) , such as holdout and kfold. The others try to propose a better split algorithm for the validation dataset generation. Joseph & Vakayil (2021) and Budka & Gabrys (2012) propose methods to sample a specific subset from the training set to generate a validation set in the tabular scenario. Joseph (2022) propose the optimal train/validate splitting ratio theoretically, but only for the linear regression model. However, most, if not all, of the literature works are limited by a "split-relied" framework. Based on the logic chains from our previous section, they mostly would suffer from the splitting tradeoff problem. Among them, we deem PVP as the pioneer attempt to be fully split-free. The split-relied framework and suffer the train/validation split tradeoff.

3.1. PROBLEM SETUP

To ease the discussion of validation methods, we start with problem setup. First, we define a source set D src as a compilation of the training and potential validation set, i.e., an unsplit (labeled) training set. Similarly, the testing set is defined as D te . Further, we follow the well-known assumption that the instances of both datasets are I.I.D and drawn from an unknown underlying density distribution F (X , Y): D src = {(x src i , y src i )} N i=1 , D te = x te i , y te i M i=1 , where (x i , y i ) iid ∼ F (X , Y) where N, M denote the number of labeled instances in the source set and test set, respectively, and X ∈ R d , Y ∈ R are the input and label space. From the source set D src , we further define the validation and train set so as to formulate the traditional validation process (e.g., the holdout scheme). Definition 3.1 (Validation Process). We define a validation set, D val = ∁(D src ), where ∁ is a (stochastic) collecting function facilitated by different seeds. This definition, in turn, yields a definition of a training set, D tr = D src \ D val , by eliminating the validation samples from the source set. From here, we further define a mapping function f , f (X ) → Y. Finally, we define the validation process as §(f, D val ), where § is a scoring function compiling the inference pass together with an evaluation metric, e.g., accuracy, f1-score. Now, we describe the evaluation metrics of the validation method, which is essential for judging the quality of the validation method. We formally declare the T-V Gap and Variance from the perspective of validation, following previous works (Arlot & Celisse, 2010; Zeng & Martinez, 2000) . Definition 3.2 (T-V Gap). We define the test-validation gap (T-V Gap) as the difference between the test score(i.e., performance on the data space (X , Y)) and the estimated score of practically yielded function: §(f, (X , Y)) - §(f, D val ) . Definition 3.3 (Variance). We define ∁ i as a set of collection functions under different random initialization, and the variance is defined as: n i=1 §(f, ∁ i (D src )) - §(f, ∁(D src )) 2 n , Intuitively, T-V Gap signifies the inaccuracy of evaluation. A small T-V Gap corresponds to a precise estimate of the true score (usually refers to performance in the real-world test environment). The Variance measures the stability of the evaluation. And low variance means the estimated score is robust under collection function ∁ with a different random seed. Ideally, with near-zero T-V Gap and Variance, the evaluation result is reliable, stable and able to obtain the best function f with the highest score (on the test set). Usually, perhaps not practical on many occasions, when more labeled data is set for validation, i.e., gathered by ∁ or ∁ i , the greater intersection extent between the chosen set becomes larger. On the one hand, this would effectively reduce the Variance. Besides, this will make D val closer to (X , Y), which would reduce the T-V Gap. On the other hand, this further causes a reduction of the labeled samples for training which may degrade the test score of the algorithm. Following this setup, in the pursuit of a decent validation set, we may term it a "split tradeoff". Essentially, in the conventional randomized cross-validation or vanilla holdout schemes, it is always a plague to set the ratio between samples used for training and samples entertained for validation drawn from the source set. We find that this line of research is very much absent. Most research has set this split ratio as a preset meta-parameter and stuck with it throughout the development. However, as important as it is, we attempt to scrutinize this problem and propose a systematic "split-free" solution. As a result, we manage to build a framework PVP for ∁ ′ and prove that it requires the following two critical components (colored with red in Figure 1 ): • A data generator G to produce candidate samples from source set D src , which we define as auxiliary set D aux = G (D src ). To tackle the obstacle of no available data for validation set construction when all labeled samples of D src are saved for training, we generate a set of synthetic data. To prove this concept, we resort to the simplest method -the data augmentation family -to implement the data synthesis process. The reason is two fold: (i)-the augmented data can be arbitrarily large in theory, which may contribute to low variance by providing sufficient samples for validation; (ii)-the methods from the data augmentation family are generalized and can adapt to most of the data modalities and tasks. • A distributional-consistent sampling algorithm A selects distributionally representative samples from the auxiliary set to form a validation set. The chief challenge lies in how to design the strategy to locate suitable samples among the auxiliary set, which can be chaotic and biased due to the randomization nature of data augmentation. Again, we propose a frustratingly simple method by relying on an angle-based distribution approximation (see Figure 2 ) to chase for a small T-V Gap evaluation. In general, ∁ ′ can be formulated as ∁ ′ = A (G (•)). And the output of ∁ ′ (D src ), i.e., proximal validation set D pro , can replace the D val in the validation process (in definition 3.1). By proposing PVP, we entertain the possibility of relying (purely) on the generated samples to construct the validation process. Practically, this may pose a great number of advantages over the traditional split-relied validation process because it saves the samples for training, which likely leads to performance gains. In the following, in correspondence to the aforementioned bullets, we detail the instantiation of the PVP framework.

3.2.2. DATA GENERATOR

The first step to facilitating a proximal validation process is to form a candidate pool for the proximal validation set, which uses no original data points in the source set. We include an external data generator G to conduct the construction. The function G can be implemented by many existing methods. Specifically, we profoundly choose the data augmentation approaches and carefully take them to adapt to the requirements of high validation quality in Section 3.1. Generally, we use this module to generate a large number of synthetic candidate data points, thanks to its continuous nature. By forming a large pool, we aim to bring down the variance metric during the final proximal validation. In addition, as is pointed out by Xu et al. (2022) , the data augmentation scheme -when designed and implemented properly -is capable of covering the data space for the most part yet providing denser coverage. This may correspond to the metric of T-V Gap evaluation (definition 3.2) that relates to the following module. Formally, we can define the data generator G and the candidate pool D aux (named as the auxiliary set) more concretely as: D aux = G (D src ) = {g 0 (D src ), g 1 (D src ), . . . , g Q (D src )} where G consists of a set of Q augmentation functions, i.e., G = {g 0 , g 1 , . . . , g Q }. In particular, we conduct PVP on three fields, and the detailed augmentation methods are listed in Appendix A.1. There is admittedly rich literature around data augmentation; this is mostly embedded into the training stage for feeding more samples to train the model instead of for validation purposes. Data generation is another feasible way to generate the auxiliary set. However, given the extendibility and computational advantages of the data augmentation methods, we stick PVP with them. We intend to leave the exploration of dataset generation methods, such as generative modeling (Goodfellow et al., 2014) and dataset distillation (Wang et al., 2018) , to work further.

3.2.3. DISTRIBUTIONAL-CONSISTENT SAMPLING ALGORITHM

The prior data augmentation module can, in theory, produce an arbitrarily large number of samples for validation. This followed module is devised to select the most suitable samples to form the proximal validation set in place of the validation set of the conventional counterpart workflow. As mentioned in Section 3.1, high evaluation quality expects low T-V Gap and variance. It can be achieved if the distribution of D val differs slightly from F (X , Y) and D val has a large volume. However, we do not have the precise form of F . Instead, we only have D src , which is a set of realizations from F . Thus, we can use the empirical distribution of the source set as a substitute for F . And we abide by the expectation to propose the distributional-consistent sampling algorithm A to select sufficient samples that are distributional representations. As mentioned in Section 3.2.2, while the augmented data may be able to wrap the data space of the source set, we must be careful with it due to the inductive bias from the data augmentation methods. Revealed from the literature (Zhang et al., 2015; He et al., 2019) , the distribution of auxiliary set, produced by a set of augmentations, may drift from that of the source set. This drift, on the downside, may produce a large T-V Gap in the estimated score in the validation process if we directly employ the auxiliary set as a proximal validation set. Therefore, a sampling algorithm is indispensable in proximal validation set construction for less biased evaluation. Briefly, we propose the simplest solution. This algorithm first characterizes the empirical distribution of source set D src by an explicit angular distribution (Liu et al., 2020; 2017) and then samples the angles via an explicit function to locate the corresponding points in the auxiliary set (as illustrated in Figure 2 ). Distribution Estimation with Angles. The first step of our algorithm is to capture the empirical distribution of the source set. We adopt the simplest but most efficient method -intra-class angular distribution on feature space (Liu et al., 2020; Kobayashi, 2021; Liu et al., 2017) -for proof-ofconcept of PVP. Unlike common collecting functions ∁ that only consider inter-class distribution (e.g., stratified random sampling), our ∁ ′ further measures intra-class distributions so as to control the distribution of validation set more finely, and ultimately keep the T-V Gap of the evaluation at a low level. To be specific, we modeled the intra-class angular distribution on the angles between samples from D src and their corresponding class centers. Given a sample x y i with category label y, let z y i = Φ(x y i ) be the features extracted by an extractor Φ, where we utilize BERT (Devlin et al., 2018) for text data and ResNet-18 He et al. (2016) for images. We define the calculation process of angle as A and define the angle of x y i relative to its class center as follow: α y i = A(Φ(x y i )) = arccos⟨z y i , c y ⟩, ⟨a, b⟩ = a • b ∥a∥ • ∥b∥ (5) c y = Ny i=1 exp(w y i ) Ny j=1 exp w y j z y i , w y i = 1 N y -1 Ny i ′ ̸ =i ⟨z y i ′ , z y i ⟩ where c y , N y denotes the class center and the number of labeled instances of class y, respectively. Note that c y is calculated by the weighted average of the features within the class y. Assuming the angles obey a Gaussian distribution, we can define the angular distribution for class y as P (y) (α y ; θ), and its parameters θ can be obtained via maximum likelihood estimation (MLE): θ = arg min θ - Ny i=1 log p (α y i | θ) Notably, P (y) is an explicit density function. To this end, we can obtain the empirical distribution of each class in D src .

Sample through Distribution.

With the estimated distribution for the source set, our goal is to find representative points for the distribution as proximal validation samples. In particular, we conduct the process in a class-wise fashion. We generate angles via the explicit density function of the distribution to locate the corresponding samples in the auxiliary set for each class. Specifically, for a class y, we generate a set of angles by sampling from the distribution P (y) : A y = {α | α ∼ P (y) } For each angle in A y , we search for samples in D aux with the smallest angular gap to construct a proximal validation set D pro : D pro = {(x, y) | arg min (x,y)∈D aux |A(Φ(x)) -α|, ∀α ∈ A y } L y=1 ( ) where L is the number of categories. At this point, we obtain a distributionally representative D pro with sufficient data. It can be employed directly for validation while remaining all the data in D src for training, as shown in Figure 1 . And the entire process of PVP is shown in Algorithm 1.

4.1. DATASET AND BASELINE

To validate our approaches, we experiment with three modalities of data: tabular, images, and natural language. All datasets are on classification tasks; notably, two are put in a long-tailed setup. The statistics of all datasets are listed in Appendix A.3. And these datasets contain different levels of data volume from 400-20k. Tabular Data. We adopt 4 publicly available datasets from UCI (details in Appendix A.3), which are also data sources applied in other works (Wang et al., 2020; Bi & Zhang, 2018; Moss et al., 2018) . It is relatively easy to achieve stable and low-biased evaluations with balanced and sufficient data via cross-validation, which makes comparisons between methods less meaningful and persuasive. Therefore, a series of datasets with various imbalanced ratios are picked. Computer Vision. The common and challenging long-tailed distribution is chosen as our evaluation environment. And we follow previous works (Park et al., 2021; Cui et al., 2019) to construct long-tailed versions of CIFAR10/100 with imbalance factor ρ = 100 and 10 respectively, named as CIFAR10/100-LT. The details about the construction are illustrated in section A.3. Natural Language Process. We use the Reuters-21578 dataset (Dua & Graff, 2017) according to the exact setting in JKFold (Moss et al., 2018) , where only corn and wheat categories are used. Model. For tabular data, we use a decision-tree-based model xgboost (Chen & Guestrin, 2016) to perform classification tasks. While for CIFAR-LT and REUTERS, we train ResNet-44 (He et al., 2016) and DistilBert (Sanh et al., 2019) as classifiers, respectively. Notice that we suppress randomness in the model by fixing random seeds and maintaining a consistent batch data feeding order to enable the dataset (train/val set) to be the unique variable. And more implementation details are listed in Appendix A.4, including all training hyper-parameters. Method. Generally, three baselines are considered to compare against. First, holdout with a trainval split ratio of 8:2. Despite it being widely used, especially in deep learning, owing to high efficiency, instability and deviation are its weak points due to the size of the validation set being still relatively small (e.g., near few-shot setup under hundreds of total samples). Second, k-fold CV where a choice 10 is taken for k. The instability can be improved versus holdout. Third, repeated k-fold CV (named as J-K-Fold (Moss et al., 2018)), the repeat times and k are set to 4 and 5, respectively. Theoretically, it is a further enhancement of stability at the cost of time. All these three methods are based on stratified sampling, which makes the percentages of different classes in both validation and train sets essentially the same. Additionally, we also add a customized method on the single tabular scenario as an extra baseline, i.e., SPlit (Joseph & Vakayil, 2021) . SPlit is the latest splitting work, which utilizes support points (Mak & Joseph, 2018) to sample a specific subset from the train set as the validation set. And we set the sample ratio of SPlit the same as holdout (i.e., 8:2). Notably, the hyper-parameters mentioned above (i.e., 8:2 for holdout, 10 for k-fold CV, and 4-5 for J-K-Fold) are found by selecting the one with the best validation quality on the three metrics. And we run baselines and PVP 5 times for the image and text datasets considering time complexity issues, while 100 times for each tabular dataset. Metrics. To comprehensively evaluate validation methods, we propose to compare three metrics simultaneously, i.e., variance, T-V Gap, and scores on the test set. (i) For variance, we calculate the standard deviation of estimated scores (scores on the validation set) over all runs. (ii) For T-V Gap, we use the mean of the absolute gap between scores on the validation set and the test set overall runs, which is the same usage in previous work (Zeng & Martinez, 2000; Budka & Gabrys, 2012) . (iii) For test score, we use the mean of scores on the test set over all rounds. Likewise, the following combo -high test score, small variance, and T-V Gap -corresponds to good validation.

4.2. COMPARISON TO OTHER METHODS

We report our results in Table 1 . We may conclude from the scores that: (i) The performances of models obtained by PVP are superior to all the rivals, and the improvement is significant on 4 datasets, i.e. BankMarket, PageBlocks, Diabetes and CIFAR10-LT. (ii) The T-V Gap are considerably reduced compared to all competitors, which is reduced by 32.3% on average over the best baselines. (iii) The variance is maintained between holdout and 10-Fold CV, and is closer to the latter, with higher evaluation stability.In a nutshell, PVP ameliorates both the test score and the T-V Gap at the same time while keeping the variance within a competitive level. These main results effectively justify the feasibility of PVP towards a proximal, split-free validation setup.

4.3. WHY DOES PVP IMPROVE MODEL PERFORMANCE?

Table 1 shows that the F1 score on all datasets is consistently improved compared to all baselines. We would like to attribute the performance improvement to the extra training data. Specifically, in our framework, all the given data (i.e., source set) can be used for training, without the necessity to partition some of them as the validation set. Owing to the split-free nature, PVP can save data for training to provide performance gains compared to the split-relied methods.

4.4. PROXIMAL VALIDATION SET VS TRADITIONAL VALIDATION SET

We compare proximal validation set with other validation sets in both quantitative and qualitative aspects: distribution gap and visualization difference. These two aspects demonstrate why the proximal validation set can work better than the traditional one. Proximal validation set has smaller distribution gap. We compared the gap between the distribution of the validation set and the global distribution (approximated on source set) under different methods. The distribution is quantified via the intra-class angular distribution (as mentioned in Section 3.2.3), and the distance of two distribution is calculated by Wasserstein distance (Takatsu, 2008) . As shown in Table 2 , the distribution gaps of PVP are consistently smaller than the other methods. It empirically verifies that our validation set is a better representation of the global distribution, which can also lead to smaller T-V Gap in evaluation since D val is more approximated to F in Eq 2. Proximal validation set are well spread out. We visualize the 2D data representation produced by t-SNE in Figure 3 . We use the MushRoom dataset, and different colors represent different groundtruth class labels. We contrast the TSNE embeddings of the validation set produced by two approaches: (a) SPlit (with split ratio 8:2), current best split-based methods, and (b) our method PVP. Compared with SPlit, we can observe that our validation points cover a wider region containing both central and marginal areas. Since PVP are not constrained by the split scheme, the number of validation samples in the proximal validation set can be larger and the coverage can be broader, which is presumed to lead to better evaluation quality. We conduct experiments using the same training set as the holdout's rather than the whole training set while replacing the original split validation set of the holdout with our proximal validation set. Table 3 shows that even without the extra training data, the performance can still be enhanced to some degree. This implies that the improvement in performance comes not only from additional training data but also from the high-quality validation set itself. Besides, we can also see that the T-V Gap of our method is still smaller in most cases when the validation set is the only different element. This comparison further confirms the superiority of our method in evaluation precision.

4.6. ABLATION STUDY

Effects of Distributional-Consistent Sampling. In this section, we attempt to reveal the effectiveness of the core algorithm in PVP. We compare PVP with a variant: PVP-random, which uses (stratified) random sampling from auxiliary set instead of distributional-consistent sampling. From Table 4 , we can observe that the T-V Gap of PVP-random are aggravated, especially for the text and image datasets using deep models. Moreover, the test scores also degrade on most datasets. It signifies the indispensability of our sampling algorithm in the entire framework.

5. LIMITATION AND CONCLUSION

In this article, we proposed a Proximal Validation Protocol (PVP) to fully resolve the train/validation split trade-off. Through extensive experiments on datasets in different fields, including seven publicly datasets covering tabular, image and text data, we justify the comprehensive validity of PVP from both performance, robustness and reliability perspectives. Indeed, we have chosen to leave the theoretical understanding of this framework out of the scope of this paper. Mostly, it is due to a general lack of theoretical study in this specific line of research around validation set construction. Rather, we intend to empirically prove this concept of split-free proximal validation with the simplest instantiation in this work. We hope to motivate the cohorts in the community to pay more attention to this research line because we believe it is vital and widely existing for ML production.



https://nlpaug.readthedocs.io/ https://nlpaug.readthedocs.io/ https://archive.ics.uci.edu/ml/datasets/bank+marketing https://archive.ics.uci.edu/ml/datasets/diabetes https://archive.ics.uci.edu/ml/datasets/Page+Blocks+Classification https://archive.ics.uci.edu/ml/datasets/mushroom https://github.com/dmlc/xgboost https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make classification.html



Figure 1: Traditional validation scheme (left) vs. Proximal Validation Protocol (right).

3.2 PROXIMAL VALIDATION PROTOCOL3.2.1 OVERVIEW OF FRAMEWORKAs introduced in definition 3.1, the validation set is obtained by a collecting function ∁, i.e., D val = ∁(D src ), while the train set is formed via D tr = D src \ D val . However, to achieve a split-free solution, all data are left for training (i.e., D tr = D src ). Thus, we design a new collecting function, termed ∁ ′ . The sample set yielded by ∁ ′ is expected to have comparable evaluation quality to its traditional validation set counterpart -comparable T-V Gap and Variance (in definition 3.2 and 3.3), and comparable or superior performance.

Figure 2: Illustration of distributional-consistent sampling algorithm. The steps of the algorithm are displayed from left to right. First, the distribution of D src is estimated via an intra-class angular distribution on feature space. The estimated distribution is then used to select samples from the auxiliary set to form a proximal validation set.

Figure 3: TSNE visualization of the validation set representation on MushRoom. Different colors represent the different classes. The translucent points represent the source set, while the solidcolored ones represent the validation set. The details are referred to in Section 4.4.

Comparisons between PVP and other validation methods on seven classification datasets covering all three major data modalities, including tabular data(top), image(bottom left) and text(bottom right). The results are reported in F1, T-V Gap and variance, i.e. F1 ↑ T-V Gap↓/Var↓ (↑ / ↓ indicates that the higher/lower the metric is, the better and vice versa. The numbers are scaled by 1e2 for straightforward comparison.). And bold indicates superior results.

Comparison of our PVP and other methods on distribution gap. Reported in Wasserstein distance over 10 times. (Tabular data(top), image(bottom left) and text(bottom right).)

Using proximal validation set without extra training samples. The improvement of results was obtained by replacing the original split validation set with ours. Theimplies deterioration, while the others are the default for improvement.

Ablation Study. The deterioration of results obtained by using random sampling (from auxiliary set) instead of our distributional-consistent sampling. Theimplies improvement while the others are default for deterioation.

A APPENDIX

A.1 METHODS FOR GENERATING AUXILIARY SET We list the detailed augmentation methods for each dataset in Table 5 .Table 5 : The list of all the augmentation methods used for each dataset. B(a, b) denotes a beta distribution with parameters a and b. Times is the number of augmentation method performed per sample. (For Reuters, we utilize nlpaug 2 to implement these operations.) Construction details of the long-tailed dataset. As mentioned in Section 4.1, we construct longtailed versions of CIFAR10/100 with imbalance factor ρ ρ ρ = 100 and 10 following Park et al. (2021) ; Cui et al. (2019) . Specifically, for a k-class dataset, we create a long-tailed dataset by reducing the number of examples per class according to the exponential function n ′ i = n i µ i (µ ∈ (0, 1), i = 0, 1, ..., k), where n i is the original number of examples for class i, while n ′ i is the new number. A.4 IMPLEMENTATION DETAILS Tabular Data. We used xgboost (Chen & Guestrin, 2016) as classifier for all tabular datasets. Specifically, we used the XGBClassifier 7 with parameters: objective of 'multi:softprob', booster of 'gbtree', early stopping rounds of 15, max depth of 4, learning rate of 0.1, n estimators of 50. While for the features of tabular data, we conduct some transformations on the original sample to generate features. The transformations contains two types: (i) for categorical feature, we use convert it to onehot feature; (ii) for other types of features, we convert them to be normalized. And the dimension of feature varies among datasets.CIFAR10-LT and CIFAR100-LT. We used ResNet44 from He et al. (2016) as the classification model for both CIFAR10-LT and CIFAR100-LT. We followed the same training procedure (train from scratch), initialization, and hyperparameters as He et al. (2016) (i.e., weight decay of 0.0001, momentum of 0.9, minibatch size of 128, total epochs of 200, initial learning rate of 0.1, learning rate decay stage of 100 and 150 epochs). While for the feature extractor, we used ResNet20 from He et al. (2016) . And we train the extractor on source set for each dataset with same training hyperparameters as above. And the output of the last layer before classification is used as features, whose dimension is 72.Reuters. We used DistilBert (Sanh et al., 2019) as classification model. We finetuned the pre-train model with training parameters: total epochs of 15, batch size of 8, weight decay of 0.01, learning rate of 2e-5. While for the feature extractor, we used Bert from Devlin et al. (2018) . And we train the extractor on source set with same training hyperparameters as above. And the output in the 'cls' position of the last layer before classification is used as features, whose dimension is 144.

A.5 ANALYSIS OF TIME COST

As shown in Table 7 , we provided some quantitative measurements toward the running time of the proximal validation set construction process. The running time can be devided into two parts: (i) the pre-processing part, which only needs to be executed once, regardless of the number of training and validation process. This part contains the generation of auxiliary set (Section 3.2.2) and the estimation(Section 3.2.3). (ii) the sampling part, i.e., the second step in distributionalconsistent sampling algorithm (Section 3.2.3), the execution number of which is consistent with training process. Indeed, the time cost of part one is high majorly due to the generation process.However, the operations in part one only need to be conduct only once no matter how many times we perform train and validation on the same dataset. While the operation included in every training process is the sampling part, which can be finished in a few seconds and therefore does not incur much additional time cost for the training process.On the other hand, we position that the time cost of construction process can be attributed as a minor point and the reasons are two folds: (i)-this process can be faster via some engineering efforts like multithreading technology. (ii)-it can actually be viewed as a strategy to achieve a certain goal (better model performance and evaluation quality) at the expense of time, and such a strategy is common in many works, such as AutoAugment (Cubuk et al., 2019 ), RandAugment (Cubuk et al., 2020) , FlipDA (Zhou et al., 2021) . Here, we test two standard validation methods on five synthetic datasets (created by sklearn.datasets.make classification 8 , train size of 1000, test size of 1000, class of 2 and features of 2) with increasing class imbalance in Figure 4 . In this case, we can observe that these methods can work well in ideal scenarios with balanced data. In contrast, there is a risk of an imprecise estimation failure under unbalanced and small-scale scenarios, where the discrepancy between the test performance and the estimated one (i.e. the height of the yellow region) is undesirable large. This case illustrates that standard validation schemes do not always work perfectly even in a simple synthesis scenario, thus we can not easily rely on these methods in complex scenarios in the real world. 

