PROBABLE DATASET SEARCHING METHOD WITH UN-CERTAIN DATASET INFORMATION IN ADJUSTING AR-CHITECTURE HYPER PARAMETER

Abstract

Different types of tasks with uncertain dataset information are studied because different parts of data may have different difficulties to achieve. For example, in unsupervised learning and domain adaptation, datasets are provided without label information because of the cost of human annotation. In deep learning, adjusting architecture hyper parameters is important for the model performance and is also time consuming, so we try to adjust hyper parameters in two types of uncertain dataset information:1, dataset labels are postponed to be obtained so hyper parameters need to be adjusted without complete dataset information. 2, hyper parameters are adjusted with a subset training dataset since training models with complete training dataset is time consuming. Here, we propose several loss functions to search for probable dataset when the complete dataset information is not obtained. The experiments on 9 real world data demonstrate the performance of our method.

1. INTRODUCTION

In deep learning, most regression data can be represented with the form (X, Y ), where X is the data input and Y is the label. However, different parts of the data may have different difficulties to achieve. For example, in unsupervised learning Barlow (1989) and domain adaptation Wang & Deng (2018) , the label of dataset is assumed hard to be obtained since label usually needs human annotation. These situations could be viewed as making decisions when part of dataset information is uncertain. Another probable situation is that input sample X can be obtained much earlier than label Y is obtained, because human annotation of Y is time consuming or the exact task target of Y is not determined when the input samples are collected. In such situation, the computing resources are assumed to be abundant before the label Y is obtained. In deep learning, architecture hyper parameter setting is an important factor for the performance of a model. Then a question is whether the architecture hyper parameters, corresponding to different network architectures, can be compared and adjusted only with input sample information X. Selecting hyper parameter only with input sample information X could save the time to try different hyper parameters when the label Y is obtained. Input sample information X usually takes more memory space than label information Y , indicating that input sample information X contains more information than label information Y , so predicting the architecture comparison only with input sample information X seems probable. To deal with the uncertain information in a dataset, we propose a probable dataset searching method to predict architecture comparison, where the dataset representation is inspired by the dataset definitions and assumptions in recent neural network convergence works Kohler & Langer (2021); Bauer & Kohler (2019); Schmidt-Hieber (2020); Suzuki (2018); Farrell et al. (2021) . Our method could search probable datasets with provided dataset information such as input sample information X. Concretely, the comparison of two hyper parameters can be predicted by searching for the existence of probable dataset that one architecture is better or worse than another. Here, we use a neural network to approximate the dataset regression function and apply several loss functions to search for the probable dataset that a trained architecture is better than another in testing dataset. An assumption in our method is that the compared architectures should have competitive performance on searched dataset. Empirically, the compared architectures are selected because they perform well on the data domain due to former experience, so we search the dataset from the situations that at least one compared architecture could perform well in the dataset. In implementation, the neural network to approximate the regression function has the same architecture as a compared architecture. Probable dataset searching method could also help analyze the characteristics irrelevant to the concrete dataset information of X. For example, sometimes training models with complete dataset costs too much time so a small subset can be used to approximately adjust architecture hyper parameters Klein et al. (2017); Elsken et al. (2019) . However, this approximation is correct sometimes and wrong at other times. When comparing two given hyper parameters, probable dataset searching method could figure out the conditions that the approximate comparison is correct. 2021), it is proved that f 0 could be converged by multi-layer fully connected neural networks trained with enough samples, where f 0 satisfies some requirements and assumptions. Here, we assume that the convergence theorem is also true with other types of deep neural networks. We then use a neural network to approximate a regression function with two reasons: 1, as a function, neural network also satisfies the requirements and assumptions of a regression function. 2, there is always a neural network g that the difference between f 0 and g is smaller than a given positive value because of the definition of convergence.

2. METHOD

As a result, a regression dataset can be expressed by Y = g(X) + , where the neural network g has a small enough similarity to f 0 , is a random disturbance of regression function with 0 expectation. Then, there are three components influencing the regression dataset, the regression function g, X distribution and random disturbance distribution . should be much smaller than g in the dataset so the relation between inputs and outputs could be easily distinguished from random disturbance. There is another additional assumption of the regression dataset: The regression function should not be too complex that no compared architecture can perform well in the comparison. It also assumes that the compared architectures in the searching method are selected with empirical knowledge to provide competitive performance in an uncertain dataset with enough training samples. Practically, the assumption is satisfied by setting the regression function g with the same architecture as a compared architecture. Formally, to train a neural network g AR-A with its parameters θ AR-A with a training dataset α 1 , error value Err is introduced as a smaller the better value to evaluate the performance of a model in the dataset α 1 . Then training g AR-A with a training dataset α 1 aims to find parameters θ * AR-A (α 1 ) that θ * AR-A (α 1 ) = arg min θAR-A Err(α 1 , g AR-A (•, θ AR-A )), where smaller error value means better model performance in training dataset α 1 . In regression task, Root Mean Square Error (RMSE) is a widely used error value: RMSE = 1 K ŷ -y 2 2 , where • 2 is the l-2 norm, y is the true value, ŷ is the model output and K is the dimension of y. The difference of the error value between two architectures on a testing dataset α 2 can be used to compare the performance of two architectures AR-A and AR-B: C(AR-A, AR-B, α 1 , α 2 ) = Err(α 2 , g AR-A (•, θ * AR-A (α 1 ))) -Err(α 2 , g AR-B (•, θ * AR-B (α 1 ))), (3) where models are trained by training dataset α 1 and tested by testing dataset α 2 . In this paper, we try to compare the performance of two architectures when part of dataset information is uncertain, which aims to find the probable situations that AR-A (or AR-B) performs better, corresponding to a small C(AR-A, AR-B, α 1 , α 2 ) (or C(AR-B, AR-A, α 1 , α 2 )) value.



in a regression dataset is denoted by a random variable vector with the form (X, Y ), the relationship between X and Y can be represented as a regression function f 0 (x) = E{Y |X = x}, where E is the expectation. In recent deep neural network convergence theorem Kohler & Langer (2021); Bauer & Kohler (2019); Schmidt-Hieber (2020); Suzuki (2018); Farrell et al. (

