EFFICIENT HYPERPARAMETER OPTIMISATION THROUGH TENSOR COMPLETION

Abstract

Hyperparameter optimisation is a prerequisite for state-of-the-art performance in machine learning, with current strategies including Bayesian optimisation, hyperband, and evolutionary methods. While such methods have been shown to improve performance, none of these is designed to explicitly take advantage of the underlying data structure. To this end, we introduce a completely different approach for hyperparameter optimisation, based on low-rank tensor completion. This is achieved by first forming a multi-dimensional tensor which comprises performance scores for different combinations of hyperparameters. Based on the realistic underlying assumption that the so-formed tensor has a low-rank structure, reliable estimates of the unobserved validation scores of combinations of hyperparameters can be obtained through tensor completion, from knowing only a fraction of the elements in the tensor. Through extensive experimentation on various datasets and learning models, the proposed method is shown to exhibit competitive or superior performance to state-of-the-art hyperparameter optimisation strategies. Distinctive advantages of the proposed method include its ability to simultaneously handle any hyperparameter type (e.g., kind of optimiser, number of neurons, number of layer, etc.), its relative simplicity compared to competing methods, as well as the ability to suggest multiple optimal combinations of hyperparameters.

1. INTRODUCTION

Machine learning (ML) applications have been steadily growing in number and scope over recent years, especially in Computer Vision and Natural Language Processing. This growth is mainly attributed to the ability of large deep learning (DL) models to learn very complex functions. Consequently, the performance of such models (but even of simpler models that perform less computationally heavy tasks) is critically dependent on fine-tuning of their internal hyperparameters. Given the importance of hyperparameter optimisation and the ever increasing complexity of training modern ML models, efficient tuning of hyperparameters has become an area of utmost importance, not only in obtaining state-of-the-art performance, but also in alleviating the complexity of the training process. It is therefore not surprising that a large effort of the machine learning community has been focused on developing efficient hyperparameter optimisation methods. Commonly used methods include grid search, random search, methods based on Bayesian optimisation, multi-fidelity optimisation and evolutionary strategies. For a comprehensive review of currently available hyperparameter optimisation methods we refer the reader to Yu & Zhu (2020) . Despite success, current approaches are either rather heuristic or depend on strong underlying assumptions. To this end, we introduce a radically different approach based on exploiting the low-rank structure of the hyperparameter space. For example, we expect that a given optimiser with its learning rate set to 1e -3 will most likely yield similar validation loss performance to a learning rate of 9.9e -4. Extending this argument to multiple dimensions, so as to reflect multiple hyperparameters, we aim to identify highly promising subspaces in the vast space of hyperparameter combinations by evaluating only a small fraction of these combinations. More specifically, the proposed method models the set of all possible hyperparameter combinations as a multidimensional tensor. Each entry in this tensor corresponds to a score indicating relative suitability of a particular hyperparameter combination with respect to others (e.g., validation loss). By evaluating a subset of these combinations, we construct an incomplete tensor with only a fraction of its elements known. Assuming the complete tensor is of low rank, the unknown entries can be estimated using low-rank tensor completion techniques. This makes it possible to predict the relative performance of different hyperparameter configurations without the need for their explicit evaluation. The best hyperparameter configurations can then be found by searching the so completed tensor. To take full advantage of the tensor completion framework, we propose a sequential tensor completion algorithm which narrows down the hyperparameter search space based on identified promising subspaces from previous tensor completion cycles. Furthermore, for each cycle, we employ the Cross method (Zhang (2019)), a tensor sampling scheme which ensures that as few hyperparameter evaluations as possible are required for accurate tensor completion. We show that such an approach results in a speed of optimisation that is highly competitive with, and often surpassing, other stateof-the-art hyperparameter optimisation techniques. Comprehensive numerical results and extensive experimentation illustrate the potential of the proposed framework as a competitive and intuitive, yet physically meaningful, alternative option for efficient hyperparameter optimisation. The rest of the paper is organised as follows. Section 2 discusses related work. After briefly discussing key tensor preliminary concepts in Section 3, we present a validation of the assumed lowrank property of the hyperparameter tensor in 4 and the proposed tensor completion algorithm for hyperparameter optimisation in Section 5. Next, comprehensive experimental results are presented in Section 6, followed by the Conclusion in Section 7.

2. RELATED WORK

Deng & Xiao (2022) provide a meta-learning approach based on low-rank tensor completion which allows the optimal hyperparameter configuration in a new machine learning problem to be predicted based on the optimal configurations found in other related problems. Our work is crucially different since it does not require the results of related problems or any other prior knowledge to optimise a given machine learning problem. A plethora of hyperparameter optimisation techniques exist that also do not require prior knowledge of the problem e.g. random search or Bayesian optimisation. However, to the best of our knowledge, there is no such hyperparameter optimisation method based on low-rank tensor completion. Furthermore, we have been unable to find any work hypothesising that the performance distribution of various hyperparameter combinations has an underlying lowrank structure, even though this is a natural assumption for any physically meaningful data structure.

3. TENSOR PRELIMINARIES

Low-rank tensor completion is based on the premise that elements in a low rank tensor exhibit certain interrelationships. This makes it possible to infer values of unknown elements based on the elements whose values are known. Research abounds with algorithms for tensor completion e.g., Bengua et al. (2017 ), Song et al. (2018) , Liu et al. (2014) or Acar et al. (2011) ; however, most of these algorithms assume there is no prior knowledge of which elements of the true tensor are known. In the strategy proposed by this paper, one is able to "choose" which elements of the tensor are known by choosing which hyperparameter combinations to evaluate. Since each hyperparameter evaluation is time-consuming, it is important to be achieve accurate tensor completion with as few known elements as possible; this capability is provided by the Cross technique for efficient low rank tensor completion from Zhang (2019). To explain this technique, it is necessary to explain two important prerequisites: Tucker decomposition and Tucker rank. Any N -dimensional tensor X can be expressed as a Tucker decomposition, consisting of an N -dimensional core tensor G and N factor matrices A (n) , one for each dimension. The relationship between the tensor X and its Tucker decomposition is X = G × 1 A (1) × 2 A (2) ... × N A (N ) (1) The operation × n denotes the mode-n product. Consider a tensor X ∈ R I1×I2...×I N and matrix A ∈ R J×In , such that 1 ≤ n ≤ N . The n-mode product of X and A is a tensor whose elements are (X × n A) i1,i2...in-1,j,in+1...in = In in=1 x i1,i2...in-1,in,in+1...in • a j,in where x and a are respectively elements of X and A with indices given by their subscripts. Based on the Tucker decomposition, the Tucker rank of a tensor X is a list containing the smallest possible dimensions of G in a Tucker decomposition that accurately reconstructs X . Tucker rank is the definition of tensor rank used throughout this paper. For more details, refer to Cichocki et al. (2015) . Let the tensor X to be completed have dimensions I 1 × I 2 ...I N . The Cross technique requires the Tucker rank of the complete tensor to be assumed beforehand -let this assumed rank be [r 1 , r 2 ...r N ]. It provides a heuristic to accurately estimate the complete tensor by sampling r 1 × r 2 ...r N + r 1 × (I 1 -r 1 ) + r 2 × (I 2 -r 2 )...r N × (I N -r N ) elements -the authors prove this is the minimum number of elements required for accurate tensor completion at a given assumed rank. The heuristic samples a body tensor of dimensions r 1 × r 2 ...r N , starting from X 0,0...0 , and r n arm vectors for each n th tensor dimension that stretch along the entire dimension and must intersect with the body. The authors of Zhang (2019) provide 2 algorithms to generate a complete tensor estimate from these samples; throughout this paper we use "Noisy Tensor Completion with Cross Measurements", which accounts for the presence of "noise" that prevents the tensor from being perfectly low-rank.

4. INVESTIGATION OF THE LOW RANK PROPERTY OF HYPERPARAMETER TENSORS

Before introducing the proposed hyperparameter optimisation algorithm, it is necessary to provide validation to the assumption that a tensor constructed from validation losses of hyperparameter combinations, when constructed in a particular manner, is of low (Tucker) rank. This section aims to illustrate this through experiments on widely used machine learning setups.

4.1. MAPPING THE HYPERPARAMETER SEARCH SPACE TO THE TENSOR

The first step is to discuss how we construct the tensor T based on the sets of possible values i.e., search spaces for the hyperparameters. Firstly, if there are N hyperparameters, the tensor T would be N -dimensional. Each hyperparameter H corresponds to one dimension of T -there is no constraint on which dimension this can be. In experiments, we consider three kinds of hyperparameters based on their search spaces: categorical: a set of category-based values; uniform integer: a uniformly spaced sequence of integers and uniform real a uniformly spaced sequence of real numbers. The search space for a categorical hyperparameter, H cat , is specified as a list: consider the activation function in neural networks as an example: ["ReLU", "tanh", "sigmoid"]. If H cat corresponds to the l th dimension of T then, based on the example provided, elements of T with index 0 for the l th dimension correspond to all hyperparameter combinations with H cat ='ReLU'. Similarly, indices 1 and 2 correspond to 'tanh' and 'sigmoid', respectively. The search space for a uniform integer or real hyperparameter is specified through three values: start s, resolution interval r, and end e. For uniform integer hyperparameters, r is an integer, while for uniform real r is real. The search space is then {s, s + r, s + 2r, ..., min(e, ⌊(e -s)/r⌋ × r)}. Let H uni be a uniform integer or real hyperparameter on the m th dimension of T . The elements of T whose index for the m th dimension is i correspond to all hyperparameter combinations having H uni = s + i × r. There is thus a mapping between hyperparameter combinations and tensor indices, generated based on the set of search spaces for the hyperparameters. Each tensor element represents the validation loss of the hyperparameter combination corresponding to that index. We hypothesize that a tensor constructed in this way is (at least approximately) of low rank.

4.2. SETUP FOR TENSOR COMPLETION EXPERIMENTS

Experiments were performed with the following machine learning problems: support-vector machine (SVM) with polynomial kernel in binary classification of the iris data set from Fisher (1936) (abbreviated SVM-P-iris); K-NN regression on the diabetes data set from Efron et al. (2004) (KNN-R-diab), and random forest binary classification of the wine data set (RF-wine). The wine data set of size 178 was modified for binary classification by retaining only samples from classes 0 and 1 to give a data set of size 130. For SVM-P-iris, 4 hyperparameters were used to construct the (4-dimensional) tensor: the SVM regularisation hyperparameter, and the degree, constant term and scaling factor of the polynomial kernel. For KNN-R-diab, 3 hyperparameters were optimised: the number of neighbours K of the K-NN algorithm, the order of Minkowski norm (used to calculate distances between points) and the weighting heuristic (uniform or distance-based) for values of the K neighbours when making a prediction at any data point. For RF-wine, 5 hyperparameters were optimised: the number of decision trees in the forest; maximum depth of any tree; minimum number of data samples needed to split a tree node; number of data features to consider when splitting a tree node, and a Boolean categorical hyperparameter determining whether a subset or the entire training data is used to construct each tree. The search spaces for all these hyperparameters are provided in Appendix A.1. For each of these problems p, a tensor T p was generated based on the mapping discussed in part 4.1. T p was then sampled and estimated through tensor completion based on the Cross heuristic. In all problems, validation loss was calculated using cross-validation with 5 folds predefined for each data set. The loss metric used to calculate validation loss varied across problems: hinge loss for SVM-P-iris, logcosh loss for KNN-R-diab and Kullback-Leibler divergence for RF-wine.

4.3. TENSOR COMPLETION RESULTS

Table 1 represents the result of tensor completion using an assumed Tucker rank of 1 for each dimension i.e., r 1 = r 2 ...r N = 1. Table 2 represents the best result obtained when using any Tucker rank different from the one used in Table 1 . The columns represent metrics devised to compare the predicted tensor from tensor completion Tp with the true tensor T p . NND stands for "normalised norm difference", and is given by: N N D = || Tp -T p || ||T p || (3) where ||X || is the norm of a tensor ||X ||, given by the square root of the sum of squares of all its elements. A lower NND indicates better accuracy, although N N D = 1 is still poor: equivalent to all elements of Tp being 0. CE10% is the percentage of tensor indices of the top 10% elements in Tp , by lowest validation loss, in common with the top 10% of T p . This metric provides an indication of whether tensor completion can identify the best hyperparameter combinations. Note that the number of elements in T p is the same in tables 1 and 2; we found it convenient to display in table 1 . Due to the inherent randomness of the Cross heuristic, the set of elements of T p sampled when the Tucker rank is not [1, 1...1] varies over trials -hence the range and mean of each metric over 10 trials is provided in Table 2 . As seen in table 1, tensor completion with the Cross technique consistently provides a degree of accuracy in approximation (NND < 1) when the Tucker rank is [1, 1...1]. This is in spite of the fact that across all problems, the proportion of tensor elements being sampled is always < 1%. However, the CE10% value is always < 15%, indicating that, while some of the best validation loss elements are identified, many are not. When higher rank values are used as in Table 2 , the NND worsens to the extent that NND ≈ 1 but the CE10% performance improves. These results suggest that a Tucker rank of 1 for each dimension is able to capture general trends in the variation of validation loss values throughout the tensor -this explains why it can consistently produce low NND values. This can only be true if the validation loss values can be roughly described by a lowrank structure of rank [1, 1...1]. However, this Tucker rank is less effective at capturing specific local variations in the values, which is why its CE10% performance is lower. On the other hand, using a higher Tucker rank may at times be able to capture local variations accurately -hence the higher CE10% -it tends to overfit these variations i.e., emphasise the variations over the underlying low rank structure. This results in more variable performance and poorer NND. For higher Tucker ranks to be able to capture the (approximate) rank [1, 1...1] structure underlying the validation loss values, a larger number of samples of T p would be required.

Metric

For hyperparameter optimisation, we decided that while in some cases, a higher rank may prove more effective at identifying the best hyperparameter combinations, using a Tucker rank of [1, 1...1] is the best default choice as it can capture the structure of different T p , while ensuring the best time performance by evaluating the lowest number of hyperparameter combinations.

5. TENSOR COMPLETION FOR HYPERPARAMETER OPTIMISATION

In this section, we present a technique for hyperparameter optimisation based on tensor construction and completion as performed in Section 4. Throughout this paper, we refer to this technique as "Hyperparameter optimisation through tensor completion", abbreviated HOTC. From the results of Section 4, it is apparent that low-rank tensor completion can approximate the global distribution of validation losses over hyperparameter combinations, but may miss local variations. Accounting for these variations is, however, crucial to obtaining the optimal hyperparameter combination. Therefore, we decided a suitable approach would be to use tensor completion to predict the general region of the search space most likely to hold the optimal combination(s), and then focus the optimisation on this region for another round of tensor completion. This focusing of the optimisation can be repeatedly applied until the region being searched is small enough for an exhaustive i.e., grid search.

5.1. THE PROPOSED ALGORITHM

Algorithm 1 describes HOTC. Note that the search space for each hyperparameter H is as defined in part 4.1. In each tensor completion cycle, the function generate tensor cross components first evaluates different hyperparameter combinations i.e., samples the true validation loss tensor according to the Cross sampling scheme. It returns the sampled elements in the form of Cross measurements B, J and A which are respectively the body tensor, array of joint matricisations and array of arm matricisations -see Zhang ( 2019) for more information. The function generate complete tensor then applies the noisy tensor completion algorithm from Zhang (2019) to estimate the complete tensor. In this, B, J , A are used to generate the core and factor matrices of a Tucker decomposition of the rank T estimate of the true tensor T . An overall illustration of how true tensor samples can form a Tucker decomposition of the estimate is illustrated in figure 1 . The hyperparameter combination h with the lowest validation loss T is then found by the function f ind best combination, which converts the tensor index to its corresponding hyperparameter combination. The function narrow search spaces then generates a new version of S with smaller search spaces centred around the values in the hyperparameter combination, h. The search space for each hyperparameter is narrowed around its corresponding value in h; assume this value to be h Huni for a uniform integer/real hyperparameter H uni . Let the start, resolution interval, and end (see part 4.1 for definitions) of the original search space before narrowing for H uni be s, r and e respectively. The range of values searched in this space is G i = ⌊(e -s)/r⌋ × r; the range of the narrowed space G i+1 would be at most ⌊G i /2r⌋ × r + 1 with h Huni at the centre of the new sequence of values. The new start and end of the search space hence become max h Huni -⌊(G i /4r)⌋ × r, s and min h Huni + ⌊(G i /4r)⌋ × r, e respectively. The new resolution interval is max(⌊r/2⌋, r min ) when H uni is uniform real, or max(⌊r/2⌋, 1) for uniform integer hyperparameters. Some categorical hyperparameters may take numerical category values, where closer numbers indicate a closer relationship. To narrow the search space of such a hyperparameter, H cat , the list of numerical values representing the search space is first sorted in an ascending order. The narrowed search space is then the sequence of values from the a th to the b th element (both inclusive) of the sorted list, where a = max i -round to integer(L i /4), 1 and b = min i + round to integer(L i /4), L . Here, the length of the original list is L i and h Hcat is the i th element in the sorted list. The size L i+1 of the new search space list is at most ⌈L i /2⌉ + 1, with h Hcat roughly in the middle. The search spaces of categorical hyperparameters taking unrelated non-numerical values e.g., ['ReLU', 'tanh', 'sigmoid'] , are not narrowed down, and left as they are. Once the search spaces have been sufficiently narrowed that the tensor estimate T has < M elements, grid search performs an exhaustive grid search by evaluating every hyperparameter com-bination in S to find the lowest validation loss. Note that M can be set to 0, in which case the predictions would be purely from tensor completion cycles without any grid search.

5.2. HEURISTICS TO ENHANCE PERFORMANCE

It is important to mention heuristics to obtain the best performance from Algorithm 1, which has numerous configuration parameters. These heuristics were followed to generate the results in Section 6. Firstly, it is advisable to define large resolution intervals for the initial search spaces in S. For example, if a uniform integer hyperparameter has a start and end of 1 and 101 respectively, a good choice for the resolution interval would be 10 or 20, so that 10 or 5 values are being searched in the completion cycle. As discussed in part 4.3, a Tucker rank of 1 for every dimension is the best default choice. With these settings, the algorithm should narrow the search space, with each tensor completion cycle, on areas likely to contain the optimal hyperparameter combinations. The resolution intervals would automatically decrease from their large values to r min . Once the initial search spaces are defined, the next step is to run the algorithm for one completion cycle, keeping the grid search limit M as zero (i.e., disabling grid search) and to observe the performance. The number of cycles, C, can be gradually increased until no improvement in performance is obtained on running the algorithm, or the maximum amount of time acceptable for optimisation is reached. At this stage, performance may be further improved by setting > 0, enabling a grid search -at the cost of more execution time, this ensures that the best possible validation loss in the area of the search space being examined has been obtained. It is possible to predict how much time the algorithm will take, if one knows the time required to evaluate one hyperparameter combination; let this be t. Consider a tensor T of dimensions I 1 × I 2 ...I N . To estimate T using the Cross technique, with assumed Tucker rank [r 1 , r 2 ...r N ], it is necessary to sample samp = r 1 × r 2 ...r N + r1 × (I 1 -r 1 ) + r 2 × (I 2 -r 2 )...r N × (I N -r N ) elements of T under the Cross scheme. In the context of algorithm 1, for the first completion cycle, I n is the number of values in the initial search space for the n th hyperparameter. Estimating the completed tensor based on the sampled elements involves N n-mode tensor multiplications that take negligible time compared evaluating samp hyperparameter combinations. Thus, the first completion cycle takes roughly samp × t in time. In subsequent completion cycles, the tensor to be completed has smaller dimensions due to narrowing of the search space -thus samp × t is an upper bound for the time taken by these cycles. The maximum time taken by the completion cycles is thus ≈ C × samp × t. If a grid search is performed at the end, the maximum time is ≈ (C × samp) × t + M .

6. EVALUATION OF HOTC AGAINST ALTERNATIVE HYPERPARAMETER OPTIMISATION TECHNIQUES

The HOTC technique was compared with six existing hyperparameter optimisation techniques: random search (RS) (Bergstra & Bengio (2012) ); Bayesian optimisation with Gaussian process (BO-GP) (Frazier ( 2018 For each hyperparameter in each problem, the limits (maximum and minimum) of the range of values to search were kept identical for each hyperparameter optimisation technique. In KNN-C-Wine, KNN-R-Calh and RF-FC, each hyperparameter combination was evaluated using cross-validation loss over 5 fold pre-defined for each data set. This ensures a consistent loss value is obtained over multiple trials. For 3LC-MNIST and VGG-CIF10, separate training and validation data sets, in size ratio 80 : 20, were used instead, to enable faster calculation of validation loss.

6.1. IMPLEMENTATION DETAILS

In KNN-C-Wine, 3 hyperparameters were optimised: the number of neighbours K of the K-NN algorithm, the order of Minkowski norm (used to calculate distances between points) and the weighting heuristic (uniform or distance-based) for values of the K neighbours when making a prediction at any data point. The wine data set of size 178 was modified for binary classification to only retain samples from classes 0 and 1. The resulting set was of size 130. The fraction of misclassified validation data samples i.e., misclassification loss was the metric used to represent the validation loss. In KNN-R-Calh, 2 hyperparameters were optimised: the number of neighbours K and the order of Minkowski norm. Only the uniform wighting heuristic could be used here, as the California Housing data set had coinciding elements that resulted in infinite distance-based weights. The logcosh metric was used to calculate validation loss. In RF-FC, 5 hyperparameters were optimised: the number of decision trees in the random forest; maximum depth of any tree; minimum number of data samples needed to split a tree node; number of data features to consider when splitting a tree node, and a boolean hyperparameter determining whether a subset or the entire training data is used to construct each tree. The Forest Covertype data set, of initial size 581, 012 was too large to be processed in RAM. Hence, it was modified for binary classification by retaining the first 10, 000 samples of classes 1 and 2 each. This gave an "evenly balanced" data set of size 20, 000. The truncation also made the data set faster to train on. In 3LC-MNIST, the CNN had 2 convolutional layers and 1 fully-connected output layer. 11 hyperparameters were optimised, the most among all the benchmark problems. These were: the numbers of neurons, activation functions, dimensions of max pooling and dimensions of stride of the first and second convolutional layers, as well as the neural network optimiser, learning rate and learning rate decay rate. To evaluate each hyperparameter combination, the CNN was trained for 5 epochs. The transfer learning problem in VGG-CIF10 involved freezing its 16 convolutional layers and training its 3 fully connected output layers. These layers were of size 4096, 4096 and 10 (to classify 10 classes) resulting in 18, 923, 530 trainable parameters. 7 hyperparameters were optimised: the optimiser, learning rate, learning rate decay, number of epochs to train the network, batch size and dropout probabilities of the fully connected layers of size 4096. It should be noted that the 32 × 32 CIFAR10 were resized to 224 × 224 to be accepted by the VGG16 network. The search spaces for each of these hyperparameters are described in Appendix A.2. The computing hardware, software implementations of the different hyperparameter optimisation techniques, and the configurations of these techniques are described in Appendix B.

6.2. NUMERICAL RESULTS

The code for HOTC and our benchmarks is publicly available on GitHub. Figure 2 illustrates the results of the experiments. The graphs were generated for RS, BO-TPE, CMA-ES, BO-GP, HB and BOHB by sampling the best obtained validation loss at the timestamp of every trial. For HOTC, this sampling was done after each completion cycle and after the final grid search, if any. For KNN-C-Wine, KNN-R-Calh and RF-FC, the HOTC approach is remarkably faster than the other techniques and also obtains the lowest value of validation loss. It is not possible to run HOTC for longer than is seen on these graphs, as the completion cycles have already converged to very small region in the tensor that is trivial to exhaustively search. Note that in the diagram for KNN-C-Wine, the graphs of BO-GP, CMA-ES, RS and BOHB overlap each other, as do those of BO-TPE and HB. In the diagrams for 3LC-MNIST and VGG-CIF10, the ends of the completion cycles for HOTC have been indicated by numbered markers. Grid search was not enabled in either problem, as it would consume too much time. In these problems with more hyperparameters, each combination of which takes longer to evaluate (neural networks take longer to train), HOTC does not outperform the other techniques. One has to wait ≈ 1500 seconds for the first suggestion of a hyperparameter combination from HOTC in 3LC-MNIST and ≈ 2000 seconds in VGG-CIF10 -this is because a minimum number of tensor elements must be known in order for tensor completion to be possible. However, the effectiveness of the tensor completion can be seen as the suggested combination improves over subsequent completion cycles -in both cases, the final validation loss is comparable to the best results observed. Overall, while HOTC is not outperforming its competitors in all situations, it is still able to consistently find optimal or near-optimal combinations across machine learning paradigms. 

7. CONCLUSION

We have introduced the concept of tensor completion in the hyperparameter optimisation paradigm. Through sequential completion cycles that are able to identify the most promising subspaces, the proposed method has been shown to be highly competitive and often superior to a wide range of state-of-the-art and commonly used frameworks over a diverse set of benchmarks and algorithms. It is our hope that the presented intuitive yet efficient approach to hyperparameter optimisation will spur the interest of the machine learning community, and we envision tensor completion becoming a strong alternative to current approaches. Regarding future work, we aim to work on a procedure to automatically determine optimal values for the algorithm inputs, such as resolution of hyperparameter ranges or number of tensor completion cycles. Furthermore, we plan to investigate whether combining the current approach of focusing on optimal regions with random exploration of the space will improve the overall performance. Finally, an interesting direction we aim to pursue is the investigation of alternative tensorisation methods of the hyperparameter space. 



Figure 1: Illustration of a tensor completion approach, based on Tucker decomposition, for hyperparameter optimisation. Missing values of the incomplete tensor are designated in grey.

)); Bayesian optimisation with Tree-Parzen estimator (BO-TPE)(Feurer & Hutter (2019)); covariance matrix adaptation evolution strategy (CMA-ES)(Hansen (2016)); hyperband (HB)(Li et al. (2018)), and Bayesian optimisation hyperband (BOHB)(Falkner et al. (2018)).The techniques were compared across five benchmark machine learning problems: K-nearest neighbours (K-NN) binary classification on the wine data set (abbreviated KNN-C-Wine); K-NN regression on the California Housing data set (KNN-R-Calh); random forest binary classification on the Forest Covertype data set (RF-FC); binary classification with a 3-layer convolutional neural network (CNN) on the MNIST data set (3LC-MNIST), and transfer-learning in the VGG16 CNN architecture in multi-class classification of the CIFAR10 data set (VGG-CIF10).

Figure 2: Graphs of validation loss against optimisation time for the different hyperparameter optimisation algorithms on the benchmark machine learning problems.

Tensor completion results using assumed Tucker rank of [1, 1...1].



Configuration parameters of HOTC in different machine learning problems.

Multi-fidelity budget limits used for HB and BOHB in the different machine learning problems.

annex

Qingquan Song, Hancheng Ge, James Caverlee, and Xia Hu. Tensor completion algorithms in big data analytics, 2018.Tong Yu and Hong Zhu. Hyper-parameter optimization: A review of algorithms and applications, 2020. URL https://arxiv.org/abs/2003.05689.Anru Zhang. Cross: Efficient low-rank tensor completion. The Annals of Statistics, 47(2):936 -964, 2019. doi: 10.1214/18-AOS1694. URL https://doi.org/10.1214/18-AOS1694.

A HYPERPARAMETER SEARCH SPACES

A.1 TENSOR COMPLETION EXPERIMENTSIn these experiments, the hyperparameter search spaces are set in order to generate a complete tensor of validation loss values. In the tables, they are presented in the format of (start, resolution interval, end) for uniform integer or real hyperparameters, or [item1, item2, ...] for categorical hyperparameters.

SVM-P-iris

Refer to table 3 .

Hyperparameter

Search Space Regularisation Hyperparameter (0.1, 0.1, 3.0) Polynomial Kernel Degree (0.0, 0.1, 3.0) Polynomial Kernel Scaling Factor (0.1, 0.1, 3.0) Polynomial Kernel Constant Term (0.0, 0.1, 3.0) 

A.2 HYPERPARAMETER OPTIMISATION EXPERIMENTS

In these experiments, the limits of the search spaces for each hyperparameter were kept common for all of the tensor completion techniques. The search space resolution interval is only defined for HOTC so that it can construct a tensor representing the search spaces. For all other techniques, integer hyperparameters always have resolution 1 and the search spaces for real hyperparameters are continuous. The advantage of this is that it ensures these techniques have access to the entire set of possible values, although the downside is that the set of combinations that can be tested becomes larger, especially with real hyperparameters.In the tables, search spaces are presented either in the format of (start, end) for real or integer hyperparameters, or [item1, item2 ...] for categorical hyperparameters. Fuethermore, if the log column value is "True", this means the hyperparameter value varies between start and end exponentially rather than uniformly. Note that while it was stated that HOTC uniformly distributed hyperparameter values, it is easy to accomodate exponentially varying values by considering the exponent as a uniformly varying hyperparameter over a constant base. 

B SOFTWARE AND HARDWARE DETAILS

All experiments were evaluated on Google Colab: the Google Compute Engine backend consisted of two single-core Intel Xeon CPUs of clock frequency 2.2 GHz, each accommodating two threads. The total RAM available was 12.68 GB. For experiments with 3LC-MNIST and VGG-CIF10, the environment was accelerated with an NVIDIA Tesla T4 GPU. The configuration parameters of HOTC were set for each problem according to the heuristics described in part 5.2, and are listed in table 11. For RS, BO-GP, BO-TPE and CMA-ES only the number of trials were varied according to the time for which optimisation was to be performed. The time of optimisation for each problem can be seen in the graph of figure 2. For multi-fidelity techniques: HB and BOHB, the number of trials, the minimum and maximum budget values, and the step size with which the minimum budget was increased to the maximum were set to reasonable values based on the data to achieve good performance -these values are described in table 12. Apart from this, all other configurations were left with the default values from their respective library implementations.For HB and BOHB, the multi-fidelity budget was the fraction of the training data used to train the machine learning model in all problems except KNN-C-Wine. Since the size of the modified wine dataset, 130, was small compared to the maximum searched value of K i.e., 100, the budget was defined as the fraction of the total number of data features (which was 13) used in training.

