LEARNING TO PREDICT PARAMETER FOR UNSEEN DATA Anonymous authors Paper under double-blind review

Abstract

Typical deep learning models depend heavily on large amounts of training data and resort to an iterative optimization algorithm (e.g., SGD or Adam) for learning network parameters, which makes the training process very time-and resourceintensive. In this paper, we propose a new training paradigm and formulate network parameter training into a prediction task: given a network architecture, we observe there exists correlations between datasets and their corresponding optimal network parameters, and explore if we can learn a hyper-mapping between them to capture the relations, such that we can directly predict the parameters of the network for a new dataset never seen during the training phase. To do this, we put forward a new hypernetwork with the purpose of building a mapping between datasets and their corresponding network parameters, and then predict parameters for unseen data with only a single forward propagation of the hypernetwork. At its heart, our model benefits from a series of GRU sharing weights to capture the dependencies of parameters among different layers in the network. Extensive experimental studies are performed and experimental results validate our proposed method achieves surprisingly good efficacy. For instance, it takes 119 GPU seconds to train ResNet-18 using Adam from scratch and the network obtains a top-1 accuracy of 74.56%, while our method costs only 0.5 GPU seconds to predict the network parameters of ResNet-18 achieving comparable performance (73.33%), more than 200 times faster than the traditional training paradigm.

1. INTRODUCTION

Deep learning has yielded superior performance in a variety of fields in the past decade, such as computer vision (Kendall & Gal, 2017) , natural language processing (DBL), reinforcement learning (Zheng et al., 2018; Fujimoto et al., 2018) , etc. One of the keys to success for deep learning stems from huge amounts of training data used to learn a deep network. In order to optimize the network, the traditional training paradigm takes advantage of an iterative optimization algorithm (e.g., SGD) to train the model in a mini-batch manner, leading to huge time and resource consumption. For example, when training RestNet-101 (He et al., 2016) on the ImageNet (Deng et al., 2009) dataset, it often takes several days or weeks for the model to be well optimized with GPU involved. Thus, how to accelerate the training process of the network is an emergent topic in deep learning. Nowadays, many methods for accelerating training of deep neural networks have been proposed (Kingma & Ba, 2015; Ioffe & Szegedy, 2015; Chen et al., 2018) . The representative works include optimization based techniques by improving the stochastic gradient descent (Kingma & Ba, 2015; Yong et al., 2020; Anil et al., 2020) , normalization based techniques (Ioffe & Szegedy, 2015; Salimans & Kingma, 2016; Ba et al., 2016) , parallel training techniques (Chen et al., 2018; Kim et al., 2019) , et. Although these methods have showed promising potential to speed up the training of the network, they still follow the traditional iterative-based training paradigm. In this paper, we investigate a new training paradigm for deep neural networks. In contrast to previous works accelerating the training of the network, we formulate the parameter training problem into a prediction task: given a network architecture, we attempt to learn a hyper-mapping between datasets and their corresponding optimal network parameters, and then leverage the hypermapping to directly predict the network parameters for a new dataset unseen during training. A basic )DVKLRQ01,67 &,)$5 0LQL,PDJH1HW 'DWDVHWV

&RUUHODWLRQ&RHIILFLHQW UDQGRP OHDUQHG

Figure 1 : Correlation coefficients between training datasets and the network parameters on the Fashion-MNIST (Xiao et al., 2017) , CIFAR-100 (Krizhevsky et al., 2009) , Mini-ImageNet (Vinyals et al., 2016) datasets, respectively.'learned' depicts correlations between training datasets and the corresponding optimal network parameters. 'random' denotes correlations between training datasets and the network parameters selected randomly from 1000 groups. assumption behind the above prediction task is that there exists correlations between datasets and their corresponding parameters of a given network. In order to demonstrate the rationality of this assumption, we perform the following experiment: for a dataset, we first randomly sample 3000 images to train a 3-layer convolutional neural network until convergence. Then we conduct the average pooling operation to the original inputs as a vector representation of the training data. We repeat the above experiment 1000 times, and thus obtain 1000 groups of representations and the corresponding network parameters. Finally, we utilize Canonical Correlation Analysis (CCA) (Weenink, 2003) to evaluate the correlations between training data and the network parameters by the above 1000 groups of data. Figure 1 shows the results, which illustrates there are indeed correlations between training datasets and their network parameters for a given network architecture. In light of this, we propose a new hypernetwork, called PudNet, to learn a hyper-mapping between datasets and network parameters. Specifically, PudNet first summarizes the characters of datasets by compressing them into different vectors as their sketch. Then, PudNet extends the traditional hypernetwork (Ha et al., 2017) to predict network parameters of different layers based on these vectors. Considering that parameters among different layers should be dependent, we incorporate Gate Recurrent Unit (GRU) (Cho et al., 2014) into PudNet to capture the relations among them, so as to improve the performance of PudNet. Finally, it is worth noting that for training PudNet, it is infeasible if we prepare thousands of datasets and train networks on these datasets to obtain the corresponding optimal parameters respectively. Instead, we adopt a meta-learning based approach (Finn et al., 2017) to train the hypernetwork. Extensive experiments demonstrate the surprising effectiveness of our PudNet. For example, it takes around 54, 119, 140 GPU seconds to train ResNet-18 using Adam from scratch and obtain top-1 accuracies of 99.91%, 74.56%, 71.84% on the Fashion-MNIST (Xiao et al., 2017) , CIFAR-100 (Krizhevsky et al., 2009) , Mini-ImageNet (Vinyals et al., 2016) , respectively. While our method costs only around 0.5 GPU seconds to predict the parameters of ResNet-18 and still achieves 96.24%, 73.33%, 71.57% top-1 accuracies on the three datasets respectively, at least 100 times faster than the traditional training paradigm. Our contributions are summarized as follows: 1) We find there are correlations between datasets and their corresponding parameters of a given network, and propose a general training paradigm for deep networks by formulating network training into a parameter prediction task. 2) We extend hypernetwork to learn the correlations between datasets and their corresponding network parameters, such that we can directly generate parameters for arbitrary unseen data with only a single forward propagation. 3) Our method achieves surprisingly good performance for unseen data, which is expected to motivate more researchers to explore along with this research direction.

2.1. HYPERNETWORKS

The original goal of hypernetwork proposed in (Ha et al., 2017) is to decrease the number of training parameters , by training a hypernetwork with a smaller size to generate the parameters of another network with a larger size on a fixed dataset. Because of its promising performance, hypernetwork has been gradually applied to various tasks (Krueger et al., 2017; Zhang et al., 2019; von Oswald et al., 2020; Li et al., 2020; Shamsian et al., 2021 ). von Oswald et al. (2020) proposes a task-conditioned hypernetwork to overcome catastrophic forgetting in continual learning. It learns an embedding for each task and utilizes the task embedding to generate corresponding parameters for each task. Bayesian hypernetwork (Krueger et al., 2017) is proposed to approximate Bayesian inference in neural networks. GHN-2 proposed in Knyazev et al. (2021) attempts to build a mapping between the network architectures and network parameters, where the dataset is always fixed. GHN-2 leverages graph neural networks to model the information of the network architectures for learning the mapping. Our work is orthogonal to GHN-2, since we aim to build a mapping between the datasets and the network parameters, given a network architecture. Moreover, we extend the traditional hypernetwork by incorporating GRU to capture the relations among parameters of differnt layers and develop a meta-learning based manner to optimize the hypernetwork.

2.2. ACCELERATION OF NETWORK TRAINING

Many works have been proposed to speed up the training process of deep neural networks in the past decade, including optimization based methods (Kingma & Ba, 2015; Yong et al., 2020; Anil et al., 2020) , normalization based methods (Ioffe & Szegedy, 2015; Ba et al., 2016) , parallel training methods (Chen et al., 2018; Kim et al., 2019) 

3. PROPOSED METHOD

In this section, we will introduce our PudNet in detail. For better illustration, we first give preliminaries and our problem formulation, and then elaborate the details of our method. 

3.1. PRELIMINARIES

arg min θ N i=1 L(F(D i , Ω; H θ ), M Ω i ), where F(D i , Ω; H θ ) denotes a forward propagation of our hypernetwork H θ . The input of the forward propagation is the dataset D i and its output is the predicted parameters of network Ω by H θ . Note that the architecture of Ω is always fixed during training and testing, e.g., ResNet-18. This makes sense because we often apply a representative deep model to data of different domains. Thus, it is obviously meaningful if we can predict the network parameters for unseen data using an identical network architecture. M Ω = {M Ω i } N i=1 denotes the ground-truth parameter set of network Ω corresponding to datasets D train , where M Ω i is the ground-truth parameters for the dataset D i . L is a loss function, measuring the difference between the ground-truth parameters M Ω i and the predicted parameters. The core idea in (1) is to learn a hyper-mapping H θ between datasets D train and the network parameter set M Ω , on the basis of our finding that there are correlations between datasets and the network parameters, as shown in Figure 1 . However, it is prohibitive if preparing thousands of datasets D i and training network Ω on D i to obtain the corresponding ground-truth parameters M Ω i respectively. To alleviate this problem, we adopt a meta-learning based (Vinyals et al., 2016) approach to train the hypernetwork H θ , and propose another objective function as: arg min θ N i=1 xj ∈Di L(x j , y j ; F(D i , Ω; H θ )), Instead of optimizing H θ by directly matching the predicted parameters F(D i , Ω; H θ ) with the ground-truth parameters M Ω i , we can adopt a typical loss, e.g., cross-entropy, to optimize H θ , where each dataset D i can be regarded as a task in meta-learning (Vinyals et al., 2016) . By learning on multiple tasks, the parameter predictor H θ is gradually able to learn to predict performant parameters for training datasets D train . During testing, we can utilize F(D, Ω; H θ ) to directly predict the parameters for a testing dataset D never seen in D train with only a single forward propagation.

3.2. OVERVIEW OF OUR FRAMEWORK

Our goal is to learn a hypernetwork H θ , so as to directly predict the network parameters for an unseen dataset by H θ . However, there remains two issues that are not solved: First, the sizes of different D i may be different and the dataset sizes may be large, which makes H θ hard to be trained; Second, there may be correlations among parameters of different layers in a network. However, how to capture such context relations among parameters has not been fully explored so far. To this end, we propose a novel framework, PudNet, as shown in Figuire 2, PudNet first introduces a dataset compression module to compress each dataset D i into a small size sketch s i ∈ R l×m to summarize the major characteristics of D i , where l and m are the size and dimension of the sketch, respectively. Then, our context-aware parameter prediction module takes the sketch s i as input, and outputs the predicted parameters of the target network, e.g., ResNet-18. At its heart, multiple GRUs sharing weights are constructed to capture the dependencies of parameters among different layers in the network. Finally, PudNet is optimized based on a support set in a meta-learning based manner.

3.3. DATASET COMPRESSION

To solve the issue of different sizes of training datasets, we first compress each dataset into a sketch with a fixed size. In recent years, many data compression methods have been proposed, such as matrix sketching (Liberty, 2013; Qian et al., 2015) , random projection (Sarlos, 2006; Liberty et al., 2007) , etc. In principle, these methods can be applied to our data compression module. For simplification, we leverage a deep neural network to extract a feature vector as the representation of each sample, and then conduct the average pooling operation to generate a final vector as the sketch of the dataset. The sketch s i for the dataset D i can be calculated as: s i = 1 |D i | xj ∈Di T ϕ (x j ), where T ϕ (•) denotes a feature extractor parameterized by ϕ, and the structure of the feature extractor used in the experiments can be found in Appendix A.1. |D i | is the size of the dataset D i . The parameter ϕ is jointly trained with PudNet in an end-to-end fashion. In future work, more efforts could be made to explore more effective solutions to summarize the information of a dataset, e.g. using statistic network (Edwards & Storkey, 2017) to compress datasets.

3.4. CONTEXT-AWARE PARAMETER PREDICTION

After obtaining the sketches for all training datasets, we will feed them into the context-aware parameter prediction module, i.e., our hypernetwork, as shown in Figure 2 . In the followings, we will introduce this module in detail. Capturing Context Relations via GRU Since an input of a neural network would sequentially pass forward the layers of this network, the parameters of different layers should be not independent. If we ignore the context relations among parameters of different layers, the solution may be suboptimal. Thus, we utilize GRU (Cho et al., 2014) to capture the context-aware parameter relations as shown in Figure 3 . Note that we have two changes compared to conventional GRU: 1) conventional GRU utilizes the randomly initialized hidden state h 0 . Different from this, we set the dataset sketch embedding as the initial hidden state: h 0 = s i , so as to provide dataset information for predicting corresponding structure parameters; 2) In each recurrent step, conventional GRU usually takes the next word as input in natural language process field (Cho et al., 2014) . Instead of that, we exploit the predicted structure parameters of previous layer as input, enabling the information of shallower layer parameters memorized in GRU to help the prediction of parameters in the deeper layer. By this way, the dependency relations among parameters of different layers can be well captured. GRU GRU GRU ℎ ! = # " $ #$% $ # ℎ #$% ℎ # Figure 3: An illustration of capturing context re- lations via GRU. The following comes a formal description: r t = σ(W r • [h t-1 , a t-1 ]), z t = σ(W z • [h t-1 , a t-1 ]), ht = tanh(W h • [r t * h t-1 , a t-1 ]), h t = (1 -z t ) * h t-1 + z t * ht , a t = σ(W o • h t ), where h t could transfer dataset related information. a t is a latent vector encoding the context information. W r , W z , W h , W o are the learnable parameters. As in GRU, the reset gate r t decides how much information in the hidden state h t-1 needs to be reset. ht is a new memory, which absorbs the information of h t-1 and a t-1 . z t is an update gate, which regulates how much information in ht to update and how much information in h t-1 to forget. The context-aware output a t is taken as input to the weight generator to predict parameters of the t-th layer of the target network Ω.

Initial Residual Connection

To ensure that the final context-aware output contains at least a fraction of the initial dataset information, we additionally implement an initial residual connection between dataset sketch embedding s i and a t as: ât = a t × (1 -η) + s i × η, ( ) where η is the hyperparameter. After obtaining ât , we put ât into a weight generator used to generate the parameters of Ω. Weight Generator Since the target network Ω usually has different sizes and dimensions in different layers, we construct the weight generator g ψt for each layer t to transform ât of a fixed dimension to network parameter tensor w t with variable dimensions. Here g ψt denotes the weight generator of the t-th layer, ψ t is the learnable parameters of g ψt , w t is the predicted parameter of t-th layer in Ω. We can derive the predicted parameter of the t-th layer as: w t = g ψt (â t ) (5) where g ψt consists of one linear layer and two 1 × 1 convolutional layers. More details of the weight generator can be found in Appendix A.1. When the parameters w t of each layer are predicted, we can use these parameters as the final parameters of Ω.

3.5. OPTIMIZATION OF OUR FRAMEWORK

In this section, we introduce how to optimize our PudNet. In contrast to traditional classification tasks where training data and testing data have the identical label space. In our task, the label spaces between training and testing can be different, even not overlapped. Thus, training a classification head on the training data can not be used to predict labels of testing data. Motivated by several metric learning methods (Chen et al., 2021; Oreshkin et al., 2018) , we introduce a parameter-free classification method to solve the above issue. Similar to Chen et al. (2021) , we obtain a metric-based category prediction on class c k as: p(y = c k |x j , Ω, D i ; H θ ) = exp(τ • < f (x j ; F(D i , Ω; H θ )), u k >) c exp(τ • < f (x j ; F(D i , Ω; H θ )), u c >) , where u k is the centroid of class c k , which is the average output of the predicted network Ω over samples belonging to class c k in the support set, as in Snell et al. (2017) . < •, • > denotes the cosine similarity of two vector, and τ is a learnable temperature parameter. f (x j ; F(D i , Ω; H θ )) is the output of the target neural network Ω based on the input x j . Then, the parameter-free classification loss can be defined as: L 1 = N i=1 xj ∈Di L(p(y|x j , Ω, D i ; H θ ), y j ), where y j is the true label of x j , L is the cross-entropy loss. To further improve the performance of the model, we introduce an auxiliary task for training our hypernetwork, by adding a full classification head Q φ parameterized by φ. The classification head aims to map the output of the target network Ω to probabilities of the whole classes C tr from D train . The full classification loss is defined as: L 2 = N i=1 xj ∈Di L(Q φ (f (x j ; F(D i , Ω; H θ ))), y j ). Our parameter prediction task co-trained with a full classification head is related to curriculum learning (Oreshkin et al., 2018) . Since learning on varying label space is more challenging than learning on a static one, the full classification problem that maps features to a static label set could be regraded as a simpler curriculum. This easier 'prerequisite' could help the hypernetwork to obtain the basic level knowledge before handling harder parameter prediction task. Moreover, to make parameter-free based prediction and full classification based prediction consistent, which is motivated by (Chen et al., 2022; Wu et al., 2019) , we introduce a Kullback-Leibler Divergence loss to encourage their predicted probabilities to be matched: L 3 = N i=1 xj ∈Di KL(q(y|x j )||p(y|x j )), ( ) where KL is the Kullback-Leibler Divergence. p(y|x j ) and q(y|x j ) are the predicted probabilities of x j of parameter-free based and full classification based methods, respectively. The probabilities of the corresponding classes in p(y|x j ) are padding with zero to match the dimension of q(y|x j ). Finally, we give the overall multi-task loss as: L total = L 1 + L 2 + L 3 . ) where D i can be regarded as a task similar to that in meta-learning. By minimizing (10), our hypernetwork can be well trained. For an unseen data in testing, we utilize our hypernetwork to directly predict its network parameters, and use the parameter-free based method for classification. The training procedure of our PudNet is in Appendix A.4.

4.1. DATASET CONSTRUCTION

In the experiment, we construct numerous datasets for evaluating our method based on four datasets: Fashion-MNIST (Xiao et al., 2017) , CIFAR-100 (Krizhevsky et al., 2009) , Mini-ImageNet (Vinyals et al., 2016) , Animals-10 (Gupta & Brown, 2022) . The constructed datasets are summarized as: Fashion-set: We randomly select 6 classes from Fashion-MNIST to construct training datasets and the remaining 4 classes for constructing testing datasets. We construct 2000 groups of datasets from the 6-category training set to train PudNet. To verify PudNet's ability to directly generate parameters, we construct 500 groups of datastes from the 4-category testing set to generate 500 groups of network parameters. For each group of network parameters, we also construct another dataset having identical labels but not overlapped samples with the dataset used for generating parameters, in order to test the performance of the predicted network parameters. Each dataset contains 600 randomly sampled images with 2 randomly sampled classes. CIFAR100-set: We randomly choose 80 classes from CIFAR-100 for constructing training datasets and 20 classes not overlapped with the above 80 classes to construct testing datasets. We sample 100000 groups of datasets for training PudNet. Similar to Fashion-set, we construct 500 groups of datastes to directly generate their network parameters by PudNet, and create another 500 groups of datasets for testing the performance of the predicted parameters. Each datasets consists of 500 images with 5 classes randomly selected. ImageNet-set: Similar to CIFAR100-set, the mini-Imagenet dataset is randomly split into 80 classes for creating training datasets and 20 classes for creating testing datsets. We sample 50000 groups of datasets for training PudNet. Similar to Fashion-set, we use 500 groups of datastes to generate their network parameters, and construct another 500 groups of datasets for testing. There are 500 images with 5 classes selected randomly in each dataset. CIFAR100→Animals10: To further verify our PudNet, we construct a cross-domain dataset. We use CIFAR100-set to construct training datasets, and Animals-10 for testing datasets. There are 100000 groups of datasets from CIFAR100-set to train PudNet. We randomly split Animals10 into two not overlapped subsets: one is used to generate parameters, and the other for testing.

4.2. BASELINES

We compare our method with traditional iterative based training paradigm including training from scratch and one training acceleration method, GC (Yong et al., 2020) . We also take the pretrained model as a baseline. In addition, we also compare with meta-learning methods, including Match-Net (Vinyals et al., 2016) , ProtoNet (Snell et al., 2017) , Meta-Baseline (Chen et al., 2021) , Meta-DeepDBC (Xie et al., 2022) , and MUSML (Jiang et al., 2022) . We use two kinds of architectures as our target network Ω: a 3-layer CNN, ConvNet-3 and Resnet-18. To ensure a fair comparison, we use all training datasets for training meta-learning methods and the pretrained model. For all experiments, we use ACC (Top-1 Accuracy) metric to evaluate the classification performance.

4.3. IMPLEMENTATION DETAILS

We perform the experiments using GeForce RTX 3090 Ti GPU. We set the learning rate as 0.001. For the target network ConvNet-3, we set the hyperparameter η as 0.2,0.1,0.3,0.1 for Fashion-set, CIFAR100-set, ImageNet-set, CIFAR100→Animals10 respectively. For the target network ResNet-18, we set the hyperparameter η as 0.2,0.5,0.5,0.5 for Fashion-set, CIFAR100-set, ImageNet-set, CIFAR100→Animals10 respectively. More details could be found in Appendix A.1 and A.2.

4.4. RESULT AND ANALYSIS

General Performance Analysis Table 1 and Table 2 show the general results of our method. We could find that our method consistently outperforms the meta-learning methods and the pretrained method. This demonstrates that learning a hyper-mapping between datasets and corresponding network parameters is effective. To demonstrate the time consumption our method could save, we also Ba, 2015) and the training acceleration technique, GC. We could find that it takes around 55, 119, 140 GPU seconds to train ResNet-18 using the accelerated method GC and the network obtains top-1 accuracies of 99.94%, 75.74%, 72.89% on the Fashion-set, CIFAR-set, ImageNet-set respectively. While our method costs only around 0.5 GPU seconds to predict the parameters of ResNet-18 and still achieves a comparable performance (96.24%, 73.33%, 71.57% top-1 accuracies) on the three datasets respectively, at least 100 times faster than the accelerated method. Performance on Cross-domain Datasets We further evaluate our method on the cross-domain datasets, CIFAR100→Animals10. Table 3 shows the results on the CIFAR100→Animals10 dataset. we can obtain comparable accuracy when training the model from scratch at around 5 epochs, while our method is at least 300 times faster than the traditional training methods. In addition, we also compare with Meta-DeepDBC that generally achieves better performance based on Table 1 and 2 and with MUSML which is proposed very recently. Our model still outperforms them in a large margin. We expect that such a result could motivate more researchers to explore along this direction. Ablation Study We design a variant of our method to analyse the effect of the context relation information. PudNet-w.o.-Context denotes our method directly feeds the dataset sketch into the weight generator without using GRU. The results are listed in Table 4 . Our PudNet outperforms PudNet-w.o.-Context in a large margin, demonstrating the effectiveness of capturing dependencies among parameters of different layers. Moreover, we design two variants of our method to further study the contribution of the dataset information. PudNet-Random denotes our method randomly initializes the hidden state of GRU. PudNet-w.o.-initRes denotes our method does not utilize initial residual connection for initial dataset information complementary. As shown in Table 4 , the performance of our method decreases when randomly initializing the hidden state of GRU. It indicates that exploiting the initialization of the hidden state to deliver the information of datasets is effective. Moreover, PudNet is better than PudNet-w.o.-initRes, illustrating the effectiveness of the initial residual connection. Finally, we design another two variants of our methods to study the impact of the auxiliary task. PudNet-metric means our method only using the parameter-free loss. PudNet-w.o.-kl means out method without using the KL Divergence. As shown in Table 4 , PudNet-w.o.-kl has better performance than PudNet-metric, demonstrating it is effective for the auxiliary full classification task. PudNet outperforms PudNet-w.o.-kl. This illustrates it is effective to encourage the predicted probability distribution of parameter-free method and full classification method to be matched. Fine-tuning Predicted Parameters Since a typical strategy for applying a pretrained model to a new dataset is to fine-tune the model. Thus, we intend to evaluate the performance of fine-tuning our method and baselines on the CIFAR-set dataset. To do this, we first incorporate an additional linear classification layer to our method and baselines, except 'From scratch'. Then, we randomly select 10000 samples from 20 classes for fine-tuning the models, and use the remaining 2000 samples for testing. Table 5 shows the results. Here 'From Scratch' means directly training the target network with random initialized parameters from scratch. Our method achieves the best performance. This indicates the predicted parameters by our PudNet can well serve as a pretrained model.

5. CONCLUSION

In this paper, we found there are correlations among datasets and the corresponding parameters of a given network, and explored a new training paradigm for deep neural networks. We proposed a new hypernetwork, PudNet, which can directly predict the network parameters for an unseen data with only a single forward propagation. Essential to our hypernetwork is the construction of a series of GRU, to capture the relations among parameters of different layers in a network. Extensive experimental results demonstrated the effectiveness and efficiency of our method. 

A.3 ADDITIONAL EXPERIMENTS

Performance on Other Full Dataset Analysis: Here we utilze CIFAR100-set to train the model and evaluate the performance on CIFAR-10. Note that the classes in CIFAR-10 are mutually exclusive with the classes in CIFAR-100 (Krizhevsky et al., 2009) . Since Meta-DeepDBC achieves better performance among all meta-learning methods based on Table 1 and 2, here we only report the results of Meta-DeepDBC and pretrained model for clarity. Table 6 shows the results on CIFAR-100→CIFAR-10. We observe that our model outperforms the Pretrained and Meta-DeepDBC competitors. This further verifies the effectiveness of our method. We also provide the time consumption of training the target network from scratch with Adam optimizer on CIFAR-10. It worth noting that it takes around 622 GPU seconds to train ResNet-18 using Adam from scratch and obtain top-1 accuracy of 48.21% on CIFAR-10. While our method costs only around 0.56 GPU seconds to predict the parameters of ResNet-18 and still achieves 51.05% top-1 accuracy, which is 1000 times faster than the training from scratch method. Performance on Cross-domain Datasets: We construct another two cross-domain datasets. We use ImageNet-set to construct training datasets, and Animals-10 and CIFAR10 for testing datasets. There are 50000 groups of datasets from ImageNet-set to train PudNet. We randomly split CIFAR10 into two nonoverlapping subsets separately: one is used to generate parameters, and the other for testing. The separation process for Animals10 is analogy. We further evaluate our method on these two cross-domain datasets: ImageNet→Animals10, ImageNet→CIFAR10. Table 7 shows the results. our method still achieves surprisingly good efficiency. For instance, it takes 622.01 GPU seconds to train ResNet-18 on the ImageNet→CIFAR10 dataset using GC from scratch and obtain a top-1 accuracy of 49.89%, while our method PudNet costs only 0.57 GPU seconds to predict the network parameters of ResNet-18 achieving comparable performance (45.07%), more than 1000 times faster than the traditional training paradigm. In addition, we could find that our method also outperforms state-of-the-art meta-learning methods in a large margin. We expect that such a result could motivate more researchers to explore along this direction. Performance on Deeper Target Network: We perform another experiment to directly predict parameters of ResNet-34 by PudNet on CIFAR100-set. The results are listed in Table 8 . We observe that our method achieves comparable performance to that of GC at 30 epochs, while our method is more than 250 times faster than GC. This further demonstrates the efficiency of our method. Effect of Different Dataset Embedding: To further study the effect of dataset embedding, we design three variants to predict parameters for ConvNet-3 on CIFAR100. "Sum" denotes summing up the representations of all samples in a dataset as the dataset embedding. "Geometric means" denotes using the geometric mean of sample representations as the dataset embedding. "Mean+Var" denotes concatenating the mean and the variance of sample representations as the dataset embedding. The results are reported in Table 9 . We find that these four dataset embedding methods have comparable results. Here we only explore some simple dataset embedding methods. In the future, more complicated data compression methods could be explored, such as matrix sketching (Liberty, 2013; Qian et al., 2015) , random projection (Sarlos, 2006; Liberty et al., 2007) and statistic network (Edwards & Storkey, 2017) , etc. Effect of Different Number of Stacked GRU: We exploit stacked GRU to transfer context-aware information of parameters for ResNet-18. We conduct experiments to investigate the effect of different number of layers in stacked GRU and report the results in Table 11 . We can observe that the performance falls after rising, as the number of layers increases. Effect of Different Structure of Feature Extractors: To study the influence of feature extractors with different architectures, we add an experiment on CIFAR100-set in terms of ConvNet-3. The results are listed in Table 12 . In Table 12 , "2conv", "3conv", "4conv" denote stacking 2, 3, 4 convolution layers as the feature extractor respectively. "1linear" denotes adding one linear layer after the convolution layer. We observe that our method obtains comparable performance with the feature extractors of different structures. Thus, our method is not sensitive to the structure of feature extractors. 6 . For the target network ConvNet-3, we construct 2000, 100000, 50000 groups of datasets for training PudNet on Fashion-set, CIFAR100-set, Imagenet-set respectively. Figure 7 reports the results of exploiting PudNet to predict parameters for ResNet-18. We construct 2000, 100000, 100000 groups of datasets for training our PudNet on Fashion-set, CIFAR100-set, Imagenet-set respectively. We find that with more datasets for training, our PudNet could obtain better performance. This is because Transfer Learning: The key idea of transfer learning is to transfer knowledge from source domains to a different but related target domain to improve the performance of the target learner (Zhuang et al., 2020) . There are considerable methods on transfer learning, including feature-based methods (Gretton et al., 2012) , parameter-based methods (Tommasi & Caputo, 2009) and relationalbased methods (Richardson & Domingos, 2006) , etc. Feature-based approaches usually transform the original sample features in different domains into a common latent feature space. Parameterbased methods usually learn to finetune the parameters of the last few layers across different domain datasets. Relational-based methods transfer the logical relationship or rules learned in the source domain to the target domain. Meta-Learning: Meta-Learning introduces the mechanism of "learning to learn ", which intends to train a model on a variety of learning tasks, such that it can solve new learning tasks using only a small number of training samples (Hospedales et al., 2021) . Meta-Learning is usually divided into three categories: optimization-based methods, metric-based methods, and model-based methods (Yao et al., 2020) . Optimization-based methods usually train the model to be easy to fine-tune by a small number of gradient steps with a small amount of training data (Finn et al., 2017) . Metricbased methods learn to compare validation points with training points and predicting the label of matching training points (Hospedales et al., 2021) . Note that few-shot learning can be regarded as the applications of metric-based meta-learning (Hospedales et al., 2021) . Model-based methods embed the training data into activation state, making predictions for test data based on this state. Zero-shot Learning: Zero-shot learning aims to learn a classifier that could classify never seen classes during training without knowing any labeled data of novel class (Wang et al., 2019) . The representative zero-shot learning approaches includes: ESZSL (Romera-Paredes & Torr, 2015) , SAE (Kodirov et al., 2017) ,f-CLSWGAN (Xian et al., 2018) , etc. The core idea of zero-shot is to transfer the learned knowledge of seen classes to the classes unseen during training (Pourpanah et al., 2022) . Since there is no label information for the unseen class, the auxiliary information for each unseen class is necessary to solve zero-shot learning problem (Wang et al., 2019) . For example, given the auxiliary information for a unseen class zebra: "look like horse, with stripes", the zero-shot learning model could use this semantic information to recognize the zebra class as long as the model known the pattern of "horse" and "stripes" (Fu et al., 2015) . Since our work aims to learn a hyper-mapping between datasets and their corresponding network parameters and directly predict the parameters for an unseen dataset based on the hyper-mapping, our task is totally different from the above works. Despite this, there are some relations between our method and the above works. First, since it is prohibitive to prepare thousands of datasets and training networks on them to obtain the corresponding ground-truth parameters, we develop a new learning manner, motivated by meta-learning. Besides, because our method can directly predict network parameters for an unseen dataset, it is potential to couple with zero-shot learning, which is worthy to be further studied.



Figure 2: Overview of our proposed PudNet. PudNet first compresses each dataset into a sketch with a fixed size, and then utilizes the hypernetwork to generate parameters of a target network based on the sketch. Finally, PudNet is optimized based on a support set in a meta-learning based manner.

Figure 6: Effect of Different Groups of Datasets for Training to Predict Parameters for ConvNet-3.

, etc. Optimization based methods mainly aim to improve the stochastic gradient descent. For instance,Yong et al. (2020) proposes a gradient centralization method that centralizes gradient vectors to improve the Lipschitzness of the loss function. Normalization based methods intend to propose good normalization methods to speed up the training process. The representative work is the batch normalization(Ioffe & Szegedy, 2015) that can make the optimization landscape smooth and lead to fast convergence(Santurkar et al., 2018). Parallel training methods usually stack multiple hardwares to conduct parallel training, which can reduce training time by dispersing calculation amounts to distributed devices. However, these methods still follow the traditional iterative based training paradigm. Different from them, we attempt to explore a new training paradigm, and transform the network training problem of into a prediction task.

NotationWe denote H θ as our hypernetwork parameterized by θ. Let D train = {D i } N i=1 be the set of training datasets, where D i is the i th dataset and N is the number of training datasets. Each sample x j ∈ D i has a label y j ∈ C tr i , where C tr i is the class set of D i . We use C tr = to denote the whole label set of training datasets. Similarly, we define D test as the set of unseen datasets used for testing and C te as the set containing all labels in D test .

Results of different methods in terms of the target network ConvNet-3 on the Fashion-set, CIFAR100-set, ImageNet-set datasets.

Results of different methods in terms of the target network ResNet-18 on the Fashion-set, CIFAR100-set, ImageNet-set datasets.

Results of different methods on the cross-domain datasets CIFAR100→Animals10

Ablation study of our method.

Performance of finetuning all methods using 50 epochs.

Results of different methods on new dataset CIFAR-100→CIFAR-10.

Results of different methods on the cross-domain datasets on ResNet-18.

Results of different methods in terms of the target network ResNet-34 on CIFAR100-set.

Effect of Different Number of Training Classes with Varying Groups:We analyze the effect of different number of classes on training set. We utilize PudNet to predict parameters for ConvNet-3. The results are shown in Table10. Train-C20 denotes that the training set involving 50000 groups of datasets contains 20 classes in total. Similarly, Train-C80 denotes that the training set involving 50000 groups of datasets has 80 classes. We find that with more classes included in training set, the performance of our PudNet is improved as the number of dataset groups increases.

Effect of different number of training classes with varying groups on CIFAR100-set.

Effect of different layers of stacked GRU.

We analyze the effect of different groups of datasets for training. The results of utilizing PudNet to predict parameters for ConvNet-3 are shown in Figure

A APPENDIX

A.1 ARCHITECTURE DETAILS Dataset Compression Module Details. The feature extractor in our framework contains several basic blocks, where each basic block consists of a 5 × 5 convolutional layer, a leakyReLU function and a batch normalization layer. For generating parameters of ConvNet-3, we use one basic block as the feature extractor. For ResNet-18 or ResNet-34, we stack five basic blocks as the feature extractor. Note that our feature extractor is jointly trained with PudNet in an end-to-end manner.Structure of the Weight Generator Figure 4 shows the architecture of the weight generator. The weight generator takes as input a vector with the dimension of d a , and outputs a tensor with the size of d out × d in × f × f as the parameters of the convolutional layer. 2021), the temperature τ = 10 in Eq.( 6) is fixed. 10 labeled samples per class are used as support sets to deduce the class centroid. As mentioned in the main paper, we introduce auxiliary tasks to assist optimization.We add an full classification linear layer (e.g. 80-way linear head in CIFAR-100) to maintain static class set during training. We also introduce a consistency loss, while the dimension between logits deduced by metric-based classification (e.g.5-dimensional in CIFAR-100) and the logits produced by full linear head (e.g.80-dimensional in CIFAR-100) are not matching, thus we transpose the 5dimensional logit to 80-dimensional logit, by padding the rest values with zero. We search η from {0, 0.1, 0.2, 0.3, • • • , 0.9}. For target network ConvNet-3, we set η = 0.2 for Fashion-set, η = 0.1 for CIFAR100-set, η = 0.3 for ImageNet-set, η = 0.1 for CIFAR100→Animals10. For target network ResNet-18, we set η = 0.2 for Fashion-set, η = 0.5 for CIFAR100-set and ImageNet-set, η = 0.5 for CIFAR100→Animals10. with larger groups of datasets to learn the hyper-mapping relation, our PudNet could obtain better generalization ability. However, when the number of group becomes large, the performance increase becomes slow.Parameter Sensitive Analysis: We analyze the effect of different values of the hyper-parameter η. Recall that η controls the percent of dataset complementary information in the initial residual connection. Figure 8 We observe that our model obtain better performance when η > 0 in general. Additionally, our method is not sensitive to η in a relatively large range.Convergence Analysis: We discuss the convergence property of the proposed method by plotting the loss curves with increasing iteration. Here we utilize PudNet to predict parameters for ConvNet-3, based on Fashion-set, CIFAR100-set and ImageNet-set respectively. As shown in Figure 9 , the training metric-based loss and training total loss first decrease rapidly as the number of iterations increases, and then gradually decreases to convergence.

A.4 ALGORITHM PSEUDO-CODE

We provide the training procedure of our PudNet as listed in Algorithm 1. For each training dataset D i ∈ D train , we first derive the skecth s i of dataset D i and set the initial hidden state h 0 = s i in GRU. Then, we predict the parameters of each layer in the target network Ω. Finally, we optimize the the learnable parameters θ, φ by the the overall multi-task loss L total .A.5 DISCUSSION WITH SOME RELATED TOPICS Generalization: The generalization ability of model is an important research topic in the machine learning community. To generalize well on unseen data, many methods have been proposed. For example, normalization methods such as batch normalization (Bjorck et al., 2018) and layer normalization (Xu et al., 2019) could improve the generalization ability (Lyu et al., 2022) . Besides, some regularization techniques such as L2 regularization (Cortes et al., 2012) and dropout (Baldi & Sadowski, 2013) could also help the generalization (Wei et al., 2019) . What's more, some works resort to unsupervised pretraining on large-scale data to obtain a model with great generalization ability (Devlin et al., 2018) . 

