CONTEXTUAL HYPERNETWORKS FOR NOVEL FEATURE ADAPTATION

Abstract

While deep learning has obtained state-of-the-art results in many applications, the adaptation of neural network architectures to incorporate new output features remains a challenge, as a neural networks are commonly trained to produce a fixed output dimension. This issue is particularly severe in online learning settings, where new output features, such as items in a recommender system, are added continually with few or no associated observations. As such, methods for adapting neural networks to novel features which are both time and data-efficient are desired. To address this, we propose the Contextual HyperNetwork (CHN), an auxiliary model which generates parameters for extending the base model to a new feature, by utilizing both existing data as well as any observations and/or metadata associated with the new feature. At prediction time, the CHN requires only a single forward pass through a neural network, yielding a significant speed-up when compared to re-training and fine-tuning approaches. To assess the performance of CHNs, we use a CHN to augment a partial variational autoencoder (P-VAE), a deep generative model which can impute the values of missing features in sparsely-observed data. We show that this system obtains improved few-shot learning performance for novel features over existing imputation and meta-learning baselines across recommender systems, e-learning, and healthcare tasks.

1. INTRODUCTION

In many deep learning application domains, it is common to see the set of predictions made by a model grow over time: a new item may be introduced into a recommender system, a new question may be added to a survey, or a new disease may require diagnosis. In such settings, it is valuable to be able to accurately predict the values that this feature takes within data points for which it is unobserved: for example, predicting whether a user will enjoy a new movie in a recommender system, or predicting how a user will answer a new question in a questionnaire. On the introduction of a new feature, there may be few or no labelled data points containing observed values for it; a newly added movie may have received very few or even no ratings. The typically poor performance of machine learning models in this low-data regime is often referred to as the cold-start problem (Schein et al., 2002; Lika et al., 2014; Lam et al., 2008) , which is prevalent not only in recommender systems but also in applications where high quality data is sparse. This presents a key challenge: the adaptation of a deep learning model to accurately predict the new feature values in the low data regime. On one hand, it is often required to deploy the model in applications immediately upon the arrival of new features, so it is impractical for the adaptation to wait until much more data has been acquired. On the other hand, simply retraining the model every time a new feature is introduced is computationally costly, and may fall victim to severe over-fitting if there are only a small number of observations available for the new feature. Few-shot learning (Snell et al., 2017; Requeima et al., 2019; Vinyals et al., 2016; Gordon et al., 2018) has seen great successes in recent years, particularly in image classification tasks; however, these approaches typically treat all tasks as independent of one another. We wish to extend these ideas to the challenge of extending deep learning models to new output features, using a method which captures how a new feature relates to the existing features in the model. Furthermore, we seek a method that is computationally efficient, ideally requiring no fine-tuning of the model, and that is resistant to over-fitting in the few-shot regime. • We propose an auxiliary neural network, called Contextual HyperNet (CHN), that can be used to initialize the model parameters associated with a new feature (see Section 2). CHNs are conditioned on both a context set made up of observations for the new feature, and any associated content information or metadata. CHNs amortize the process of performing gradient descent on the new parameters by mapping the newly observed data directly into high-performing new parameter values, with no additional fine-tuning of the model being required. This makes CHNs highly computationally efficient and scalable to large datasets. • We use a CHN to augment a partial variational autoencoder (P-VAE) and evaluate the system's performance across a range of applications (see Section 4). While CHNs are applicable to a wide range of deep learning models, in this work we choose a P-VAE as the evaluation framework. The result is a flexible deep learning model able to rapidly adapt to new features, even when the data is sparsely-observed, e.g. in recommender systems. We show that this model outperforms a range of baselines in both predictive accuracy and speed across recommender system, e-learning and healthcare tasks.

2.1. PROBLEM SETTING

Our goal is to enable fast adaptation of a machine learning model when new output features are added to augment the originally observed data. Specifically, we consider the original observations as a set of vector-valued data points D = {x (i) } m i=1 , where each of the feature values in a given data point may be missing. We denote x j as the jth feature of a data point x ∈ D and group the observed and unobserved features within a data point as x = [x O , x U ]. In many scenarios, such as recommender systems, a machine learning model p(x U |x O ) aims then at predicting the unobserved features x U given observed ones x O .

Observed Features

x n Now suppose a new output feature x n becomes available, so that each data vector x ∈ D is augmented to become x = [x;

Features

x n ]. This happens when e.g. a new item is added to a recommender system, or a new type of diagnostic test is added in a medical application. We note that not every data point x ∈ D receives an observed value for the new feature: a newly added movie may have received very few ratings, or a new diagnostic test may have yet to be performed on all of the patients. We refer to the set of data points where the new feature is observed as the context set for the new feature n, i.e. 1 Its complement, the target set T n , is the set of those data points for which there is no associated observation for the feature: C n = {x = [x; x n ] | x ∈ D, x n is observed}. context set is shown in yellow in Figure T n = {x = [x; x n ] | x ∈ D, x n is unobserved}. One can also split the augmented data into observed and unobserved parts, i.e. x = [x O , xU ]. Using this notation, it is clear that xO = [x O ; x n ], xU = x U for x ∈ C n , and xO = x O , xU = [x U ; x n ] for x ∈ T n . In addition, we may also have access to some metadata M n describing the new feature. This could be categorical data such as the category of a product in a recommender system or the topic of a question in an e-learning system, or some richer data format such as images or text. We wish to adapt the machine learning model p θ0 (x U |x O ) to p θ (x U |x O ) so that it is able to accurately predict the value of the unobserved new features for data points x ∈ T n . A naive strategy would ignore the previous model p 



Figure 1: Data used when adapting to a new feature x n . The rows which contain yellow blocks are C n and other rows are in T n .

θ0 (x U |x O ) and instead seek the maximum likelihood estimates (MLE) of the parameters for the new model p θ (x U |x O ). This is typically done by training the new model on the context set, by temporarily moving the observed new features x n to the prediction targets: θ = arg max θ x∈Cn log p θ (x n , x U |x O ) .

