ZERO-SHOT LEARNING WITH COMMON SENSE KNOWLEDGE GRAPHS Anonymous authors Paper under double-blind review

Abstract

Zero-shot learning relies on semantic class representations such as hand-engineered attributes or learned embeddings to predict classes without any labeled examples. We propose to learn class representations from common sense knowledge graphs. Common sense knowledge graphs are an untapped source of explicit high-level knowledge that requires little human effort to apply to a range of tasks. To capture the knowledge in the graph, we introduce ZSL-KG, a general-purpose framework with a novel transformer graph convolutional network (TrGCN) to generate class representations. Our proposed TrGCN architecture computes non-linear combinations of the node neighbourhood and leads to significant improvements on zero-shot learning tasks. We report new state-of-the-art accuracies on six zero-shot benchmark datasets in object classification, intent classification, and fine-grained entity typing tasks. ZSL-KG outperforms the specialized state-of-the-art method for each task by an average 1.7 accuracy points and outperforms the general-purpose method with the best average accuracy by 5.3 points. Our ablation study on ZSL-KG with alternate graph neural networks shows that our transformer-based aggregator adds up to 2.8 accuracy points improvement on these tasks.

1. INTRODUCTION

Deep neural networks require large amounts of labeled training data to achieve optimal performance. This is a severe bottleneck, as obtaining large amounts of hand-labeled data is an expensive process. Zero-shot learning is a training strategy which allows a machine learning model to predict novel classes without the need for any labeled examples for the new classes (Romera-Paredes & Torr, 2015; Socher et al., 2013; Wang et al., 2019) . Zero-shot models learn parameters for seen classes along with their class representations. During inference, new class representations are provided for the unseen classes. Previous zero-shot learning systems have used hand-engineered attributes (Akata et al., 2015; Farhadi et al., 2009; Lampert et al., 2014 ), pretrained embeddings (Frome et al., 2013) and learned embeddings (e.g. sentence embeddings) (Xian et al., 2016) as class representations. Class representations in a zero-shot learning framework should satisfy the following properties: (1) they should adapt to unseen classes without requiring additional human effort, (2) they should provide rich features such that the unseen classes have sufficient distinguishing characteristics among themselves, (3) they should be applicable to a range of downstream tasks. Previous approaches for class representations have various limitations. On one end of the spectrum, attribute-based methods provide rich features but the attributes have to be fixed ahead of time for the unseen classes. On the other end of the spectrum, pretrained embeddings such as GloVe (Pennington et al., 2014) and Word2Vec (Mikolov et al., 2013) offer the flexibility of easily adapting to new classes but rely on unsupervised training on large corpora-which may not provide distinguishing characteristics necessary for zero-shot learning. Many methods lie within the spectrum and learn class representations for zero-shot tasks from descriptions such as attributes, text, and image prototypes. Existing approaches that have achieved state-of-the-art performance make task-specific adjustments and cannot exactly be adapted to tasks in different domains (Liu et al., 2019a; Verma et al., 2020) . Methods using graph neural networks on the ImageNet graph to learn class representations have achieved strong performance on zero-shot object classification (Kampffmeyer et al., 2019; Wang et al., 2018) . These methods are general-purpose, since we show that they can be adapted to other tasks as well. However, the ImageNet graph may not provide rich features suitable for a wide range of downstream tasks.

