ZERO-SHOT LEARNING WITH COMMON SENSE KNOWLEDGE GRAPHS Anonymous authors Paper under double-blind review

Abstract

Zero-shot learning relies on semantic class representations such as hand-engineered attributes or learned embeddings to predict classes without any labeled examples. We propose to learn class representations from common sense knowledge graphs. Common sense knowledge graphs are an untapped source of explicit high-level knowledge that requires little human effort to apply to a range of tasks. To capture the knowledge in the graph, we introduce ZSL-KG, a general-purpose framework with a novel transformer graph convolutional network (TrGCN) to generate class representations. Our proposed TrGCN architecture computes non-linear combinations of the node neighbourhood and leads to significant improvements on zero-shot learning tasks. We report new state-of-the-art accuracies on six zero-shot benchmark datasets in object classification, intent classification, and fine-grained entity typing tasks. ZSL-KG outperforms the specialized state-of-the-art method for each task by an average 1.7 accuracy points and outperforms the general-purpose method with the best average accuracy by 5.3 points. Our ablation study on ZSL-KG with alternate graph neural networks shows that our transformer-based aggregator adds up to 2.8 accuracy points improvement on these tasks.

1. INTRODUCTION

Deep neural networks require large amounts of labeled training data to achieve optimal performance. This is a severe bottleneck, as obtaining large amounts of hand-labeled data is an expensive process. Zero-shot learning is a training strategy which allows a machine learning model to predict novel classes without the need for any labeled examples for the new classes (Romera-Paredes & Torr, 2015; Socher et al., 2013; Wang et al., 2019) . Zero-shot models learn parameters for seen classes along with their class representations. During inference, new class representations are provided for the unseen classes. Previous zero-shot learning systems have used hand-engineered attributes (Akata et al., 2015; Farhadi et al., 2009; Lampert et al., 2014 ), pretrained embeddings (Frome et al., 2013) and learned embeddings (e.g. sentence embeddings) (Xian et al., 2016) as class representations. Class representations in a zero-shot learning framework should satisfy the following properties: (1) they should adapt to unseen classes without requiring additional human effort, (2) they should provide rich features such that the unseen classes have sufficient distinguishing characteristics among themselves, (3) they should be applicable to a range of downstream tasks. Previous approaches for class representations have various limitations. On one end of the spectrum, attribute-based methods provide rich features but the attributes have to be fixed ahead of time for the unseen classes. On the other end of the spectrum, pretrained embeddings such as GloVe (Pennington et al., 2014) and Word2Vec (Mikolov et al., 2013) offer the flexibility of easily adapting to new classes but rely on unsupervised training on large corpora-which may not provide distinguishing characteristics necessary for zero-shot learning. Many methods lie within the spectrum and learn class representations for zero-shot tasks from descriptions such as attributes, text, and image prototypes. Existing approaches that have achieved state-of-the-art performance make task-specific adjustments and cannot exactly be adapted to tasks in different domains (Liu et al., 2019a; Verma et al., 2020) . Methods using graph neural networks on the ImageNet graph to learn class representations have achieved strong performance on zero-shot object classification (Kampffmeyer et al., 2019; Wang et al., 2018) . These methods are general-purpose, since we show that they can be adapted to other tasks as well. However, the ImageNet graph may not provide rich features suitable for a wide range of downstream tasks. In our work, we propose to learn class representations from common sense knowledge graphs. Common sense knowledge graphs (Liu & Singh, 2004; Speer et al., 2017; Tandon et al., 2017; Zhang et al., 2020) are an untapped source of explicit high-level knowledge that requires little human effort to apply to a range of tasks. These graphs have explicit edges between related concept nodes and provide valuable information to distinguish between different concepts. However, adapting existing zero-shot learning frameworks to learn class representations from common sense knowledge graphs is challenging in several ways. GCNZ (Wang et al., 2018) learns graph neural networks with a symmetrically normalized graph Laplacian, which not only requires the entire graph structure during training but also needs retraining if the graph structure changes, i.e., GCNZ is not inductive. Common sense knowledge graphs can be large (2 million to 21 million edges) and training a graph neural network on the entire graph can be prohibitively expensive. DGP (Kampffmeyer et al., 2019) is an inductive method and aims to generate expressive class representations but assumes a directed acyclic graph such as WordNet. Common sense knowledge graphs do not have a directed acyclic graph structure. To address these limitations, we propose ZSL-KG, a general-purpose framework with a novel transformer graph convolutional network (TrGCN) to learn class representations. Graph neural networks learn to represent the structure of graphs by aggregating information from each node's neighbourhood. Aggregation techniques used in GCNZ, DGP, and most other graph neural network approaches are linear, in the sense that they take a (possibly weighted) mean or maximum of the neighbourhood features. To capture the complex information in the common sense knowledge graph, TrGCN learns a transformer-based aggregator to compute the non-linear combination of the node neighbours. A few prior works have considered LSTM-based aggregators (Hamilton et al., 2017a) as a way to increase the expressivity of graph neural networks, but their outputs can be sensitive to the ordering imposed on the nodes in each neighborhood. For example, on the Animals with Attributes 2 dataset, we find that when given the same test image 10 times with different neighborhood orderings, an LSTM-based graph neural network outputs inconsistent predictions 16% of the time (Appendix A). One recent work considers trying to make LSTMs less sensitive by averaging the outputs over permutations, but this significantly increases the computational cost and provides only a small boost to prediction accuracy (Murphy et al., 2019) . In contrast, TrGCN learns a transformer-based aggregator, which is non-linear and naturally permutation invariant. Additionally, our framework is inductive, i.e., the graph neural network can be executed on graphs that are different from the training graph, which is necessary for inductive zero-shot learning under which the test classes are unknown during training. We demonstrate the effectiveness of our framework on three zero-shot learning tasks in vision and language: object classification, intent classification, and fine-grained entity typing. We report new state-of-the-art accuracies on six zero-shot benchmark datasets (Xian et al., 2018a; Farhadi et al., 2009; Deng et al., 2009; Coucke et al., 2018; Gillick et al., 2014; Weischedel & Brunstein, 2005) . ZSL-KG outperforms the state-of-the-art specialized method for each task by an average 1.7 accuracy points. ZSL-KG also outperforms GCNZ, the best general-purpose method on average by 5.3 accuracy points. Our ablation study on ZSL-KG with alternate graph neural networks shows that our transformer-based aggregator adds up to 2.8 accuracy points improvement on these tasks. In summary, our main contributions are the following: 1. We propose to learn class representations from common sense knowledge graphs for zeroshot learning. 2. We present ZSL-KG, a general-purpose framework based on graph neural networks with a novel transformer-based architecture. Our proposed architecture learns non-linear combination of the nodes neighbourhood and generates expressive class representations. 3. ZSL-KG achieves new state-of-the-art accuracies on Animals with Attributes 2 (Xian et al., 2018a ), aPY (Farhadi et al., 2009 ), ImageNet (Deng et al., 2009) , SNIPS-NLU (Coucke et al., 2018 ), Ontonotes (Gillick et al., 2014 ), and BBN (Weischedel & Brunstein, 2005) .

2. BACKGROUND

In this section, we summarize zero-shot learning and graph neural networks.

