TRACKING THE PROGRESS OF LANGUAGE MODELS BY EXTRACTING THEIR UNDERLYING KNOWLEDGE GRAPHS Anonymous authors Paper under double-blind review

Abstract

The state of the art of language models, previously dominated by pre-trained word embeddings, is now being pushed forward by large pre-trained contextual representations. This success has driven growing interest to understand what these models encode inside their inner workings. Despite this, understanding their semantic skills has been elusive, often leading to unsuccessful, non-conclusive, or contradictory results among different works. In this work, we define a probing classifier that we use to extract the underlying knowledge graph of nine of the currently most influential language models, including word embeddings, context encoders, and text generators. This probe is based on concept relatedness, grounded on WordNet. Our results show that this knowledge is present in all the models, but has several inaccuracies. Furthermore, we show that the different pre-training strategies and architectures lead to different model biases. We conduct a systematic evaluation to discover specific factors that explain why some concepts are challenging for the different families of models. We hope our insights will motivate the future development of models that capture concepts more precisely.

1. INTRODUCTION

Natural language processing (NLP) encompasses a wide variety of applications such as summarization (Kovaleva et al., 2019) , information retrieval (Zhan et al., 2020) , and machine translation (Tang et al., 2018) , among others. Currently, the use of pre-trained language models has become the de facto starting point to tackle most of these applications. The usual pipeline consists of finetuning a pre-trained language model by using a discriminative learning objective to adapt the model to the requirements of each specific task. As key ingredients, these models are pre-trained using massive amounts of unlabeled data that can include millions of documents, and may include billions of parameters. Massive data and parameters are supplemented with a suitable learning architecture, resulting in a highly powerful but also complex model, whose internal operation is hard to analyze. The success of pre-trained language models has driven the interest to understand how they manage to solve NLP tasks. As an example, in the case of BERT (Devlin et al., 2019) , one of the most popular pre-trained models based on a Transformer architecture (Vaswani et al., 2017) , several studies have attempted to access the knowledge encoded in its layers and attention heads (Tenney et al., 2019b; Devlin et al., 2019; Hewitt & Manning, 2019) . In particular, (Jawahar et al., 2019) shows that BERT can solve tasks at a syntactic level by using Transformer blocks to encode a soft hierarchy of features at different levels of abstraction. Similarly, (Hewitt & Manning, 2019) shows that BERT is capable of encoding structural information from text. In particular, using a structural probe, they show that syntax trees are embedded in a linear transformation of the encodings provided by BERT. In general, previous efforts have provided strong evidence indicating that current pre-trained language models encode complex syntactic rules, however, relevant evidence about their abilities to capture semantic information remains still elusive. As an example, a recent study (Si et al., 2019) attempts to locate the encoding of semantic information as part of the top layers of Transformer architectures, however, results provide contradictory evidence. Similarly, (Kovaleva et al., 2019) focuses on studying knowledge encoded by self-attention weights, however, results provide evidence for over-parameterization but not about language understanding capabilities. In this work, we study to which extent pre-trained language models encode semantic information. As a key source of semantic knowledge, we focus on studying how precisely pre-trained language models encode the concept relations embedded in the conceptual taxonomy of WordNetfoot_0 (Miller, 1995) . The ability to understand, organize, and correctly use concepts is one of the most remarkable capabilities of human intelligence (Lake et al., 2017) . Therefore, a quantification of the ability that a pre-trained language model can exhibit to encode the conceptual organization behind WordNet is highly valuable. In particular, it can provide useful insights about the inner mechanisms that these models use to encode semantic information. Furthermore, an analysis of concepts and associations that result difficult to these models can provide relevant insights about how to improve them. In contrast to most previous works, we do not focus on a particular model, but target a large list of the most popular pre-trained language and text-embedding models. In this sense, one of our goals is to provide a comparative analysis of the capacities of different types of approaches. Following Hewitt & Manning (2019), we study semantic performance by defining a probing classifier based on concept relatedness according to WordNet. Using this tool, we analyze the different models, enlightening how and where semantic knowledge is encoded. Furthermore, we explore how these models encode suitable information to recreate the structure of WordNet. Among our main results, we show that the different pre-training strategies and architectures lead to different model biases. In particular, we show that contextualized word embeddings, such as BERT, encode high-level concepts and hierarchical relationships among them, creating a taxonomy. This finding corroborates previous work results (Reif et al., 2019) that claim that BERT vectors are stored in sub-spaces that have correspondence with semantic knowledge. Our study also shows evidence about the limitations of current pre-trained language models, demonstrating that they all have difficulties encoding specific concepts. As an example, all the models struggle with concepts related to "taxonomical groups", performing worse than chance in some cases. Our results also reveal that models have very distinctive patterns in terms of where they encode most of the semantic information. These patterns are dependant on architecture and not on model sizes.

2. RELATED WORK

The success of deep learning architectures in various NLP tasks has fueled a growing interest from the community to improve understanding of what these models encode. Several works have studied these models' impact on downstream tasks at the syntactic or semantic level. Some studies (Tenney et al., 2019b) claim that success in a specific task helps understand what type of information the model encodes. Other studies have improved the understanding of what and where these models encode information, by analyzing correlations between input-targets and specific architecture blocks, such as layers (Jawahar et al., 2019) , encoded hidden states (Tang et al., 2018; Saphra & Lopez, 2019) , and attention heads (Michel et al., 2019) . Evidence of syntactic information: Using probing classifiers, Clark et al. (2019) claims that some specific BERT's attention heads show correspondence with syntactic tasks. Goldberg (2019) illustrates the capabilities that BERT has to solve syntactic tasks, such as subject-verb agreement. BERT's success in these tasks fuels the belief that BERT can code the syntax of a language. Hewitt & Manning (2019) proposes a structural probe that evaluates whether syntax trees are encoded in a linear transformation of BERT embeddings. The study shows that such transformation exists in BERT, providing evidence that syntax trees are implicitly embedded in BERT's vector geometry. Reif et al. ( 2019) has found evidence of syntactic representation in BERT's attention matrices, with specific directions in space representing particular dependency relations. Evidence of semantic information: Reif et al. (2019) suggests that BERT's internal geometry may be broken into multiple linear subspaces, with separate spaces for different syntactic and semantic information. Despite this, previous work has not yet reached consensus about this topic. While some studies show satisfactory results in tasks such as entity types (Tenney et al., 2019a) , semantic roles (Rogers et al., 2020) , and sentence completion (Ettinger, 2020) , other studies show less favorable results in coreference (Tenney et al., 2019b) and Multiple-Choice Reading Comprehension (Si et al., 2019) , claiming that BERT's performance may not reflect the model's true ability of language understanding and reasoning. Some works have studied which blocks of BERT are used to solve



WordNet is a human-generated graph, where each one of its 117000 nodes (also called synsets) represent a concept. In this work we use the hyponymy relations, which represent if a concept is a subclass of another.

