LEARNING LANGUAGE REPRESENTATIONS WITH LOGICAL INDUCTIVE BIAS

Abstract

Transformer architectures have achieved great success in solving natural language tasks, which learn strong language representations from large-scale unlabeled texts. In this paper, we seek to go further beyond and explore a new logical inductive bias for better language representation learning. Logic reasoning is known as a formal methodology to reach answers from given knowledge and facts. Inspired by such a view, we develop a novel neural architecture named FOLNet (First-Order Logic Network), to encode this new inductive bias. We construct a set of neural logic operators as learnable Horn clauses, which are further forward-chained into a fully differentiable neural architecture (FOLNet). Interestingly, we find that the self-attention module in transformers can be composed by two of our neural logic operators, which probably explains their strong reasoning performance. Our proposed FOLNet has the same input and output interfaces as other pretrained models and thus could be pretrained/finetuned by using similar losses. It also allows FOLNet to be used in a plug-and-play manner when replacing other pretrained models. With our logical inductive bias, the same set of "logic deduction skills" learned through pretraining are expected to be equally capable of solving diverse downstream tasks. For this reason, FOLNet learns language representations that have much stronger transfer capabilities. Experimental results on several language understanding tasks show that our pretrained FOLNet model outperforms the existing strong transformer-based approaches. 1

1. INTRODUCTION

Pretrained transformer models have achieved great success in solving natural language tasks, which learn strong language representations from large-scale unlabeled texts. The learned representations can be easily transferred to different downstream tasks by finetuning over limited amount of labeled data (Radford et al., 2018; Devlin et al., 2018; Lan et al., 2019; Liu et al., 2019; Yang et al., 2019) . They even exhibit strong zero-shot or few-shot generalization capability without finetuning when further scaling up the model size (Radford et al., 2019; Brown et al., 2020; Chowdhery et al., 2022) . Besides large-scale models and training data, one important reason for the success is the strong relational inductive bias encoded in the transformer architecture (Vaswani et al., 2017) ; it effectively models the pairwise relations between tokens and use it to compute the language representations. In this paper, we seek to go beyond the inductive bias in transformer models and explore a new logical inductive bias for better language representation learning. The main idea is to view the computation of language representations as a logic reasoning process; that is, the language representations are deduced via logic reasoning step-by-step from the original discrete token sequences. Specifically, we treat the tokens in the input sequence as the terms in logic programming, and treat their properties and relations as the predicates of different arities. Then, the final language representations are derived as the advanced properties and relations from the basic input properties and relations (e.g., token ids and relative distances). Most importantly, we require the construction of such deduction process to follow the principles of first-order logic, in order to encode such logical inductive bias. Following the above logical inductive bias, we derive a principled neural architecture, named FOLNet (First-Order Logic Network), for learning language representations. Specifically, we construct a set of neural logic operators as learnable Horn clauses, which are further forward-chained into a fully differentiable neural architecture. In particular, the FOLNet architecture consists of two interacting branches responsible for unary and binary relational reasoning, respectively. Interestingly, we find that the self-attention mechanism can be constructed by two of our developed neural logic operators, and the entire transformer architecture can be understood as a single-branch version of FOLNet. This newly discovered connection might partially explain the surprisingly strong reasoning performance of the transformer architecture (Wei et al., 2022; Lewkowycz et al., 2022) . As we will demonstrate in our experiments, such dual-branch architecture has several significant advantages that are essential for learning better language representations. Furthermore, we also establish a new unified understanding of different positional encoding strategies with our logical inductive bias. For instance, we find that the existing popular relative positional encoding can be constructed by the degenerated version of our two neural logic operators. More importantly, it also allows us to develop a new principled relative positional encoding that is simple yet quite effective in practice. Notably, our proposed FOLNet has the same input and output interfaces as other pretrained transformer models (e.g., BERT) and thus could be trained by using similar losses. It also allows FOLNet to be used in a plug-and-play manner when replacing other pretrained models in solving downstream tasks. Our logical inductive bias assumes that the "logic deduction skills" are shared across all natural language tasks; that is, these skills learned during pretraining should be equally applicable to solving diverse downstream tasks. For this reason, FOLNet learns language representations that have much stronger transfer generalization capabilities. Experimental results on several language understanding tasks (GLUE, SQuAD 2.0 and FOLIO) show that our FOLNet model outperforms the transformer architecture by a large-margin when they are pretrained using similar losses. The results clearly show that advantage of using the logical inductive bias for learning language representations.

2. LOGICAL INDUCTIVE BIAS FOR LANGUAGE REPRESENTATIONS

Natural language text can be viewed as a sequence of discrete symbols, and language representations learning considers the problem of mapping the discrete symbols into certain more computable forms. One widely used approach is distributed representation, which maps the discrete token ids into dense vectors (Mikolov et al., 2013; Pennington et al., 2014; Peters et al., 2018) . Many different functional forms, such as LSTM (Hochreiter & Schmidhuber, 1997) , and more recently, transformer models (Vaswani et al., 2017) , have been used to implement such mappings. They generally encode different kinds of inductive bias for modeling natural languages. For example, RNNs use the same set of model parameters to update the hidden states over time, which encodes translation-invariance over time (Battaglia et al., 2018) . These forms of inductive bias continuously push the state-of-the-arts in solving natural language tasks. In this section, we introduce a new form of inductive bias, named logical inductive bias, which will work together with distributed representations to design more effective representation mappings. Our main idea is to view the language representation mapping as a logic reasoning process; that is, the language representations are deduced step-by-step from the original discrete token sequences. Specifically, we treat the tokens in the input sequence as terms (or objects) in logic programming, and treat their properties and relations as predicates of different arities. In light of logical inductive bias, the language representations that we seek to compute are the (advanced) properties and relations that can be deduced from these input (basic) properties and relations. Most importantly, we require the construction of such deduction process to follow the principles of first-order logic, in order to encode the logical inductive bias into the representation learning process. We now formulate the language representation learning as a logic programming problem by adopting similar (logic programming) notations used in Evans & Grefenstette (2018). • Terms: We consider a first-order logic system without function symbols, so that terms can only be variables or constant. They are used to represent general objects or a particular object of interest, respectively. In the context of language representation learning, we model each instance of text sequence x (of length T ) as a collection of constants x = {x 1 , . . . , x T }, where each token x t is a constant (t = 1, . . . , T ). We use lower-case letters to denote constants and upper case for variables as in logic programming. For example, X is a variable to represent a general object (e.g., token). • Atoms: For each term, we will define its properties and relations as an r-ary predicate p(X 1 , . . . , X r ), which takes the value of T (True) or F (False) depending on whether a certain property/relation regarding (X 1 , . . . , X r ) holds or not. For example, whether the a token a takes the v-th id in the vocabulary is a unary predicate TokenID v (a) for v = 1, . . . , V , where V



The code along with the pretrained model checkpoints will be released publicly.

