TRANSFORMERS CAN BE TRANSLATED TO FIRST-ORDER LOGIC WITH MAJORITY QUANTIFIERS

Abstract

Characterizing the implicit structure of the computation within neural networks is a foundational problem in the area of deep learning interpretability. Can their inner decision process be captured symbolically in some familiar logic? We show that any transformer neural network can be translated into an equivalent fixedsize first-order logic formula which may also use majority quantifiers. The idea is to simulate transformers with highly uniform threshold circuits and leverage known theoretical connections between circuits and logic. Our findings also reveal the surprising fact that the entire transformer computation can be reduced merely to the division of two (large) integers. While our results are most pertinent for transformers, they apply equally to a broader class of neural network architectures, namely those with a fixed-depth uniform computation graph made up of standard neural net components, which includes feedforward and convolutional networks.

1. INTRODUCTION

The incredible success of deep learning models, especially very large language and vision models with tens to hundreds of billions of parameters (Brown et al., 2020; Thoppilan et al., 2022) , has come at the cost of increasingly limited understanding of how these models actually work and when they might fail. This raises many concerns, such as around their safe deployment, fairness, and accountability. Is the inner working of such networks fundamentally different from classical algorithms and symbolic systems that we understand better? Or can their computation be described symbolically using a familiar symbolic formalism? We derive the first, to the best of our knowledge, direct connection between a broad class of neural networks and the well-studied classical formalism of first-order logic. Specifically, we show that transformers-and other neural networks with a computation graph that has constant depth and a "repetitive" or uniform structure-implement nothing but fixed-size first-order logic expressions, if the logic is allowed to have majority quantifiers (M) in addition to the standard existential (∃) and universal quantifiers (∀). Majority quantifiers simply take a sequence of boolean values and return true if more than half of them are true. The resulting logic formalism is often referred to as FO(M). Theorem 1 (Informal version of Cor. 5.1). For any neural network N with a constant-depth computation graph, there is a fixed-size FO(M) formula ϕ equivalent to N . This result immediately provides mechanistic interpretability-it demonstrates that, at least in principle, the inner decision process of any transformer model can be efficiently translated into a fixedsize formula (with respect to the input length) in a simple, well-defined logic. The output N (x) of the transformer on any input x is simply the value ϕ(x) of this formula. Similar to decision trees, FO(M) formulae have the property that each sub-expression corresponds to a logical constraint, i.e., a function mapping the input sequence to a truth value. In contrast, the internal modules of a transformer or complex circuit do not satisfy this, as they map between uninterpretable latent spaces. We thus believe that converting transformers to FO(M) formulae could be leveraged for interpreting their behavior in future work, although a thorough exploration of this idea lies outside the scope of our theoretical contributions in this paper. Thm. 1 also gives some insight about how to contrast the abilities of transformers and finite-state machines. Classically, the regular languages can be characterized as the languages definable in terms of monadic second-order logical formulae (Büchi, 1960; Elgot, 1961) . We have shown transformers can be simulated by first-order formulae with majority quantifiers. Thus, an advantage transformers have over finite-state machines is the ability to resolve majority quantifiers over their input. However, transformers are not necessarily strictly more powerful than finite-state machines: it is unknown whether second-order monadic quantifiers can be simulated by majority quantifiers. We derive this connection between transformers and FO(M) by leveraging a key result in circuit complexity theory: the equivalence of FO(M) with highly uniform threshold circuits, specifically with a circuit class known as DLOGTIME-uniform TC 0 ( Barrington et al., 1990 ).foot_0 Our proof builds upon and significantly tightens prior work by Hao et al. (2022) , Hahn (2020), Merrill et al. ( 2022), and Merrill & Sabharwal (2022) on relating specific types of transformer networks to circuit complexity classes. We improve prior analysis on two fronts. First, the class log-uniform TC 0 is much tigher than in previous results. Second, in contrast to their work, we obtain a characterization for a fully general model of transformers without limiting assumptions. Our formal model is the first to cover realistic design choices and sizes used in typical transformer implementations. Specifically, we show that any transformer network with a fixed-depth computation graph can be simulated by a highly uniform class of threshold circuits: 2Theorem 2 (Informal version of Thm. 5). For any neural network N with a constant-depth computation graph, there exists a log-uniform TC 0 circuit family C = {C n } ∞ n=0 such that, for all x of size n, N (x) = C n (x). This result, in addition to helping derive Thm. 1, itself has significant implications. It provides a much tighter class of transformer-hard problems, i.e., problems to which any transformer can be efficiently reduced, than previously known. It shows that every problem complete for the class of log-uniform TC 0 circuits is transformer-hard. Since division (Hesse, 2001; Hesse et al., 2002) is known to be one such problem (Aaronson et al., 2022) , this leads to the following rather surprising finding: Corollary 2.1. For any neural network N with a constant-depth computation graph and input x, the computation of N (x) can be efficiently reduced to integer division in the following sense: for all j ∈ {1, . . . , |N (x)|}, there exist efficiently computable (via log-uniform circuits) integers a j (x), b j (x), and i j such that the j-th bit of N (x) equals the i j -th bit of ⌊a j (x)/b j (x)⌋. This again allows us to view transformers from a novel perspective: namely, computing a bit of the output of a transformer with hundreds of billions of parameters can be easily reduced to dividing two integers. Even very large transformers are, in this sense, ultimately simple functions operating at a very large scale. In summary, our findings shed new light into the inner computation of transformers (and other neural architectures) and its connection to first-order logic. While literature on neuro-symbolic models has often viewed symbolic and neural systems as very different from each other, our results show that the boundaries between these two computational paradigms are not as rigid as they might seem. Roadmap. §2 gives a definition and examples of FO(M), and then introduces relevant background on computation graphs and circuits for our proofs. §3 presents the constructive algorithm to compile computation graph families into threshold circuit families and justifies its correctness. Specifically, we obtain the result that log-uniform computation graphs (with uniform node types) can be compiled into FO(M) expressions. Then, §4 shows that transformers are uniform computation graph families, which implies that they can be compiled into FO(M).



For brevity, we will henceforth abbreviate DLOGTIME-uniform as log-uniform. Conceptually, our results converting transformers to logical formulae and circuits resemble empirical work extracting discrete "subcircuits" from transformers(Elhage et al., 2021), although we use the term circuit in a precise formal sense, rather than to mean any discrete rules summarizing model behavior.

