LEARNING CHESS BLINDFOLDED

Abstract

Transformer language models have made tremendous strides in natural language understanding. However, the complexity of natural language makes it challenging to ascertain how accurately these models are tracking the world state underlying the text. Motivated by this issue, we consider the task of language modeling for the game of chess. Unlike natural language, chess notations describe a simple, constrained, and deterministic domain. Moreover, we observe that chess notation itself allows for directly probing the world state, without requiring any additional probing-related machinery. Additionally, we have access to a vast number of chess games coupled with the exact state at every move, allowing us to measure the impact of various ways of including grounding during language model training. Overall, we find that with enough training data, transformer language models can learn to track pieces and predict legal moves when trained solely from move sequences. However, in adverse circumstances (small training sets or prediction following long move histories), providing access to board state information during training can yield consistent improvements.

1. INTRODUCTION

Recently, transformer-based language models such as GPT-3 have stretched notions of what is possible with the simple self-supervised objective of language modeling, becoming a fixture in state of the art language technologies (Vaswani et al., 2017; Devlin et al., 2019; Brown et al., 2020) . However, the black box nature of these models combined with the complexity of natural language makes it challenging to measure how accurately these models represent the world state underlying the text. Motivated by the above issues, we propose training transformer language models for the game of chess. Chess provides a simple, constrained, and deterministic domain where the exact world state is known. Also, chess games can be transcribed exactly and unambiguously using chess notations (Section 2). In fact, the form of chess notations allows us to probe our language models for aspects of the board state using simple prompts (Section 3). Due to the simplicity and precision of chess, we can evaluate language model predictions at a more fine-grained level than merely comparing it to the ground truth. For example, even if the next move prediction doesn't match the ground truth move, we can still evaluate whether the move is legal given the board state, and if it is illegal, we can determine the reason why (Appendix D). Moreover, since the state can be exactly modeled, we can evaluate models using counterfactual queries as well. The proposed evaluation sets and metrics are described in Section 5.3. A side benefit of working with chess is that we have access to nearly unlimited data that is coupled with the exact board state at every turn. This board state is a form of grounding for the move sequence and allows us to compare training on move sequences alone to training with access to varying amounts of explicit state. Thus, modeling chess using language models may have implications for the debate surrounding the ability of language models to capture meaning if only exposed to text (Bender & Koller, 2020) . To test the impact of chess board grounding on learnability and data efficiency, we can train language models with varying degree of access to the board state (Section 4). Finally, while chess represents a controlled domain, it is by no means trivial for a language model. To illustrate the challenges of language modeling for chess, consider the left board shown in Figure 1b , where white is next to move. In order to generate a valid next move, the language model needs to (a) infer that it is white's turn, (b) represent the locations of all pieces, both white and black, (c) select one of the white pieces which can be legally moved, and finally (d) make a legal move with the selected piece. Thus, a language model has to learn to track the board state, learn to generate moves according to the rules of chess, and on top of that learn chess strategies to predict the actual move. We find that when given enough training data, transformers can learn to both track piece locations and predict legal moves at high levels of accuracy. However, when testing predictive ability for long move histories or when only given small training sets or when the model has access to limited history (Appendix C.1), predictive ability suffers. These challenging settings can provide an interesting testbed for future development of language models, and moreover because of the probing properties, errors can be diagnosed in great detail. In these more challenging settings, we show that providing parts of the board state (during training time only) can lead to significant improvements in accuracy. Our results also provide some key insights on transformer language models: (i) They are robust to various ways of incorporating explicit supervision about the board state when given enough training data. (ii) In particular, they are robust to changes in input distribution where additional tokens, related to board state, are added to input sequence only during training (Section 4.1). In contrast to LSTMs, transformers achieve this robustness even with smaller training sets (Appendix F). (iii) The model performance strongly relies on access to the whole sequence history as the performance drops on limiting this access to a fixed-size window of previous tokens (Appendix C.1). To summarize, our contributions are to: • Propose chess as a testbed for evaluating world state tracking capabilities of language models. • Show that by selecting (and tweaking) the appropriate chess notation, we can probe language model for aspects of the world state using simple prompts (Section 3). • Propose a suite of probing tasks to evaluate language models for chess on world state tracking (Section 5.3). These probing tasks go beyond simple exact match, and use a more fine-grained evaluation, and allow for automated error analysis (Appendix D). • Show that given enough training data, transformer language models can learn to track piece locations and predict legal moves at high levels of accuracy. • Evaluate the effect of grounding by training and evaluating a spectrum of transformer language models with varying degrees of access to the world state. We find that grounding helps in challenging settings of our proposed probing tasks. • Provide insights on transformer language models such as their robustness to incorporating the world state in various ways, their dependence on access to the whole history, etc.

2. BACKGROUND

Chess Preliminaries. Figure 1a 1 shows how squares are indicated in chess notations via a combination of lettered columns and numbered rows. Chess notations use this square naming convention to denote the movement of pieces. As our notation, we choose Universal Chess Interface (UCI) notation, which combines the starting square and the destination square to represent a move. 2 The move in Figure 1b is 



For more details see https://en.wikipedia.org/wiki/Universal_Chess_Interface



Board state before (left) and after (right) the bishop at f1 is moved to b5. UCI notation represents the move as f1b5.

Figure 1: Chess Notation

