IMPROVING THE STRENGTH OF HUMAN-LIKE MOD-ELS IN CHESS

Abstract

Designing AI systems that capture human-like behavior has attracted growing attention in applications where humans may want to learn from, or need to collaborate with, these AI systems. Many existing works in designing human-like AI have taken a supervised learning approach that learns from data of human behavior, with the goal of creating models that can accurately predict human behavior. While this approach has shown success in capturing human behavior at different skill levels and even identifying individual behavioral styles, it also suffers from the drawback of mimicking human mistakes. Moreover, existing models only capture a snapshot of human behavior, leaving the question of how to improve them-e.g., from one human skill level to a stronger one-largely unanswered. Using chess as an experimental domain, we investigate the question of teaching an existing human-like model to be stronger using a data-efficient curriculum, while maintaining the model's human similarity. To achieve this goal, we extend the concept of curriculum learning to settings with multiple labeling strategies, allowing us to vary both the curriculum (dataset) and the teacher (labeling strategy). We find that the choice of teacher has a strong impact on both playing strength and human similarity; for example, a teacher that is too strong can be less effective at improving playing strength and degrade human similarity more rapidly. We also find that the choice of curriculum can impact these metrics, but to a smaller extent; for example, training on a curriculum of human mistakes provides only a marginal benefit over training on a random curriculum. Finally, we show that our strengthened models achieve human similarity on datasets corresponding to their strengthened level of play, suggesting that our curriculum training methodology is improving them in human-like steps.

1. INTRODUCTION

AI systems are growing increasingly capable at solving tasks, making decisions, and assisting humans in a wide number of domains. In games such as Chess, Go, and Poker, AI has demonstrated clear superiority over human performance (Silver et al., 2018; Brown & Sandholm, 2019) . In domains such as marketing, transportation, medicine, law, hiring, and finance, AI decision making is capable enough to be deployed alongside humans in many real-life scenarios, and in some cases, AI performance is sufficient to take over entirely. More recently, researchers have looked into developing AI with the explicit goal of replicating human decision making, in contrast to simply optimizing for AI performance. There are several reasons for doing this. In domains such as autonomous driving, developing human-like AI can result in increased safety for both humans and AI, by providing an understanding of human driving tendencies (Hecker et al., 2019) . In situations where AI is assisting or educating humans, humans exhibit a higher level of trust when the AI is human-like (Wang et al., 2019; Li et al., 2021; Kim et al., 2022) , and higher levels of satisfaction when interacting with human-like AI (Amigó et al., 2006; Ragot et al., 2020; Jenneboer et al., 2022) . In gaming, human-like AI can offer opportunities to practice against models of real opponents (McIlroy-Young et al., 2022b; a) , and is generally more enjoyable to play with in competitive or cooperative play (Zhang et al., 2021; Soni & Hingston, 2008) . researchers have used a supervised learning approach on human game data to create models that match human decision-making at different skill levels (McIlroy-Young et al., 2020) ; these models can even be fine-tuned to match the decisions of an individual player and distinguish between individual playing styles (McIlroy-Young et al., 2022b; 2021) . Since these models strive to match human behavior at different skill levels, they also adopt the human tendencies and mistakes made at those levels. Moreover, each of these models only represents a snapshot of human behavior, with no investigation of how to improve the models, e.g., from one human skill level to a stronger one. In this paper, we attempt to address these questions by developing a methodology for improving the strength of human-like models, while retaining their human similarity. Specifically, given a humanlike model at a particular skill level, our goal is to improve the model's strength in a data-efficient manner, without degrading its human similarity by too much. Thus, we evaluate these human-like models along a trade-off of three metrics: strength, data efficiency, and human similarity ( §3.2). Chess is an ideal domain for our study due to the availability of AI systems at both ends of the strength vs. human similarity spectrum. Chess engines like Stockfish and Leela (the open-source replication of AlphaZero (Silver et al., 2018) ) dominate the best human chess players in strength, while Maia Chess (McIlroy-Young et al., 2020) provides a framework for training models that can accurately mimic human behavior at different skill levels. Additionally, using chess affords us a massive amount of public human data, which can be used for training or further analysis. Finally, chess has a well-defined ruleset, which allows us to evaluate the playing strength of any chess model we obtain using computationally cheap simulations. In devising a methodology for strengthening human-like models, we considered two main approaches. The first is to leverage a state-of-the-art chess learning framework, such as the deep reinforcement learning framework of AlphaZero, but with an additional constraint to match human moves. Such a constrained optimization approach, however, is difficult to integrate into existing human-like models. Additionally, this process is far removed from any real human learning process. Instead, we chose a methodology that more closely mimics how a human might learn: starting with an existing human-like model, which we call the student, we fine-tune the model using a customized training curriculum. This approach gives us control over how we trade off human similarity for playing strength, while also developing training curricula that are interpretable to humans. We implement this methodology by replicating and extending Maia's framework for training human-like models and fine-tuning them on individual player data (McIlroy-Young et al., 2022b) , except we train our own models and fine-tune them on different training curricula that we design. The core idea behind our approach is quite similar to the concept of Curriculum Learning developed by Bengio et al. (2009) , which looks at selecting the most efficient ordering of the data for training a model. However, we extend this concept in an important way, by not just looking at the selection of the data, but also considering the strategy for labeling the data. That is, we distinguish between the input dataset, which we call the curriculum, and the labeling strategy, which we call the teacher. In the case of chess, the curriculum corresponds to a set of chess positions, and the teacher could be any chess model that suggests which move to make in each position. Whereas traditional curriculum learning assumes a single, "correct" label, in our work we consider chess engines and human-like models of varying strength, which may have different opinions about which move to make in a given position. In our evaluation, we explore what happens when we vary both the curriculum and the teacher ( §4.1), vary only the teacher ( §4.2), or vary only the curriculum ( §4.3). Our results lead to several interesting conclusions. We find that the choice of teacher has a strong impact on both playing strength and human similarity, and it is not the case that the strongest teacher is the best choice. For example, a teacher like Stockfish is too strong and different when the student is a human-like model of novice strength. Instead, using a stronger human-like model as a teacher leads to much better outcomes in strength, human similarity, and training efficiency. We also find that the choice of curriculum can impact these metrics, but to a smaller extent. For example, training on a curriculum of positions the student made mistakes on provides only a marginal benefit over training on random positions the student encountered. Finally, we show that our strengthened models achieve human similarity on datasets corresponding to their strengthened level of play ( §4.4), suggesting that our curriculum training methodology is improving these models in human-like steps.

