CONSISTENCY AND MONOTONICITY REGULARIZA-TION FOR NEURAL KNOWLEDGE TRACING

Abstract

Knowledge Tracing (KT), tracking a human's knowledge acquisition, is a central component in online learning and AI in Education. In this paper, we present a simple, yet effective strategy to improve the generalization ability of KT models: we propose three types of novel data augmentation, coined replacement, insertion, and deletion, along with corresponding regularization losses that impose certain consistency or monotonicity biases on model's predictions for the original and augmented sequence. Extensive experiments on various KT benchmarks show that our regularization scheme consistently improves the model performances, under 3 widely-used neural networks and 4 public benchmarks, e.g., it yields 6.3% improvement in AUC under the DKT model and the ASSISTmentsChall dataset.

1. INTRODUCTION

In recent years, Artificial Intelligence in Education (AIEd) has gained much attention as one of the currently emerging fields in educational technology. In particular, the recent COVID-19 pandemic has transformed the setting of education from classroom learning to online learning. As a result, AIEd has become more prominent because of its ability to diagnose students automatically and provide personalized learning paths. High-quality diagnosis and educational content recommendation require good understanding of students' current knowledge status, and it is essential to model their learning behavior precisely. Due to this, Knowledge Tracing (KT), a task of modeling a student's evolution of knowledge over time, has become one of the most central tasks in AIEd research. Since the work of Piech et al. (2015) , deep neural networks have been widely used for the KT modeling. Current research trends in the KT literature concentrate on building more sophisticated, complex and large-scale models, inspired by model architectures from Natural Language Processing (NLP), such as LSTM (Hochreiter & Schmidhuber, 1997 ) or Transformer (Vaswani et al., 2017) architectures, along with additional components that extract question textual information or students' forgetting behaviors (Huang et al., 2019; Pu et al., 2020; Ghosh et al., 2020) . However, as the number of parameters of these models increases, they may easily overfit on small datasets and hurt model's generalizabiliy. Such an issue has been under-explored in the literature. To address the issue, we propose simple, yet effective data augmentation strategies for improving the generalization ability of KT models, along with novel regularization losses for each strategy. In particular, we suggest three types of data augmentation, coined (skill-based) replacement, insertion, and deletion. Specifically, we generate augmented (training) samples by randomly replacing questions that a student solved with similar questions or inserting/deleting interactions with fixed responses. Then, during training, we impose certain consistency (for replacement) and monotonicity (for insertion/deletion) bias on a model's predictions by optimizing corresponding regularization losses that compares the original and the augmented interaction sequences. Here, our intuition behind the proposed consistency regularization is that the model's output for two interaction sequences with same response logs for similar questions should be close. Next, the proposed monotonicity regularization is designed to enforce the model's prediction to be monotone with respect to the number of questions that correctly (or incorrectly) answered, i.e., a student is more likely to answer correctly (or incorrectly) if the student did the same more in the past. By analyzing distribution of the previous correctness rates of interaction sequences, we can observe that the existing student interaction datasets indeed have monotonicity properties -see Figure 1 and Section A.2 for details. The overall augmentation and regularization strategies are sketched in Figure 2 . Such regularization strategies Figure 1 : Distribution of the correctness rate of past interactions when the response correctness of current interaction is fixed, for 4 knowledge tracing benchmark datasets. Orange (resp. blue) represents the distribution of correctness rate (of past interactions) where current interaction's response is correct (resp. incorrect). x axis represents previous interactions' correctness rates (values in [0, 1]). The orange distribution lean more to the right than the blue distribution, which shows the monotonicity nature of the interaction datasets. See Section A.2 for details. (Q', 1) ins replacement (Q 1 , 1) (Q 2 , 0) (Q 3 , 1) (Q 4 , 0) rep (Q 1 ', 1) (Q 2 , 0) model's prediction interaction sequence correct insertion correct deletion (Q 3 , 1) rep (Q 1 , 1) (Q 2 , 0) (Q 3 , 1) (Q'',1) (Q 4 , 0) (Q 2 , 0) (Q 3 ', 1) (Q 4 , 0) (Q 4 , 0) Figure 2 : Augmentation strategies and corresponding bias on model's predictions (predicted correctness probabilities). Each tuple represents question id and response of the student's interaction (1 means correct). Replacing interactions with similar questions (Q 1 , Q 3 to Q 1 , Q 3 ) does not change model's predictions drastically. Introducing new interactions with correct responses (Q , Q ) increases model's estimation , but deleting such interaction (Q 1 , 1) decreases model's estimation. are motivated from our observation that existing knowledge tracing models' prediction often fails to satisfy the consistency and monotonicity condition, e.g., see Figure 4 in Section 3. We demonstrate the effectiveness of the proposed method with 3 widely used neural knowledge tracing models -DKT (Piech et al., 2015) , DKVMN (Zhang et al., 2017b), and SAINT (Choi et al., 2020a) -on 4 public benchmark datasets -ASSISTments2015, ASSISTmentsChall, STATICS2011, and EdNet-KT1. Extensive experiments show that, regardless of dataset or model architecture, our scheme remarkably increases the prediction performance -6.2% gain in Area Under Curve (AUC) for DKT on the ASSISTmentsChall dataset. In particular, ours is much more effective under smaller datasets: by using only 25% of the ASSISTmentsChall dataset, we improve AUC of the DKT model from 69.68% to 75.44%, which even surpasses the baseline performance 74.4% with the full training set. We further provide various ablation studies for the selected design choices, e.g., AUC of the DKT model on the ASSISTments2015 dataset is dropped from 72.44% to 66.48% when we impose 'reversed' (wrong) monotonicity regularization. We believe that our work can be a strong guideline for other researchers attempting to improve the generalization ability of KT models.

1.1. RELATED WORKS AND PRELIMINARIES

Data augmentation is arguably the most trustworthy technique to prevent overfitting or improve the generalizability of machine learning models. In particular, it has been developed as an effective way to impose a domain-specific, inductive bias to a model. For example, for computer vision models, simple image warpings such as flip, rotation, distortion, color shifting, blur, and random erasing are the most popular data augmentation methods (Shorten & Khoshgoftaar, 2019) . More advanced techniques, e.g., augmenting images by interpolation (Zhang et al., 2017a; Yun et al., 2019) or by using generative adversarial networks (Huang et al., 2018) , have been also investigated. For NLP models,

