CONSISTENCY AND MONOTONICITY REGULARIZA-TION FOR NEURAL KNOWLEDGE TRACING

Abstract

Knowledge Tracing (KT), tracking a human's knowledge acquisition, is a central component in online learning and AI in Education. In this paper, we present a simple, yet effective strategy to improve the generalization ability of KT models: we propose three types of novel data augmentation, coined replacement, insertion, and deletion, along with corresponding regularization losses that impose certain consistency or monotonicity biases on model's predictions for the original and augmented sequence. Extensive experiments on various KT benchmarks show that our regularization scheme consistently improves the model performances, under 3 widely-used neural networks and 4 public benchmarks, e.g., it yields 6.3% improvement in AUC under the DKT model and the ASSISTmentsChall dataset.

1. INTRODUCTION

In recent years, Artificial Intelligence in Education (AIEd) has gained much attention as one of the currently emerging fields in educational technology. In particular, the recent COVID-19 pandemic has transformed the setting of education from classroom learning to online learning. As a result, AIEd has become more prominent because of its ability to diagnose students automatically and provide personalized learning paths. High-quality diagnosis and educational content recommendation require good understanding of students' current knowledge status, and it is essential to model their learning behavior precisely. Due to this, Knowledge Tracing (KT), a task of modeling a student's evolution of knowledge over time, has become one of the most central tasks in AIEd research. Since the work of Piech et al. (2015) , deep neural networks have been widely used for the KT modeling. Current research trends in the KT literature concentrate on building more sophisticated, complex and large-scale models, inspired by model architectures from Natural Language Processing (NLP), such as LSTM (Hochreiter & Schmidhuber, 1997) or Transformer (Vaswani et al., 2017) architectures, along with additional components that extract question textual information or students' forgetting behaviors (Huang et al., 2019; Pu et al., 2020; Ghosh et al., 2020) . However, as the number of parameters of these models increases, they may easily overfit on small datasets and hurt model's generalizabiliy. Such an issue has been under-explored in the literature. To address the issue, we propose simple, yet effective data augmentation strategies for improving the generalization ability of KT models, along with novel regularization losses for each strategy. In particular, we suggest three types of data augmentation, coined (skill-based) replacement, insertion, and deletion. Specifically, we generate augmented (training) samples by randomly replacing questions that a student solved with similar questions or inserting/deleting interactions with fixed responses. Then, during training, we impose certain consistency (for replacement) and monotonicity (for insertion/deletion) bias on a model's predictions by optimizing corresponding regularization losses that compares the original and the augmented interaction sequences. Here, our intuition behind the proposed consistency regularization is that the model's output for two interaction sequences with same response logs for similar questions should be close. Next, the proposed monotonicity regularization is designed to enforce the model's prediction to be monotone with respect to the number of questions that correctly (or incorrectly) answered, i.e., a student is more likely to answer correctly (or incorrectly) if the student did the same more in the past. By analyzing distribution of the previous correctness rates of interaction sequences, we can observe that the existing student interaction datasets indeed have monotonicity properties -see Figure 1 and Section A.2 for details. The overall augmentation and regularization strategies are sketched in Figure 2 . Such regularization strategies Figure 1 : Distribution of the correctness rate of past interactions when the response correctness of current interaction is fixed, for 4 knowledge tracing benchmark datasets. Orange (resp. blue) represents the distribution of correctness rate (of past interactions) where current interaction's response is correct (resp. incorrect). x axis represents previous interactions' correctness rates (values in [0, 1]). The orange distribution lean more to the right than the blue distribution, which shows the monotonicity nature of the interaction datasets. See Section A.2 for details. (Q', 1) ins replacement (Q 1 , 1) (Q 2 , 0) (Q 3 , 1) (Q 4 , 0) rep (Q 1 ', 1) (Q 2 , 0) model's prediction interaction sequence correct insertion correct deletion (Q 3 , 1) rep (Q 1 , 1) (Q 2 , 0) (Q 3 , 1) (Q'',1) (Q 4 , 0) (Q 2 , 0) (Q 3 ', 1) (Q 4 , 0) (Q 4 , 0) Figure 2 : Augmentation strategies and corresponding bias on model's predictions (predicted correctness probabilities). Each tuple represents question id and response of the student's interaction (1 means correct). Replacing interactions with similar questions (Q 1 , Q 3 to Q 1 , Q 3 ) does not change model's predictions drastically. Introducing new interactions with correct responses (Q , Q ) increases model's estimation , but deleting such interaction (Q 1 , 1) decreases model's estimation. are motivated from our observation that existing knowledge tracing models' prediction often fails to satisfy the consistency and monotonicity condition, e.g., see Figure 4 in Section 3. We demonstrate the effectiveness of the proposed method with 3 widely used neural knowledge tracing models -DKT (Piech et al., 2015) , DKVMN (Zhang et al., 2017b) , and SAINT (Choi et al., 2020a) -on 4 public benchmark datasets -ASSISTments2015, ASSISTmentsChall, STATICS2011, and EdNet-KT1. Extensive experiments show that, regardless of dataset or model architecture, our scheme remarkably increases the prediction performance -6.2% gain in Area Under Curve (AUC) for DKT on the ASSISTmentsChall dataset. In particular, ours is much more effective under smaller datasets: by using only 25% of the ASSISTmentsChall dataset, we improve AUC of the DKT model from 69.68% to 75.44%, which even surpasses the baseline performance 74.4% with the full training set. We further provide various ablation studies for the selected design choices, e.g., AUC of the DKT model on the ASSISTments2015 dataset is dropped from 72.44% to 66.48% when we impose 'reversed' (wrong) monotonicity regularization. We believe that our work can be a strong guideline for other researchers attempting to improve the generalization ability of KT models.

1.1. RELATED WORKS AND PRELIMINARIES

Data augmentation is arguably the most trustworthy technique to prevent overfitting or improve the generalizability of machine learning models. In particular, it has been developed as an effective way to impose a domain-specific, inductive bias to a model. For example, for computer vision models, simple image warpings such as flip, rotation, distortion, color shifting, blur, and random erasing are the most popular data augmentation methods (Shorten & Khoshgoftaar, 2019) . More advanced techniques, e.g., augmenting images by interpolation (Zhang et al., 2017a; Yun et al., 2019) or by using generative adversarial networks (Huang et al., 2018) , have been also investigated. For NLP models, it is popular to augment texts by replacing words with synonyms (Zhang et al., 2015) or words with similar (contextualized) embeddings (Wang & Yang, 2015; Kobayashi, 2018) . As an alternative method, back translation (Sennrich et al., 2016; Yu et al., 2018) generates an augmented sentence by translating a given sentence into a different language domain and translate it back to the original domain with machine translation models. Recently, Wei & Zou (2019) show that even simple methods like random insertion/swap/deletion could improve text classification performances. In the area of speech recognition, vocal tract length normalization (Jaitly & Hinton, 2013) , synthesizing noisy audio (Hannun et al., 2014) , perturbing speed (Ko et al., 2015) , and augmenting spectrogram (Park et al., 2019) are popular as data augmentation methods. The aformentioned data augmentation techniques have been used not only for standard supervised learning setups, but also for various unsupervised and semi-supervised learning frameworks, by imposing certain inductive biases to models. For example, consistency learning (Sajjadi et al., 2016; Xie et al., 2019; Berthelot et al., 2019; Sohn et al., 2020)  P[R t = 1|I 1 , I 2 , . . . , I t-1 , Q t ], i.e., the probability that the student answers correctly to the question Q t at t-th step. Corbett & Anderson (1994) proposed Bayesian Knowledge Tracing (BKT) that models a student's knowledge as a latent variable in a Hidden Markov Model. Also, various seq2seq architectures including LSTM (Hochreiter & Schmidhuber, 1997) , MANN (Graves et al., 2016) , and Transformer (Vaswani et al., 2017) are used in the context of KT and showed their efficacy. Deep Knowledge Tracing (DKT) is the first deep learning based model that models student's knowledge states as LSTM's hidden state vectors (Piech et al., 2015) . Dynamic Key-Value Memory Network and its variation can exploit relationships between questions/skills with concept vectors and concept-state vectors with key and value matrices, which is more interpretable than DKT (Zhang et al., 2017b; Abdelrahman & Wang, 2019) . Transformer based models (Pandey & Karypis, 2019; Choi et al., 2020a; Ghosh et al., 2020; Pu et al., 2020) are able to learn long-range dependencies with their self-attention mechanisms and be trained in parallel. Utilizing additional features of interactions, such as question texts (Huang et al., 2019; Pandey & Srivastava, 2020) , prerequisite relations (Chen et al., 2018) and time information (Nagatani et al., 2019; Choi et al., 2020a; Pu et al., 2020) is another way to improve performances. Recent works try to use graph neural networks (Nakagawa et al., 2019; Liu et al.; Tong et al., 2020; Yang et al., 2020b) and convolutional networks (Yang et al., 2020a; Shen et al., 2020) to model relations between questions and skills or extract individualized prior knowledge.

2. CONSISTENCY AND MONOTONICITY REGULARIZATION FOR KT

For a given set of data augmentations A, we train KT models with the following loss: L tot = L ori + aug∈A (λ aug L aug + λ reg-aug L reg-aug ), where L ori is the commonly used binary cross-entropy (BCE) loss for original training sequences and L aug are the same BCE losses for augmented sequences generated by applying augmentation strategies aug ∈ A.foot_0 L reg-aug are the regularization losses that impose consistency and monotonicity bias on model's predictions for the original and augmented sequence, which are going to be defined in the following sections. Finally, λ aug , λ reg-aug > 0 are hyperparameters to control the trade-off among L ori , L aug , and L reg-aug . In the following sections, we introduce our three simple augmentation strategies, replacement, insertion and deletion with corresponding consistency and monotonicity regularization losses, L reg-rep , L reg-cor ins (or L reg-incor ins ) and L reg-cor del (or L reg-incor del ), respectively.

2.1. REPLACEMENT

Replacement, which is motivated by the synonym replacement in NLP, is an augmentation strategy that replaces questions in the original interaction sequence with other similar questions without changing their responses, where similar questions are defined as the questions that have overlapping skills attached to. Our hypothesis is that the predicted correctness probabilities for questions in an augmented interaction sequence will not change a lot from those in the original interaction sequence. Formally, for each interaction in the original interaction sequence (I 1 , . . . , I T ), we randomly decide whether the interaction will be replaced or not, following the Bernoulli distribution with the probability α rep . If an interaction I t = (Q t , R t ) with a set of skills S t associated with the question Q t is set to be replaced, we determine  I rep t = (Q rep t , L reg-rep = E t ∈R [(p t -p rep t ) 2 ] (3) where p t and p rep t are model's predicted correctness probabilities for t-th question of the original and augmented sequences, respectively. We do not include the output for the replaced interactions in the loss computation. For the replacement strategy itself we consider several variants. For instance, randomly selecting a question for Q rep t from the question pool is an alternative strategy if a skill set for each question is not available. It is also possible to only consider outputs for interactions that are replaced or consider outputs for all interactions in the augmented sequence for the loss computation. We investigate the effectiveness of each strategy in Section 3.

2.2. INSERTION

When a student answers more questions correctly (resp. incorrectly), the predicted correctness probabilities of the KT models for the remaining questions should increase (resp. decrease). Based on this intuition, we introduce a monotonicity constraint by inserting new interactions into the original interaction sequence. Formally, we generate an augmented interaction sequence (I ins 1 , . . . , I ins T ) by inserting a correctly (resp. incorrectly) answered interaction I ins t = (Q ins t , 1) (resp. I ins t = (Q ins t , 0)) into the original interaction sequence (I 1 , . . . , I T ) for t ∈ I ⊂ [T ] , where the question Q ins t is randomly selected from the question pool and I with the size α ins proportion of the original sequence is a set of indices of inserted interactions. Then our hypothesis is formulated as p t ≤ p ins σ(t) (resp. p t ≥ p ins σ(t) ), where p t and p ins t are model's predicted correctness probabilities for t-th question of the original and augmented sequences, respectively. Here, σ : [T ] → [T ] -I is the order-preserving bijection which satisfies I t = I ins σ(t) for 1 ≤ t ≤ T . (For instance, in Figure 2 , σ sends {1, 2, 3, 4} to {2, 3, 4, 6}) We impose our hypothesis through the following losses: L reg-cor ins = E t∈[T ] [max(0, p t -p ins σ(t) )], L reg-incor ins = E t∈[T ] [max(0, p ins σ(t) -p t )] where L reg-cor ins and L reg-incor ins are losses for augmented interaction sequences of inserting correctly and incorrectly answered interactions, respectively.

2.3. DELETION

Similar to the insertion augmentation strategy, we propose another monotonicity constraint by removing some interactions in the original interaction sequence based on the following hypothesis: if a student's response records contain less correct (resp. incorrect) answers, the correctness probabilities for the remaining questions would become decrease (resp. increase). Formally, from the original interaction sequence (I 1 , . for t ∈ [T ] -D. We impose the hypothesis through the following losses: L reg-cor del = E t ∈D [max(0, p del σ(t) -p t )], L reg-incor del = E t ∈D [max(0, p t -p del σ(t) )] (5) where L reg-cor del and L reg-incor del are losses for augmented interaction sequences of deleting correctly and incorrectly answered interactions, respectively.

3. EXPERIMENTS

We demonstrated the effectiveness of the proposed method on 4 widely used benchmark datasets: ASSISTments2015, ASSISTmentsChall, STATICS2011, and EdNet-KT1. ASSISTments datasets are the most widely used benchmark for Knowledge Tracing, which is provided by ASSISTments online tutoring platformfoot_1 (Feng et al., 2009) . There are several versions of dataset depend on the years they collected, and we used ASSISTments2015foot_2 and ASSISTmentsChallfoot_3 . ASSIST-mentsChall dataset is provided by the 2017 ASSISTments data mining competition. STATICS2011 consists of the interaction logs from an engineering statics course, which is available on the PSLC datashopfoot_4 . EdNet-KT1 is the largest publicly available interaction dataset consists of TOEIC (Test of English for Interational Communication) problem solving logs collected by Santafoot_5 (Choi et al., Figure 4 : Response correctness prediction for a student in the ASSISTmentsChall dataset. We randomly insert interactions with correct responses (interactions with yellow boundaries). In case of the vanilla DKT model, the predictions for the original interactions (especially the interactions with green boundaries) are decreased, even if the student answered more questions correctly. However, such problem is resolved when we train the model with monotonicity regularization (with the loss L tot = L ori + L cor ins + 100 • L reg-cor ins ). Unlike the vanilla DKT model, predicted correctness probabilities for the original interactions are increased after insertion. 2020b). We reduce the size of the EdNet-KT1 dataset by sampling 6000 users among 600K users. Detailed statistics and pre-processing methods for these datasets are described in Appendix. With the exception of the EdNet-KT1 dataset, we used 80% of the students as a training set and the remaining 20% as a test set. We test DKT (Piech et al., 2015) , DKVMN (Zhang et al., 2017b) , and SAINT (Choi et al., 2020a) models. For DKT, we set the embedding dimension and the hidden dimension as 256. For DKVMN, key, value, and summary dimension are all set to be 256, and we set the number of latent concepts as 64. SAINT has 2 layers with hidden dimension 256, 8 heads, and feed-forward dimension 1024. All the models do not use any additional features of interactions except question ids and responses as an input, and the model weights are initialized with Xavier distribution (Glorot & Bengio, 2010) . They are trained from scratch with batch size 64, and we use the Adam optimizer (Kingma & Ba, 2015) with learning rate 0.001 which is scheduled by Noam scheme with warm-up step 4000 as Vaswani et al. (2017) suggest. We set each model's maximum sequence size as 100 on ASSISTments2015 & EdNet-KT1 dataset and 200 on ASSISTmentsChall & STATICS2011 dataset. Hyperparameters for augmentations, α aug , λ reg-aug , and λ aug are searched over α aug ∈ {0.1, 0.3, 0.5}, λ reg-aug ∈ {1, 10, 50, 100}, and λ aug ∈ {0, 1}. For all dataset, we evaluate our results using 5-fold cross validation and use Area Under Curve (AUC) as an evaluation metric.

3.1. MAIN RESULTS

The results (AUCs) are shown in Table 1 that compares models without and with augmentations, and we report the best results for each strategy. (The detailed hyperparameters for these results are given in Appendix.) The 4th column represents results using both insertion and deletion, and the last column shows the results with all 3 augmentations. Since there's no big difference on performance gain between insertion and deletion, we only report the performance that uses one or both of them together. We use skill-based replacement if skill information for each question in the dataset is available, and use question-random replacement that that selects new questions among all questios if not (e.g. ASSISTments2015). As one can see, the models trained with consistency and monotonicity regularizations outperforms the models without augmentations in a large margin, regardless of model's architectures or datasets. Using all three augmentations gives the best performances for most of the cases. For instance, there exists 6.3% gain in AUC on ASSISTmentsChall dataset under the DKT model. Furthermore, not only enhancing the prediction performances, our training scheme also resolves the vanilla model's issue where the monotonicity condition on the predictions of original and augmented sequences is violated. As in Figure 4 , the predictions of the model trained with monotonicity regularization (correct insertion) are increased after insertion, which contrasts to the vanilla DKT model's outputs. Table 3 : Ablation test on the directions of monotonicity regularizations with the DKT model. 2nd to 5th rows show the results with the original regularization losses, and the last 4 rows show the results with the reversed regularization losses. Since overfitting is expected to be more severe when using a smaller dataset, we conduct experiments using various fractions of the existing training datasets (5%, 10%, 25%, 50%) and show that our augmentations yield more significant improvements for smaller training datasets. Figure 3 shows performances of DKT model on various datasets, with and without augmentations. For example, on ASSISTmentsChall dataset, using 100% of the training data gives AUC 74.4%, while the same model trained with augmentations achieved AUC 75.44% with only 25% of the training dataset.

3.2. ABLATION STUDY

Are constraint losses necessary? One might think that data augmentations are enough for boosting up the performance, and imposing consistency and monotonicity are not necessary. However, we found that including such regularization losses for training is essential for further performance gain. To see this, we compare the performances of the model trained only with KT losses for both original and augmented sequences L tot = L ori + aug∈A λ aug L aug (where λ aug = 1) and with consistency and monotonicity regularization losses (2) where A is a set that contains a single augmentation. Training a model with the loss (6) can be thought as using augmentations without imposing any consistency or monotonicity biases. Table 2 shows results under the DKT model. Using only data augmentation (training the model with the loss ( 6)) gives a marginal gain in performances or even worse performances. However, training with both data augmentation and consistency or monotonicity regularization losses (2) give significantly higher performance gain. Under ASSISTmentsChall dataset, using replacement along with consistency regularization improves AUC by 6%, which is much higher than the 1% improvement only using data augmentation.

Ablation on monotonicity constraints

We perform an ablation study to compare the effects of monotonicity regularization and reversed monotonocity regularization. Monotonocity regularization introduces constraint loss to align the inserted or deleted sequence in order to modify the probability of correctness of the original sequence to follow insertion or deletion. For example, when a correct response is inserted to the sequence, the probability of correctness for the original sequence increases. Reversed monotonocity regularization modifies the probability of correctness in the opposite manner, where inserting a correct response would decrease the probability of correctness in the original sequence. For each aug ∈ {cor ins, incor ins, cor del, incor del}, we can define reversed version of the monotonicity regularization loss L rev reg-aug which impose the opposite constraint on the model's output, e.g. we define L rev reg-cor ins as L rev reg-cor ins = E t∈[T ] [max(0, p ins σ(t) -p t )] = L reg-incor ins (7) which forces model's output of correctness probability to decrease when correct responses are inserted. In this experiments, we do not include KT loss for augmented sequences (set λ aug = 0) to observe the effects of consistency loss only. Also, the same hyperparameters (α aug and λ reg-aug ) are used for both the original and reversed constraints. Table 3 shows the performances of DKT model with the original and reversed monotonicity regularizations. Second row represents the performance with no augmentations, the 3rd to the 6th rows represent the results from using original (aligned) insertion/deletion monotonicity regularization losses, and the last four rows represent the results when the reversed monotonicity regularization losses are used. The results demonstrate that using aligned monotonicity regularization loss outperforms the model with reversed monotonicity regularization loss. Also, the performances of reversed monotonicity shows large decrease in performance on several datasets even compared to the model with no augmentation. For example, in case of the EdNet-KT1 dataset, the model's performance with correct insertion along with original (aligned) regularization improves the AUC from 72.75% to 73.70%, while using the reversed regularization drops the performance to 69.67%. Ablation on replacement. We compare our consistency regularization with the other two variations of replacements -consistency regularization on replaced interactions and full interactions -and qDKT (Sonkar et al., 2020) . As we mentioned in Section 3, there are two more possible variations of the consistency loss for the replacement depends on whether we include replaced interaction's output in the loss or not: L reg-rep ro = E t∈R [(p t -p rep t ) 2 ], L reg-rep full = E t∈[T ] [(p t -p rep t ) 2 ], where ro stands for replaced only. We compared such variations with the original consistency loss L reg-rep that does not include predictions for the replaced interactions. For all variations, we used the same replacement probability α rep and loss weight λ reg-rep , and we do not include KT loss for replaced sequences (set λ rep = 0) as before. Also, we compare replacement with qDKT that uses the following Laplacian loss which regularizes the variance of predicted correctness probabilities for questions that fall under the same skill: L Laplacian = E (qi,qj )∈Q×Q [1(i, j)(p i -p j ) 2 ] ( ) where Q is the set of all questions, p i , p j are the model's predicted correctness probabilities for the questions q i , q j , and 1(i, j) is 1 if q i , q j have common skills attached, otherwise 0. It is similar to our variation of consistency regularization (L reg-rep ro in (8)) that only compares replaced interactions' outputs, but it does not replace questions and it compares all questions (with same skills) at once. Since the hyperparameter λ that scales the Laplacian regularization term is not provided in the paper, we use the same set of hyperparameters we use for other losses, and report the best results among them. Table 4 shows that including the replaced interactions' outputs hurt performances. For example, under the EdNet-KT1 dataset, all the variations of consistency regularization and Laplacian regularization significantly dropped AUCs to under 70%, while the original consistency regularization boost up the performance from 72.75% to 73.87%. To see the effect of using the skill information of questions for replacement, we compared skill-based replacement with three different random versions of replacement: question random replacement, interaction random replacement, and skill-set-based replacement. For question random replacement, we replace questions with different ones randomly (without considering skill information), while interaction random replacement changes both question and responses (sample each response with 0.5 probability). Skill-set-based replacement is almost the same as the original skill-based replacement, but the candidates of the questions to be replaced are chosen as ones with exactly same set of skills are associated, not only have common skills (S = S rep ). The results in Table 5 show that the performances of the question random replacements depends on the nature of dataset. It shows similar performance with skill-based replacement on ASSISTmentsChall and EdNet-KT1 datasets, but only give a minor gain or even dropped the performance on other datasets. However, applying interaction-random replacement significantly hurt performances over all datasets, e.g. the AUC is decreased from 86.43% to 84.50% on STATICS2011 dataset. This demonstrates the importance of fixing responses of the interactions for consistency regularization. At last, skill-set-based replacement works similar or even worse than the original skill-based replacement. Note that each question of the STATICS2011 dataset has single skill attached to, so the performance of skill-based and skill-set-based replacement coincide on the dataset. 1 ). Each entry represents a tuple of augmentation probability (α aug ) and a weight for constraint loss (λ reg-aug ), which shows the best performances among α aug ∈ {0.1, 0.3, and λ reg-aug ∈ {1, 10, 50, 100}. Each entry represents (α aug , λ reg-aug ) for each augmentation. We use λ aug = 1 for all experiments with augmentations, except for the DKT model on STATICS2011 dataset with incorrect insertion augmentation (λ incor ins = 0). To see the effect of augmentation probabilities and regularization loss weights, we perform grid search over α aug ∈ {0.1, 0.3, 0.5} and λ reg-aug ∈ {1, 10, 50, 100} with DKT model, and the AUC results are shown as heatmaps in Figure 5 . 



For replacement and insertion, we do not include outputs for augmented interactions in Laug. https://new.assistments.org/ https://sites.google.com/site/assistmentsdata/home/2015-assistments-skill-builder-data https://sites.google.com/view/assistmentsdatamining https://pslcdatashop.web.cmu.edu/DatasetInfo?datasetId=507 https://aitutorsanta.com/



Figure 3: Performances with various sizes of training data under the DKT model. x axis stands for the portion of the training set we use for training (relative to the full train set) and y axis is the AUC. Blue line represents the AUCs of the vanilla DKT model, and red line represents the AUCs of the DKT model trained with augmentations and regularizations.

Knowledge tracing (KT) is the task of modeling student knowledge over time based on the student's learning history. Formally, for a given student interaction sequence (I 1 , . . . , I T ), where each I t = (Q t , R t ) is a pair of question id Q t and the student's response correctness R t ∈ {0, 1} (1 means correct), KT aims to estimate the following probability

R t ) by selecting a question Q rep The resulting augmented sequence (I rep 1 , . . . , I rep T ) is generated by replacing I t with I rep t for t ∈ R ⊂ [T ] = {1, 2, . . . , T }, where R is a set of indices to replace. Then we consider the following consistency regularization loss:

Performances (AUCs) of DKT, DKVMN, and SAINT models on 4 public benchmark datasets. The results show the mean and standard deviation averaged over 5 runs and the best result for each dataset and model is indicated in bold.

Comparison of the performances (AUCs) of the DKT model, trained with only data augmentation (i.e., using the loss (6)) and with consistency and monotonicity regularizations (i.e., using the loss (2)). AUCs of the vanilla DKT model are given in parentheses below the dataset names.

Performances (AUCs) of the DKT model with variations of replacements and qDKT with Lapacian regularization. Best result for each dataset is indicated in bold.

Performances (AUCs) of the DKT model with different type of replacements -questionrandom replacement, interaction-random replacement, skill-set-based replacement, and skill-based replacement. Best result for each dataset is indicated in bold.

Comparison of the average consistency loss for correctly and incorrectly predicted responses of the DKT model.A.4 HYPERPARAMETERSTable 8 describes detailed hyperparameters for each augmentation and model that are used for the main results (Table

4. CONCLUSION

We propose simple augmentation strategies with corresponding constraint regularization losses for KT and show their efficacy. We only considered the most basic features of interactions, question and response correctness, and other features like elapsed time or question texts enables us to exploit diverse augmentation strategies if available. Furthermore, exploring applicability of our idea on other AiEd tasks (dropout prediction or at-risk student prediction) is another interesting future direction.

A APPENDIX

A.1 DATASET STATISTICS AND PRE-PROCESSING Detailted dataset statistics are given in the Table 6 .• ASSISTments: For ASSISTments2015 dataset, we filtered out the logs with CORRECTS not in {0, 1}. Note that ASSISTments2015 dataset only provides question and no corresponding skills.• STATICS2011 A concatenation of a problem name and step name is used as a question id, and the values in the column KC (F2011) are regarded as skills attached to each question.• EdNet-KT1 Among 600K students, we filtered out whose interaction length is in We perform data analysis to explore monotonicity nature of datasets, i.e. a property that students are more likely to answer correctly if they did the same more in the past. For each interaction of each student, we see the distribution of past interactions' correctness rate. Formally, for given interaction sequences (I 1 , . . . , I T ) with I t = (Q t , R t ) and each 2 ≤ t ≤ T , we compare the distributions of past interactions' correctness ratewhere 1 Rτ =1 is an indicator function which is 1 (resp. 0) when R τ = 1 (resp. R τ = 0). We compare the distributions of correctness rate <t over all interactions with R t = 1 and R t = 0 separately, and the results are shown in Figure 1 . We can see that the distributions of previous correctness rates of interactions with correct response lean more to the right than ones of interactions with incorrect response. This shows the positive correlation between previous correctness rate and the current response correctness, and it also explains why monotonicity regularization actually improve prediction performances of knowledge tracing models.

A.3 MODEL'S PREDICTIONS AND CONSISTENCY REGULARIZATION LOSSES

Instead of analyzing consistency nature of datasets directly, we compare the test consistency loss for correctly and incorrectly predicted responses separately, with the DKT model on ASSIST-mentsChall, STATICS2011, and EdNet-KT1 datasets. Table 7 shows the average consistency loss for correctly and incorrectly predicted responses, with the vanilla DKT model and the model trained with consistency regularization losses. When we compute the test consistency loss, we replaced each (previous) interaction's questions to another questions with overlapping skills with α rep = 0.3 probability. For all models, the average loss for the correctly predicted responses are lower than the incorrectly predicted responses. This verifies that smaller consistency loss actually improves prediction accuracy.ASSISTChall DKT (0.5, 100) (0, 0) (0, 0) (0, 0) (0.5, 1) (0, 0) (0, 0) (0, 0) (0.3, 100) DKVMN (0.5, 1) (0, 0) (0, 0) (0, 0) (0.5, 1) (0, 0) (0, 0) (0, 0) (0.5, 100) SAINT (0, 0) (0, 0) (0.3, 1) (0, 0) (0, 0) (0.3, 1) (0.3, 1) (0, 0) (0.3, 100) STATICS2011 DKT (0, 0) (0.5, 10) (0, 0) (0, 0) (0, 0) (0, 0) (0, 0) (0, 0) (0.3, 100) DKVMN (0, 0) (0, 0) (0.3, 10) (0, 0) (0, 0) (0, 0) (0.3, 1) (0, 0) (0.3, 10) SAINT (0, 0) (0.5, 1) (0, 0) (0.5, 1) (0, 0) (0.5, 1) (0, 0) (0.5, 1) (0.3, 100)EdNet-KT1 DKT (0, 0) (0, 0) (0.3, 50) (0, 0) (0, 0) (0.3, 1) (0.3, 1) (0, 0) (0.1, 100) DKVMN (0, 0) (0.5, 1) (0, 0) (0, 0) (0, 0) (0.5, 1) (0, 0) (0, 0) (0.1, 1) SAINT (0, 0) (0.3, 50) (0, 0) (0, 0) (0, 0) (0.3, 50) (0, 0) (0, 0) (0.5, 1) 

