UNCOVERING THE EFFECTIVENESS OF CALIBRATION ON OPEN INTENT CLASSIFICATION Anonymous authors Paper under double-blind review

Abstract

Open intent classification aims to simultaneously identify known and unknown intents, and it is one of the challenging tasks in modern dialogue systems. While prior approaches are based on known intent classifiers trained under the crossentropy loss, we presume this loss function yields a representation overly biased to the known intents; thus, it negatively impacts identifying unknown intents. In this study, we propose a novel open intent classification approach that utilizes model calibration into the previously-proposed state-of-the-art. We empirically examine that simply changing a learning objective in a more calibrated manner outperforms the past state-of-the-art. We further excavate that the underlying reason behind calibrated classifier's supremacy derives from the high-level layers of the deep neural networks. We also discover that our approach is robust to harsh settings where few training samples per class exist. Consequentially, we expect our findings and takeaways to exhibit practical guidelines of open intent classification, thus helping to inform future model design choices.

1. INTRODUCTION

Background and Motivation Beyond the success of intent classification under the supervised regime, one of the next challenges in the modern dialogue system is open intent classification (Scheirer et al., 2013) . While the number of intents in the training and test sets is the same under the supervised setting (known as a closed-set setting), an intent classifier in the real world is required to recognize unknown intents as well as known intents (Zhang et al., 2021) . For example, supposing the training set includes N intents, the open intent classification solves N + 1 classification where the N + 1 th intent is a set of unknown ones (Shu et al., 2017; Lin & Xu, 2019; Zhang et al., 2021) . This open intent classification task is also related to open world recognition (Bendale & Boult, 2016; Vaze et al., 2021) or out-of-distribution detection studies (Hendrycks & Gimpel, 2016; Liang et al., 2017) which are actively dealt with image domains, but it is specifically denoted as open intent classification in a natural language processing domain. Upon previously-proposed open intent classification methods, we figure out that most of these works conventionally trained the closed-set classifier with a cross-entropy loss (Bendale & Boult, 2016; Hendrycks & Gimpel, 2016; Prakhya et al., 2017; Shu et al., 2017; Lin & Xu, 2019; Zhang et al., 2021) . However, we doubt whether this use of cross-entropy loss is the utmost learning objective for identifying open intents. Previous open intent classification study highlighted that adequate strength of decision boundaries among known intents is important for detecting unknown intents (Zhang et al., 2021) . To interpret, an inductive bias established with known intents should be neither overly biased nor too loosely optimized. Not only in open intent classification but recently-proposed state-of-the-art open world classification study in the computer vision domain also supports this proposition: acquiring adequate representation power correlates to effective open world classification performance (Vaze et al., 2021) . But, as several works once pointed out, the cross-entropy loss is known to convey an inductive bias that is excessively biased to the given labels because it enforces the model to select one single label among the given label space (Recht et al., 2019; Zhang et al., 2016) . To this end, we assume the use of cross-entropy loss has room for improvement and aims to provide an outperforming open intent classifier. Main Idea and Its Novelty Our work's key proposition is utilizing model calibration during the model training on known intents. Model calibration is a method that adjusts a model's predicted probabilities of outcomes to reflect the true probabilities of those outcomes (Nixon et al., 2019) . Referring to the calibration studies, the calibrated deep neural networks accomplished robustness against various noises and perturbations (Müller et al., 2019; Pereyra et al., 2017) . Inspired by this finding, we presume that applying calibration to the cross-entropy loss will improve the inductive bias's quality and escalate the open intent classification performance. Accordingly, we select state-of-the-art open-world classification methods in the text and image domains and simply apply calibration to their training procedure. Throughout our work, we firstly showed whether the calibration improves inductive bias compared to the cross-entropy loss. Then, we further examine whether our simple idea can outperform previous open intent classifiers in various problem settings and how calibration changes the representation landscape in the trained model. Although our idea seems to be simple, we highlight that the proposed open intent classifiers are novel because, to the best of our knowledge, our approach is the first attempt to utilize calibration to improve open intent classification performance in the text domain.

Key Contributions

• As a preliminary analysis, we show that model calibration reduces the bias of the conventional known intent classifier, as well as escalates the distribution discrepancy between known and unknown intents. We analyze that this large discrepancy would contribute to better open intent classification performances • We further scrutinize that the supremacy of C-LC and C-ADB derives from the representations at higher layers of the deep neural networks. We interpret that the proposed methods acquire better contextual understandings than the previously-proposed methods. • Lastly, we examine our approaches' stability in extreme settings of the training set. We discover that C-ADB is less stable than C-LC given few training samples per known intent; thus, there should be careful consideration on using C-ADB. (Zhang et al., 2021) . As the aforementioned methods commonly employed cross-entropy loss to train the known intent classifier, we hypothesize it is not advantageous in establishing good decision boundaries. Under this motivation, we aim to scrutinize optimal decision boundaries' tightness through applying model calibration to the know intent classifier.

2. RELATED WORKS

Model Calibration Calibration reflects ground truth correctness likelihood in a predicted class label. Good calibrated confidence provides suitable information on why the neural network prediction is made. Guo et al. (2017) proposed temperature scaling to calibrate modern neural networks with over-confident problems. Lee et al. (2017) suggests two additional terms on the original objective function for detecting out-of-distribution. Research of calibration has been widely examined for computer vision, but natural language processing has only recently begun. Kumar & Sarawagi (2019) presented that neural machine translation models are miscalibrated. Desai & Durrett (2020) tried temperature scaling on BERT and RoBERTa model, and analyzed their calibration over three tasks. In addition, they show that further increasing empirical uncertainty help out-of-domain classification. Moreover, Müller et al. (2019) shows label smoothing behaviors while training network and investigates the effect of label smoothing, which improve model calibration. Among various methods in model calibration, we employed label smoothing in the proposed methods for the following reasons: 1) it does not require any validation set, 2) it directly influences the representation power of deep neural networks while temperature scaling does not. (Zhang et al., 2021) method is employed from the text domain. Note that both approaches utilize cross-entropy loss during the model training. To calibrate these models, we apply label smoothing (Szegedy et al., 2016) with the calibration strength of α and denote them as Calibrated Logit-based Classifier (C-LC) and Calibrated ADB (C-ADB), respectively. We hereby highlight that temperature scaling approaches (Guo et al., 2017) , which are other promising calibration methods, are not considered in our work because they only change absolute logit values without any change in inductive bias. We formalize the cross-entropy loss with label smoothing in equation 1. Note that K implies the number of known intents, α stands for the calibration strength, p k and y LS k means logit vector and label-smoothed ground truth, respectively.

3. OUR APPROACH

L LS (p k , y) = K k=1 -y LS k log(p k ) where p k = e x T w k K i=1 e x T wi , y LS k = y k (1 -α) + α/K (1) Calibrated Logit-based Classifier (C-LC) C-LC is an open intent classifier that shares the same motivation with the LC, a recently-proposed open-world classification method (Vaze et al., 2021) . Its novelty exists in the calculation of confidence, because it illustrates that simply using the logit vector before the softmax layer can surprisingly increase the open intent classification performance, while most prior works used logit vectors after the softmax layer. During the training procedure, it trains closed-set classifier with cross-entropy loss. Given a test sample, it LC recognizes unknown intents if a given sample's confidence yielded by the trained model is smaller than a preset threshold. Note that this confidence is measured as the maximum value at the logit vector extracted from the layer right before the softmax activation. Lastly, it selects the threshold as a mean confidence at the training samples. On the aforementioend procedures of LC, we establish C-LC by changing the learning objective from simple cross-entropy loss into the calibrated one, shown in Equation 1. L C-ADB = L LS + L Boundary (2) Calibrated ADB (C-ADB) C-ADB is another open-intent classifier proposed in our study. The original ADB identifies unknown intents if a given sample locates far from the known clusters' centroids in the latent space. To empower the model to find adequate decision boundaries' tightness, ADB trains the model with cross-entropy loss and boundary loss, where the boundary loss aims to predict the radius of each known intent. We denote this training procedure as post-processing. Given the trained model ϕ( We say this open intent classification procedure as postprocessing. While the LC simply predicts a given sample as an open intent when its confidence is less than a preset thresdhold, we hereby highlight that ADB utilizes post-proceesor which utilizes estimated distances in the latent space. On the aforementioend establishment procedures of ADB, we calbrate the model with applying label smoothing toward the cross-entropy loss as shown in Equation 2. We presume calibrated pre-processing procedure and post-processing would make a synergy on identifying unknwn intents. x; θ), each known intent's centroid (C 1 • • • C K ) and radius (R 1 • • • R K ) in the

4.1. EXPERIMENTAL SETUP

Dataset and Problem Setting We utilized three public datasets (STACKOVERFLOW (Xu et al., 2015) , BANKING (Casanueva et al., 2020) , OOS (Larson et al., 2019) ), which are widely utilized in past open intent classification study (Lin & Xu, 2019; Zhang et al., 2021) . We show a brief summary of these datasets in Table 1 . Upon these datasets, we postulate three problem settings with different Known Label Ratios (KLR). The KLR implies the ratio of known labels to the total number of labels. Supposing the scenario under the KLR of 25%, we use 25% of total intents as known ones while the other 75% intents are set as unknown ones. Thus, the open intent classifier can only learn 25% of the total intents during the training stage and identify both known and unknown intents (75% of total intents) during the test stage. Our study utilized three KLRs of 25%, 50%, and 75% in the experiments. We utilized accuracy and F1-score on the test set as evaluation metrics. Given N intents for the known label, the open intent classifier solves N + 1 classification where the added one label implies unknown intents. We report the average performance over five runs of experiments for each known class ratio. (Loshchilov & Hutter, 2017) . We set the learning rate as 2e-5 and schedule it with cosine scheduiling. We Please refer to the supplementary materials for codes and more details. Baselines We employed five baseline open intent classification methods to examine our approaches' effectiveness. Brief elaborations on the baseline approaches are described as follows. MSP Hendrycks & Gimpel (2016) proposed that the out-of-distribution example can be diverged based on the maximum softmax probability when predicting the sample class. Setup We first and foremost scrutinize whether the calibration contributes to discriminating unknown intents from the known labels. As we presume that the use of cross-entropy loss yields a large bias to the known intents, we analyze whether the cross-entropy loss indeed exhibits insufficient discriminative decision boundary between known and unknown intents. Moreover, we also aim to validate whether the calibration establishes a more distinct representation between known and unknown intents. To excavate answers to these questions, we trained closed-set intent classifiers with various calibration options: No calibration (label smoothing with strength α of 0) and label smoothing with the strength of 0.2, 0.5, 0.8, and 0.9. Note that the larger strength implies stronger calibration on the model. Given the trained classifiers, we extract the confidence scores (extracted at the layer before the softmax activation) from known and unknown test samples and analyze their distributions. We regard the more distinct distribution between known and unknown intents casts better representation quality, because it implies the model posits more appropriate discriminative decision boundaries We perform the analyses on three datasets under the 75% KLR, and visualize the confidence distributions in Figure 1 . Please refer the supplementary materials for experiment results under the other KLR options. Result Following the results shown in Figure 1 , we observe that cross-entropy loss exhibits a particular amount of adjoined area between known and unknown intents. Regardless of dataset types, closed-set classifier poses particular amount of adjoined confidence distributions between known and unknown intents; thus, we conclude that cross-entropy loss bears improvement avenue for more effective open intent classification. We further scrutinize that calibration separates these two distributions by establishing a representation less biased to the known intents. Interestingly, the more strong calibration exhibits a larger discrepancy between known and unknown labels in every KLR. While we could not confidently say the stronger label smoothing creates a better representation for discriminating unknown intents, we resulted that calibration is beneficial to establishing discrepancy between known and unknown intents regardless of KLR levels. 

4.3. WHAT IS OPTIMAL CALIBRATION STRENGTH?

Setup After we confirm the calibration as a presumable solution for the cross-entropy loss's bias, we then excavate what is an optimal calibration strength on C-LC and C-ADB. As the C-ADB has additional post-processing procedure, we assume the adequate calibration strength will be differ from the C-LC's optimal one. To examine this question, we train C-LC and C-ADB along with various calibration strengths, and check the open intent classification performances. We regard an optimal calibration sterngth as the one acheive the best performance. The experiment results are shown in Table 2 .  Result

4.4. COMPARISON WITH BASELINES

Setup After we scrutinize that C-LC and C-ADB requires different calibration strenghts, we then analyze whether the proposed methods outperform previously-proposed approaches. We implement baseline approaches described in section 4.1, and compare the performances in various KLR levels. For the calibration strength of C-LC and C-ADB, we used the ones which accomplish the best accuracy in Table 2 at each KLR under each dataset. The results are shown in Table 3 .

Calibration is advantageous in low KLR

We discover that the proposed methods accomplish promising open intent classification performances compared to the baselines. Especially, the proposed C-ADB outperforms the baselines in KLR of 25% and 50%, while it accomplises competitive performance in 75% KLR. Unfortunately, C-LC achives promising performances but fail to outperform the prior methods. For the C-ADB's competitive performances in 75% KLR, we hypothesize that large KLR setting is more difficult to sensitively discover an optimal calibration strength. The larger number of intents implies larger number of known intent clusters; thus, we expect the model cannot easily establish effective decison boundaries between the known clusters. As Zhang et al. (2021) once urged, especially under the highly-complex decision boundaries, the calibration should have to be carefully applied into the model to maximize the open intent classification. In other words, when many known intents exist, careless calibration has higher risk of demolishing representations on given data. However, we simply applied heuristically-chosen calibration strength (one among 0, 0.2, 0.5, 0.8, 0.9); thus, we expect this careless use of strength is related to the competitive performance of C-ADB. To experimentally excavate our hypothesis's validity, we visualized the representations yielded from the models at different KLRs. We visualized the representations extracted at the last layer of BERT with t-SNE (Van der Maaten & Hinton, 2008) , which is a conventionally utilized method of dimensionality reduction, and the results are shown in Figure 2 , 3. These results show that a model under the large KLR exposes less-qualified representation clusters of known intents , which implies that a model's representation is insufficient to discriminate various known intents clearly. Referring to the representation distribution under low KLR, known intents are clearly discriminanted regardeless. Consequentially, we figure out that calibration is effective in escalating open intent classification performances in low KLRs, but there should be more appropriate manner of selecting the calibration strength. One presumable solution is utilizing a learnable calibration strength for C-LC and C-ADB, but we leave this point as an improvement avenue. (25%, 0.0) (25%, 0.9) (50%, 0.0) (50%, 0.9) (75%, 0.0) (75%, 0.9) (25%, 0.0) (25%, 0.9) (50%, 0.0) (50%, 0.9) (75%, 0.0) (75%, 0.9) Validation set can improve C-LC We figured out that the proposed C-LC could not perform open intent classification better than the original ADB and C-ADB. We analyze ADB-based methods' effectiveness stems from the post-processing procedure as it establishes tighter decision boundaries between known intents. Conversely, in C-LC, the lack of this post-processor might contribute to discover adequate tightness of decision boundaries among known intents; therefore, it could not accomplish comparable performances with ADB-based methods. To improve this limit, we aim to excavate further how we can elevate C-LC's performances. We presume improving a threshold of identifying unknown intents would additionally escalate the performance; thus, we utilized validation sets for choosing this threshold and denoted this method as C-LC-Val. The proposed C-LC-Val uses a validation set consisting of samples from both known and unknown intents. While naive C-LC sets a threshold as mean confidence in the training samples, C-LC-Val chooses a decision threshold set by Youden's J statistics (Youden, 1950) (which is commonly used to set optimal threshold in ROC curve, as well as in prior works (Schisterman et al., 2005; Powers, 2011) ). Note that the calibration strength α for C-LC-Val is selected as the one that accomplished the best accuracy in Table 2 . We compared the C-LC-Val's performance with other options in Table 4 . From Table 4 , we scrutinize that C-LC-Val improves naive C-LC's performances in several cases; thus, we conclude that utilizing a validation set can be one presumable method to escalate C-LC's performance. Compared to naive C-LC, We highlight that this assumption of validation set is a limited setting as real-world practitioners cannot always acquire unknown intent samples a priori. But, we aim to show that the practitioners can use C-LC-Val if they can acquire a particular amount of unknown intent samples.

4.5. STABILITY UNDER A FEW TRAINING SAMPLES

Setup We aim to further excavate whether the proposed methods robustly sustain their performances under the low number of labeled training samples. We define the portion of labeled training samples as Training Data Ratio (TDR), and configure the experiments with TDRs of 0.2, 0.4, 0.6, 0.8, and 1.0. Under the TDR of 0.2, it means there exist only 20% of labeled training samples per each intent. The results are shown in Table 5 and Table 6 . Note that α means the calibration strength and a bold number implies the best performance at a given TDR setting. Table 5 : The proposed methods' performances measured in accuracy under various TDR levels. KLR TDR STACKOVERFLOW BANKING OOS C-LC C-ADB C-LC C-ADB C-LC C-ADB .2 α=0.5 α=0.8 α=0.9 α=0.2 α=0.5 α=0.8 α=0.9 α=0.2 α=0.5 α=0.8 α=0.9 α=0.2 α=0.5 α=0.8 α=0.9 α=0.2 α=0.5 α=0.8 α=0.9 α=0.2 α=0.5 α=0.8 α=0.9 C-ADB α=0.2 α=0.5 α=0.8 α=0.9 α=0.2 α=0.5 α=0.8 α=0.9 α=0.2 α=0.5 α=0.8 α=0.9 α=0.2 α=0.5 α=0.8 α=0.9 α=0.2 α=0.5 α=0.8 α=0.9 α=0.2 α=0.5 α=0.8 α=0.9 the open intent classification performance. Therefore, we urge that applying calibrations or postprocessor is not a golden key to escalating performance in every circumstance. Still, we also discover that the landscape of performance change along with various KLR and TDR levels depends on the dataset. We yield this analysis as an improvement avenue as it is not a core component of our study. As a naive direction of analysis, we expect there should be a method of quantifying the decision boundaries' tightness for the open intent classification task. In a nutshell, we recommend that NLP practitioners use the proposed C-ADB carefully. We highlight that C-ADB might yield inconsistent performance if a few training samples exist on known intents.

5. WHY DO CALIBRATION CONTRIBUTE TO OPEN INTENT CLASSIFICATION?

Setup Lastly, we aim to excavate how the model calibration contributes to the escalation of open intent classification performances. We hypothesize the answer would exist in the learned representations of the model; thus, we measured representation similarities between various models trained under different calibration strengths. We establish five pairs consisting of models trained under different calibration strength as follows: (0 v. 0), (0 v. 0.2), (0 v.0.5), (0 v. 0.8) and (0 v. 0.9). Given these pairs, we quantitatively measure the similarity between two models in a pair by applying Centered Kernel Alignment (CKA) (Kornblith et al., 2019) . CKA measures representation similarity between two layers from different models and returns the similarity score between 0 and 1 (where 1 means the highest similarity). Among various methods in quantifying representation similarities, such as CCA or SVCCA, we utilize CKA as it accomplished state-of-the-art performances in their domain's benchmarks (Kornblith et al., 2019; Wu et al., 2020; Sridhar & Sarah, 2020) . Following the prior work (Kornblith et al., 2019) , we compared representation similarities at every LayerNorm layer between two models in a given pair. Due to the page limits, we show representation similarities among various models trained under STACKOVERFLOW dataset in While overall landscapes of representation similarities look similar regardless of calibration strengths, similarity at the very last layer (11th layer) becomes less similar as long as the calibration strength increases. We interpret this phenomenon implies that calibration makes the model's high-level representation becomes different. Along with prior work's analysis (Hao et al., 2019; Wu et al., 2020) , we further hypothesize that calibrated model bears a different contextual understanding of given text input. We analyze calibrated model interprets the text input as less biased to the known intents; thus, it discriminates unknown intents based on this qualified understanding of the text data. Note that this analysis supports that calibrated model acquires 'different' representation from the not-calibrated one, but it does not justify how the calibrated representation yields better open intent classification performance. We let this point as an improvement avenue.

6. CONCLUSION

In this study, we propose novel open intent classification methods that utilize label smoothing on prior state-of-the-art methods. We experimentally show this simple idea improves prior approaches' performance in particular settings as calibrated representation makes an adequate tightness on decision among known intents; thus, the proposed C-ADB becomes a novel state-of-the-art in benchmark settings. Furthermore, we analyze the proposed methods' supremacy derives from high-level representations, which implies model calibration contributes to acquiring a more qualified contextual understanding of text inputs. Last but not least, we also scrutinize that C-ADB is less stable than C-LC under a few training samples; thus, we highly recommend that practitioners carefully utilize it on their own datasets. Nevertheless, several improvement avenues exist as proposed in the prior sections. We expect practitioners in real world can use the proposed methods, especially C-ADB, to establish effective NLP applications in their domains.



Figure1: Distribution of maximum value of logit between known intent (blue) and unknown (orange) intent samples under KLR of 75%. From top to bottom, each row indicates the dataset type of STACKOVERFLOW, BANKING, and OOS. From left to right, each column indicates calibration strengths of 0, 0.2, 0.5, 0.8, and 0.9. Note that blue and orange distribution implies known and unknown intents, respectively.

Figure 2: The representation distributions between known intent (colored) and unknown intent (gray) samples in STACKOVERFLOW, where the representations are yielded by C-LC. Note that the left element at the bracket implies KLR, and the right one means calibration strength.

Figure 3: The representation distributions between known intent (colored) and unknown intent (gray) samples in STACKOVERFLOW, where the representations are yielded by C-ADB. Note that the left element at the bracket implies KLR, and the right one means calibration strength.

Figure 4: Representation similarities between models with and without calibrationResult Upon the Figure4, for both C-LC and C-ADB, we scrutinize that model calibration yields different high-level representations from the model without any calibration. While overall landscapes of representation similarities look similar regardless of calibration strengths, similarity at the very last layer (11th layer) becomes less similar as long as the calibration strength increases. We interpret this phenomenon implies that calibration makes the model's high-level representation becomes different. Along with prior work's analysis(Hao et al., 2019;Wu et al., 2020), we further hypothesize that calibrated model bears a different contextual understanding of given text input. We analyze calibrated model interprets the text input as less biased to the known intents; thus, it discriminates unknown intents based on this qualified understanding of the text data. Note that this analysis supports that calibrated model acquires 'different' representation from the not-calibrated one, but it does not justify how the calibrated representation yields better open intent classification performance. We let this point as an improvement avenue.

Shu et al. (2021) use several data augmentation strategies to expand distribution shift examples on ADB. Throughout the prior works, we analyze that a key takeaway of the precise open intent classification method is establishing adequate decision boundaries among the known intent samples; it is also usually denoted as appropriate tightness of decision boundaries

Descriptions of the utilized datasets.

distinguishing known intent clusters for good performance; thus, applying the strongest calibration was beneficial in elevating open intent classification performance. On the other hand, C-ADB's performance is maximized at moderate calibration strength. We analyze this phenomenon occurs because the original ADB already bears tightened decision boundaries compared to the LC as the ADB tightens the decision boundaries with post-processing procedure. Suppose we apply the strongest calibration for C-ADB. In this case, we analyze the decision boundaries would become overly tightened as both post-processing procedure and calibration simultaneously tighten the decision boundaries. As the prior study(Zhang et al., 2021) once urged, overly tightened decision boundary degrades the open intent classification performance. We presume a combined influence from post-processing procedure and strong calibration creates extreme tightness of decision boundaries, which leads to inferior performance. Conversely, we presume an adequate tightness of decision boundaries is accomplished by combining post-processor and moderate calibration; therefore, C-ADB with moderate calibration strength achieves the best performance. In a nutshell, we discover that C-LC and C-ADB requires different calibration strenghs for effective open intent classifcation. Furthermore, we confirm that open intent classifier indeed requires an adequate decision boundaries' tightness level and re-assured that overly-tightened decision boundary degrades the performance. Table 2: Our C-LC and C-ADB's open intent classification performances under various calibration strengths. Each method acquires different calibration strengths for the best performance.

Comparative study of the proposed methods with baselines

Open intent classification performances under the the validation set

25% 0.2 78.77 89.00 88.78 91.52 89.82 71.27 12.90 14.10 79.06 81.75 85.52 86.04 80.81 85.03 69.81 29.03 88.46 88.67 90.32 90.86 87.49 88.61 89.82 89.35 0.4 77.83 83.33 88.00 91.02 91.97 90.47 16.97 15.42 80.55 83.34 86.88 87.86 84.25 85.23 84.06 71.17 87.26 87.07 89.89 90.58 89.98 89.89 90.04 89.72 0.6 78.47 79.93 86.75 89.72 92.48 92.53 16.02 16.23 81.56 80.75 85.10 88.34 85.71 85.13 84.84 77.73 86.95 87.68 89.53 90.74 90.91 90.74 90.37 90.18 0.8 73.07 79.02 85.83 88.08 92.88 92.58 15.68 15.68 82.34 83.38 87.44 87.66 85.71 86.30 85.03 81.14 86.33 87.21 88.82 90.25 90.18 90.72 90.70 90.47 1 72.62 73.37 81.87 86.18 92.55 93.33 16.23 16.22 80.00 82.27 85.94 87.66 86.20 85.88 86.46 79.16 85.86 87.33 88.72 90.51 90.82 90.96 90.81 90.58 50% 0.2 85.83 79.23 80.62 83.73 88.45 88.12 85.97 19.45 68.31 74.77 76.95 75.29 74.97 76.82 76.53 74.58 79.00 80.60 82.58 83.93 85.65 85.42 84.79 83.84 0.4 84.60 82.72 82.02 84.20 88.85 88.55 87.15 19.93 72.40 76.01 78.41 78.38 79.74 79.58 80.29 78.47 79.11 80.53 83.25 84.54 86.95 87.04 86.47 86.14 0.6 82.00 84.15 83.23 83.68 88.88 88.65 87.20 23.23 69.51 71.79 76.95 78.44 81.79 81.75 81.23 79.94 78.11 80.07 83.19 83.47 87.30 87.46 87.54 87.32 0.8 82.37 80.75 79.12 83.22 88.92 88.75 87.63 23.48 70.52 73.38 78.90 79.48 81.82 82.34 81.59 79.84 78.46 80.77 83.04 83.39 87.67 87.42 87.35 87.70 1 85.35 80.58 79.77 81.43 88.87 88.85 87.23 23.07 69.58 73.12 77.27 78.51 82.95 82.79 81.85 80.55 78.89 80.49 83.12 83.75 88.00 88.07 88.19 87.54

The proposed methods' performances measured in F1 score under various TDR levels.

Result We discover that key findings presented in the section 4.4 still exist unless there are fewer training samples per class. Nevertheless, we observe that C-ADB is less stable than C-LC under the various training samples per class; thus, we analyze that there should be a careful configuration of C-ADB, especially under a few training samples. Following the results, C-LC's performances do not significantly change under various TDR levels, while C-ADB's performances bear such high variability. Particularly, C-ADB's performance experiences a harsh drop under low KLR (i.e., 25%) and high calibration strength. We presume the underlying reasons behind these phenomena are also the tightness of decision boundaries. Supposing low TDR, a model would learn insufficient knowledge during the training; thus, we expect it to risk overfitting and disqualified understanding on known intents. Under the overfitted, disqualified representations, we presume applying calibration or post-processors (which escalated open intent classification performance under large training samples) would not contribute to acquiring better decision boundaries. Accordingly, it degrades

