CLASS-WEIGHTED EVALUATION METRICS FOR IMBALANCED DATA CLASSIFICATION Anonymous

Abstract

Class distribution skews in imbalanced datasets may lead to models with prediction bias towards majority classes, making fair assessment of classifiers a challenging task. Balanced Accuracy is a popular metric used to evaluate a classifier's prediction performance under such scenarios. However, this metric falls short when classes vary in importance, especially when class importance is skewed differently from class cardinality distributions. In this paper, we propose a simple and general-purpose evaluation framework for imbalanced data classification that is sensitive to arbitrary skews in class cardinalities and importances. Experiments with several state-of-the-art classifiers tested on real-world datasets and benchmarks from two different domains show that our new framework is more effective than Balanced Accuracy -not only in evaluating and ranking model predictions, but also in training the models themselves.

1. INTRODUCTION

For a broad range of machine learning (ML) tasks, predictive modeling in the presence of imbalanced datasets -those with severe distribution skews -has been a long-standing problem (He & Garcia, 2009; Sun et al., 2009; He & Ma, 2013; Branco et al., 2016; Hilario et al., 2018; Johnson & Khoshgoftaar, 2019) . Imbalanced training datasets lead to models with prediction bias towards majority classes, which in turn results in misclassification of the underrepresented ones. Yet, those minority classes often are the ones that correspond to the most important events of interest (e.g., errors in system logs (Zhu et al., 2019) , infected patients in medical diagnosis (Cohen et al., 2006) , fraud in financial transactions (Makki et al., 2019) ). While there is often an inverse correlation between the class cardinalities and their importance (i.e., rare classes are more important than others), the core problem here is the mismatch between the way these two distributions are skewed: the i th most common class is not necessarily the i th most important class (see Figure 1a for an illustration). In fact, rarity is one of many potential criteria that can determine the importance of a class, which is usually positively correlated with the costs or risks involved in its misprediction. Ignoring these criteria when dealing with imbalanced data classification may have detrimental consequences. Consider automatic classification of messages in system event logs as an example (Zhu et al., 2019) . An event log is a temporal sequence of event messages that have transpired for a given software system (e.g., operating systems, cyber-physical systems) over a certain period of time. Event logs are particularly useful after a system has been deployed. These logs can provide the DevOps teams with information and insights about errors outside of the testing environment, thereby improving their ability to debug and improve the quality of the system. There is typically an inverse correlation between the stability/maturity of a system and the frequency of the errors it produces in its event log. Furthermore, the message types that appear least frequently in an event log are usually the ones with the greatest importance. A concrete example of this was a rare anomaly in Uber's self-driving car that led to the death of a pedestrian, since the system flagged it as a false positive (Efrati, 2018) . If this event had not been misclassified and dismissed by the system, the pedestrian death in Arizona may have been avoided. A plethora of approaches have been proposed for building balanced classifiers (e.g., resampling to balance datasets, imbalanced learning methods, prediction post-processing (Sun et al., 2009; Branco et al., 2016) ). A fundamental issue that still remains an open challenge is the lack of a generallyaccepted methodology for measuring classification performance. The traditional metrics, which are designed to evaluate average case performance, (e.g., Accuracy) are not capable of correctly assess- Let us illustrate the problem with the simple example depicted in Figure 1b . The test dataset consists of 100 data items from 3 classes (A, B, C). The greatest majority of the items belong to class C (70/100), but class B (20/100) has the greatest importance (0.7/1.0). In other words, Cardinality and Importance are both non-uniform and in favor of different classes (i.e., falls in the top-right quadrant of Figure 1a ). The confusion matrix on the right-hand side shows the results from a classifier that was run against this test dataset. Unsurprisingly, the classifier performed the best in labeling the majority class C (60/70 correct predictions). When this result is evaluated using the traditional Accuracy metric, neither Class Cardinality nor Class Importance is taken into account. If Balanced Accuracy is used instead, we observe the degrading impact of the skew in Class Cardinality (0.38 < 0.65), but Class Importance is still not accounted for. This example demonstrates the need for a new evaluation approach that is both sensitive to Cardinality and Importance skew, as well as any arbitrary correlations between them. This is especially critical for ensuring a fair assessment of results, when comparing across multiple classifiers or problem instances. Our goal in this paper is to design an evaluation framework for imbalanced data classification, which can be reliably used to measure, compare, train, and tune classifier performance in a way that is sensitive to non-uniform class importance. We identify two key design principles for such a framework: • Simplicity: It should be intuitive and easy to use and interpret. • Generality: It should be general-purpose, i.e., (i) extensible to an arbitrary number of classes and (ii) customizable to any application domain. To meet the first design goal, we focus on scalar metrics such as Accuracy (as opposed to graphical metrics such as ROC curves), as they are simpler, more commonly used, and scale well with increasing numbers of classes and models. To meet the second design goal, we target the more general n-ary classification problems (as opposed to binary), as well as providing the capability to flexibly adjust class weights to capture non-uniform importance criteria that may vary across application domains. Note that we primarily focus on Accuracy as our base scalar metric in this paper, as it is seen as the de facto metric for classification problems (Sci). However, our framework is general enough to be extended to other scalar metrics, such as Precision and Recall. Similarly, while we deeply examine two applications (log parsing and sentiment analysis) in this work, our framework in principle is generally applicable to any domain with imbalanced class and importance distributions. In the rest of this paper, we first provide a brief overview of related work in Section 2. Section 3 presents our new, class-weighted evaluation framework. In Section 4, we show the practical utility of our framework by applying it over three log parsing systems (Drain (He et al., 2017 ), MoLFI (Messaoudi et al., 2018 ), Spell (Du & Li, 2016; 2018) ) using four real-world benchmarks (Zhu et al., 2019) , as well as over a variety of deep learning models developed for sentiment analysis on a customer reviews dataset from Amazon (Ni et al., 2019) . Finally, we conclude the paper with a brief discussion of future directions.



Figure 1: Skew in distributions of Class Cardinalities or Class Importance, and the potential mismatch between these two distributions render Accuracy and Balanced Accuracy metrics unusable in general multi-class prediction problems.

