CLASS-WEIGHTED EVALUATION METRICS FOR IMBALANCED DATA CLASSIFICATION Anonymous

Abstract

Class distribution skews in imbalanced datasets may lead to models with prediction bias towards majority classes, making fair assessment of classifiers a challenging task. Balanced Accuracy is a popular metric used to evaluate a classifier's prediction performance under such scenarios. However, this metric falls short when classes vary in importance, especially when class importance is skewed differently from class cardinality distributions. In this paper, we propose a simple and general-purpose evaluation framework for imbalanced data classification that is sensitive to arbitrary skews in class cardinalities and importances. Experiments with several state-of-the-art classifiers tested on real-world datasets and benchmarks from two different domains show that our new framework is more effective than Balanced Accuracy -not only in evaluating and ranking model predictions, but also in training the models themselves.

1. INTRODUCTION

For a broad range of machine learning (ML) tasks, predictive modeling in the presence of imbalanced datasets -those with severe distribution skews -has been a long-standing problem (He & Garcia, 2009; Sun et al., 2009; He & Ma, 2013; Branco et al., 2016; Hilario et al., 2018; Johnson & Khoshgoftaar, 2019) . Imbalanced training datasets lead to models with prediction bias towards majority classes, which in turn results in misclassification of the underrepresented ones. Yet, those minority classes often are the ones that correspond to the most important events of interest (e.g., errors in system logs (Zhu et al., 2019) , infected patients in medical diagnosis (Cohen et al., 2006) , fraud in financial transactions (Makki et al., 2019) ). While there is often an inverse correlation between the class cardinalities and their importance (i.e., rare classes are more important than others), the core problem here is the mismatch between the way these two distributions are skewed: the i th most common class is not necessarily the i th most important class (see Figure 1a for an illustration). In fact, rarity is one of many potential criteria that can determine the importance of a class, which is usually positively correlated with the costs or risks involved in its misprediction. Ignoring these criteria when dealing with imbalanced data classification may have detrimental consequences. Consider automatic classification of messages in system event logs as an example (Zhu et al., 2019) . An event log is a temporal sequence of event messages that have transpired for a given software system (e.g., operating systems, cyber-physical systems) over a certain period of time. Event logs are particularly useful after a system has been deployed. These logs can provide the DevOps teams with information and insights about errors outside of the testing environment, thereby improving their ability to debug and improve the quality of the system. There is typically an inverse correlation between the stability/maturity of a system and the frequency of the errors it produces in its event log. Furthermore, the message types that appear least frequently in an event log are usually the ones with the greatest importance. A concrete example of this was a rare anomaly in Uber's self-driving car that led to the death of a pedestrian, since the system flagged it as a false positive (Efrati, 2018) . If this event had not been misclassified and dismissed by the system, the pedestrian death in Arizona may have been avoided. A plethora of approaches have been proposed for building balanced classifiers (e.g., resampling to balance datasets, imbalanced learning methods, prediction post-processing (Sun et al., 2009; Branco et al., 2016) ). A fundamental issue that still remains an open challenge is the lack of a generallyaccepted methodology for measuring classification performance. The traditional metrics, which are designed to evaluate average case performance, (e.g., Accuracy) are not capable of correctly assess-1

