GEOMETRY MATTERS: EXPLORING LANGUAGE EXAM-PLES AT THE DECISION BOUNDARY

Abstract

A growing body of recent evidence has highlighted the limitations of natural language processing (NLP) datasets and classifiers. These include the presence of annotation artifacts in datasets, classifiers relying on shallow features like a single word (e.g., if a movie review has the word "romantic", the review tends to be positive), or unnecessary words (e.g., learning a proper noun to classify a movie as positive or negative). The presence of such artifacts has subsequently led to the development of challenging datasets to force the model to generalize better. While a variety of heuristic strategies, such as counterfactual examples and contrast sets, have been proposed, the theoretical justification about what makes these examples difficult for the classifier is often lacking or unclear. In this paper, using tools from information geometry, we propose a theoretical way to quantify the difficulty of an example in NLP. Using our approach, we explore difficult examples for several deep learning architectures. We discover that both BERT, CNN and fasttext are susceptible to word substitutions in high difficulty examples. These classifiers tend to perform poorly on the FIM test set. (generated by sampling and perturbing difficult examples, with accuracy dropping below 50%). We replicate our experiments on 5 NLP datasets (YelpReviewPolarity, AGNEWS, SogouNews, YelpReviewFull and Yahoo Answers). On YelpReviewPolarity we observe a correlation coefficient of -0.4 between resilience to perturbations and the difficulty score. Similarly we observe a correlation of 0.35 between the difficulty score and the empirical success probability of random substitutions. Our approach is simple, architecture agnostic and can be used to study the fragilities of text classification models. All the code used will be made publicly available, including a tool to explore the difficult examples for other datasets.

1. INTRODUCTION

Machine learning classifiers have achieved state-of-the-art success in tasks such as image classification and text classification. Despite their successes, several recent papers have pointed out flaws in the features learned by such classifiers. Geirhos et al. (2020) cast this phenomenon as shortcut learning, where a classifier ends up relying on shallow features in benchmark datasets that do not generalize well to more difficult datasets or tasks. For instance, Beery et al. (2018) showed that an image dataset constructed for animal detection and classification failed to generalize to images of animals in new locations. In language, this problem manifests at the word level. Poliak et al. (2018) showed that models using one of the two input sentences for semantic entailment performed better than the majority class by relying on shallow features. Similar observations were also made by Gururangan et al. (2018) , where linguistic traits such as "vagueness" and "negation" were highly correlated with certain classes. In order to study the robustness of a classifier, it is essential to perturb the examples at the classifier's decision boundary. Contrast sets by Gardner et al. (2020) and counterfactual examples by Kaushik et al. (2020) are two approaches where the authors aimed at perturbing the datasets to identify difficult examples. In contrast sets, authors of the dataset manually fill in the examples near the decision boundary (examples highlighted in small circles in Figure 1 ) to better evaluate the classifier performance. In counterfactual examples, the authors use counterfactual reasoning along with Amazon Mechanical Turk to create the "non-gratuitous changes." While these approaches are interesting, it's still unclear if evaluating on these will actually capture a classifier's fragility. Furthermore, these 1 ). As such, it is more important to perform evaluation on the examples lying in the green region, which represent confusing examples for the classifier, where even a small perturbation (for instance, substituting the name of an actress) can cause the neural network to misclassify. It is important to note that this does not depend solely on the classifier's certainty as adversarial examples can fool neural networks into misclassifying with high confidence, as was shown by Szegedy et al. (2013) . We now motivate our choice of using the Fisher information metric (FIM) in order to quantify the difficulty of an example. In most natural language processing tasks, deep learning models are used to model the conditional probability distribution p(y | x) of a class label y conditioned on the input x. Here x can represent a sentence, while y can be a sentiment of the sentence. If we imagine a neural network as a probabilistic mapping between inputs to outputs, a natural property to measure is the Kullback-Leibler (KL) divergence between the example and an perturbation around that example. For small perturbations to the input, the FIM gives a quadratic form that approximates, up to second order, the change in the output probabilities of a neural network. Zhao et al. (2019) used this fact to demonstrate that the eigenvector associated with the maximum eigenvalue of the FIM gives an effective direction to perturb an example to generate an adversarial attack in computer vision. Furthermore, from an information geometry viewpoint, the FIM is a Riemannian metric, inducing a manifold geometry on the input space and providing a notion of distance based on changes in the information of inputs. To the best of our knowledge, this is the first work analyzing properties of the fisher metric to understand classifier fragility in NLP. The rest of the paper is organized as follows: In Section 2, we summarize related work. In Section 3, we discuss our approach of computing the FIM and the gradient-based perturbation strategy. In Section 4, we discuss the results of the eigenvalues of FIM in synthetic data and sentiment analysis datasets with BERT and CNN. Finally, in Section 5, we discuss the implications of studying the eigenvalues of FIM for evaluating NLP models.



Figure 1: Quantifying difficulty by using the largest eigenvalue of the Fisher information metric (FIM) a) We show that contrast sets and counterfactual examples aren't necessarily concentrated near the decision boundary as shown in this diagram. Difficult examples are the ones shown in green region (close to the decision boundary) and this is the region where we should evaluate model fragility. b) We sample points from a two-component Gaussian mixture model. We next train a classifier to separate the two classes. c) Dataset colored by the eigenvalue of the FIM, difficult examples with a higher eigenvalue lie closer to the decision boundary.

