A TRULY CONSTANT-TIME DISTRIBUTION-AWARE NEGATIVE SAMPLING Anonymous

Abstract

Softmax classifiers with a very large number of classes naturally occur in many applications such as natural language processing and information retrieval. The calculation of full-softmax is very expensive from the computational and energy perspective. There have been a variety of sampling approaches to overcome this challenge, popularly known as negative sampling (NS). Ideally, NS should sample negative classes from a distribution that is dependent on the input data, the current parameters, and the correct positive class. Unfortunately, due to the dynamically updated parameters and data samples, there does not exist any sampling scheme that is truly adaptive and also samples the negative classes in constant time every iteration. Therefore, alternative heuristics like random sampling, static frequencybased sampling, or learning-based biased sampling, which primarily trade either the sampling cost or the adaptivity of samples per iteration, are adopted. In this paper, we show a class of distribution where the sampling scheme is truly adaptive and provably generates negative samples in constant time. Our implementation in C++ on commodity CPU is significantly faster, in terms of wall clock time, compared to the most optimized TensorFlow implementations of standard softmax or other sampling approaches on modern GPUs (V100s).

1. INTRODUCTION

Neural Networks (NN) have successfully pushed the boundaries of many application tasks, such as image or text classification (Wang et al., 2017; Yao et al., 2019) , speech recognition (Dong et al., 2018) and recommendation systems (Zhang et al., 2015; Medini et al., 2019) . Many hard AI problems are currently modeled as massive multiclass or multilabel problems leading to a drastic improvement over prior work. For example, popular NLP models predicts the best word, given the full context observed so far. Such models are becoming the state-of-the-art. Recommendation systems and related Information Retrieval (IR) problems are classical examples of machine learning with outrageously large outputs (Medini et al., 2019; Jain et al., 2019) . In IR, given the user query, the task is to predict few relevant documents (or products) from among hundreds of millions possible documents, a typical machine learning problem with massive output space. Owing to the significance of the problem, machine learning with large output space or alternatively also known as extreme classification is a field in itself (Bengio et al., 2019) . A large number of classes naturally brings a new set of computational and memory challenge. Fortunately, with access to our powerful Graphic Processing Unit (GPU) (Owens et al., 2008) , training processes of large models have been accelerated heavily. That is because GPUs have a unique advantage for matrix multiplication, which usually requires a cubic time algebraic operation (O(N 3 )) and is the major and costly building block of NN computations. However, the number of concurrent operations required in large matrix multiplications for classification with extensive number of classes has reached a limit for further speedups even using GPUs.

1.1. NEGATIVE SAMPLING

The common approach to address this challenge is known as negative sampling (Pennington et al., 2014; Jean et al., 2014; Rawat et al., 2019; Mikolov et al., 2013b) . In Negative Sampling, we only sample a small subset of classes for each input and compute the softmax and cross-entropy function. This subset usually includes the positive (true) and a small set of negative (false) classes. Negative sampling scales down the computations in the most cumbersome last layer, thereby making training efficient. However, approximating full-softmax with small sub-sample results in poor convergence if the negative samples are not chosen appropriately. For instance, let us take the example of a recommendation system (predicting products relevant to a query) with a large number of products. If the input query is 'Nike Running Shoes', the true loss concentrates on the specific small number of confusing ('hard') negative classes like 'Adidas Running Shoes'. Since the number of classes is huge, random sampling is unlikely to identify this hard negative class. Other heuristics like frequent class sampling as negative samples are also unlikely to find these hard negatives most of the time. Clearly, without discriminating between closely related negative samples, the classifier cannot achieve good accuracy. Our experiments on recommendations datasets clearly indicate this sub-optimality of current negative sampling heuristics. If there exists a way to sample the subset of confusing classes from the skewed distribution, the training progress would be largely accelerated. However, as evident from the example, such ground-truth distribution depends on the input sample and current model parameters. Moreover, this distribution varies significantly as training progresses. Consider the same query 'Nike Running Shoes', initially, when the network has not learned anything and has random weights, all classes are equally confusing. Thus, uniform sampling is optimal initially as the network has just started to learn. As the training progresses, the network's belief starts getting more concentrated on a few classes; at this time, a negative sample of say 'baby toys' is not at all useful because the network has already learned to tell them apart. The sampling distribution keeps changing, often drastically, as the training progresses. To the best of our knowledge, there does not exist any statistical sampling scheme and implementation for adaptive Negative Sampling, where the cost of maintaining and updating the distribution, per iteration, is O(1) (independent of the number of classes). This is because the input, current true class, and parameters update all the sampling weights in every iteration. It is widely assumed that there is no such sampling scheme, and hence several heuristic alternatives are proposed. The first set of alternatives use a static distribution. The most popular ones, implemented in Tensor-Flow, assume a static distribution such as the distribution based on the frequency of classes. Uniform sampling is another popular choice. Learning-based alternatives are also proposed (Bamler & Mandt, 2020) , where a machine learning generator predicts (or generates) the negative samples. The sampler is solving the same hard problem, prediction over a large number of classes, as a sub-routine. Most importantly, since the sampling distribution for the same data point shifts drastically throughout training, ML models are likely to suffer. Negative sampling alternatives try to balance the sampling cost with quality. So far, negative sampling methods, other than the ones based on static sampling, have failed to demonstrate any training time improvements over the optimized full softmax implementation over GPUs. Static sampling strategies are known to be fast but lead to poor accuracy. With current strategies, the cost of improving the quality with current alternatives does not seem worth it over the GPU acceleration of softmax. In this paper, we change this. Our work provides a truly constant time adaptive sampling scheme utilizing the recent advances in Locality Sensitive Sampling (Charikar & Siminelakis, 2017; Spring & Shrivastava, 2017a) . More impressively, we provide an efficient implementation of our proposal on CPU, which outperforms TensorFlow's implementation of softmax and other negative sampling strategies on some of the best available GPUs (V100) in terms of wall-clock training time.

Summary of Contributions:

1) We propose two efficient schemes for sampling 'hard' negatives where the negative sampling distribution provably adapts to changing parameters and the data instance. Furthermore, the sampling cost is provably constant (independent of the number of classes). 2) We show that our technique is not only provably adaptive but also practical. We provide an efficient CPU implementation, in C++, of our negative sampling approach. We demonstrate the effectiveness of a truly constant time negative sampler by showing that our implementation significantly outperforms standard TensorFlow on V100 GPU in-wall clock speed, while retaining the accuracy.

