CONSISTENT DATA DISTRIBUTION SAMPLING FOR LARGE-SCALE RETRIEVAL

Abstract

Retrieving candidate items with low latency and computational cost is important for large-scale advertising systems. Negative sampling is a general approach to model million-scale items with rich features in the retrieval. The traininginference inconsistency of data distribution brought from sampling negatives is a key challenge. In this work, we propose a novel negative sampling strategy Consistent Data Distribution Sampling (CDDS) to solve such an issue. Specifically, we employ a relative large-scale of uniform training negatives and batch negatives to adequately train long-tail and hot items respectively, and employ high divergence negatives to improve the learning convergence. To make the above training samples approximate the serving item data distribution, we introduce an auxiliary loss based on an asynchronous item embedding matrix over the entire item pool. Offline experiments on real datasets achieve SOTA performance. Online experiments with multiple advertising scenarios show that our method has achieved significant increases in GMV. The source code will be released in the future.

1. INTRODUCTION

Industrial search systems, recommendation systems, and advertising systems generally contain a large scale of users and items. To handle the millions and even billions of items, such systems usually comprise the matching stage and the ranking stage. The matching stage, which is also called retrieval, aims to retrieve a small subset of item candidates from the large item pool. Based on the thousand-scale retrieval subset, the ranking stage concentrates on the specific ranks of items for the final display. Since the retrieval task should consider both accuracy and efficiency, the twotower architecture is the main-stream matching method widely used in most industrial large-scale systems (Yi et al., 2019) . In a two-tower retrieval mechanism, a million-scale to billion-scale item embedding can be prepared in advance and updated in a regular cycle when serving. FAISS (Johnson et al., 2019) is usually employed to quantize the vectors. With the efficient nearest neighbors search mechanism, retrieving top similar items from the entire item pool for a given query can be low latency. But the same procession is extremely expensive in the training phase. Especially entire item embeddings change a lot along with the training steps and are hard to feed-forward in every training step. On this condition, the inconsistency between the training data distribution and inference data distribution can not be ignored. What's more, the inconsistency is much more serious when the scale of the items grows. Totally, uniform sampling negatives are often easy to distinguish and yield diminishing gradient norms (Xiong et al., 2020) . Batch negatives lead to an insufficient capability when retrieving longtail items and over-suppress those hot items which are interacted frequently. The current mixed sampling method does not clarify the role and sampling distribution between easy negative samples and hot negative samples. The combination of various sampling methods is usually well-designed for the respective case rather than a general situation. Although previous work sought to empirically alleviate the training-inference inconsistency problem in the matching stage, directly approximating the matching probability over the entire item data distribution is absent. Complex sampling methods make it difficult to analyze the effect of a single negative sample on the selection bias. Empirical approaches which have achieved improvement in efficiency metrics are uncertain about whether the selection bias is actually reduced due to the difficulty of analysis. Since only the non-sampling method is truly free of selection bias, optimizing the selection bias becomes possible if the retrieval probability of a single item can be approximated over the entire item pool. To this end, we propose a novel negative sampling approach called Consistent Data Distribution Sampling, CDDSfoot_0 for brevity. To adequately train long-tail and hot items respectively, we employ a relative large-scale of uniform training negatives and batch negatives. To improve the learning convergence, we employ global approximate nearest neighbor negatives as high divergence negatives. By maintaining an asynchronous item embedding matrix, we calculate the loss not only between the query embedding and the sampled items but also over the entire item pool. Directly approximating the matching probability over the entire item data distribution, our theoretical analysis and experiments show that CDDS achieves a consistent training-inference data distribution, a fast learning convergence, and the state-of-the-art performance. In summary, the main contributions of this work are as follows: • We analyze the bottlenecks of different negative sampling strategies and show that introducing those unsampled vast majority items can alleviate the training-inference inconsistency and resolve the bottlenecks. • We propose CDDS which adequately trains long-tail items, improves the learning convergence, and directly approximates the matching probability over the entire item data distribution. • Our theoretical analysis proves that CDDS achieves a consistent training-inference data distribution. Extensive experiments on real-world datasets achieve the state-of-the-art results. Online experiments with multiple advertising scenarios show that our method has achieved significant increases in GMV and adjust cost.

2.1. TWO-TOWER MODELS

The two-tower architecture is popular in industrial retrieval, which learns query and item vector representations separately in two forward networks. This framework has been widely used in text retrieval (Huang et al., 2013; Yang et al., 2020; Hu et al., 2014 ), entity retrieval (Gillick et al., 2019) , and large-scale recommendation (Cen et al., 2020; Covington et al., 2016; Li et al., 2019) . Our work is orthogonal to existing complex user representation network architectures such as CNN (Hu et al., 2014; Shen et al., 2014) and multi-interest models (Li et al., 2019; Cen et al., 2020) , which can also benefit from our optimal sampling strategy. 



we will release the code once the paper is accepted.



Since the training-inference inconsistency has a great impact in practice, recent research explored various ways to construct negative training instances to alleviate the problem. Faghri et al. (2017) employ contrastive learning to select hard negatives in current or recent mini-batches. Xiong et al. (2020) construct global negatives using the being-optimized dense retrieval model to retrieve from the entire corpus. Grbovic & Cheng (2018) encode user preference signals and treats users' rejections as explicit negatives. Yang et al. (2020) propose mixed negative sampling to use a mixture of batch and uniformly sampled negatives to tackle the selection bias.

NEGATIVE SAMPLING METHODS FOR TWO-TOWER MODELSCovington et al. (2016) and Li et al. (2019)  formulate the retrieval task as an extreme multi-class classification with a sampled softmax loss. Although the challenge of efficient training and the training-inference inconsistency seems to be solved, these methods rely on a predetermined item pool and are not applicable when sampling with complex distributions(Yang et al., 2020).Chen et al. (2020)  show that an enlarged negative pool significantly benefits sampling. In text retrieval, ANCE(Xiong et al., 2020)  constructs global negatives from the entire corpus by using an asynchronously updated ANN index, theoretically showing that uninformative items yield dimin-

