CONSISTENT DATA DISTRIBUTION SAMPLING FOR LARGE-SCALE RETRIEVAL

Abstract

Retrieving candidate items with low latency and computational cost is important for large-scale advertising systems. Negative sampling is a general approach to model million-scale items with rich features in the retrieval. The traininginference inconsistency of data distribution brought from sampling negatives is a key challenge. In this work, we propose a novel negative sampling strategy Consistent Data Distribution Sampling (CDDS) to solve such an issue. Specifically, we employ a relative large-scale of uniform training negatives and batch negatives to adequately train long-tail and hot items respectively, and employ high divergence negatives to improve the learning convergence. To make the above training samples approximate the serving item data distribution, we introduce an auxiliary loss based on an asynchronous item embedding matrix over the entire item pool. Offline experiments on real datasets achieve SOTA performance. Online experiments with multiple advertising scenarios show that our method has achieved significant increases in GMV. The source code will be released in the future.

1. INTRODUCTION

Industrial search systems, recommendation systems, and advertising systems generally contain a large scale of users and items. To handle the millions and even billions of items, such systems usually comprise the matching stage and the ranking stage. The matching stage, which is also called retrieval, aims to retrieve a small subset of item candidates from the large item pool. Based on the thousand-scale retrieval subset, the ranking stage concentrates on the specific ranks of items for the final display. Since the retrieval task should consider both accuracy and efficiency, the twotower architecture is the main-stream matching method widely used in most industrial large-scale systems (Yi et al., 2019) . In a two-tower retrieval mechanism, a million-scale to billion-scale item embedding can be prepared in advance and updated in a regular cycle when serving. FAISS (Johnson et al., 2019) is usually employed to quantize the vectors. With the efficient nearest neighbors search mechanism, retrieving top similar items from the entire item pool for a given query can be low latency. But the same procession is extremely expensive in the training phase. Especially entire item embeddings change a lot along with the training steps and are hard to feed-forward in every training step. On this condition, the inconsistency between the training data distribution and inference data distribution can not be ignored. What's more, the inconsistency is much more serious when the scale of the items grows. Totally, uniform sampling negatives are often easy to distinguish and yield diminishing gradient norms (Xiong et al., 2020) . Batch negatives lead to an insufficient capability when retrieving longtail items and over-suppress those hot items which are interacted frequently. The current mixed sampling method does not clarify the role and sampling distribution between easy negative samples 1



Since the training-inference inconsistency has a great impact in practice, recent research explored various ways to construct negative training instances to alleviate the problem. Faghri et al. (2017) employ contrastive learning to select hard negatives in current or recent mini-batches. Xiong et al. (2020) construct global negatives using the being-optimized dense retrieval model to retrieve from the entire corpus. Grbovic & Cheng (2018) encode user preference signals and treats users' rejections as explicit negatives. Yang et al. (2020) propose mixed negative sampling to use a mixture of batch and uniformly sampled negatives to tackle the selection bias.

