Noise-Robust De-Duplication at Scale NOISE-ROBUST DE-DUPLICATION AT SCALE

Abstract

Identifying near duplicates within large, noisy text corpora has a myriad of applications that range from de-duplicating training datasets, reducing privacy risk, and evaluating test set leakage, to identifying reproduced news articles and literature within large corpora. Across these diverse applications, the overwhelming majority of work relies on N -grams. Limited efforts have been made to evaluate how well N -gram methods perform, in part because it is unclear how one could create an unbiased evaluation dataset for a massive corpus. This study uses the unique timeliness of historical news wires to create a 27,210 document dataset, with 122,876 positive duplicate pairs, for studying noise-robust de-duplication. The time-sensitivity of news makes comprehensive hand labelling feasible -despite the massive overall size of the corpus -as duplicates occur within a narrow date range. The study then develops and evaluates a range of de-duplication methods: hashing and N -gram overlap (which predominate in the literature), a contrastively trained bi-encoder, and a "re-rank" style approach combining a biand cross-encoder. The neural approaches significantly outperform hashing and N -gram overlap. We show that the bi-encoder scales well, de-duplicating a 10 million article corpus on a single GPU card in a matter of hours. We also apply our pre-trained model to the RealNews and patent portions of C4 (Colossal Clean Crawled Corpus), illustrating that a neural approach can identify many near duplicates missed by hashing, in the presence of various types of noise. The public release of our NEWS-COPY de-duplication dataset, codebase, and the pre-trained models will facilitate further research and applications.

1. INTRODUCTION

Robust identification of near-duplicate texts in large, noisy corpora is important for a variety of applications. Duplication in training data degrades model performance (Lee et al., 2021) , can raise serious privacy risks (Kandpal et al., 2022) , and can degrade performance on downstream tasks (Schofield et al., 2017; Liu et al., 2022; Allamanis, 2019) . Additionally, the presence of test set leakage complicates evaluation of model performance, concerns that are elevated with large language models that have greater capacity to memorize training data or can consult an external database. Patterns of duplication are also themselves of interest, for studying the dissemination of reproduced content such as literature or news (Cordell, 2015; Smith et al., 2015; Vesanto et al., 2017) and for reducing noise in datasets used for statistical analyses. In contrast to the literature on semantic textual similarity, where deep neural architectures predominate -e.g. Reimers & Gurevych ( 2019) -text de-duplication overwhelmingly uses N -gram methods. There have been few efforts to formally evaluate the adequacy of N -gram based de-duplication or to explore potential performance gains from neural text de-duplication. This study builds a large deduplication dataset and develops neural methods for robust textual de-duplication that significantly outperform N -gram based methods and scale efficiently. A major hurdle to overcome in systematically studying text de-duplication is the lack of data for an unbiased evaluation of different methods. Typically, there is no way to exhaustively identify all du-1

