NOISE-ROBUST DE-DUPLICATION AT SCALE

Abstract

Identifying near duplicates within large, noisy text corpora has a myriad of applications that range from de-duplicating training datasets, reducing privacy risk, and evaluating test set leakage, to identifying reproduced news articles and literature within large corpora. Across these diverse applications, the overwhelming majority of work relies on N -grams. Limited efforts have been made to evaluate how well N -gram methods perform, in part because it is unclear how one could create an unbiased evaluation dataset for a massive corpus. This study uses the unique timeliness of historical news wires to create a 27,210 document dataset, with 122,876 positive duplicate pairs, for studying noise-robust de-duplication. The time-sensitivity of news makes comprehensive hand labelling feasible -despite the massive overall size of the corpus -as duplicates occur within a narrow date range. The study then develops and evaluates a range of de-duplication methods: hashing and N -gram overlap (which predominate in the literature), a contrastively trained bi-encoder, and a "re-rank" style approach combining a biand cross-encoder. The neural approaches significantly outperform hashing and N -gram overlap. We show that the bi-encoder scales well, de-duplicating a 10 million article corpus on a single GPU card in a matter of hours. We also apply our pre-trained model to the RealNews and patent portions of C4 (Colossal Clean Crawled Corpus), illustrating that a neural approach can identify many near duplicates missed by hashing, in the presence of various types of noise. The public release of our NEWS-COPY de-duplication dataset, codebase, and the pre-trained models will facilitate further research and applications.

1. INTRODUCTION

Robust identification of near-duplicate texts in large, noisy corpora is important for a variety of applications. Duplication in training data degrades model performance (Lee et al., 2021) , can raise serious privacy risks (Kandpal et al., 2022) , and can degrade performance on downstream tasks (Schofield et al., 2017; Liu et al., 2022; Allamanis, 2019) . Additionally, the presence of test set leakage complicates evaluation of model performance, concerns that are elevated with large language models that have greater capacity to memorize training data or can consult an external database. Patterns of duplication are also themselves of interest, for studying the dissemination of reproduced content such as literature or news (Cordell, 2015; Smith et al., 2015; Vesanto et al., 2017) and for reducing noise in datasets used for statistical analyses. In contrast to the literature on semantic textual similarity, where deep neural architectures predominate -e.g. Reimers & Gurevych (2019) -text de-duplication overwhelmingly uses N -gram methods. There have been few efforts to formally evaluate the adequacy of N -gram based de-duplication or to explore potential performance gains from neural text de-duplication. This study builds a large deduplication dataset and develops neural methods for robust textual de-duplication that significantly outperform N -gram based methods and scale efficiently. A major hurdle to overcome in systematically studying text de-duplication is the lack of data for an unbiased evaluation of different methods. Typically, there is no way to exhaustively identify all du-plicates of a given example in a large corpus, complicating comparisons of recall. To circumvent this challenge, we examine duplication in historical news. Reproduction from news wires and syndicate services was widespread, forming over half the content of U.S. local newspapers. Media historian Julia Guarneri (2017) writes: "by the 1910s and 1920s, most of the articles that Americans read in their local papers had either been bought or sold on the national news market... This constructed a broadly understood American 'way of life' that would become a touchstone of U.S. domestic politics and international relations throughout the twentieth century." Because news is timely, reproduction happens within a narrow time window, and hence annotators can exhaustively identify all duplicates despite the massive overall size of the corpus. To build an unbiased evaluation sample, highly skilled human annotators manually reviewed every front page article from 973 newspapers on four randomly chosen days in 1930, 1955, and 1974 to create clusters of duplicated articles (including all singletons). Additional data, spanning the period from 1920 to 1977, were compiled for model training. The resulting public NEWS-COPY dataset -which contains 27,210 articles, comprising 122,876 positive duplicate pairs -aims to encourage further study of robust de-duplication. In the absence of evaluation data, the literature has largely assumed that text de-duplication is sufficiently simple that neural methods are not required. However, noise is an integral feature of large text datasets, resulting from OCR errors, abridgement, news aggregators, plagiarism, or machine translation, to name a few reasons. This can lead near duplicate documents to have low N -gram similarity. Amongst duplicated pairs of articles in the NEWS-COPY test set, the average Jaccard similarity using 3-grams (4-grams, 5-grams) between pairs of reproduced articles is 30% (26%, 23%). 19% of duplicates have no 10-grams in common and 31% have no 15-grams in common, often as a result of minor text noise. Neural methods are plausibly more robust. Using the NEWS-COPY dataset, we examine different text de-duplication methods that vary along two key dimensions: whether or not the method is neural and computational cost. Drawing inspiration from work on semantic textual similarity and on retrieval, we develop two approaches for neural text de-duplication: a contrastively trained bi-encoder plus clustering method and a 'reranking' style method, which uses a computationally cheap transformer bi-encoder to measure the pairwise similarity between all articles and then passes each article's nearest neighbors to a cross-encoder, at an additional computational cost. We also examine N -gram overlap and locally sensitive hashing, the latter of which is highly scalable. The neural methods significantly outperform the non-neural approaches. The Adjusted Rand Index (ARI) for the re-rank model is 93.7 and for the bi-encoder model is 91.5, versus 73.7 for LSH and 75.0 for N -gram overlap. While the primary advantage of hashing -and a central motivation for its frequent usage -is its scalability, massive scale similarity search (Johnson et al., 2019) is sufficiently cheap on modern GPUs to make neural de-duplication highly scalable. We use our contrastively-trained bi-encoder and a single NVIDIA 40GB A6000 GPU card to de-duplicate a 10 million document, 19 GB corpus in 11 hours and 45 minutes. While this cost is already marginal in the context of working with large text corpora, it could be reduced significantly further by using a lighter weight language model, as the majority of the time cost is embedding the 10M articles. The publicly available neural de-duplication models, available at https://github.com/ dell-research-harvard/NEWS-COPY, can be applied to novel de-duplication problems. To evaluate off-the-shelf performance, we apply our bi-encoder model to two subsets of C4 (Colossal Clean Crawled Corpus), a massive dataset created by applying a series of filters to a single snapshot of Common Crawl (Raffel et al., 2019; Dodge et al., 2021) : RealNews -which consists of around 13 million digital news articles -and all 90,671 patents scraped from Google's online patent database. We also examine test set leakage between SuperGlue (Sarlin et al., 2020) and RealNews. While there is not an unbiased ground truth measure for these datasets, an analysis of predicted duplicates shows that the bi-encoder detects a variety of noisy duplicates that hashing overlooks, which result from aggregators of digital news, machine translation, and other sources of noise. The rest of this paper is organized as follows: Section 2 provides an overview of the relevant literature. Section 3 describes the NEWS-COPY dataset, and Section 4 develops neural de-duplication methods and their non-neural comparisons. Section 5 evaluates the performance of different deduplication methods, Section 6 explores scaling, and Section 7 applies de-duplication to a subset of C4. Finally, Section 8 concludes.

