SCIREPEVAL: A MULTI-FORMAT BENCHMARK FOR SCIENTIFIC DOCUMENT REPRESENTATIONS

Abstract

Learned representations of scientific documents can serve as valuable input features for downstream tasks, without the need for further fine-tuning. However, existing benchmarks for evaluating these representations fail to capture the diversity of relevant tasks. In response, we introduce SciRepEval, the first comprehensive benchmark for training and evaluating scientific document representations. It includes 25 challenging and realistic tasks, 11 of which are new, across four formats: classification, regression, ranking and search. We then use the benchmark to study and improve the generalization ability of scientific document representation models. We show how state-of-the-art models struggle to generalize across task formats, and that simple multi-task training fails to improve them. However, a new approach that learns multiple embeddings per document, each tailored to a different format, can improve performance. We experiment with task-format-specific control codes and adapters in a multi-task setting and find that they outperform the existing single-embedding state-of-the-art by up to 1.5 points absolute.

1. INTRODUCTION

Learning representations of whole documents is critical for a variety of NLP tasks including classification, search, and recommendation (Cohan et al., 2020) . Recent work has shown how pretrained language models (e.g., (Devlin et al., 2019; Raffel et al., 2020; Brown et al., 2020) ) can be tailored to produce high-quality representations of documents with contrastive learning (Xu et al., 2021; Gao et al., 2021; Neelakantan et al., 2022) . In the scientific domain, training objectives based on contrastive learning of cross-document links (e.g., citations) have shown further improvements in document-level representation learning (Cohan et al., 2020; Ostendorff et al., 2022b; Mysore et al., 2022) . These methods are especially useful because the representations they produce can be indexed and later efficiently consumed by lightweight downstream models without additional fine-tuning. While there has been significant progress in evaluating generalizability of NLP models (Ye et al., 2021; Sanh et al., 2021) , evaluation of scientific document representation models has remained limited. Existing benchmarks either focus on document similarity (Mysore et al., 2021; Voorhees et al., 2021) or include tasks that are highly correlated and not diverse (Cohan et al., 2020) . We introduce SciRepEval, the first benchmark for comprehensive evaluation of documentrepresentation learning models in the scientific domain. Unlike prior work, SciRepEval is large and includes a collection of highly diverse tasks, thus encouraging research on generalization (including instance-level, cross-task and cross-domain generalization). It consists of 25 realistic tasks that reflect practical use cases of scientific document representations across four formats: text classification, regression, proximity-based ranking (e.g., nearest-neighbor), and ad-hoc search. Eleven of these tasks are new contributions. SciRepEval contains standard sets of both training and evaluation datasets to simplify and standardize comparisons between methods evaluated on the benchmark. We then use this new benchmark to investigate and improve the generalization ability of document representation models. Following recent work (Cohan et al., 2020; Ostendorff et al., 2022b; Mysore et al., 2022) we further pre-fine-tune a transformer language model (SciNCL; Ostendorff et al., 2022b) to produce high-quality representations for downstream tasks. We hypothesize that condensing all relevant information of the document into a single vector representation might not be expressive enough for generalization across a wide range of tasks. Prior work addresses a similar challenge in the context of document similarity by learning multiple finer-grained representations,

