SCIREPEVAL: A MULTI-FORMAT BENCHMARK FOR SCIENTIFIC DOCUMENT REPRESENTATIONS

Abstract

Learned representations of scientific documents can serve as valuable input features for downstream tasks, without the need for further fine-tuning. However, existing benchmarks for evaluating these representations fail to capture the diversity of relevant tasks. In response, we introduce SciRepEval, the first comprehensive benchmark for training and evaluating scientific document representations. It includes 25 challenging and realistic tasks, 11 of which are new, across four formats: classification, regression, ranking and search. We then use the benchmark to study and improve the generalization ability of scientific document representation models. We show how state-of-the-art models struggle to generalize across task formats, and that simple multi-task training fails to improve them. However, a new approach that learns multiple embeddings per document, each tailored to a different format, can improve performance. We experiment with task-format-specific control codes and adapters in a multi-task setting and find that they outperform the existing single-embedding state-of-the-art by up to 1.5 points absolute.

1. INTRODUCTION

Learning representations of whole documents is critical for a variety of NLP tasks including classification, search, and recommendation (Cohan et al., 2020) . Recent work has shown how pretrained language models (e.g., (Devlin et al., 2019; Raffel et al., 2020; Brown et al., 2020) ) can be tailored to produce high-quality representations of documents with contrastive learning (Xu et al., 2021; Gao et al., 2021; Neelakantan et al., 2022) . In the scientific domain, training objectives based on contrastive learning of cross-document links (e.g., citations) have shown further improvements in document-level representation learning (Cohan et al., 2020; Ostendorff et al., 2022b; Mysore et al., 2022) . These methods are especially useful because the representations they produce can be indexed and later efficiently consumed by lightweight downstream models without additional fine-tuning. While there has been significant progress in evaluating generalizability of NLP models (Ye et al., 2021; Sanh et al., 2021) , evaluation of scientific document representation models has remained limited. Existing benchmarks either focus on document similarity (Mysore et al., 2021; Voorhees et al., 2021) or include tasks that are highly correlated and not diverse (Cohan et al., 2020) . We introduce SciRepEval, the first benchmark for comprehensive evaluation of documentrepresentation learning models in the scientific domain. Unlike prior work, SciRepEval is large and includes a collection of highly diverse tasks, thus encouraging research on generalization (including instance-level, cross-task and cross-domain generalization). It consists of 25 realistic tasks that reflect practical use cases of scientific document representations across four formats: text classification, regression, proximity-based ranking (e.g., nearest-neighbor), and ad-hoc search. Eleven of these tasks are new contributions. SciRepEval contains standard sets of both training and evaluation datasets to simplify and standardize comparisons between methods evaluated on the benchmark. We then use this new benchmark to investigate and improve the generalization ability of document representation models. Following recent work (Cohan et al., 2020; Ostendorff et al., 2022b; Mysore et al., 2022) we further pre-fine-tune a transformer language model (SciNCL; Ostendorff et al., 2022b) to produce high-quality representations for downstream tasks. We hypothesize that condensing all relevant information of the document into a single vector representation might not be expressive enough for generalization across a wide range of tasks. Prior work addresses a similar challenge in the context of document similarity by learning multiple finer-grained representations, each associated with a different aspect of a paper (e.g., task, method, results, etc) (Mysore et al., 2022; Ostendorff et al., 2022a) . In contrast, we aim to learn effective representations for multiple downstream task formats. Following recent success in multi-task learning in NLP (Ye et al., 2021; Sanh et al., 2021) , we explore large-scale multi-task training in the context of scientific document representations, where we apply suitable optimization objectives for the various task formats in SciRepEval. i.e., crossentropy loss for classification, triplet loss for proximity/ad-hoc search, and mean square error loss for regression. We explore two state-of-the-art techniques for generating format-specific document representations: using control codes (Keskar et al., 2019; Raffel et al., 2020) as input indicating the format, and parameter-efficient adapter methods (Houlsby et al., 2019; Pfeiffer et al., 2021; Stickland & Murray, 2019) , in which a separate network module is introduced for every task format. Our experiments investigate: (i) if existing document representation methods have the ability to generalize to a highly diverse set of tasks, (ii) if multi-task training on diverse data can improve document representation models, and (iii) if task-format-specific representations can improve generalization. Through extensive analysis we find that existing state-of-the-art scientific document representation models such as SPECTER (Cohan et al., 2020) and SciNCL (Ostendorff et al., 2022b) struggle with generalizing to all task types. We interestingly find that simple multi-task training on large set of tasks is not able to significantly improve the results. However, we learn that multiple task format-specific representations can substantially improve generalization. To summarize, our contributions are: 

2. BACKGROUND

Representing Scientific Documents Earlier work aimed at document embeddings used word vectors (J et al., 2016; Le & Mikolov, 2014; Wu et al., 2018 ), convolutions (Liu et al., 2017; Zamani et al., 2018 ), bi-encoder networks (Conneau et al., 2017) and BERT-based methods (Reimers & Gurevych, 2019) . Recent works have produced large scale language models pre-trained on scientific corpora (Beltagy et al., 2019; Yasunaga et al., 2022; Trewartha et al., 2022) . These tend to perform better than general purpose models on scientific domain tasks, and serve as a foundation for learning dense embeddings of scientific documents. However, 4 of the 7 tasks in SciDocs are overly-simplistic in that the goal is to distinguish 5 real citations from 20 randomly chosen non-citations (further limitations of SciDocs are discussed in section 3 and Appendix F). Hence, the existing techniques work reasonably well on SciDocs. In contrast, SciRepEval provides a more challenging and diverse set of tasks, for both training and evaluation to help motivate methods for producing scientific document representations that can do well across multiple task formats. As a first step in this direction, we attempt to learn task-specific embeddings of the documents by pre-fine-tuning on multiple objectives simultaneously. Related to our approach, there are techniques in learning multiple embeddings per paper (Ostendorff et al., 2022a; Mysore et al., 2022) . These methods are, however, orthogonal to ours in that they generate an embedding per paper "facet", while we focus on learning separate embeddings per task format. In addition, these techniques focus only on finer-grained paper similarity, while our aim is producing general embeddings applicable to a variety of task formats. Multi-Task Learning Across Formats Multi-task learning (Caruana, 1993) with deep neural networks has been shown to improve performance over single-task training for related objectives (Liu



(i) SciRepEval, a new comprehensive benchmark of 25 highly diverse and practical tasks for scientific document representation techniques across four different formats, of which 11 tasks are made available for the first time, and six of the tasks are explicitly designed for training. (ii) An extensive investigation on the generalization ability of state-of-the-art scientific document representation models. (iii) A set of new multi-task document representation models that, unlike existing methods, can produce representations tailored to different task formats. The new methods show improved generalization over previous work, outperforming prior methods by up to 1.5 points absolute.

Cohan et al. (2020) and Ostendorff et al. (2022b) fine-tune SciBERT (Beltagy et al., 2019) with a triplet loss that encourages papers citing each other to have similar embeddings, using the title and abstract of research papers as the input. Both Cohan et al. (2020) and Ostendorff et al. (2022b) are evaluated on the SciDocs benchmark.

