OF OPENREVIEW: A CRITICAL ANALYSIS OF THE MACHINE LEARNING CONFERENCE REVIEW PROCESS Anonymous authors Paper under double-blind review

Abstract

Mainstream machine learning conferences have seen a dramatic increase in the number of participants, along with a growing range of perspectives, in recent years. Members of the machine learning community are likely to overhear allegations ranging from randomness of acceptance decisions to institutional bias. In this work, we critically analyze the review process through a comprehensive study of papers submitted to ICLR between 2017 and 2020. We quantify reproducibility/randomness in review scores and acceptance decisions, and examine whether scores correlate with paper impact. Our findings suggest strong institutional bias in accept/reject decisions, even after controlling for paper quality. Furthermore, we find evidence for a gender gap, with female authors receiving lower scores, lower acceptance rates, and fewer citations per paper than their male counterparts. We conclude our work with recommendations for future conference organizers.

1. INTRODUCTION

Over the last decade, mainstream machine learning conferences have been strained by a deluge of conference paper submissions. At ICLR, for example, the number of submissions has grown by an order of magnitude within the last 5 years alone. Furthermore, the influx of researchers from disparate fields has led to a diverse range of perspectives and opinions that often conflict when it comes to reviewing and accepting papers. This has created an environment where the legitimacy and randomness of the review process is a common topic of discussion. Do conference reviews consistently identify high quality work? Or has review degenerated into a process orthogonal to meritocracy? In this paper, we put the review process under a microscope using publicly available data from across the web, in addition to hand-curated datasets. Our goals are to: • Quantify reproducibility in the review process We employ statistical methods to disentangle sources of randomness in the review process. Using Monte-Carlo simulations, we quantify the level of outcome reproducibility. Simulations indicate that randomness is not effectively mitigated by recruiting more reviewers. • Measure whether high-impact papers score better We see that review scores are only weakly correlated with citation impact. • Determine whether the process has gotten worse over time We present empirical evidence that the level of reproducibility of decisions, correlation between reviewer scores and impact, and consensus among reviewers has decreased over time. • Identify institutional bias We find strong evidence that area chair decisions are impacted by institutional name-brands. ACs are more likely to accept papers from prestigious institutions (even when controlling for reviewer scores), and papers from more recognizable authors are more likely to be accepted as well. • Present evidence for a gender gap in the review process We find that women tend to receive lower scores than men, and have a lower acceptance rate overall (even after controlling for differences in the topic distribution for men and women).

2. DATASET CONSTRUCTION

We scraped data from multiple sources to enable analysis of many factors in the review process. OpenReview was a primary source of data, and we collected titles, abstracts, authors lists, emails, scores, and reviews for ICLR papers from 2017-2020. We also communicated with OpenReview maintainers to obtain information on withdrawn papers. There were a total of 5569 ICLR submissions from these years: ICLR 2020 had 2560 submissions, ICLR 2019 had 1565, ICLR 2018 had 960, and ICLR 2017 had 490. Authors were associated with institutions using both author profiles from OpenReview and the open-source World University and Domains dataset. CS Rankings was chosen to rank academic institutions because it includes institutions from outside the US. The arXiv repository was scraped to find papers that first appeared in non-anonymous form before review. We were able to find 3196 papers on arXiv, 1415 of them from 2020. Citation and impact measures were obtained from SemanticScholar. This includes citations for individual papers and individual authors, in addition to the publication counts of each author. Se-manticScholar search results were screened for duplicate publications using an edit distance metric. The dataset was hand-curated to resolve a number of difficulties, including finding missing authors whose names appear differently in different venues, checking for duplicate author pages that might corrupt citation counts, and hand-checking for duplicate papers when titles/authors are similar. To study gender disparities in the review process, we produced gender labels for first and last authors on papers in 2020. We assigned labels based on gendered pronouns appearing on personal webpages when possible, and on the use of canonically gendered names otherwise. This produced labels for 2527 out of 2560 papers. We acknowledge the inaccuracies and complexities inherent in labeling complex attributes like gender, and its imbrication with race. However we do not think these complexities should prevent the impact of gender on reviewer scores from being studied.

2.1. TOPIC BREAKDOWN

To study how review statistics vary by subject, and control for topic distribution in the analysis below, we define a rough categorization of papers to common ML topics. To keep things simple and interpretable, we define a short list of hand curated keywords for each topic, and identify a paper with that topic if it contains at least one of the relevant keywords. The topics used were theory, computer vision, natural language processing, adversarial ML, generative modelling, meta-learning, fairness, generalization, optimization, graphs, Bayesian methods, and Other. 1 In total, 1605 papers from 2020 fell into the above categories, and 772 out of 1605 papers fell into multiple categories. 



Figure 1: Breakdown of papers and review outcomes by topic, ICLR 2020 (a) Acceptance rate

