OF OPENREVIEW: A CRITICAL ANALYSIS OF THE MACHINE LEARNING CONFERENCE REVIEW PROCESS Anonymous authors Paper under double-blind review

Abstract

Mainstream machine learning conferences have seen a dramatic increase in the number of participants, along with a growing range of perspectives, in recent years. Members of the machine learning community are likely to overhear allegations ranging from randomness of acceptance decisions to institutional bias. In this work, we critically analyze the review process through a comprehensive study of papers submitted to ICLR between 2017 and 2020. We quantify reproducibility/randomness in review scores and acceptance decisions, and examine whether scores correlate with paper impact. Our findings suggest strong institutional bias in accept/reject decisions, even after controlling for paper quality. Furthermore, we find evidence for a gender gap, with female authors receiving lower scores, lower acceptance rates, and fewer citations per paper than their male counterparts. We conclude our work with recommendations for future conference organizers.

1. INTRODUCTION

Over the last decade, mainstream machine learning conferences have been strained by a deluge of conference paper submissions. At ICLR, for example, the number of submissions has grown by an order of magnitude within the last 5 years alone. Furthermore, the influx of researchers from disparate fields has led to a diverse range of perspectives and opinions that often conflict when it comes to reviewing and accepting papers. This has created an environment where the legitimacy and randomness of the review process is a common topic of discussion. Do conference reviews consistently identify high quality work? Or has review degenerated into a process orthogonal to meritocracy? In this paper, we put the review process under a microscope using publicly available data from across the web, in addition to hand-curated datasets. Our goals are to: • Quantify reproducibility in the review process We employ statistical methods to disentangle sources of randomness in the review process. Using Monte-Carlo simulations, we quantify the level of outcome reproducibility. Simulations indicate that randomness is not effectively mitigated by recruiting more reviewers. • Measure whether high-impact papers score better We see that review scores are only weakly correlated with citation impact. • Determine whether the process has gotten worse over time We present empirical evidence that the level of reproducibility of decisions, correlation between reviewer scores and impact, and consensus among reviewers has decreased over time. • Identify institutional bias We find strong evidence that area chair decisions are impacted by institutional name-brands. ACs are more likely to accept papers from prestigious institutions (even when controlling for reviewer scores), and papers from more recognizable authors are more likely to be accepted as well. • Present evidence for a gender gap in the review process We find that women tend to receive lower scores than men, and have a lower acceptance rate overall (even after controlling for differences in the topic distribution for men and women).

