

Abstract

Diversity is an important criterion for many areas of machine learning (ML), including generative modeling and dataset curation. Yet little work has gone into understanding, formalizing, and measuring diversity in ML. In this paper we address the diversity evaluation problem by proposing the Vendi Score, which extends ideas from ecology to ML. The Vendi Score is defined as the exponential of the Shannon entropy of the eigenvalues of a similarity matrix. This matrix is induced by a user-defined similarity function applied to the sample to be evaluated for diversity. In taking a similarity function as input, the Vendi Score enables its user to specify any desired form of diversity. Importantly, unlike many existing metrics in ML, the Vendi Score does not require a reference dataset or distribution over samples or labels, it is therefore general and applicable to any generative model, decoding algorithm, and dataset from any domain where similarity can be defined. We showcase the Vendi Score on molecular generative modeling where we found it addresses shortcomings of the current diversity metric of choice in that domain. We also applied the Vendi Score to generative models of images and decoding algorithms of text where we found it confirms known results about diversity in those domains. Furthermore, we used the Vendi Score to measure mode collapse, a known shortcoming of generative adversarial networks (GANs). In particular, the Vendi Score revealed that even GANs that capture all the modes of a labelled dataset can be less diverse than the original dataset. Finally, the interpretability of the Vendi Score allowed us to diagnose several benchmark ML datasets for diversity, opening the door for diversity-informed data augmentation. 1 

I

Diversity is a criterion that is sought after in many areas of machine learning (ML), from dataset curation and generative modeling to reinforcement learning, active learning, and decoding algorithms. A lack of diversity in datasets and models can hinder the usefulness of ML in many critical applications, e.g. scientific discovery. It is therefore important to be able to measure diversity. Many diversity metrics have been proposed in ML, but these metrics are often domain-specific and limited in flexibility. These include metrics that define diversity in terms of a reference dataset (Heusel et al., 2017; Sajjadi et al., 2018) , a pre-trained classifier (Salimans et al., 2016; Srivastava et al., 2017) , or discrete features, like n-grams (Li et al., 2016) . In this paper, we propose a general, reference-free approach that defines diversity in terms of a user-specified similarity function. Our approach is based on work in ecology, where biological diversity has been defined as the exponential of the entropy of the distribution of species within a population (Hill, 1973; Jost, 2006; Leinster, 2021) . This value can be interpreted as the effective number of species in the population. To adapt this approach to ML, we define the diversity of a collection of elements x 1 , . . . , x n as the exponential of the entropy of the eigenvalues of the n × n similarity matrix K, whose entries are equal to the similarity scores between each pair of elements. This entropy can be seen as the von Neumann entropy associated with K (Bach, 2022), so we call our metric the Vendi Score, for the von Neumann diversity.



Code for calculating the Vendi Score will be made available publicly after the anonymity period.1

