INDIVIDUALLY FAIR RANKING

Abstract

We develop an algorithm to train individually fair learning-to-rank (LTR) models. The proposed approach ensures items from minority groups appear alongside similar items from majority groups. This notion of fair ranking is based on the definition of individual fairness from supervised learning and is more nuanced than prior fair LTR approaches that simply ensure the ranking model provides underrepresented items with a basic level of exposure. The crux of our method is an optimal transport-based regularizer that enforces individual fairness and an efficient algorithm for optimizing the regularizer. We show that our approach leads to certifiably individually fair LTR models and demonstrate the efficacy of our method on ranking tasks subject to demographic biases.

1. INTRODUCTION

Information retrieval (IR) systems are everywhere in today's digital world, and ranking models are integral parts of many IR systems. In light of their ubiquity, issues of algorithmic bias and unfairness in ranking models have come to the fore of the public's attention. In many applications, the items to be ranked are individuals, so algorithmic biases in the output of ranking models directly affect people's lives. For example, gender bias in job search engines directly affect the career success of job applicants (Dastin, 2018) . There is a rapidly growing literature on detecting and mitigating algorithmic bias in machine learning (ML). The ML community has developed many formal definitions of algorithmic fairness along with algorithms to enforce these definitions (Dwork et al., 2012; Hardt et al., 2016; Berk et al., 2018; Kusner et al., 2018; Ritov et al., 2017; Yurochkin et al., 2020) . Unfortunately, these issues have received less attention in the IR community. In particular, compared to the myriad of mathematical definitions of algorithmic fairness in the ML community, there are only a few definitions of algorithmic fairness for ranking. A recent review of fair ranking (Castillo, 2019) identifies two characteristics of fair rankings: 1. sufficient exposure of items from disadvantaged groups in rankings: Rankings should display a diversity of items. In particular, rankings should take care to display items from disadvantaged groups to avoid allocative harms to items from such groups. 2. consistent treatment of similar items in rankings: Items with similar relevant attributes should be ranked similarly. There is a line of work on fair ranking by Singh & Joachims (2018; 2019) that focuses on the first characteristic. In this paper, we complement this line of work by focusing on the second characteristic. In particular, we (i) specialize the notion of individual fairness in ML to rankings and (ii) devise an efficient algorithm for enforcing this notion in practice. We focus on the second characteristic since, in some sense, consistent treatment of similar items implies adequate exposure: if there are items from disadvantaged groups that are similar to relevant items from advantaged groups, then a ranking model that treats similar items consistently will provide adequate exposure to the items from disadvantaged groups.

1.1. RELATED WORK

Our work addresses the fairness of a learning-to-rank (LTR) system with respect to the items being ranked. The majority of work in this area requires a fair ranking to fairly allocate exposure (measured by the rank of an item in a ranking) to items. One line of work (Yang & Stoyanovich, 2017; Zehlike et al., 2017; Celis et al., 2018; Geyik et al., 2019; Celis et al., 2020; Yang et al., 2019b) 2019), we propose an in-processing algorithm. We also note that there is some work on We consider individual fairness as opposed to group fairness (Yang & Stoyanovich, 2017; Zehlike et al., 2017; Celis et al., 2018; Singh & Joachims, 2018; Zehlike & Castillo, 2020; Geyik et al., 2019; Sapiezynski et al., 2019; Kuhlman et al., 2019; Celis et al., 2020; Yang et al., 2019b; Wu et al., 2018; Asudeh et al., 2019) . The merits of individual fairness over group fairness have been well established, e.g., group fair models can be blatantly unfair to individuals (Dwork et al., 2012) . In fact, we show empirically that individual fairness is adequate for group fairness but not vice versa. 

2. PROBLEM FORMULATION

A query q ∈ Q to a ranker consists of a candidate set of n items that needs to be ranked d q {d q 1 , . . . , d q n } and a set of relevance scores rel q {rel q (d) ∈ R} d∈d q . Each item is represented by a feature vector ϕ(d) ∈ X that describes the match between item d and query q where X is the feature space of the item representations. We consider stochastic ranking policies π(• | q) that are distributions over rankings r (i.e. permutations) of the candidate set. Our notation for rankings is r(d): the rank of item d in ranking r (and r -1 (j) is the j-ranked item). A policy generally consists of two components: a scoring model and a sampling method. The scoring model is a smooth ML model h θ parameterized by θ (e.g.a neural network) that outputs a vector of scores: h θ (ϕ(d q )) (h θ (ϕ(d q 1 )), . . . , h θ (ϕ(d q n ))). The sampling method defines a distribution on rankings of the candidate set from the scores. For example, the Plackett-Luce (Plackett, 1975) . (2.1)



model defines the probability of the ranking r = d 1 , . . . , d n asπ θ (r | q) = n j=1 exp(h θ (ϕ(d j ))) exp(h θ (ϕ(d j ))) + • • • + exp(h θ (ϕ(d n )))

requires a fair ranking to place a minimum number of minority group items in the top k ranks. Another line of work models the exposure items receive based on rank position and allocates exposure based on these exposure models and item relevance(Singh & Joachims, 2018; Zehlike & Castillo, 2020; Biega  et al., 2018; Singh & Joachims, 2019; Sapiezynski et al., 2019). There is some work that consider other fairness notions. The work ofKuhlman et al. (2019)  proposes error-based fairness criteria, and the framework of Asudeh et al. (2019) can handle arbitrary fairness constraints given by an oracle. In contrast, we propose a fundamentally new definition: an individually fair ranking is invariant to sensitive perturbations of the features of the items. For example, consider ranking a set of job candidates, and consider the hypothetical set of candidates obtained from the original set by flipping each candidate's gender. We require that a fair LTR model produces the same ranking for both the original and hypothetical set.

The work inBiega et al. (2018); Singh & Joachims (2019) also considers individually fair LTR models. However, our notion of individual fairness is fundamentally different since we utilize a fair metric on queries like in the seminal work that introduced individual fairness (Dwork et al., 2012) instead of measuring the similarity of items through relevance alone. To see the benefit of our approach, consider the job applicant example. If the training data does not contain highly ranked minority candidates, then at test time our LTR model will be able to correctly rank a minority candidate who should be highly ranked, which is not necessarily true for the work inBiega et al. (2018); Singh & Joachims (2019).

