RS-FAIRFRS: COMMUNICATION EFFICIENT FAIR FEDERATED RECOMMENDER SYSTEM

Abstract

Federated Recommender Systems (FRSs) aim to provide recommendations to clients in a distributed manner with privacy preservation. FRSs suffer from high communication costs due to the communication between the server and many clients. Some past literature on federated supervised learning shows that sampling clients randomly improve communication efficiency without jeopardizing accuracy. However, each user is considered a separate client in FRS and clients communicate only item gradients. Thus, incorporating random sampling and determining the number of clients to be sampled in each communication round to retain the model's accuracy in FRS becomes challenging. This paper provides sample complexity bounds on the number of clients that must be sampled in an FRS to preserve accuracy. Next, we consider the issue of demographic bias in FRS, quantified as the difference in the average error rates across different groups. Supervised learning algorithms mitigate the group bias by adding the fairness constraint in the training loss, which requires sharing protected attributes with the server. This is prohibited in a federated setting to ensure clients' privacy. We design RS-FAIRFRS, a Random Sampling based Fair Federated Recommender System, which trains to achieve a fair global model. In addition, it also trains local clients towards a fair global model to reduce demographic bias at the client level without the need to share their protected attributes. We empirically demonstrate across the two most popular real-world datasets (ML1M, ML100k) and different sensitive features (age and gender) that RS-FAIRFRS helps reduce communication cost and demographic bias with improved model accuracy.

1. INTRODUCTION

Recommender systems (RSs) have a wide variety of applications in online platforms like e-business, e-commerce, e-learning, e-tourism, music and movie recommendation engines (Lu et al. (2015) ). Traditional RSs require gathering clients' private information at the central server, leading to serious privacy and security risks. ML models can train locally due to edge devices' increased storage and processing power. This has led to Federated learning (FL) (McMahan et al. (2017) ), which allows clients to share their updates with the server without any data transfer. The server proposes a common model which is communicated with all clients. Using their data and the global model, clients train locally and communicate the updated model to the server. FL has found many applications in the past few years, e.g., Google keyboard query suggestion (Yang et al. (2018) ), smartphone voice assistant, mobile edge computing, and visual object detection (Aledhari et al. (2020) ). These applications face numerous challenges including communication efficiency (Smith et al. (2018) ), statistical heterogeneity (Smith et al. (2017) ), systems heterogeneity (Bonawitz et al. (2019) ), privacy, personalization, fairness (Kairouz et al. (2021) ), and many more. This paper focuses on two primary issues: communication efficiency and demographic bias in FRSs. Unlike other applications of FL, where one client has data of many users, in FRS, each user acts as one client constituting a user's profile. FedRec (Lin et al. (2021) ), an FRS, expects all the clients to train parallelly using matrix factorization (MF). In each communication round, the server aggregates the model updates from a huge number of local clients to obtain a global model, and this global model is then sent back to all the clients. This whole procedure increases the communication cost. We show that random sampling of clients in each communication round reduces the communication cost even when only item gradients of sampled users are communicated. Theoretically, we provide bounds on an ideal fraction of clients to be sampled to maintain the model's accuracy. Proving sample complexity bounds is non-trivial as the clients may possess non-IID data. To circumvent this issue, we assume an underlying clustering structure on the clients such that clients within a cluster share similar item vectors. The main novelty lies in proving that the random sampling will fetch enough representation from each cluster and the predicted ratings obtained after sampling small number of clients will not be far (with high probability) from that of predicted ratings obtained after communicating with all clients in all the rounds. Fairness in FRSs is a critical yet under-investigated area. Empirically, we prove that FedRec offers better recommendations to a particular group of clients. This unfair treatment can fortify the social stereotypes based on gender, race, or age, resulting in significant repercussions. So far, researchers have studied fairness in the domain of centralized RSs (Li et al. (2022) ). Many past works (Islam et al. (2019) ; Yao & Huang (2017) ; Li et al. (2021) ; Yang et al. (2020a) ) develop bias mitigation strategies in traditional RSs, which require sharing sensitive attributes with the server, causing privacy leakage in the federated setting. In FL framework, Yue et al. (2021) ; Kanaparthy et al. (2021) ; Du et al. (2020) ; Zhang et al. (2020) ameliorate bias in classification setting where each client possesses data of many users and thus can train for fairness in each communication round. As opposed to this, in FRS, each user acts as one client that sends its item gradients to the server after updating the user vectors and item gradients locally. This makes it extremely difficult to train locally towards fairness. We propose a dual-fair vector update technique with two phases. In Phase 1, the server aggregates the received item vectors and trains them towards fairness on a small fraction of data. Even if the global model is fair, local client updates may result in a heavily biased model. Thus in Phase 2, the clients minimize local error and learn item vectors closer to the globally fair vectors. In summary, our work aims at mitigating the issues of reducing the communication bottleneck and group bias in Federated Recommendation system (FedRec) (Yang et al. (2020b) ) for the first time. We list down our main contributions below: 1. We provide sample complexity bounds on the fraction of clients required for maintaining accuracy within the desirable limit in Theorem4.1. Our experiments prove that sampling these many clients improve communication costs in FRS without affecting accuracy even when the clients do not disclose their user vectors and share only updated item gradients. 2. We show the existence of group bias in FRS quantified by evaluating discrepancies in the average group losses for each sensitive attribute. To mitigate this issue, we propose a novel dual-fairness vector update technique that tackles the issue of group fairness at local as well as global level. 3. Combining the ideas of random-sampling and dual-fairness vector update, we propose RS-FAIRFRS, a novel FRS model which provides communication efficiency and improved fairness as well as accuracy . 4. We show that RS-FAIRFRS mitigates demographic bias and improves accuracy via extensive experimentation on the two most popular datasets of ML1M and ML100K, with different demographics (age and gender).

2. RELATED WORK

We divide the literature review into four sections: (i) federated recommender systems (FRS), (ii) client sampling in federated learning, (iii) fairness in centralized RSs, and (iv) fair federated learning models. We emphasize that there does not exist any work which targets fairness in FRS. (2021) conducted extensive experiments to elicit significant challenges such as generalization issues, diminishing returns, training failures, and fairness concerns due to using large cohorts in FL models. Authors in (Balakrishnan et al. (2022) ) adopt a greedy strategy to select clients to represent the overall population and provide convergence guarantees for the same. Fraboni et al. (2021a) used clustered sampling for better client representation and reduced variance of stochastic aggregation weight, and Chen et al. (2020) restricted the number of clients allowed to communicate their updates to the server. Finally, Fraboni et al. (2021b) studies the impact of client sampling on the convergence of FL models. While all the above methods assume unbiased client selection, another line of work by Cho et al. (2020) offers the first federated optimization convergence study for biased client selection techniques. Compared to classification settings, FRSs are different as out of the user, only item gradients are communicated to the server. None of the above papers provide bounds on the ideal number of samples required during communication even in federated learning setting. This paper, for the first time provides the sample complexity bounds with respect to FRSs. Group Fairness in RSs: Xiao et al. (2017) study group fairness by maximizing the satisfaction of each group while minimizing the unfairness between them. Fu et al. (2020) propose a fairness constrained approach via heuristic re-ranking to mitigate group bias in explainable RSs. Li et al. (2021) categorizes clients into advantaged and disadvantaged groups according to their activity level and provides a re-ranking approach to debias the recommendations. Beutel et al. (2019) introduce novel metrics using pairwise comparisons to provide reasoning to bias and offer a regularizer to encourage improving the corresponding metric. Yao & Huang (2017) formalizes four novel metrics to quantify demographic bias and introduce a regularizer term in the objective function to mitigate demographic bias; this approach is somewhat similar to Padala & Gujar (2020) . All these methods require the availability of sensitive features of clients to get fair recommendations, leading to privacy leakage in federated settings. A few fair approaches like (Edizel et al. (2020) ; Bobadilla et al. (2020) ) do not require sensitive attributes for mitigating bias but learn disparity from data while training to get unbiased recommendations for the clients whose demographics are unknown. However, FedRec only permits sharing item vector updates with the server, thus making these techniques inapplicable to building fair FRS. Unlike all these approaches, our model RS-FAIRFRS uses an in-processing method that neither disturbs the original data nor requires information leakage to the server. Group Fairness in Federated Learning: Papadaki et al. (2022) provides an optimization algorithm to improve group fairness with similar performance guarantees to centralized ML models. Kanaparthy et al. (2021) proposes four heuristics by considering balanced and heterogeneous data cases separately for fair federated classification models. Yue et al. (2021) propose GI-Fair to tackle group and individual fairness in FLSs by using a regularization term to penalize the spread in the aggregated loss. Recent work by Hu et al. (2022) uses the concept of bounded group loss to provide theoretical guarantees in group fairness. All these works were proposed to solve the issue of demographic fairness in a federated classification setting. They can not be applied to FRS as in FRS each user acts as one client, unlike classification, where one client can have data of many users. Unique from all other methods, RS-FAIRFRS is the first algorithm to provide dual-fairness vector updation in FRS by learning locally towards the global fair model for local fairness and achieving global fairness by training aggregated vectors towards fairness. FedRec: In a typical FRS with explicit feedback, we have n users (or clients), u ∈ {1, 2, 3, ......., n} and m items, i ∈ {1, 2, 3, ........, m}. Each client u has it's rating vector [r ui ] m i=1 that depicts the rating given by a client u to an item i. The true ratings given by the user and predicted ratings are denoted using r ui and rui , respectively. We assume that r ui = 0 if the client has not rated an item. p ui ∈ {0, 1} acts as an indicator variable for rated and unrated items. FedRec uses matrix factorization that identifies the latent structure behind the data to generate two matrices U ∈ R n×k and V ∈ R m×k in a way that each client u is associated with a vector U u ∈ R 1×k , called as user vector and each item i is associated with a vector V i ∈ R 1×k , termed as item vector. The predicted rating of i th item by u th user can be computed as rui = U u .V T i . The goal of FedRec is then to learn the user vectors (locally) and item vectors globally to minimize the loss function: L MF = u∈[n] i∈[m] p ui (r ui -U u .V T i ) 2 + λ r (|| V i || 2 + || U u || 2 ) (1) FedRec aims at predicting the rating of a client u for each item i without sharing their rating behaviors or records. For this, some unrated items are randomly sampled using sampling parameter ρ and assigned virtual rating. Then, item gradients ∇V (u, i) for both truly as well as virtually rated items are shared with the server. The server than aggregates these gradients and sends back the aggregated item vectors to all the clients. Additionally, each client also computes the user gradient ∇U u locally and is not shared with the server. Next section proposes RS-FAIRFRS that solves the issue of communication inefficiency in FRS and mitigates demographic bias.

4. PROPOSED METHODOLOGY

To reduce the communication cost, we randomly sample clients in each communication round who communicate ∇V (u, i) with the server. Randomly sampling the clients has been proposed in literature under supervised federated learning (Charles et al. (2021) ; Fraboni et al. (2021b) ; Cho et al. (2020) ). However, since each client is a separate user in FRS and user only shares the item gradients with the server, it is not clear if random sampling will aid the recommender system to reduce communication cost without affecting its accuracy in a federated setting. More importantly, the main question is that how many clients should we sample in each communication round. To answer this question, next section provides sample complexity bounds on the number of clients required to be sampled to obtained the desired accuracy.

4.1. RANDOM SAMPLING WITHOUT REPLACEMENT

Unlike FedRec (Lin et al. (2021) ), where the server aggregates item vectors after all the clients have sent their updates, in RS-FAIRFRS, the server uniformly samples a τ fraction of n clients. It is well known that the users in FL are non-IID. However, users who provide ratings to items tend to possess an underlying clustered structure. Various algorithms (Koren et al. (2009) ; Gupta et al. ( 2020)) work by identifying the latent patterns of users to provide recommendations and within the cluster, users are IID. We aim to utilize this homogeneity within the same clusters without the knowledge of the K clusters and the users belonging to each cluster. For random sampling to work, it is important that the sampled set of clients C τ must represent the entire population. In the first result, we show that when certain number of clients are sampled randomly at uniform, they represent each cluster equally to ensure that this sampled set is enough to represent the entire population. Lemma 1 Suppose n clients are uniformly distributed amongst K clusters. Then, a subset S ⊆ [n] sampled uniformly at random (without replacement) will contain an approximately equal number of clients from each cluster. We use Hoeffding's bound (Serfling (1974)), which holds for sampling without replacement but provides a very loose bound. Let X j i ∈ {0, 1} denote the random variable taking the value 1 when i th sample belong to cluster j and 0 othersise. Then using Hoeffding's bound, we get P | i X j i -|S| K | ≥ ϵ ≤ 2 exp -2ϵ 2

|S|

. If we take |C τ | = 2100, i.e. 35% of the total number of clients (6000), then we get this probability to be roughly around 0.62, which is obtained at K = 10. Hoeffding's inequality provides a very loose bound but actually this probability is very high which is evident from some basic experiments provided in the Appendix. Lemma 2 Given n clients during the training, τ represents the fraction of clients sampled for each communication round. If V τ i = 1 nτ i∈C τ V i denote the average of item vectors over some C τ clients and V n i = 1 n n i=1 V i represent the average of item vectors over total n clients. Then, E[U T u V τ i ] = E[U T u V n i ] Here, U T u V τ i and U T u V n i denotes the predicted rating of any user u when aggregated item vectors are obtained only via the sampled users and all the clients in the training set respectively. This lemma holds inherently true if the underlying clients are homogeneous which is not true in FL. Thus we use the latent clustering assumption and Lemma 1 to prove that the expected values of predicted ratings of sampled as well as the entire population are equal. The detailed proof is provided in Appendix. Now, we state our main theorem below: Theorem 4.1 (Random Sampling of Clients) Given a rating matrix R, let < {U u } n u=1 , {V i } m i=1 > denote a Federated Recommendation Model with predicted ratings ly- ing within a range of [a, b]. If V τ i = 1 nτ i∈C τ V i and V n i = 1 n n i=1 V i represent the average of item vectors over some τ fraction of clients and total n clients,respectively, then P(|U T u V τ i -U T u V n i | ≥ ϵ) ≤ 2 exp -nτ ϵ 2 2(b -a) 2 The above theorem can be proved using Hoeffding's bound and Lemma 2 According to the above theorem, if the ratings lie between [1, 5], then the probability that the error in predicted rating is more than 10% is less than 1% with just 35% clients from the pool of 6000 clients. Therefore, from our theorem if a dataset has around 6000 clients, we chose τ = 0.35 in our experiments. It is important to note that our main contribution lies in showing that while bounding the sample complexity, in general, is a hard problem, clustering assumption on the underlying data makes it possible to provide the non-trivial bounds. Without this assumption, straightforward use of Hoeffding's inequality will give trivial bounds of 100% on the sample complexity, whereas we need only 30%. It is important to note that we assume the existence of clusters of item vectors with almost equal number of vectors in each cluster for theoretical analysis. We prove the same experimentally in Appendix. Furthermore, the clustering only aids in acquiring a bound on ideal fraction of clients to be sampled. The fairness of RS-FAIRFRSis independent of any clustering with or without groups (age/gender). Since private clustering is an open challenge in FL and forming clusters require sharing of sensitive attributes, RS-FAIRFRSensures privacy to most of its users by hiding their demographics from the server.

4.2. DUAL-FAIR UPDATE

This section firstly discusses the inability of existing fairness notions in RSs to be able to measure group bias in a federated setting. We then propose fairness metric which when added as a constraint in optimization function at server helps achieve a fair global model. Furthermomre, we discuss a two phase mechanism which helps in achieving global as well as local fairness. Existing fairness notions in RS, namely value unfairness, absolute unfairness, and non-parity fairness (Yao & Huang (2017) ) consider the difference in average loss on a specific item i concerning users belonging to advantaged and disadvantaged groups. It is not feasible in FL as each client is distinct and prohibited from sharing its ratings with the server and other clients. Moreover, Mansoury et al. (2020) mentions profile size as an important factor in group bias as more active clients receive better recommendations. We define user activeness as the number of items rated by a user, i.e. a user is more active if he or she has rated more items I u out of all the items in the item set I. Thus, with an assumption of a binary attribute, our metric accuracy parity (L ap ) normalizes the sum of squared loss over all the items rated by a client and is given as: L ap = 1 |g| u∈g 1 |I u | I∈Iu (r ui -r ui ) 2 - 1 |¬g| u∈¬g 1 |I u | I∈Iu (r ui -r ui ) 2 (2) Here, g and ¬g denote the disadvantaged and advantaged groups, respectively. I u is the set of items rated by u and rui , r ui depict the predicted and true ratings, respectively. For models θ and θ, we say that θ is a demographically fairer model if L ap (θ) < L ap ( θ). Designing a fair FRS involves tackling three significant challenges which must be tackled. (i)Designing debiasing techniques for local clients becomes challenging if they are reluctant to share their sensitive attributes (Ezzeldin et al. ( 2021)).(ii) Considering the settings where each user is not one client, local fairness does not ensure global fairness in FL. The non-IID data in FL makes it impossible for the entire distribution to be represented by one standard distribution.(iii) To achieve fairness, simple federated models usually do weighted aggregation; however, it becomes infeasible to assign weights to the updates sent to the server without knowing the group to which they belong. Dual-Fair Updation involves training towards fairness in two phases described below. FairMF: We assume the availability of data of very few (20%) clients D server at the server for evaluating the fairness loss.This assumption was also considered by Kanaparthy et al. (2021) for building a fair federated model for face classification. Previous works considered 5% of the overall data on the server. Choosing 5% of the overall data in RSs may lead to the privacy leakage of more users. Hence, instead of selecting 5% of the entire population, we select data of only 20% of the users. This assumption helps achieve fairness without expecting clients to reveal their private sensitive attributes to the server even during training. Therefore, unlike a simple FRS, the server in RS-FAIRFRS not only aggregates but also helps achieve fairness. We denote each client at the server as s ∈ {1, ...., S} such that S << n, i.e. the number of clients at the server will always be much less than the number of local clients. FairMF trains the data at server D server for obtaining global fairness objective. The goal of FairMF is to optimize the loss function defined as the combination of regularized MF (equation 1) and fairness penalty (equation 2). The final loss function is min U,V L M F + λ f L ap The hyperparameter λ f acts as a fairness penalizer. The update equations for client vector U u and item vector V i are obtained by taking derivatives of the fairness loss function with respect to client and item vectors respectively. Server runs FairMF for some iterations t s and obtains final U f air and V f air . Finally, V f air and [V i ] m i=1 (aggregated item vectors) are communicated to all the clients. We provide the exact procedure for FairMF in Algorithm 1. FO-ClientBatch: Fairness Oriented Client Batch ensures local fairness by learning locally towards fair global model. Each client downloads both V f air and V i from the server. While V f air contributes towards fairness, communication of aggregated V i provide each user with the benefit of other users' participation. Since in FL models global fairness does not ensure local fairness, it is important that each local client also trains towards the fair model. Then, each user vector is updated locally, followed by an updation in the V i towards V f air . Thus the local objective function changes to min U,V L M F + η(||V f air -V i || 2 ) (4) Clients keep ∇U u with themselves and communicate only ∇V to the server. The section below describes the detailed algorithm and communication details. Algorithm 1 FairMF (D server , U f air , V ) 1: for T s = 1, 2, ...., t s do 2: for each (s, i) in D server do 3: if r ui ̸ = 0, update U s and V i by calculating gradients by differentiating equation equation 3 4: end for 5: end for Calculate the aggregated gradient ∇V i for clients in C τ 13: 6: U f air ← U s , V f air ← V i 7: return U f air , V f air Algorithm 2 ClientFilling( V i , i = 1, 2, ...., m; U u ; u; t) 1: if strategy == HF Update V i = V i -γ∇V i 14: end for 15: Decrease the learning rate γ ← 0.9γ 16: end for fraction of total clients present in the training dataset (D train ) are randomly sampled by the server denoted by C τ . The assumption of D server helps in obtaining fair item vectors V f air which are communicated to all the clients for local fairness. Alongwith these, V is also sent to each client to retain the federated properties of FedRec and allow benefits of the participation of other clients. For initial round, the initialized item vectors are communicated, however as the training preoceeds, V gets updated by the aggregated item vectors. Further, the server aggregates item vectors sent by C τ clients only. The fair item vectors V f air and aggregated item vectors communicated by the server are received by each client u and local training happen at all the clients. Similar to FedRec, using UA, every client will receive some (ρ) virtual ratings. In UA, some unrated items are sampled and then a virtual rating r ′ ui is assigned to these items. In HF, after a certain amount of time T predict the virtual rating is replaced by the predicted rating. To acquire predicted rating each client evaluates its user gradients and then updates its user vector. This procedure is called Client Filling (Lin et al. (2021) ) explained in Algorithm 2. The user and item gradients are evaluated simply by differentiating the equation 4. The proposed local objective minimizes the squared loss between the global item vectors and local ones. The updated item gradients are uploaded to the server to train FairMF on the D server and obtains U f air and V f air . This procedure is repeated till convergence.

5. EXPERIMENTS

We empirically show that (i) random sub-sampling of clients reduces communication costs without any slump in accuracy of an FRS , (ii) FedRec suffers heavily from demographic bias, and (iii) RS-FAIRFRS significantly reduces bias without any leakage of client's privacy and improves accuracy.

5.1. EXPERIMENTAL SET-UP

We use two benchmark datasets ML1Mfoot_0 and ML100Kfoot_1 with explicit ratings ranging from 1 to 5. ML1M dataset consists of 1, 000, 209 ratings of 3, 706 movies by 6, 040 users and ML100K consists of 943 users with 100, 000 ratings for 1, 682 items. ML1M has 4, 331 males and 1, 709 females; also, there are 5, 818 people with age > 18 and 212 users with age < 18. Similarly, in ML100k, users with age above 18 (889) are much more than ones with under 18 (54). Furthermore, ML100k has only 273 females as compared to 670 males. All our experiments use RMSE (Root Mean Square Error) as used by Lin et al. (2021) to provide a fair comparison of the accuracy of our model with the FedRec. We also provide an analysis of Group Losses, calculated using the average sum of the RMSE score of all clients belonging to a certain group. Finally, we use L ap [equation 2] to evaluate demographic bias in all the experiments. We split each dataset randomly by keeping 20% of the data for the test set and the rest for training. Similar to Lin et al. (2021) , we use k = 20 latent features and set the value of sampling parameter to ρ = 2. We run all the models till convergence (T = 20) and the FairMF procedure is executed to t s = 15 runs. We report all our results over an average of 10 runs. The final tuned hyperparameter values and other details are provided in the Appendix. While it can be apparent that more data on the server will yield better performance, we run all our experiments on RS-FAIRFRSwith 100% and RS-FAIRFRS with 20% data on the server to show that even with all the data on the server, RS-FAIRFRS with only a 20% data can provide much better accuracy as well as fairness. For exact numbers of the training and testing results, refer Appendix. Now we provide interesting results via various graphical representations.

5.2. EMPIRICAL EVALUATION

Random Sampling of Clients without Replacement: We randomly sample τ fraction of all the available clients. The model's accuracy is analyzed for a wide range of τ values in Figure 1a . Our model has a very high error for small τ values. This indicates the failure to converge and with around 35% of clients, the model performs nearly the same as with 100% clients. From Theorem4.1, it can be shown that the predicted rating of each client with sampled item vectors will be within 5% error with a probability of at least 0.96. In line with the findings of Charles et al. (2021) , increasing τ increases the training loss. Figure 1b presents the average time taken (in sec) for one communication round by FedRec with different values of τ . From these inferences, for all our experiments we select τ = 35 as an ideal value which provides reasonable accuracy in very less time.

Disparate treatment of FedRec for certain sensitive attributes:

To demonstrate the biased treatment of FedRec towards certain sensitive groups, we compute average losses on each sensitive attribute. Figure 2 shows that the groups with more number of clients (advantaged groups) enjoy more accurate recommendations where as ones with lesser number of clients (disadvantaged groups) receive more erroneous recommendations. In both the datasets, clients with age > 18 enjoy recommendations with lesser RMSE as compared to the ones with age < 18. Similarly, females of both the datasets obtain recommendations with much higher RMSE as compared to the males. This shows 

Fairness of RS-FAIRFRS:

To analyze fairness of RS-FAIRFRS, we compare our results with FedRec, RS-FedRec (FedRec with sampling) and RS-FAIRFRS with (100%) data at the server. We also provide bias results on MF to show that FedRec amplifies the bias due to aggregation. In Figure 3 , we show that only (20%) of client's data at the server helps in achieving better fairness than 100% because our model uses random sampling of clients in each round. The sampled clients make use of the server dataset to obtain fair results. 100% data on the server includes many outliers which worsens the fairness of sampled clients thus generating poor fairness as well as accuracy. Our results in Appendix show that even with only 25% and 50% of the items at the server, our model is able to reduce bias. Further, it is important to note that our model uses randomly initialized user and item vectors. Slowly as the training proceeds towards minima, the random vectors get trained and updated to obtain closest rating predictions. Thus, over a period of time, the curve smoothens but the initial readings can be very fluctuating. It is evident from all the four graphs in Figure 3 that RS-FAIRFRS with 20% data on the server manages to acquire least bias on both the datasets for all the sensitive attributes. Thus, RS-FAIRFRS(20%) manages to significantly reduce the bias in FedRec without leaking any sensitive information of most of the users during training. Fairness vs Accuracy in RS-FAIRFRS: RS-FAIRFRS improves group bias and generates more accurate recommendations. We compute RMSE as well as L ap to depict the excellence of our model in both the aspects. Figure 4 shows that RS-FAIRFRS(20%) provides more fair and accurate recommendations to the clients in both datasets for both the genders as well as users with age above and under 18. It can be easily inferred that a lower value in both Demographic bias and RMSE accounts for better model. Users belonging to both the age groups in ML100k dataset enjoy fair recommendations with RS-FAIRFRS(20%). Though, MF performs slightly better in terms of accuracy as well as fairness for users in ML100k for gender attribute due to the efficiency of centralized settings for smaller datasets, overall, RS-FAIRFRS(20%) outperforms all the existing federated algorithms.

6. CONCLUSION AND FUTURE WORK

We propose RS-FAIRFRS which incorporates two key ideas (i) random sampling of clients to reduce the communication cost, and (ii) dual-fair updation to mitigate group bias. We are first to theoretically bound the sample complexity on the fraction of clients to be sampled without affecting model's accuracy. We empirically demonstrate our theoretical results and provide extensive experiments to show that we can achieve much-reduced bias and improved model accuracy. Exploring various other types of biases in FRSs can be an interesting research direction. We would also like to extend our analysis to provide theoretical fairness guarantees of RS-FAIRFRS and explore the usage of dual-fair updation-like techniques in many other FL domains in future.

A APPENDIX

A.1 NOTATIONS We list down all the important and frequently used notations in amount, we can acquire much better accuracy and demographic fairness too. Further, our results emphasize an important and interesting reasoning to improved accuracy as RS-FAIRFRS[20%] not only reduces the loss on advantaged group but also on the disadvantaged group. Our model is able to achieve this due to the uniqueness of the metric l ap which when added to optimization function of MF, reduces loss as well as the difference between the loss on the various attributes over each stochastic gradient descent step. This section introduces some additional findings which validate the ability of our model to improve fairness when we further reduce the amount of data on the server. Initial experiments use entire rating data from 20% users. We use the data of 20% users which are enough to represent entire local population and belong to same platform. This is a reasonable proportion to assume as atleast these many users who already use the online platform might agree to share their sensitive attributes with the server. We further toughen this assumption by considering only few of the previously rated items by these 20% users. This is done by keeping a certain proportion of items on the server. We consider 25% and 50% items of the selected 20% users. Figure 7 presents the final results obtained on these additional experiments. We plot Demographic bias vs RMSE to represent our results using scatter plot. We show the promising results of our algorithm even when dataset is further reduced on the server. The consistency of results on accuracy as well as fairness can be seen across all the datasets and sensitive attributes. Though intutive, our results suggest that having entire data of 20% users can be very conducive in obtaining a fairer as well as more accurate results. However, with only 25% of the item ratings, we are still able to obtain much reduced bias and better accuracy for all the four datasets. Moving further, if we increase data on the server by considering 50% of the item ratings, then we results which are slight better than the previous consideration of 25%. And, the best outcomes are encountered with all the item ratings. Since RS-FAIRFRSpromotes having small amount of user's data on the server, it is evident from the results that RS-FAIRFRSwith 100% data on the server performs the worst. This happens due to the consideration of outliers which participate when all the users happen to occur on the server. Whereas, only 30% of the users from local population participate in the aggregation at server. This result emphasizes that RS-FAIRFRSneed not consider heavy fraction of users at the server. Client fraction as minimal as 20% can help in improving fairness as well as accuracy of the model.



https://grouplens.org/datasets/movielens/1m/ https://grouplens.org/datasets/movielens/100k/



then

-FAIRFRS Algorithm3 describes RS-FAIRFRS in whole. With the assumption of the availability of a small dataset (D server ) corresponding to s users at the server, the procedure begins by initializing V = [V i ] m i=1 and U = [U u ] s u=1 which represent the item and user vectors respectively. Then, some τ Algorithm 3 RS-FairFedRec Communication Efficient and Fairness Aware Federated Recommender System Input: D server , D train , τ, γ, λ r , λ f , α, ρ Output: U u , V i 1: Initialize V = [V i ] m i=1 and U = [U u ] s u=1 2: for t = 1, 2, ......, T do 3: C τ ← {Randomly sub-sample τ fraction of clients from all the users in D train } 4:U f air , V f air ← FairMF(D server , V , U f air ) 5:for each client u ∈ U in parallel do i , i = 1, 2, ...., m; U u ; u, t)8:Calculate the gradients ∇U u and ∇V (u, i) by differentiating equation 3 9:Update U u via U u ← U u -γ∇U u

Figure 1: Comparison plots for accuracy and average time per communication round on different values of τ in RS-FedRec on two datasets of ML1M and ML100k

Figure 3: L ap on two datasets and two different demographics for different algorithms.

Figure 4: Fairness vs Accuracy plots

Figure 7: Fairness vs Accuracy plots

However,Lin et al. (2021) argues that only item vectors of clients should be shared with the server to preserve clients' privacy. Further,Saito et al. (2019) explaints that usage of implicit feedback leads to the positive-unlabeled problem. Thus, we aim to analyze FedRec(Lin et al. (2021)) as our base model. It uses explicit feedback, unlike the other existing models like FCF (ud din et al. (2019)),



Hyperparameter Values

Training results of 4 different algorithms on real-world datasets.

Testing Results of 4 different algorithms on real-world datasets

annex

tering (Bock (2007) ). Figure 5 shows the elbow curve obtained after plotting clustering loss Inertia (Hedman et al. ( 2007)) corresponding to the number of clusters. Inertia is one of the most populalr metrics to evaluated clustering results. It is calculated as the sum of distances to each cluster center. A high value of total distance implies that the points are far from each other which is a clear indication that they are less similar to each other. It can be seen that for both the datasets we get K = 20 as the ideal number of clusters. Considering K = 20 and asssuming that we have total number of clients n = 6000, we randomly assign each client to one out of 20 clusters. Then, we randomly sample some of the clients and observe some interesting observations depicted in Figure 6 . Figure 6a shows that after sampling some clients randomly at uniform, the average number of clients from each cluster are almost same. Further, Figure 6b clearly shows that the probability of getting > 15% error in above experiment is as small as 7.5% when averaged over 500 runs. Finally, Figure 6c demonstrates that minimum number of samples in each cluster is also same. Thus all the three experiments support our Lemma 1. Proof: Let S j denote the set of clients in cluster j, and C j denote the set of clients sampled from cluster j. Since in each cluster, the item vectors are coming from identical distribution, we haverepresent the average of item vectors V i 's from sets S i and C i respectively. From Lemma 1, if τ fraction of clients are selected from n clients uniformly at random without repetition, then we have C j ≈ τ S j with high probability. Then,We now recall our main Theorem 4.1 and provide its detailed proof using above Lemma and Hoeffding's bound. V i represent the average of item vectors over some τ fraction of clients and total n clients,respectively, thenProof of Theorem 4.1: From Hoeffding's inequality, we have,Thus with probability atleast 1 -2 exp{ -2nτ ϵ 2 (b-a) 2 }, we have,Thus, we get

A.4 HYPERPARAMETER TUNING

We carefully tune all the hyper-parameters and list down the final tuned values in Table 2 . It is important to note that server in FedRec and RS-FedRec acts as an aggregator. Thus, the server does not require any tuning and the final tuned values for both the algorithms account for client side parameters. Opposite to this, server in RS-FAIRFRS[100%] as well as RS-FAIRFRS[20%] require fairness oriented training. Therefore, we present separate hyper-parameter values for client side and server side. Since, regularizers at server as well as client side prevents overfitting, we use λ r to denote them, but fairness regularizer at server λ f trains aggregated item vectors towards fairness using D s erver and fairness regularizer η at client side trains local models towards globally fair item vectors. Experimentally, we observe that all the models converge well at T = 20, thus we use T = 20 in all our experiments. However, while executing FairMF at the server, we train FairMF for T s = 15 after each communication round. We obtain better fairness when we train FairMF for some epochs instead of letting it converge fully over each communication round which can be time consuming and it increases the load at server too. Thus FairMF executes for only 15 epochs and provides the updated item vectors to be communicated to all the local clients.

A.5 TRAINING AND TESTING RESULTS

We report the train as well as test results of all the four federated algorithms on two real-world datasets of ML1M and ML100k with two different sensitive attributes in Tables 3 and 4 . It is noted that similar to training results, RS-FAIRFRS[20%] performs the best in terms of accuracy as well as fairness. From the tables, it is evident that FedRec and RS-FedRec performs nearly the same which shows that random sampling retains the model accuracy. Then, our results (rows for RS-FAIRFRS[20%] and RS-FAIRFRS[100%]) provide an empirical proof to the claim we made in the main paper which states that we do not need huge amount of data at the server. With a minimal

