DISTANTLY SUPERVISED RELATION EXTRACTION IN FEDERATED SETTINGS

Abstract

Distant supervision is widely used in relation extraction in order to create a largescale training dataset by aligning a knowledge base with unstructured text. Most existing studies in this field have assumed there is a great deal of centralized unstructured text. However, in practice, text may be distributed on different platforms and cannot be centralized due to privacy restrictions. Therefore, it is worthwhile to investigate distant supervision in the federated learning paradigm, which decouples the training of the model from the need for direct access to the raw text. However, overcoming label noise of distant supervision becomes more difficult in federated settings, because the sentences containing the same entity pair scatter around different platforms. In this paper, we propose a federated denoising framework to suppress label noise in federated settings. The core of this framework is a multiple instance learning based denoising method that is able to select reliable sentences via cross-platform collaboration. Various experimental results on New York Times dataset and miRNA gene regulation relation dataset demonstrate the effectiveness of the proposed method.

1. INTRODUCTION

Relation extraction (RE) aims to mine factual knowledge from free text by labeling relations between entity mentions, which is a crucial step in knowledge base (KB) construction. For example, given a sentence "[Steve Jobs] e1 and Wozniak co-founded [Apple] e2 in 1967", a relation extractor should identify that "Steve Jobs" and "Apple" are in a "Founder" relationship. Most existing supervised RE systems, such as Zeng et al. (2014) ; Zhang & Wang (2015) ; Wang et al. (2016) ; Zhou et al. (2016) , rely on a large-scale manually annotated training dataset, which is extremely expensive and cannot cover all walks of life. To ease the reliance on annotated data, Mintz et al. (2009) proposed distant supervision to automatically generate training data by heuristically aligning a KB with unstructured text. The key assumption of distant supervision is that if two entities have a relation in the KB, then all sentences that mention these two entities will express this relation. Since then, there has been a rich literature devoted to this topic, such as Riedel et al. (2010) ; Hoffmann et al. (2011) ; Zeng et al. (2015) ; Lin et al. (2016) ; Ye & Ling (2019) ; Yuan et al. (2019) . Though the progress is exciting, distant supervision approaches have so far been limited to the centralized learning paradigm, which assumes that a great deal of text is easily accessible. However, in practice, text may be distributed on different platforms and be massively convoluted with sensitive personal information, especially in the healthcare and financial fields (Yang et al., 2019; Zerka et al., 2020; Chamikara et al., 2020) . Due to privacy restrictions, it is almost impossible or cost-prohibitive to centralize text from multiple platforms. Recently, federated learning (McMahan et al., 2016) provides a compelling solution for learning a model from decentralized and privacy-sensitive data. The main idea behind federated learning is that each platform trains a local model based on its own local data and a master server coordinates massive platforms to collaboratively train a global model by aggregating these local model updates. Unfortunately, directly applying federated learning to the decentralized distantly supervised data fails, because conventional federated learning requires the local data to come with labels without noise (Tuor et al., 2020) , however, in distant supervision, automatic labeling inevitably accompanies with label noise (Riedel et al., 2010; Hoffmann et al., 2011; Zeng et al., 2015; Lin et al., 2016) , which means not all sentences that mention an entity pair can represent the relation between them. Training on such noisy data will substantially hinder the performance of the RE model. S 2 : Steve Jobs resigned as chief executive from Apple in 2011. Platform 1

Steve Jobs Apple

Platform 2 Founder Figure 1 : An example of the sentences that contain the same entity pair distributed on two platforms. The triple (Steve Jobs, Founder, Apple) is a fact in the KB Moreover, even involving previous denoising methods, such as Zeng et al. (2015) ; Lin et al. (2016) ; Ye & Ling (2019) , cannot handle label noise well in federated settings. This point can be illustrated by the example in Figure 1 . S 1 and S 2 contain the same entity pair ("Steve Jobs", "Apple") but are distributed on two platforms. S 1 is true positive while S 2 is a false positive instance, which does not express the "founder" relation. In centralized training, there is no barrier between Platform 1 and Platform 2; therefore, simultaneously considering S 1 and S 2 can easily filter out noise via only selecting S 1 (Zeng et al., 2015) or placing a small weight on S 2 (Lin et al., 2016; Ye & Ling, 2019) . However, raw data exchange between platforms is prohibited in federated settings. Due to the lack of comparison with S 1 , previous denoising methods would mistakenly regard S 2 as a true positive instance. As a result, S 2 is retained and then poisons the local model in platform 2, which would affect the global model in turn. To suppress label noise in federated settings, we propose a federated denoising framework in this paper. The core of this framework is a multiple instance learning (MIL) (Dietterich et al., 1997; Maron & Lozano-Pérez, 1998) based denoising algorithm, called Lazy MIL, which is only executed at the beginning of each communication round and then would rest until the next round. Since the sentences containing the same entity pair scatter around different platforms, Lazy MIL algorithm coordinates multiple platforms to jointly select reliable sentences. Once sentences have been selected, they would be used repeatedly to train local models until the end of this round. In summary, the contributions of this paper are: • Considering data decentralization and privacy protection, we investigate distant supervision under the federated learning paradigm, which decouples the model training from the need for direct access to the raw data. To our best knowledge, combining federated learning with distant supervision is still an unexplored territory, which is the main focus of this paper. • Since the automatic labeling in distant supervision inevitably accompanies with label noise, we present a multiple instance learning based denoising method, which can select reliable instances via cross-platform collaboration. • The proposed method yields promising results on two benchmarks datasets, and we perform various experiments to verify the effectiveness of the proposed method. The code will be released at http://anonymized.

2. RELATED WORK

In this section, we will briefly review the recent progress in distant supervision and some existing studies in federated learning.

Distant supervision.

Relation extraction is a task of mining factual knowledge from free text by labeling relations between entity mentions. To alleviate the dependence of supervised methods on annotated data, Mintz et al. (2009) proposed distant supervision by using a knowledge base to annotate a large-scale dataset automatically. However, automatic labeling inevitably accompanies with label noise. To deal with label noise, most distantly supervised approaches (Riedel et al., 2010; Hoffmann et al., 2011; Surdeanu et al., 2012; Zeng et al., 2015; Lin et al., 2016; Luo et al., 2017; Ye & Ling, 2019; Yuan et al., 2019) focus on reducing label noise at bagfoot_0 level prediction. These studies fall under multiple instance learning framework, which assumes that at least one sentence expresses the relation in a bag. Another line of work aims to reduce label noise at sentence level prediction. These studies (Zeng et al., 2018; Feng et al., 2018; Qin et al., 2018a; b) use reinforcement learning or adversarial training to select trustable relation labels by matching the predicted labels with distantly supervised labels. Compared with previous studies, our work focuses on reducing label noise at bag level prediction and extends distant supervision to federated settings. Distant supervision has also been applied to other natural language processing tasks, such as named entity recognition (Ghaddar & Langlais, 2018; Shang et al., 2018; Nooralahzadeh et al., 2019) , event extraction (Chen et al., 2017) , sentiment classification (Go et al., 2009 ) and question answer (Joshi et al., 2017; Lin et al., 2018; Cheng et al., 2020) . Federated Learning. Recently, federated learning (McMahan et al., 2016; Konečnỳ et al., 2016a; b; Bonawitz et al., 2017; Smith et al., 2017; Caldas et al., 2018; Zhao et al., 2018; Li et al., 2018; Jeong et al., 2018; Peng et al., 2019; Li et al., 2019; Wang et al., 2020; Rothchild et al., 2020; Yu et al., 2020) 3 FEDERATED DENOISING FRAMEWORK

3.1. TASK DEFINITION

In this paper, we focus on distant supervision in federated settings. Define K platforms {P 1 , ...P K } with respective unlabeled corpora {D 1 , ...D K }. Under the assumption of centralized training, each platform transfers or shares its local corpus to a server, and the server will take the integrated corpus D = D 1 ∪ ... ∪ D K to conduct training, while the task of distant supervision in federated settings requires platform P i does not expose its corpus D i to others (including the server). In distant supervision, a KB is required to automatically label these corpora. In this paper, we only focus on the data security of these unlabeled corpora and assume the KB is publicly available for all platforms. The issue of protecting the security of KB is beyond the scope of the current work. To solve this task, we propose a federated denoising framework. The key components of this framework will be elaborated in the following section. Concretely, we firstly introduce the basic relation extractor in Section 3.2, which is the network architecture shared by the global model and local models. Then, we present how to select reliable instances via cross-platform collaboration in Section 3.3. Next, we describe how to use the selected instances to train the local model in Section 3.4. Finally, we present how to use the FedAvg algorithm to update the global model in Section 3.5.

3.2. RELATION EXTRACTOR

Following previous studies (Zeng et al., 2015) , we adopt the Piecewise Convolutional Neural Network (PCNN) as our relation extractor. Given a sentence s and two entities within this sentence, we first split the sentence into tokens, and then each token w i is mapped into a dense word embedding e i ∈ R dw . To specify the entity pair, relative distances between the current token w i and the two entities are transformed into two positional features by looking up the position embedding matrices. Next, the token is represented as the concatenation of the word embedding and two positional features, and is fed into a convolutional neural network. Then, piecewise max pooling (Zeng et al., 2015) is employed to extract the high-level sentence representation. In the piecewise max pooling, an input sentence is divided into three segments based on the two entities, and the maximum value of CNN outputs in each segment is returned. After that, we apply a single fully connected layer to output the logit value o. Finally, the conditional probability of j-th relation is denoted as follows: p(rel j |s, Θ) = exp(o j ) M i=1 exp(o i ) (1) where Θ is the model parameter and M is the total number of relation. Algorithm 1 Lazy Multiple Instance Learning 1: Input: global model parameters Θ, the set of activated platforms A. 2: Define two dictionary on the server, named V and I Run on the master server 3: Distribute Θ to each platform in A 4: for each platform i ∈ A in parallel do Run on the activated platforms 5: for each triple (h, r, t) in KB do 6: for each sentence s i z in the bag b i do 7: Compute p(r|s i z , Θ) According to Equation 18: v i , id i ← max z (p(r|s i z , Θ)), s i z ∈ b i v i is called uploaded value 9: Upload [v i , id i , i] to the server and append [v i , id i , i] to V[(h, r, t)] 10: for each key (h, r, t) in V do Run on the master server 11: Sort V[(h, r, t) in descending order according to the uploaded values. 12: I[(h, r, t)] ← V[(h, r, t)][0] 13: Broadcast I to each platform in A 3.3 LAZY MULTIPLE INSTANCE LEARNING To avoid the local relation extractor being poisoned by false positive instances, we propose lazy multiple instance learning (Lazy MIL), which can select reliable instances via cross-platform collaboration. The overview of Lazy MIL is illustrated in Algorithm 1. Suppose that there is a triple (h, r, t) in the public KB, the set of sentences containing the head entity h and tail entity t is represented as {(s 1 1 , s 1 2 , ..., s 1 n1 ), ..., (s K 1 , s K 2 , ..., s K n k )} , where s j i indicates the i-th instance in the platform j. In the q-th communication round, assume that only platform i and platform j are activated. At the beginning of this round, the parameters of the global model Θ q are distributed to the activated platforms i and j for initializing local models, which ensures that all activated local models share the same parameters in Lazy MIL. In platform i, the sentences in the set (s i 1 , s i 2 , ..., s i ni ) are fed into the local model to get conditional probabilities associated with the relation r according to Equation 1, where r is the predicate of the triple. The value v i and index id i of the instance with the maximum conditional probability associated with the relation r are computed as follows: v i , id i = max z (p(r|s i z , Θ q )) 1 ≤ z ≤ n i (2) After computation, platform i uploads the value v i and index id i to the master server. At the same time, the same procedure is performed on platform j, and the value v j and index id j are also uploaded to the server. The master server decides which local instance can be selected among all activated platforms based on the uploaded values. If v i > v j , then the id i -th sentence in platform i is selected as the reliable sentence that expresses the triple (h, r, t) in this round. This decision, called denoising information, is broadcast to all activated platforms. Each activated platform selects reliable training instances from its local corpus according to this denoising information. Note that since only values and indices of conditional probabilities are uploaded to the master server, Lazy MIL almost does not leak the corpus information in each platform.

3.4. LOCAL MODEL TRAINING

After platform i selects reliable instances from its local corpus D i , the selected reliable instance set D i is used for training the local relation extractor. We use the cross-entropy loss function to optimize parameters Θ q , which is defined as follows: J(Θ q ; D i ) = - 1 |D i | |D i | u=1 log p(r u |s u , Θ q ) (3) where s u indicates the u-th sentence in the selected reliable instance set D i . After training E epochs on the selected reliable instance set, the trained parameters Θ i q+1 are uploaded to the master server, where the superscript i indicates the parameters are trained on platform i.  for communication round q = 0,1,... do 5: m ← max(C × K, 1) Select activated platforms 6: A q ← (random set of m platforms) 7: Execute lazy multiple instance learning algorithm Defined in Algorithm 1 8: for each platform i ∈ A q in parallel do 9: Θ k q+1 ← Local Training(i, Θ q ) 10: Θ q+1 ← i∈Aq |D i | j∈Aq |D j | Θ i q+1 Defined in Equation 511: 12: Function Local Training(i, Θ): Run on platform i 13: Generate denoised dataset D i from D i based on the denoising information I Θ ← Θ -η∇J(Θ; b) J is defined in Equation 318: return Θ to the master server

3.5. GLOBAL MODEL UPDATE

Suppose A q is the set of activated platforms in the q-th communication round. After all activated platforms finish local training, the master server collects all trained parameters {Θ i q+1 |i ∈ A q } to update the global model. We define the goal of the global model as follows: min Θq i∈Aq |D i | j∈Aq |D j | J(Θ q ; D i ) where J(Θ q ; D i ) is the local loss function for the platform i. Follow previous studies (McMahan et al., 2016) , we optimize this global objective function via taking the weighted average of all trained parameters, which is shown as follows: Θ q+1 = i∈Aq |D i | j∈Aq |D j | Θ i q+1 where Θ i q+1 is the optimal parameters obtained by minimizing the local loss function on the local data of platform i. Since all trained parameters from different platforms are aggregated together, the corpus information of each platform is hard to be inferred. Thus, corpora in platforms are wellprotected. Complete pseudo-code of this framework is given in Algorithm 2. Data Partitioning. To study distant supervision in federated settings, we need to specify how to distribute the data across platforms. In this paper, we focus on the IID situation, where the training data are shuffled and then partitioned into K (the total number of platforms) platforms.

4. EXPERIMENTS

Evaluation Metrics. We evaluate our approach and baseline methods on the held-out test set of these two datasets. Precision-recall (PR) curves, area under curve (AUC) values and Precision@N (P@N) values are adopted as evaluation metrics in our experiments. For a fair comparison, we implement our method and all baselines in the same experimental settings. We divide the hyperparameters into three parts, i.e., fixed hyperparameters, unfixed hyperparameters and federated hyperparameters. Fixed hyperparameters follow the hyperparameter settings in Lin et al. (2016) , including the 50-dimensional pretrained word embeddings for NYT, the 5-dimensional position embeddings, and CNN module that includes 230 filters with a window size of 3. For MIRGENE, 200-dimensional word embeddings pretrained on PubMed and MIMIC-III are used. The optimal unfixed hyperparameters are determined by grid search based on the performance of the development set, and the search space of unfixed hyperparameters is shown in Table 1 . Federated hyperparameters include the total number of platforms K, the fraction of platforms C, the local minibatch size B, the number of local epochs E. All of these control the amount of computation. In the end-to-end comparison, we fix the K to 100, B to 32, E to 3, and set the hyperparameter space of C as {0.1, 0.2, 0.5, 1} following McMahan et al. (2016) . We use stochastic gradient descent as the local training optimizer and all experiments can be done by using a single GeForce GTX 1080 Ti.

4.3. BASELINES

We compare our method with the following baselines in federated settings: (1) Directly applying FedAvg algorithm (McMahan et al., 2016) to the automatically labeled data is the first baseline, which is called NONE. In this case, there is no denoising module in this method. (2) Zeng et al. (2015) proposed to leverage multiple instance learning to choose the most reliable sentence as the bag representation, and we abbreviate this method as ONE; (3) ATT was proposed by Lin et al. (2016) , which uses the attention mechanism to select reliable instances by placing soft weights on a set of noisy sentences; (4) AVE (Lin et al., 2016) is a naive version of ATT and represents each sentence set as the average vector of sentences inside the set; (5) ATT RA (Ye & Ling, 2019) is a variant of ATT, which calculates the bag representations in a relation-aware way. The federated framework of these baselines is shown in Algorithm 3 in the appendix.

4.4. MAIN RESULTS

Figure 2 and Figure 3 show the precision-recall curves on NYT dataset and MIRGENE datasets. At the appendix, we also present AUC values of these curves in Table 8 and detailed precision values measured at different points along these curves in Table 3 : AUC values on MIRGENE dataset. We run 10 models using random seeds with early stopping on the development set, and report the mean and standard deviation of test AUC values for all methods. models using random seeds with early stopping on the development set. Table 2 and Table 3 show the mean and standard deviation test AUC values for each method on NYT 10 dataset and MIR-GENE dataset, respectively. We find that: (1) Our method significantly outperforms all baselines in federated settings. We believe the reason is that our denoising method can use cross-platform information to hinder false positive instances from poisoning local models, which leads to a better performance of the global model. ( 2) Directly applying FedAvg algorithm (McMahan et al., 2016) to the automatically labeled data achieve the worst results in both datasets. The reason behind that is training on the noisy data will substantially hinder the performance of the model. Therefore, it is necessary to conduct denoise in federated distant supervision. (3) C is the fraction of platforms that are activated on each round, which controls the amount of multi-platform parallelism. With increasing platform parallelism, the performance of all baselines declines slightly while our method performs better. Intuitively, increasing platform parallelism is able to lead to better results, since involving more platforms in training can increase the likelihood that all sentences with the same entity pair appear simultaneously. However, due to lack of cross-platform collaboration, all baselines handle label noise only based on its own local data, which may hamper the performance. In contrast, our method selects reliable instances among all activated platforms, which can effectively reap the benefits of increasing platform parallelism. (4) Leveraging attention mechanisms to denoise, an effective solution in centralized settings, seems not to work in federated settings. Compared with centralized training, the sentences in a bag scatter around different platforms in federated settings, so the number of the sentences with the same entity pair on a platform is small, which may lead to placing large attention weights on noisy sentences due to lack of inter-bag contrast. In this section, we increase the size of local datasets by setting K to 50. In such a way, each local dataset is twice as large as it was (when K is set to 100). For a fair comparison, we fix C = 0.1, B = 32 and E = 3. Figure 4 show the results of AUC values. At the appendix, we also present corresponding precision-recall curves in Figure 7 and detailed precision values measured at different points along these curves in Table 10 . From the results, we observe that: (1) Our proposed method significantly surpasses all baselines in both datasets.

4.5. INCREASING

(2) Compared with setting K to 100, the result of directly applying Fe-dAvg algorithm (McMahan et al., 2016) to the automatically labeled data remains almost unchanged when K is set to 50. (3) As the size of local datasets increases, all denoising methods can achieve better results. The most likely reason is that compared with setting K to 100, setting K to 50 increases the probability that all sentences with the same entity pairs simultaneously exist in the same platform. In this section, we investigate the impact of varying the number of local updates in this section. The number of local updates is given by E |D * i | B , where |D * i | is the size of the denoised dataset in platform i at a round, B is the local minibatch size and E is the number of local epochs. Increasing B, decreasing E, or both will reduce computation on each round. We fix C to 0.1 and only B and E are varied in this sectionfoot_3 . The results are shown in Figure 4 . We find that: (1) Compared with the other denoising baselines, our method converges faster to the optimal results. We conjecture that is due to that the proposed denoising method can effectively filter out the noise, which makes the relation extractor less affected by false positive instances and converge faster. (2) When setting B to 64 and E to 1, our method achieves the best AUC value. (3) Increasing the local minibatch B may improve extraction performance. (4) Increasing the local epoch E can speed up converge, but may not make the global model converge to a higher level of AUC value. These findings are in line with McMahan et al. (2016) , which shows it may hurt performance when we over-optimize on the local dataset.

5. CONCLUSION

Considering data decentralization and privacy protection, we investigate distant supervision under the federated learning paradigm, which permits learning to be done while data stays in its local environment. To suppress label noise in federated settings, we propose a federated denoising framework, which can select reliable instances via cross-platform collaboration. Extensive experiments on two datasets have demonstrated the effectiveness of our method. Distant supervision in federated settings is far from being solved and this work is just the beginning. There are still many problems need to be solved, such as noisy bag problem (Xu et al., 2013; Liu et al., 2017) and shifted label problem (Ye et al., 2019) . Noisy bag problem means that all sentences containing the same entity pair are incorrectly labeled, and shifted label problem means the label distribution of training set does not align with that of test set. In federated settings, how these problems affect the relation extractor is still unknown. In our future work, we will devote to solve these problems in federated settings. Table 5 shows how different denoising methods select reliable instances in the training phase. In this case, a KB fact is (Podgorica, /location/country/capital, Montenegro). Aligning this KB fact with decentralized raw text generates four training instances, which are distributed in four different platforms. Only the sentence in Platform 26 correctly represents the "/location/country/capital" relation. The other sentences distributed in the other platforms are all false positive instances, which do not express the "/location/country/capital" relation.

A CASE STUDIES

From this case, we can find that: (1) If FedAvg algorithm (McMahan et al., 2016) was directly applied to the automatically labeled data, it would face a noisy environment where most sentences are false positive. (2) Previous denoising methods, such as ONE (Zeng et al., 2015) , ATT (Lin et al., 2016) and ATT RA (Ye & Ling, 2019) , all fail to filter out false positive instances. In the worst cases, these methods will lose their denoising function. (3) Our proposed method can remove all false positive instances and only keep the true positive instance to train local models.

B ADDITIONAL EXPERIMENTS B.1 CAN A STRONG EXTRACTOR MITIGATE THE LABEL NOISE IN FEDERATED SETTINGS?

In this section, we investigate the impact of involving a stronger extractor. More concretely, we replace the PCNN-based extractor with a BERT-based extractor (Devlin et al., 2018) . In the BERTbased extractor, we use the architecture of entity mention pooling (Soares et al., 2019) to represent relations with the Transformer model (Vaswani et al., 2017) , which is shown in Figure 5 . Given a sentence s and two entities within this sentence, we first segment the given sentence into tokens by the byte pair encoding (Sennrich et al., 2016) and feed these tokens into the BERT encoder. The output of the BERT encoder is the context-aware embeddings of tokens. After that, we use max pooling on the context-aware embeddings that correspond to the word pieces in each entity mention, to get two vectors h e1 and h e2 representing the two entity mentions. Finally, we concatenate these two vectors to get the representation of relation. Transformer [CLS] [E1] Entity 1 [/E1] ... [E2] Entity 2 [/E2] [SEP] Max pooling Max pooling A q ← (random set of m platforms) 7: for each platform i ∈ A q in parallel do 8: Θ k q+1 ← Local Training(i, Θ q ) 9: Θ q+1 ← i∈Aq |Di| j∈Aq |Dj | Θ i q+1 Defined in Equation 510: 11: Table 10 : P@100, P@200, P@300 and the mean of them for each model in held-out evaluation on NYT 10 dataset and MIRGENE dataset when K is set to 50 and C is set to 0.1. 



A set of sentences containing the same entity pair is called a "bag" https://github.com/thunlp/OpenNRE https://github.com/leebird/bionlp17 The lr, lr decay, weight decay and dropout are fix to is 0.1, 0.01, 1e-5 and 0.1 respectively, which are not the optimal hyperparameters for most experiments https://github.com/google-research/bert https://en.wikipedia.org/wiki/Master/slave_(technology)



split D i into batches of size B) 15: for each local epoch e from 1 to E do 16: for batch b ∈ B do 17:

DATASETS AND EVALUATION METRICSSince conducting experiments on non-public privacy-sensitive datasets is not reproducible, we choose public distantly supervised relation extraction datasets to investigate the effectiveness of the proposed framework.NYT 10 2(Riedel et al., 2010) is a widely used dataset in distant supervision. It was automatically generated by aligning the semantic triples in Freebase with the New York Times corpus. The training set contains 466,876 sentences, 251,928 entity pairs and 16,444 relational facts. The development set contains 55167 sentences, 28077 entity pairs and 1,808 relational facts. The test set contains 172,448 sentences, 96,678 entity pairs and 1,950 relational facts. There are 52 actual relations and a special relation NA for representing no relation between two entities. MIRGENE 3(Li et al., 2017) is a large-scale biomedical dataset. This dataset is generated by aligning Tarbase and miRTarBase with the abstracts in Medline. An example is shown in the following: " MicroRNA-223 regulates FOXO1 expression and cell proliferation", where MicroRNA-223 is a miRNA and FOXO1 is a gene. There are 172727 sentences in the training set and 1239 sentences in the test set. We randomly sampled 10% of the data from the training set as the development set.

Figure 4: AUC values vs. communication rounds on NYT data with different E (the number of local epochs) and B (the local minibatch size).

Figure 5: The main architecture for BERT-based extractor.

Figure 7: Aggregate precision-recall curves on NYT 10 dataset and MIRGENE dataset when K is set to 50 and C is set to 0.1.

Algorithm 2 Federated Denoising Framework 1: Hyperparameters: K is the total number of platforms; C is the fraction of platforms; B is the local minibatch size; E is the number local epochs; η is the learning rate. 2: Master server executes:

The search space of unfixed hyperparameter.

To reduce randomness, we run 10 Aggregate precision-recall curves on NYT 10 dataset, where C is the fraction of platforms that are activated on each round.

AUC values on NYT 10 dataset. We run 10 models using random seeds with early stopping on the development set, and report the mean and standard deviation of test AUC values for all methods.

THE SIZE OF LOCAL DATASETS AUC values on NYT 10 dataset and MIRGENE dataset when K = 50.

A case to illustrate the effectiveness of the proposed model. A fact in KB is (Podgorica, /location/country/capital, Montenegro). Only the sentence in Platform 26 expresses the "/location/country/capital" relation, while the other sentences are all false positive.

From the result, we find that: (1) Involving a stronger encoder can improve the performance for all denoising methods. (2) Whether leveraging PCNN or BERT as the encoder, our method significantly outperforms all baselines.

The AUC values of PCNN-based extractor and BERT-based extractor on NYT 10 dataset and MIRGENE dataset when k is set to 100 and C is set to 0.1. Hyperparameters: K is the total number of platforms; C is the fraction of platforms; B is the local minibatch size; E is the number local epochs; η is the learning rate.

In Algorithm 3, we present the federated framework of denoising baseline. Compared with FedAvg algorithm (McMahan et al., 2016), we only add one step in local training to denoise. Compared with the proposed federated denoising framework, local platforms in the baseline framework handle label noise only based on its own local data. D SOME TABLES AND FIGURES MENTIONED IN THE MAIN TEXT AUC values on NYT 10 dataset and MIRGENE dataset.

P@100, P@200, P@300 and the mean of them for each model in held-out evaluation on NYT 10 dataset and MIRGENE dataset.

Activated

To ablate parallel computation from the conventional federated learning topology, we propose a simple way, called chain learning, to handle decentralized data. In chain learning, we only train one local platform at a time, and then synchronize model parameters to the next platform for further training. We show the architecture of chain learning in Figure 6 

