ITERATIVE TASK-ADAPTIVE PRETRAINING FOR UNSU-PERVISED WORD ALIGNMENT

Abstract

How to establish a closer relationship between pre-training and downstream task is a valuable question. We argue that task-adaptive pretraining should not be just performed before task. For word alignment task, we propose an iterative selfsupervised task-adaptive pretraining paradigm, tying together word alignment and self-supervised pretraining by code-switching data augmentation. When we get the aligned pairs predicted by the multilingual contextualized word embeddings, we employ these pairs and origin parallel sentences to synthesize code-switched sentences. Then multilingual models will be continuously finetuned on the augmented code-switched dataset. Finally, finetuned models will be used to produce new aligned pairs. This process will be executed iteratively. Our paradigm is suitable for almost all unsupervised word alignment methods based on multilingual pre-trained LMs and doesn't need gold labeled data, extra parallel data or any other external resources. Experimental results on six language pairs demonstrate that our paradigm can consistently improve baseline method. Compared to resource-rich languages, the improvements on relatively low-resource or different morphological languages are more significant. For example, the AER scores of three different alignment methods based on XLM-R are reduced by about 4 ∼ 5 percentage points on language pair En-Hi.

1. INTRODUCTION

Although pre-trained language models (PTLMs) (Devlin et al., 2019b; Conneau et al., 2020) trained with massive textual and computational resources have achieved high performance in natural language processing tasks, there can be a distributional mismatch between the pretraining and target domain corpora. To tackle domain discrepancies, domain-adaptive pretraining with a large corpus in the domain of the downstream task is usefully employed, such as BioBERT (Lee et al., 2020) . However, this approach requires large corpora in the target domain and entails a high computational cost. Gururangan et al. (2020) propose task-adaptive pretraining and explore the benefits of continued pretraining on data from the task distribution. There are also others works (Gu et al., 2020; Karouzos et al., 2021; Nishida et al., 2021) focusing on establishing a closer relationship between pre-training and downstream task. For example, Gu et al. (2020) add a task-guided pre-training stage with selective masking between general pre-training and fine-tuning. Karouzos et al. (2021) simultaneously minimize a task-specific loss on the source data and a language modeling loss on the target data during fine-tuning. However, these methods generally follow a fixed paradigm: task-adaptive pretraining then task training. There is an obvious lack of interactive feedback. Can the output of the task can be used to improve pretraining? See Figure 1 . And we find that an iterative self-supervised task-adaptive pretraining paradigm can be designed for unsupervised word alignment tasks. In the following, we give a detailed introduction about how to design our new paradigm. Continued pretraining of a LM on the unlabeled data of a given task (task-adaptive pretraining) (Gururangan et al., 2020) has been shown to be beneficial for task performance. And we think that simply pre-training LMs with MLM or TLM on monolingual parallel sentences is obviously not closely integrated with word alignment task. Based on the assumption that a closer interaction between task pre-training and the task itself can improve performance, we propose an iterative self-supervised continued pretraining paradigm, constantly pushing pre-trained LMs toward the word Figure 1 : Continued pretraining of a LM on the unlabeled data of a given task (task-adaptive pretraining) (Gururangan et al., 2020) has been show to be beneficial for task performance. However, it only occurs before tasks and is obviously not closely integrated with tasks. For word alignment task, we propose a new paradigm, in which task-adaptive pretraining and word alignment be executed iteratively. alignment task. We augment sentences with self-labeled pairs and code-switching strategy. When we get the aligned pairs predicted by the multilingual contextualized word embeddings, we employ these pairs and origin parallel sentences to synthesize code-switched sentences. Then multilingual models will be continuously finetuned on the augmented code-switched dataset. Finally, finetuned models will be used to produce new aligned pairs. This process will be executed iteratively. On the one hand, data augmentation with self-labeled pairs and code-switching strategy can alleviates data scarcity. On the other hand, training LMs with MLM on code-switched sentences can bring the different languages closer together in the embedding space and if we training LMs on both code-switched sentence and corresponding origin sentences, the code-switched tokens (predicted pairs) will also move towards each other in the embedding space. Our contribution can be listed as follows: • We design an iterative task-adaptive pretraining paradigm for word alignment, in which task-adaptive pretraining will be performed not only before task but also after task. In other words, task-adaptive pretraining and word alignment will be executed iteratively. • Our paradigm is suitable for almost all unsupervised word alignment methods based on multilingual pre-trained LMs and doesn't need gold labeled data, extra parallel data or any other external resources. We also don't need to introduce carefully designed loss function and the results can be easily reproduced. • In-depth analysis reveals that training model with standard masked language modeling on source-language,target-language and code-switched sentences is approximately optimizing the similarity of switched tokens. This can serve as a potential explanation why codeswitching can be used to improve machine translation and cross-lingual tasks. • We perform experiments on six language pairs and demonstrate that our paradigm can consistently improve baseline methods. For example, the AER scores of three different alignment methods based on XLM-R are reduced by about 4 ∼ 5 percentage points on language pair En-Hi.

2.1. TASK ADAPTIVE PRETRAINING

Language models pretrained on text from a wide variety of sources form the foundation of today's NLP. Gururangan et al. (2020) propose domain-adaptive pretraining and task-adaptive pretraining. They explore the benefits of continued pretraining on data from the task distribution and the domain distribution. Gu et al. (2020) 

2.2. WORD ALIGNMENT

Statistical alignment models directly build on the lexical translation models such as the IBM models (Brown et al., 1993) and their implementations Giza++ (Och & Ney, 2003) , fast-align (Dyer et al., 2013) and eflomal (Östling & Tiedemann, 2016) are widely used for alignment. Based on NMT models (Bahdanau et al., 2015) trained on parallel corpora, researchers have proposed several methods to extract alignments from them (Cohn et al., 2016; Zenkel et al., 2019; Garg et al., 2019; Chen et al., 2021; Zenkel et al., 2020a; Chatterjee et al., 2022) . Cohn et al. (2016) There are also work on supervised neural word alignment (Stengel-Eskin et al., 2019; Nagata et al., 2020) . For example, (Nagata et al., 2020) present a novel supervised word alignment method based on cross-language span prediction and formalize a word alignment problem as question answering task. However, supervised data are not always accessible, making their methods inapplicable in many scenarios.

2.3. CODE-SWITCHING

Code-switching is a prevalent phenomenon in multilingual communities where the words, morphemes and phrases from two or more languages are switched in speech or writing. And it has been employed to improve NMT tasks (Yang et al., 2020a; Liu et al., 2021; Yang et al., 2020b) and cross lingual tasks (Qin et al., 2020; Zhang et al., 2021; Lee et al., 2021) . Most of related work attribute the improvement of model performance to an intuitive assumption: Using code-switching data to train models will encourages them to align representations from source and target languages by mixing their context information. In this work, we dig deeper and give an approximate formal explanations for the connection between word alignment and masked language modeling.

3. METHOD

Although multilingual contextualized embeddings of pre-trained LMs can be employed to achieve reasonable performance even in the absence of explicit training on parallel data, there is still a clear gap: the model is trained with language modeling loss function and tested on word alignments task. Dou & Neubig (2021) leverage pre-trained LMs and fine-tune them on parallel texts with new objectives designed to improve alignment quality. However, they need extra large amounts of parallel sentences, which is at least thousands of times larger than test sentences and limit their applications on low-resource languages and settings without parallel text. So here is a practical and valuable setting: if we only have a few parallel sentences which need to be aligned and a pre-trained LM, can we further improve the performance of alignment methods? Figure 2 illustrates an overview of our paradigm. More accurate alignment results in higher-quality code-switched sentences. And finetuning on higher-quality code-switched sentences will encourage pretrained LMs to align representations from source and target languages by mixing their context information. A better pretrained LMs will obviously improve the accuracy of alignment. This iterative process will promote each other. In the following paragraphs, we will elaborate on each part.

3.1. WORD ALIGNMENT

The task of word alignment can be defined as: Given a source-language sentence x = ⟨x 1 , • • • , x n ⟩ of length n and its target-language translation y = ⟨y 1 , • • • , y m ⟩ of length m, the method of word alignment needs to find a set of pairs of source and target words: A = {⟨x i , y j ⟩ : x i ∈ x, y j ∈ y} Aligned words are assumed to correspond to each other, i.e. for each word pair ⟨x i , y j ⟩ , x i and y j are semantically similar to each other within the context of the sentence. We focus on improving the methods which can leverage multilingual pre-trained LMs (Devlin et al., 2019b; Conneau et al., 2020) for word alignments by extracting alignments from similarity matrices induced from their contextualized embeddings without relying on parallel data. For each pair of parallel sentences x and y, these methods extract the hidden states of the k -th layer of the multilingual model: h k x = h k x1 , • • • , h k xn and h k y = h k y1 , • • • , h k ym . Given these contextualized word embeddings, there are many methods to obtain alignments. For example, a simple and effective method Argmax (Sabet et al., 2020) is to align x i and x j when h k xi is the most similar to h k yj and vice-versa. That is, we set A ij = 1 if i = arg max l S k l,j ∧ j = arg max l S k i,l and A ij = 0 otherwise. And S k ij = sim h k xi , h k yj is some normalized measure of similarity, e.g., cosine-similarity. Some other methods frame alignment as an assignment problem (Sabet et al., 2020) or regularized optimal transport problem (Dou & Neubig, 2021; Chi et al., 2021) and is defined by A = argmax A∈{0,1} le×l f le i=1 l f j=1 A ij S ij . In our setting, these methods will self-label parallel sentences. In order to get high quality aligned pairs, we filter the pairs with a particular threshold ϵ: A f ilter = ⟨x i , y j ⟩ : S k ij > ϵ (2) Algorithm 1: Augmented Sentences Sampling input : A pair of parallel sentences ⟨x, y⟩, alignment of the t-th iteration A t , alignment of the (t + 1)-th iteration A t+1 , sampling probability P old for old pairs, sampling probability P new for new pairs, the sampling rounds rounds. output :Augmented sentences S aug  S aug ← ∅ A old ← A t ∩ A t+1 A new ← A t+1 \ (A t ∩ A t+1 ) SET AP ← {(A old , P old ), (A new , P new )} for i ← 1 to C 0 p + C 1 p + C 2 p + • • • + C p p = 2 p (3) Obviously this is an exponential and impressive data augmentation method. In practice, we do not exhaust all code-switched sentences and the sampling method is illustrated in Algorithm 1. We set A 0 = ∅ and A 1 = ∅ so that when t = 0, Algorithm 1 still works and in this setting, the 0-th iteration is exactly the standard task-adaptive pretraining. When t > 1, alignment results of two successive iterations will be used so that we can assign different sampling distributions to the intersection and new aligned pairs. In general, we give a higher probability for new aligned pairs. In addition, we will also add origin parallel sentences to augmented dataset which is prepared for masked language modeling.

3.3. MLM ON CODE-SWITCHED DATASET

For masked language modeling (MLM), the input sequence x consists of multiple individual tokens. A fraction of the input tokens are chosen randomly and replaced with < M ASK > tokens. Assume that these masked indices are collected together in a set mask(x) and we use x to denote a masked token. Then model with parameters θ learns to predict mask(x) by the surrounding unmasked tokens x \mask(x) . L MLM = - x∈mask(x) log P x | x \mask(x) ; θ Using code-switched and origin parallel sentences to train model will encourages them to align representations from source and target languages by mixing their context information. More importantly, self-labeled word pairs have the same surrounding tokens. So training model on these sentences will obviously align code-switched tokens in implicit manner and the predicted pairs will also move towards each other in the embedding space. For example, in Figure 2 , "Das" and "This" is aligned pair. When we train model on the code-switched sentence "Das is quite a good idea. " and origin sentence "This is quite a good idea. ", the word "Das" and English word "This" will move towards each other in the embedding space because they have same surrounding tokens. We try to dig deeper and give an approximate formal explanations for the connection between word alignment and masked language modeling, which can also serve as a potential explanation why code-switching can be used to improve machine translation and cross-lingual tasks. Proposition: Training model with standard masked language modeling on source-language,targetlanguage and source-target code-switched sentences is approximately optimizing: We assume that this approximate similarity is sufficient to induce word alignment. Then we have: Corollary: The performance based on the last layer of optimized model (with iterative task-adaptive pretraining) is better than the best results (usually the eighth layer) of baseline model (without iterative task-adaptive pretraining). In section 4, we will test whether this corollary holds in experiments.

4. EXPERIMENTS

Our experiments focus on three questions: (1) To what extent can our paradigm improve word alignment across methods, models, languages and layers. (2) The effect of the number of augmented sentences and sampling probability. (3) The detailed ablation study.

4.1. DATASET

Our test data are a diverse set of 6 language pairs: Persian,Czech, German, French, Hindi and Romanian, always paired with English. All of them are public dataset: En-Fa (Tavakoli & Faili) , En-Cs (Marecek) , En-Defoot_0 , En-Fr (Och & Ney, 2000) , En-Hi and En-Rofoot_1 . See Table 1 English-Hindi (En-Hi), English-Persian (En-Fa), English-Czech (En-Cs), English-German (En-De), English-French (En-Fr), English-Romanian (En-Ro). "Size" refers to the number of sentences. S is sure alignments and P is possible alignments (S ⊂ P ).

4.2. EVALUATION MEASURES

We use Alignment Error Rate (Och & Ney, 2003) as the standard evaluation: AER = 1 - |A ∩ S| + |A ∩ P | |Λ| + |S| (5) where A is a set of predicted alignment edges, S(sure) is (sure) unambiguous alignments and P(possible) is ambiguous alignments (S ⊂ P ). We report the percentage. 

4.3. BASE METHODS AND MODELS

Our experiments focus on three alignment method based on multilingual pretrained models: Argmax, Itermax and Match, proposed by SimAlign (Sabet et al., 2020) and we follow the default setting of repositoryfoot_2 without any modification. We use the contextualized embeddings from 8-th layer. We employ two multilingual pretrained models: the multilingual BERT model (mBERT), which is pretrained on the 104 largest Wikipedia languages and XLM-RoBERTa base (Conneau et al., 2020) , which is pretrained on 100 languages on cleaned CommonCrawl data (Wenzek et al., 2020) . The pretrained LMs often use subword segmentation techniques (Kudo & Richardson, 2018; Sennrich et al., 2016) and the above alignment extraction methods can only produce alignments on the subword level, we follow previous work (Sabet et al., 2020; Zenkel et al., 2020b; Dou & Neubig, 2021) and consider two words to be aligned if any of their subwords is aligned. For masked language modeling, we following Devlin et al. (2019a) and use a special [MASK] token 80% of the time, a random token 10% of the time and the original token 10% of the time to perform masking. The batch size is set to 32. Max epoch is set to 10. We use the Adam optimizer with a learning rate of 2e -5. Weight decay is 0.03. We set the filtering threshold to 0.9. When the number of iterations is 1, it only one alignment set sampling probability to 0.7. When the number of iterations is more than 1, we set the sampling probability P old to 0.7 and P new to 1.0. The sampling rounds is defaults to 5. 

4.4. RESULTS

As shown in Table 2 , we report the F1 and AER scores on the six language pairs. And Table 2 also includes some widely used algorithms which is not based on pretrain mdoels, such as Giza++ (Och & Ney, 2003) , fast-align (Dyer et al., 2013) and eflomal (Östling & Tiedemann, 2016) . It can be observed that our method improve the performance of three alignment methods on generating high-quality alignment pairs. And from section 3.2, the standard task-adaptive pretraining is equivalent to Iter 0. With the increase of iteration rounds, our method can constantly improve model performance. This prove that iterative task-adaptive pretraining is effective. In addition, there is an impressive trends: The worse the initial performance, the more the improvement. For En-Cs, En-De and En-Fr, baseline methods perform well and the AER scores are generally lower than 20 percent. After three iterations, the AER scores are reduced by about 1.5 points on average. For En-Ro, En-Fa and En-Hi, which are relatively low-resource or different morphological languagepairs, baseline methods perform poorly and AER scores are generally greater than 30 percent. After three iterations, the AER scores are reduced by about 4 ∼ 5 points on average. This is an exciting result, which doesn't need any additional parallel sentences and gold labels. In the following experiments, we employ the alignment method Argmax for further analysis and explore the word alignment across different layers. Figure 3 and table 5 show that our paradigm consistently improves baseline performance cross all layers. For both En-Hi and EN-Fa, the performance of last layer based on our methods consistently outperforms the best result of baseline models. So the Corollary in section 3.3 is valid and our assumption "this approximate similarity is sufficient to induce word alignment" is supported by experiments.

4.5. ANALYSIS

We further explore the effect of sampling probability and the number of augmented sentences on model performance. We choose XLM-RoBERTa base for multilingual pretrained models and Argmax as the alignment method. Given the large amount of possible experiments when considering 6 language pairs, we do not present all scores for all languages and we will pick up four of them in most cases: En-De, an established and well-known dataset, En-Fa and En-Hi, two low-resource languages written in a different script and En-Ro. And only two rounds of iterative training are performed (t=0,1) and we only list the final results. Table 3 shows the Alignment Error Rate in the setting of different sampling probabilities. When the sampling probability of code-switching is too high or too low, the diversity of augmented sentences will decline, which may hurt model performance. Table 3 proves this point. Although the final scores of different sampling probabilities are close, the middle probability 0.7 achieve the best scores cross four languages. Table 4 shows the effect of the number of augmented sentences. The sampling rounds is proportional to the number of augmented sentences. From this table, we can infer that the scores will increase in general when more augmented sentences becomes available. But gains continue to decay. At the same time, the cost of MLM training will increases with more augmented sentences. So we set the sampling rounds to 5 without special statements. 

4.6. ABLATION STUDIES

For ablation study, we choose XLM-RoBERTa base for multilingual pretrained models and Argmax as the alignment method. We consider three kinds of ablation studies. Table 6 lists the final iterative results. "NO-CS" means only origin monolingual parallel sentences and there are no code-switched sentences. "Random" means the pairs used for code-switching are randomly generated and they are not gold-pairs in most cases. Note that the dataset "Random" also includes origin monolingual parallel sentences. "NO-Filter" means that we don't use a threshold to filter the pairs and all aligned pairs will be employed to augment code-switched sentences. The result of "NO-CS" indicates that without code-switched sentences, method can improve performance. In fact, it is almost equivalent to standard task-adaptive pretraining and Iter 0 (section 3.2). The comparison of "Random" and "NO-CS" shows that the improvement of "Random" mainly comes from origin monolingual parallel sentences. And the randomly code-switched sentences only bring a very slight boost. The comparison of "NO-Filter" and "Ours" indicates the filtering the pairs with a threshold is beneficial to model performance.

5. CONCLUSION

Inspired by the fact that continued pretraining of pretrained models on the unlabeled data of a given task has been show to be beneficial for task performance, we further design an iterative task-adaptive pretraining paradigm for word alignment, in which task-adaptive pretraining will be performed not only before task but also after task. The multilingual models will be continuously finetuned on the augmented code-switched dataset. The iterative process will promote each other. More accurate alignment results in higher-quality code-switched sentences. And finetuning on higher-quality codeswitched sentences will encourage pretrained LMs to align representations from source and target languages by mixing their context information. A better pretrained LMs will obviously improve the accuracy of alignment. Experimental results on six language pairs and demonstrate that our paradigm can consistently improve baseline methods. We are considering a more general paradigm about iterative task-adaptive pretraining and will apply the paradigm to other token-level tasks such as such as Named Entity Recognition and Parts-of-speech tagging. And how to establish a connection between the downstream tasks and self-supervised tasks of pretraining stage is key point, especially for low-resources tasks and languages. In fact, the promptbased methods in which downstream tasks are reformulated to language modeling are alternative solutions. And we are trying to combine our iterative task-adaptive pretraining with prompt-based methods.

A APPENDIX A.1 A PROOF OF MLM ON CODE-SWITCHED DATASET

Proposition Training model with standard masked language modeling on source-language, targetlanguage and source-target code-switched sentences is approximately optimizing: c src xi ∼ e xi ∼ e yi ∼ c tgt yi where c src xi represents the contextualized embedding of token x i in source sentence and c src yi represents the contextualized embedding of token y i in target sentence. And e xi and e yi are word embeddings of token x i and y i in vocabulary. Note Under the existing conditions, we can not derive a strict bound but an approximate conclusion. Given a ∈ R n , b ∈ R n ,c ∈ R n we assume a ∼ b, if the projection components of a and b onto another vector c are the same: a • c = b • c. We think this is approximately reasonable for word alignment task, because for word alignments method two words are aligned as long as their similarity is higher than other words in two parallel sentences and doesn't need to exceed a fixed number. And in section 3.3, we give a corollary. Subsequent experiments prove that our assumption is reasonable. Proof We denote the embeddings of the corresponding original tokens as e 1 , e 2 , • • • , e L . The MLM objective L MLM (x) can be formulated as: - 1 |M| i∈M log exp (m i • e i ) |V| k=1 exp (m i • e k ) = - 1 |M| i∈M log |V| k=1 exp (m i • e k -m i • e i ) where M denotes the set of masked tokens and |V| is the size of vocabulary V. m i is hidden state of the last layer at the masked position, and can be regarded as a fusion of contextualized representations of surrounding tokens. Given two sentences: one source-language sentence x = ⟨x 1 , • • • , x i-1 , x i , x i+1 , • • • , x n ⟩ of length n and its code-switched sentence x ′ = ⟨x 1 , • • • , x i-1 , y i , x i+1 , • • • , x n ⟩, where ⟨x i , y i ⟩ is aligned pair. If we only mask x i in the x and y i in the x ′ , then x mask = ⟨x 1 , • • • , x i-1 , < mask >, x i+1 , • • • , x n ⟩=⟨x 1 , • • • , x i-1 , < mask >, x i+1 , • • • , x n ⟩= x ′ mask , the loss function can be written as This equation can't ensure e xi = e yi but e xi ∼ e yi to some extent. For standard masked language modeling, there is a probability that the original token will not be masked and we use c src xi to represent the hidden state of the last layer, which is the contextualized embedding of token x i . So we have c  L MLM = L x + L x ′ = -



http://www-i6.informatik.rwth-aachen.de/goldAlignment/ http://web.eecs.umich.edu/ mihalcea/wpt05/ https://github.com/cisnlp/simalign



rounds do for (A, P ) ∈ SET AP do for pair ∈ A do 8 if P ≥ random.random() then 9 ⟨x, y⟩ =Code-switch( ⟨x, y⟩,pair) 10 S aug ← S aug ∪ ⟨x, y⟩ 3.2 CODE-SWITCHED AUGMENTATION Code-switching is a prevalent phenomenon in multilingual communities where the words, morphemes and phrases from two or more languages are switched in speech or writing. The switched ones usually are semantically similar. Suppose we get p different pairs of source and target words for sentence x = ⟨x 1 , • • • , x n ⟩ and y = ⟨y 1 , • • • , y m ⟩, it is easy to augment sentences by code-switching. The source words and target words in one pair can be considered as synonyms and can be switched. If we augmented sentences by code-switch sentences with one token at a time, then we have C 1 p kinds choices, which means we can get C 1 p different sentence pairs at most. If the number of switched pairs range from 0 to p, then the maximum of different sentence pairs without considering special cases:

c src xi ∼ e xi ∼ e yi ∼ c tgt yi where c src xi represents the contextualized embedding of token x i in source sentence and c src yi represents the contextualized embedding of token y i in target sentence. And e xi and e yi are word embeddings of token x i and y i in vocabulary. c src xi ∼ e xi represents that c src xi is similar to e xi . See appendix for details.

Figure 3: The comparison between ours and baseline cross layers. Lower is better.

• e k -m • e xi ) + log |V| k=1 exp (m • e k -m • e yi )) (7)This inequality below is easily proved.max {x 1 , . . . , x n } ≤ log n i=0 e xi ≤ max {x 1 , . . . , x n } + log n • e k -m • e xi ) • e 0 -m • e xi . . . m • e xi-1 -m • e xi 0 m • e xi+1 -m • e xi . . . m • e |V| -m • e xi • e 0 -m • e xi . . . m • e xi-1 -m • e xi 0 m • e xi+1 -m • e xi . . .m • e |V| -m • e xi Ineq.10, 0 is fixed value. So when training model with this loss function, model is optimized to learn m • e k -m • e xi ≤ 0, ∀k ∈ |V|. In other words, m • e k ≤ m • e xi , ∀k ∈ |V|. When k = y i , we have m • e yi ≤ m • e xi . Similarly, for L x ′ , we have m • e xi ≤ m • e yi . So when training model with loss function L MLM = L x + L x ′ , model will be optimized to learn m • e xi = m • e yi .

src xi • e k ≤ c src xi • e xi , ∀k ∈ |V|. Obviously, e xi ∼ c src xi . Similarly, if we consider target language sentence y = ⟨y 1 , • • • , y i-1 , y i , y i+1 , • • • , y n ⟩, we have e yi ∼ c tgt yi . So training model with masked language modeling on source-language, target-language and source-target code-switched sentences is approximately optimizing: c src xi ∼ e xi ∼ e yi ∼ c tgt yi

Statistics of Datasets. Test sentences of the six gold word alignment datasets used in our experiments:

Evaluation results on six datasets. Argmax, Itermax and Match are three different alignment methods (Sabet et al., 2020) based on multilingual contextualized embeddings. Results are average over different runs. Best results are in bold. (For AER, lower is better.)

The effect of the number of augmented sentences.

Evaluation results of different layers.

Ablation study of code-switching strategy.

