REPRESENTATION AND BIAS IN MULTILINGUAL NLP: INSIGHTS FROM CONTROLLED EXPERIMENTS ON CONDITIONAL LANGUAGE MODELING

Abstract

Inspired by the phenomenon of performance disparity between languages in machine translation, we investigate whether and to what extent languages are equally hard to "conditional-language-model". Our goal is to improve our understanding and expectation of the relationship between language, data representation, size, and performance. We study one-to-one, bilingual conditional language modeling through a series of systematically controlled experiments with the Transformer and the 6 languages from the United Nations Parallel Corpus. We examine character, byte, and word models in 30 language directions and 5 data sizes, and observe indications suggesting a script bias on the character level, a length bias on the byte level, and a word bias that gives rise to a hierarchy in performance across languages. We also identify two types of sample-wise non-monotonicity -while word-based representations are prone to exhibit Double Descent, length can induce unstable performance across the size range studied in a novel meta phenomenon which we term erraticity. By eliminating statistically significant performance disparity on the character and byte levels by normalizing length and vocabulary in the data, we show that, in the context of computing with the Transformer, there is no complexity intrinsic to languages other than that related to their statistical attributes and that performance disparity is not a necessary condition but a byproduct of word segmentation. Our application of statistical comparisons as a fairness measure also serves as a novel rigorous method for the intrinsic evaluation of languages, resolving a decades-long debate on language complexity. While all these quantitative biases leading to disparity are mitigable through a shallower network, we find room for a human bias to be reflected upon. We hope our work helps open up new directions in the area of language and computing that would be fairer and more flexible and foster a new transdisciplinary perspective for DL-inspired scientific progress.

1. INTRODUCTION

With a transdisciplinary approach to explore a space at the intersection of Deep Learning (DL) / Neural Networks (NNs), language sciences, and language engineering, we report our undertaking in use-inspired basic research -with an application-related phenomenon as inspiration, we seek fundamental scientific understanding through empirical experimentation. This is not an application or machine translation (MT) paper, but one that strives to evaluate and seek new insights on language in the context of DL with a consideration to contribute to our evaluation, segmentation, and model interpretation practice in multilingual Natural Language Processing (NLP). Our inspiration: performance disparity in MT The use case that inspired our investigation is the disparity of MT results reported in Junczys-Dowmunt et al. (2016) . Of the 6 official languages of the United Nations (UN) -Arabic (AR), English (EN), Spanish (ES), French (FR), Russian (RU), and Chinese (ZH), results with target languages AR, RU, and ZH seem to be worse than those with EN/ES/FR, regardless of the algorithm, may it be from phrased-based Statistical MT (SMT/Moses (Koehn et al., 2007) ) or Neural MT (NMT). 1 The languages have the same amount of line-aligned, high-quality parallel data available for training, evaluation, and testing. This prompts the question: are some languages indeed harder to translate from or to? Problem statement: are all languages equally hard to Conditional-Language-Model (CLM)? A similar question concerning (monolingual) language modeling (LMing) was posed in Cotterell et al. (2018) and Mielke et al. (2019) along with the introduction of a method to evaluate LMs with multiway parallel corpora (multitexts) in information-theoretic terms. To explicitly focus on modeling the complexities that may or may not be intrinsic to the languages, we study the more fundamental process of CLMing without performing any translation. This allows us to eliminate confounds associated with generation and other evaluation metrics. One could think of our effort as estimating conditional probabilities with the Transformer, with a bilingual setup where perplexity of one target language (l trg ) is estimated given the parallel data in one source language (l src ), where l src = l trg . We focus on the very basics and examine the first step in our pipeline -input representation, holding everything else constant. Instead of measuring absolute cross-entropy scores, we evaluate the relative differences between languages from across 5 magnitudes of data sizes in 3 different representation types/levels. We consider bias to be present when performance disparity in our Transformer models is statistically significant.

1.1. SUMMARY OF FINDINGS AND CONTRIBUTIONS

In investigating performance disparity as a function of size and data with respect to language and representation on the Transformer in the context of CLMing, we find: 1. in a bilingual (one-to-one) CLMing setup, there is neutralization of source language instances, i.e. there are no statistically significant differences between source language pairs. Only pairs of target languages differ significantly (see Table 1 ). 2. We identify 2 types of sample-wise non-monotonicity on each of the primary representation levels we studied: (a) Double Descent (Belkin et al., 2019; Nakkiran et al., 2020) : on the word level, for all languages, performance at 10 2 lines is typically better than at 10 3 before it improves again at 10 4 and beyond. This phenomenon can also be observed in character models with ZH as a target language as well as on the word level with non-neural n-gram LMs; (b) erraticity: performance is irregular and exhibits great variance across runs. We find sequence length to be predictive of this phenomenon. We show that this can be rectified by data transformation or hyperparameter tuning. In our study, erraticity affects AR and RU on the byte level where the sequences are too long with UTF-8 encoding and ZH when decomposed into strokes on the character level. 3. In eliminating performance disparity through lossless data transformation on the character and byte levels, we resolve language complexity ( § 4 and App. J). We show that, in the context of computing with the Transformer, unless word-based methods are used, there is no linguistic/morphological complexity applicable or necessary. There is no complexity that is intrinsic to a language aside from its statistical properties. Hardness in modeling is relative to and bounded by its representation level (representation relativity). On the character and byte levels, hardness is correlated with statistical properties concerning sequence length and vocabulary of a language, irrespective of its linguistic typological, phylogenetic, historical, or geographical profile, and can be eliminated. On the word level, hardness is correlated with vocabulary, and a complexity hierarchy arises through the manual preprocessing step of word tokenization. This complexity/disparity effected by word segmentation cannot be eliminated due to the fundamental qualitative differences in the definition of a "word" being one that neither holds universally nor is suitable/consistent for fair crosslinguistic comparisons. We find clarification of this expectation of disparity necessary because more diligent error analyses need to be afforded instead of simply accepting massively disparate results or inappropriately attributing under-performance to linguistic reasons. 4. Representational units of finer granularity can help close the gap in performance disparity. 5. Bigger/overparameterized models can magnify/exacerbate the effects of differences in data statistics. Quantitative biases that lead to disparity are mitigable through numerical methods. Outline of the paper In § 2, we define our method and experimental setup. We present our results and analyses on the primary representations in § 3 and those from secondary set of controls in § 4 in a progressive manner to ease understanding. Meta analyses on fairness evaluation, non-monotonic behavior, and discussion on biases are in § 5. Additional related work is in § 6. We refer our readers to the Appendices for more detailed descriptions/discussions and reports on supplementary experiments.

2. METHOD AND DEFINITIONS

Controlled experiments as basic research for scientific understanding Using the United Nations Parallel Corpus (Ziemski et al., 2016) , the data from which the MT results in Junczys-Dowmunt et al. ( 2016) stem, we perform a series of controlled experiments on the Transformer, holding the hyperparameter settings for all 30 one-to-one language directions from the 6 languages constant. We control for size (from 10foot_1 to 10 6 lines) and language with respect to representational granularity. We examine 3 primary representation types -character, byte (UTF-8), and word, and upon encountering some unusual phenomena, we perform a secondary set of controls with 5 alternate representationson the character level: Pinyin and Wubi (ASCII representations for ZH phones and character strokes, respectively), on the byte level: code page 1256 (for AR) and code page 1251 (for RU), and on the word level: Byte Pair Encoding (BPE) (Sennrich et al., 2016) , an adapted compression algorithm from Gage (1994) . These symbolic variants allow us to manipulate the statistical properties of the representations, while staying as "faithful" to the language as possible. We adopt this symbolic data-centric approach because we would like to more directly interpret the confounds, if any, that make language data different from other data types. We operate on a smaller data size range as this is more common in traditional domain sciences and one of our higher goals is to bridge an understanding between language sciences and engineering (the latter being the dominant focus in NLP). We run statistical tests to identify the strongest correlates of performance and to assess whether the differences between the mean performance of different groups are indeed significant. We are concerned not with the absolute scores, but with the relations between scores from different languages and the generalizations derived therefrom. Information-theoretic, fair evaluation with multitexts Most sequence-to-sequence models are optimized using a cross-entropy loss (see Appendix B for definition). Cotterell et al. (2018) propose to use "renormalized" perplexity (PP) to evaluate LMs fairly using the total number of bits divided by some constant. In our case, we choose instead a simpler method of using an "unnormalized" PP, directly using the total number of bits needed to encode the development (dev) set, which has a constant size of 3,077 lines per language. Disparity/Inequality In the context of our CLMing experiments, we consider there to be "disparity" or "inequality" between languages l 1 and l 2 if there are significant differences between the performance distributions of these two languages with respect to each representation. Here, by performance we mean the number of bits required to encode the held-out data using a trained CLM. With 30 directions, there are 15 pairs of source languages (l src1 , l src2 ) and 15 pairs of target languages (l trg1 , l trg2 ) possible. To assess whether the differences are significant, we perform unpaired two-sided significance tests with the null hypothesis that the score distributions for the two languages are not different. Upon testing for normality with the Shapiro-Wilk test (Shapiro & Wilk, 1965; Royston, 1995) , we use the parametric unpaired two-sample Welch's t-test (Welch, 1947) (when normal) or the non-parametric unpaired Wilcoxon test (Wilcoxon, 1945) (when not normal) for the comparisons. We use the implementation in R (R Core Team, 2014) for these 3 tests. To account for the multiple comparisons we are performing, we correct all p-values using Bonferroni's correction (Benjamini & Heller, 2008; Dror et al., 2017) and follow Holm's procedure 2 (Holm, 1979; Dror et al., 2017) to identify the pairs of l 1 and l 2 with significant differences after correction. We report all 3 levels of significance (α ≤ 0.05, 0.01, 0.001) for a more comprehensive evaluation.

Experimental setup

The systematic, identical treatment we give to our data is described as follows with further preprocessing and hyperparameter details in Appendices B and C, respectively. The distinctive point of our experiment is that the training regime is the same for all (intuition in App. O.1). After filtering length to 300 characters maximum per line in parallel for the 6 languages, we made 3 subsets of the data with 1 million lines each -one having lines in the order of the original corpus (dataset A) and two other randomly sampled (without replacement) from the full corpus (datasets B & C). Lines in all datasets are extracted in parallel and remain fully aligned for the 6 languages. For each run and each representation, there are 30 pairwise directions (i.e. one l src to one l trg ) that result from the 6 languages. We trained all 150 (for 5 sizes) 6-layer Transformer models for each run using the SOCKEYE Toolkit (Hieber et al., 2018) . We optimize using PP and use early stopping if no PP improvement occurs after 3 checkpoints up to 50 epochs maximum, taking the best checkpoint. Characters and bytes are supposed to mitigate the out-of-vocabulary (OOV) problem on the word level. In order to assess the effect of modeling with finer granularity more precisely, all vocabulary items appearing once in the train set are accounted for (i.e. full vocabulary on train, as in Gerz et al. (2018a; b) ). But we allow our system to categorize all unknown items in the dev set to be unknown (UNK) so to measure OOVs (open vocabulary on dev (Jurafsky & Martin, 2009) ). To identify correlates of performance, we perform Spearman's correlation (Spearman, 1904) with some basic statistical properties of the data (e.g. length, vocabulary size (|V |), type-token-ratio, OOV rate) as metrics -a complete list thereof is provided in Appendix F. For each of the 3 primary representations -character, byte, and word, we performed 5 runs total in 5 sizes (10 2 -10 6 lines) (runs A0, B0, C0, A1, & A2) and 7 more runs in 4 sizes (10 2 -10 5 lines) (A3-7, B1, & C1), also controlling for seeds. For the alternate/secondary representations, we ran 3 runs each in 5 sizes (10 2 -10 6 lines) (A0, B0, & C0).

3. EXPERIMENTAL RESULTS OF PRIMARY REPRESENTATIONS

Subfigures 1a, 1b, and 1c present the mean results across 12 runs of the 3 primary representationscharacter, byte, and word, respectively. The x-axis represents data size in number of lines and y-axis the total conditional cross-entropy, measured in bits (Eq. 1 in Appendix B). Each line connects 5 data points corresponding to the number of bits the CLMs (trained with training data of 10 2 , 10 3 , 10 4 , 10 5 , and 10 6 lines) need to encode the target language dev set given the corresponding text in the source language. These are the same data in the same 30 language directions and 5 sizes with the same training regime, just preprocessed/segmented differently. This confirms representation relativitylanguages (or any objects being modeled) need to be evaluated relative to their representation. "One size does not fit all" (Durrani et al., 2019) , our conventional way of referring to "language" (as a socio-cultural product or with traditional word-based approaches, or even for most multilingual tasks and competitions) is too coarse-grained (see also Fisch et al. (2019) and Ponti et al. ( 2020)). Subfigures 1d, 1e, and 1f display the corresponding information sorted into facets by target language, source languages represented as line types. Through these we see more clearly that results can be grouped rather neatly by target language (cf. figures sorted by source language in Appendix H)as implicit in the Transformer's architecture, the decoder is unaware of the source language in the encoder. As shown in Table 1 in § 5 summarizing the number of source and target language pairs with significant differences, there are no significant differences across any source language pairs. The Transformer neutralizes source language instances. This could explain why transfer learning or multilingual/zero-shot translation (Johnson et al., 2017) is possible at all on a conceptual level. In general, for character and byte models, most language directions do seem to converge at 10 4 lines to similar values across all target languages, with few notable exceptions. There are some fluctuations past 10 4 , indicating further tuning of hyperparameters would be beneficial due to our present setting possibly working most favorably at 10 4 . On the character level, target language ZH (ZH trg ) shows a different learning pattern throughout. And on the byte level, AR trg and RU trg display non-monotonic and unstable behavior, which we refer to as erratic. Word models exhibit Double Descent across the board (note the spike at 10 3 ), but overall, difficult/easy languages stay consistent, with AR and RU being the hardest, followed by ES and FR, then EN and ZH. A practical takeaway from this set of experiments: in order to obtain more robust training results, use bytes for ZH (as suggested in Li et al. (2019a) ) and characters for AR and RU (e.g. Lee et al. (2017) )also if one wanted to avoid any "class" problems in performance disparity with words. Performance disparity for these representations is reported in Table 1 under "CHAR", "BYTE", and "WORD". Do note, however, that the intrinsic performance of ZH with word segmentation is not particularly subpar. But this often does not correlate with its poorer downstream tasks results (recall results from Junczys-Dowmunt et al. ( 2016)). Since the notion of word in ZH is highly contested and ambiguous -1) it is often aimed to align with that in other languages so to accommodate manual feature engineering and academic theories, 2) there is great variation among different conventions, 3) native ZH speakers identify characters as words -there are reasons to rethink this procedure now that fairer and language-independent processing in finer granularity is possible (cf. Li et al. (2019b) as well as Duanmu (2017) for a summary on the contested nature of wordhood in ZH). A more native analysis of ZH, despite being considered a high-resource language, has not yet been recognized in NLP.

4. UNDERSTANDING THE PHENOMENA WITH ALTERNATE REPRESENTATIONS

To understand why some languages show different results than others, we carried out a secondary set of control experiments with representations targeting the problematic statistical properties of the corresponding target languages. (An extended version of this section is provided in Appendix P.) Character level We reduced the high |V | in ZH with representations in ASCII characters -Pinyin and Wubi. The former is a romanization of ZH characters based on their pronunciations and the latter an input algorithm that decomposes character-internal information into stroke shape and ordering and matches these to 5 classes of radicals (Lunde, 2008) . We replaced the ZH data in these formats only on the target side and reran the experiments involving ZH trg on the character level. Results in Figure 2 and Table 1 show that the elimination of disparity on character level is possible if ZH is represented through Pinyin (transliteration), as in Subfigure 2c. But models with ZH logographic scripts display a behaviorial tendency unlike those with other (phonetic) alphabetic scripts (Subfigure 2a). Work published thus far using Wubi with the Transformer seems to have needed some form of architectural modification (Gao et al., 2020) or a different architecture altogether (Nikolov et al., 2018; Zhang et al., 2019) , suggesting a possible script bias (to be further discussed in § 5 under "Basis for biases"). Byte level Length is the most salient statistical attribute that makes AR and RU outliers. To shorten their sequence lengths, we tested with alternate encodings on AR trg and RU trg -code page 1256 and 1251, which provide 1-byte encodings specific to AR and RU, respectively. Results are shown in Subfigures 3a and 3b. Not only is erraticity resolved, the number of 15 possible target language pairs with significant differences reduces from 8 with the UTF-8 byte representation to 0 (Table 1 under "ARRU t "), indicating that we eliminated disparity with this optimization heuristic. Since our heuristic is a lossless and reversible transform, it shows that a complexity that is intrinsic and necessary in languagefoot_2 does not exist in computing, however diverse they may be, as our 6 are, from the conventional linguistic typological, phylogenetic, historical, or geographical perspectives. Please refer to Appendix J for our discussion on language complexity.

Word level

The main difference between word and character/byte models is length not being a top contributing factor correlating with performance, but instead |V | is. This is understandable as word segmentation neutralizes sequence lengths. To remedy the OOV problem, we use BPE, which learns a fixed vocabulary of variable-length character sequences (on word level, as it presupposes word segmentation) from the training data. It is more fine-grained than word segmentation and is known for its capability to model subword units for morphologically complex languages (e.g. AR and RU). We use the same vocabulary of 30,000 as specified in Junczys-Dowmunt et al. (2016) . This reduced our averaged OOV token rate by 89-100% across the 5 sizes. The number of language pairs with significant differences reduced to 7 from 8 for word models, showing how finer-grained modeling has a positive effect on closing the disparity gap.

5. META-RESULTS, ANALYSIS, AND DISCUSSION

Performance disparity Table 1 lists the number of language pairs with significant differences under the representations studied. Considering how it is possible for our character and byte models to effect no performance disparity for the same languages on the same data, this indicates that disparity is not a necessary condition. In fact, the customary expectation that languages ought to perform differently stems from our word segmentation practice. Furthermore, the order of AR/RU > ES/FR > EN/ZH (Figure 1c ) resembles the idea of morphological complexity. Considering there are character-internal meaningful units in languages with logographic script such as ZH (cf. Zhang & Komachi (2018) ) that are rarely captured, studied, or referred to as "morphemes", this goes to show that linguistic morphology, along with its complexity, as is practiced todayfoot_3 and that which has occurred in the NLP discourse thus far, has only been relevant on and is bounded to the "word" level. The definition of word, however, has been recognized as problematic for a very long time in the language sciences (see Haspelmath (2011) and references therein from the past century). Since the conventional notion of word, which has been centered on English and languages with alphabetic scripts, has a negative impact on languages both morphologically rich (see Minkov et al. (2007 ), Seddah et al. (2010) , inter alia), AR and RU in our case, as well as morphologically "frugal" (Koehn, 2005) , as in ZH, finer-grained modeling with characters and bytes (or n-gram variants/pieces thereof) is indeed a more sensible option and enables a greater variety of languages to be handled with more simplicity, fairness, independence, and flexibility. While the lack of significant differences between pairs of source languages would signify neutralization of source language instances, it does not mean that source languages have no effect on target. For our byte solutions with code pages, we experimented also with source side optimization in the directions that involve AR/RU as source. This affected the distribution of the disparity results for that representation -with 2 pairs being significantly different (see Table 1 under "ARRU s,t "). We defer further investigation on the nature of source language neutralization to future work. Sample-wise Double Descent (DD) Sample-wise non-monotonicity/DD (Nakkiran et al., 2020) denotes a degradation followed by an improvement in performance with increasing data size. We notice word models and character models with ZH trg , i.e. models with high target |V |, are prone to exhibit a spike at 10 3 . A common pattern for these is the ratio of target training token count to number of parameters falls into O(10 -4 ) for 10 2 lines, O(10 -3 ) at 10 3 , O(10 -2 ) at 10 4 , and O(10 -1 ) for 10 5 lines and so on. But for more atomic units such as alphabetic (not logographic) characters (may it be Latin, Cyrillic, or Abjad) and for bytes, this progression instead begins at O(10 -3 ) at 10 2 lines. Instead of thinking this spike of 10 3 as irregular, we may instead want to think of this learning curve as shifted by 1 order of magnitude to the right for characters and bytes and/or the performance at 10 2 lines for words and ZH-characters due to being overparameterized and hence abnormal. This would also fit in with the findings by Belkin et al. (2019) and Nakkiran et al. (2020) attributing DD to overparameterization. If we could use this ratio and logic of higher |V | to automatically detect "non-atomic" units, ones that can be further decomposed, this observation could potentially be beneficial for advancing other sciences, e.g. biology. From a cognitive modeling perspective, the similarity in behavior of ZH characters and words of other languages can affirm the interpretation of wordhood for those ZH speakers who identify ZH characters as words (see also last paragraph in § 3 and Appendix J). While almost all work attribute DD to algorithmic reasons, concurrent work by Chen et al. (2020) corroborates our observation and confirms that DD arises due to "the interaction between the properties of the data and the inductive biases of learning algorithms". Other related work on DD and its more recent development can also be found in their work. We performed additional experiments testing our setting on the datasets used by the Nakkiran et al. (2020) and testing our data on a non-neural LM. Results support our findings and are provided in Appendix K. Number of model parameters can be found in Appendix L. Erraticity We observe another type of sample-wise non-monotonicity, one that signals irregular and unstable performance across data sizes and runs. Within one run, erraticity can be observed directly as changes in direction on the y-axis. Across runs, large variance can be observed, even with the same dataset (see Figure 18 in Appendix M). Erraticity can also be observed indirectly through a negative correlation between data size and performance. Many work on length bias in NMT have focused on solutions related to search, e.g. Murray & Chiang (2018) . Our experiments show that a kind of length bias can surface already with CLMing, without generation taking place. If the connection between erraticity and length bias can indeed be drawn, it could strengthen the case for global conditioning (Sountsov & Sarawagi, 2016) . (See Appendix M for more discussion and results.) Script bias, erraticity, word bias -are these necessary conditions? To assess whether the observed phenomena are particular to this one setting, we performed one run with dataset A in 4 sizes with the primary representations on 1-layer Transformers (see Appendix N). We observed no significant disparity across the board. It shows that larger/overparameterized models can magnify/exacerbate the differences in the data statistics. That hyperparameter tuning -in this case, through the reduction of the number of layers -can mitigate effects from data statistics is, to the best of our knowledge, a novel insight, suggesting also that a general expectation of monotonic development as data size increases can indeed be held. Our other findings remain consistent (representational relativity, source language neutralization, and DD on word level). Bases for biases Recall in § 1, we "consider bias to be present when performance disparity in our Transformer models is statistically significant". As shown in our data statistics and analysis (Appendices D and P respectively), script bias, length bias wrt erraticity in CLMing, and word bias are all evident in the vocabulary and length information in the data statistics. Hence these disparities in performance are really a result of the Transformer being able to model these differences in data at such a magnitude that the differences are statistically significant. The meta phenomenon of erraticity, however, warrants an additional consideration indicative of the empirical limits of our compute (cf. Xu et al. (2020) ), even when the non-monotonicity is not observed during the training of each model. In eliminating performance disparity in character and byte models by normalizing vocabulary and length statistics in the data, we demonstrated that performance disparity as expected from the morphological complexity hierarchy is due to word tokenization, not intrinsic or necessary in language. This is the word bias. Qualitative issues in the concept of word will persist and make crosslinguistic comparison involving "words" unfair even if one were to be able to find a quantitative solution to mitigate the OOV issue, the bottleneck in word-based processing. We humans have a choice in how we see/process languages. That some might still prefer to continue with a crosslinguistic comparison with "words" and exert the superiority of "word" tokenization speaks for a view that is centered on "privileged" languages -in that case, word bias is a human bias. And, in eliminating performance disparity across the board with our one-layer models, we show that all quantitative differences in data statistics between languages can also be modeled in a "zoomed-out"/"desensitized" mode, suggesting that while languages can be perceived as being fundamentally different in different ways in different granularities, they can also be viewed as fundamentally similar.

6. ADDITIONAL RELATED WORK

Similar to our work in testing for hardness are Cotterell et al. (2018) , Mielke et al. (2019) , and Bugliarello et al. (2020) . The first two studied (monolingual) LMs -the former tested on the Europarl languages (Koehn, 2005) with n-gram and character models and concluded that morphological complexity was the culprit to hardness, the latter studied 62 languages of the Bible corpus (Mayer & Cysouw, 2014) in addition and refuted the relevance of linguistic features in hardness based on character and BPE models on both corpora in word-tokenized form. Bugliarello et al. (2020) compared translation results of the Europarl languages with BPEs at one data size and concluded that it is easier to translate out of EN than into it, statistical significance was, however, not assessed. In contrast, we ablated away the confound of generation and studied CLMing with controls with a broader range of languages with more diverse statistical profiles in 3 granularities and up to 5 orders of magnitude in data size. That basic data statistics are the driver of success in performance in multilingual modeling has so far only been explicitly argued for in Mielke et al. (2019) . We go beyond their work in monolingual LMs to study CLMs and evaluate also in relation to data size, representational granularity, and quantitative and qualitative fairness. Bender ( 2009) advocated the relevance of linguistic typology for the design of language-independent NLP systems based on crosslinguistic differences in word-based structural notions, such as parts of speech. Ponti et al. ( 2019) found typological information to be beneficial in the few-shot setting on the character level for 77 languages with Latin scripts. But no multilingual work has thus far explicitly examined the relation between linguistic typology and the statistical properties of the data, involving languages with diverse statistical profiles in different granularities. As obtaining training data is often the most difficult part of an NLP or Machine Learning (ML) project, Johnson et al. (2018) introduced an extrapolation methodology to directly model the relation between data size and performance. Our work can be viewed as one preliminary step towards this goal. To the best of our knowledge, there has been no prior work on demonstrating the neutralization of source language instances through statistical comparisons, a numerical analysis on DD for sequence-tosequence models, the meta phenomenon of a sample-wise non-monotonicity (erraticity) being related to length, or the connection between effects of data statistics and modification in architectural depth.

7. CONCLUSION

Summary We performed a novel, rigorous relational assessment of performance disparity across different languages, representations, and data sizes in CLMing with the Transformer. Different disparity patterns were observed on different representation types (character, byte, and word), which can be traced back to the data statistics. The disparity pattern reflected on the word level corresponds to the morphological complexity hierarchy, reminding us that the definition of morphology is predicated on the notion of word and indicating how morphological complexity can be modeled by the Transformer simply through word segmentation. As we were able to eliminate disparity on the same data on the character and byte levels by normalizing length and vocabulary, we showed that morphological complexity is not a necessary concept but one that results from word segmentation and is bounded to the word level, orthogonal to the performance of character or byte models. Representational units of finer granularity were shown to help eliminate performance disparity though at the cost of longer sequence length, which can have a negative impact on robustness. In addition, we found all word models and character models with ZH trg to behave similarly in their being prone to exhibit a peak (as sample-wise DD) around 10 3 lines in our setting. While bigger/overparameterized models can magnify the effect of data statistics, exacerbating the disparity, we found a decrease in model depth can eliminate these quantitative biases, leaving only the qualitative aspect of "word" and the necessity of word segmentation in question. Outlook Machine learning has enabled greater diversity in NLP (Joshi et al., 2020) . Fairness, in the elimination of disparity, does not require big data. This paper made a pioneering attempt to bridge research in DL/NNs, language sciences, and language engineering through a data-centric perspective. We believe a statistical science for NLP as a data science can well complement algorithmic analyses with an empirical view contributing to a more generalizable pool of knowledge for NNs/DL/ML. A more comprehensive study not only can lead us to new scientific frontiers, but also better design and evaluation, benefitting the development of a more general, diverse and inclusive Artificial Intelligence.

APPENDICES

A RE-VISUALIZATION OF FIGURE 

B DATA SELECTION AND PREPROCESSING DETAILS

The UN Parallel Corpus v1.0 (Ziemski et al., 2016) consists of manually translated UN documents from 1990 to 2014 in the 6 official UN languages. Therein is a subcorpus that is fully aligned by line, comprising the 6-way parallel corpus we use. We tried to have as little preprocessing or filtering as necessary to eliminate possible confounds. But as the initial runs of our experiment failed due to insufficient memory on a single GPU with 12 GB VRAMfoot_4 , we filtered out lines with more than 300 characters in any language in lockstep with one another for all the 6 languages such that the subcorpora would remain parallel, thereby keeping the material of each language semantically equivalent to one another. 8,944,859 lines for each language were retained as our training data which cover up to the 75 th percentile in line length for all 6 languages. In order to monitor the effect of data size, we made subcorpora of each language in 5 sizes by heading the first 10 2 , 10 3 , 10 4 , 10 5 , 10foot_5 lines 6 . We refer to this as dataset A. In addition, to better understand and verify the consistency of the phenomena observed, we made 2 supplemental datasets by shuffling the 8,944,859 lines two different times randomly and heading the number of lines in our 5 sizes for each language, again in lockstep with one another (datasets B and C). For character modeling, we used a dummy symbol to denote each whitespace. For byte, we turned each UTF-8-encoded character into a byte string in decimal value, such that each token is a number between 0 and 255, inclusive. For word, we followed (Junczys-Dowmunt et al., 2016) and used the Moses tokenizer (Koehn et al., 2007) as is standard in NMT practice when word tokenization is applied and Jiebafoot_6 for segmentation in ZH. For Pinyin, we used the implementation from https://github.com/lxyu/pinyin in the numerical format such that each character/syllable is followed by a single digit indicating its lexical tone in Mandarin. For Wubi, we used the dictionary from the implementation from https:// github.com/arcsecw/wubi. We have implemented all representations such that they would be reversible even when the sequence contains code-mixing. We used the official dev set as provided in (Ziemski et al., 2016) , 3,077 lines per language remained from 4,000 after filtering line length to 300 characters. Data statistics is provided in Appendix D for reference. The systematic training regime that we give to our language directions are identical for all. For each primary representation type (character, byte, and word), we performed: • 5 runs in 5 sizes (10 2 -10 6 ): A0 (seed=13), B0 ( 13), C0 (9948), A1 (9948), A2 (265), and • 7 more runs in 4 sizes (10 2 -10 5 ): A3 (777), A4 (42), A5 (340589), A6 (1000), A7 (83146), B1 (9948), & C1 (13) . For each run and each size, there are 30 pairwise directions (i.e. 1 source language to 1 target language, e.g. AR-EN for Arabic to English) that result from the 6 languages. We trained all 150 jobs for each run and representation using the Transformer model (Vaswani et al., 2017) as supported by the SOCKEYE Toolkit (Hieber et al., 2018 ) (version 1.18.85), based on MXNet (Chen et al., 2015) . A detailed description of the architecture of the Transformer can be found in (Vaswani et al., 2017) . The same set of hyperparameters applies to all and its values are listed in Appendix C.

Notes on training time

Each run of 30 directions in 5 sizes took approximately 8-12 days for character and byte models. Byte models generally took longer -hence training time is positively correlated with length (concurring with observations by Cherry et al. (2018) as they compared character with BPE models). A maximum length of 300 characters entails a maximum length of at least 300 bytes in UTF-8. Each run of word models (30 directions, 5 sizes) took about 6 days (excluding the training of some 7-9 directions out of 30 per run involving AR trg or RU trg at 10 6 on word level which took about 12-18 hours each direction to train on a CPU as these required more space and would run out of memory (OOM) on our GPUs otherwise). These figures do not include the additional probing experiments described in § 4. Evaluation metric Most sequence-to-sequence models are optimized using a cross-entropy loss, defined as: H(t, s) = - N i=1 log 2 p(t i | t <i , s) ( ) where t is the sequence of tokens to be predicted, t i refers to the i th token in that sequence, s is the sequence of tokens conditioned on, and N = |t|. It is customary to report scores as PP, which is 2 1 N H(t,s) , i.e. 2 to the power of the cross-entropy averaged by the number of tokens (based on whichever granularity of unit is used for training) in the data. Cotterell et al. (2018) propose to use "renormalized" PP to evaluate LMs fairly through the division of an arbitrary constant. In our case, we choose instead a simpler method of using an "unnormalized" PP, i.e. the total number of bits needed to encode the development (dev) set, which has a constant size of 3,077 lines per language (after length filtering of the same dev set used in Junczys-Dowmunt et al. ( 2016)) for all various training sizes. As the implementation we used (SOCKEYE (Hieber et al., 2018) ) only reports PP, we transform it back to entropy as defined above by noting that H(t, s) = log 2 P P (t|s) × N .

C HYPERPARAMETER SETTING

• encoder transformer; • decoder transformer; • num-layers 6:6; • num-embed 512:512; • transformer-model-size 512; • transformer-attention-heads 8; • transformer-feed-forward-num-hidden 2048; • transformer-activation-type relu; • transformer-positional-embedding-type fixed; • transformer-preprocess d; transformer-postprocess drn; • transformer-dropout-attention 0.1; • transformer-dropout-act 0.1; • transformer-dropout-prepost 0.1; • batch-size 15; • batch-type sentence; • max-num-checkpoint-not-improved 3; • max-num-epochs 50; • optimizer adam; • optimized-metric perplexity; • optimizer-params epsilon: 0.000000001, beta1: 0.9, beta2: 0.98; • label-smoothing 0.0; • learning-rate-reduce-num-not-improved 4; • learning-rate-reduce-factor 0.001; • loss-normalization-type valid; • max-seq-len 300 for character, word, and BPE, 672 for all bytes, 688 for Wubi, 680 for Pinyin; • checkpoint-frequency/interval 4000. (For smaller datasets, the end of 50 epochs is often reached before the first checkpoint. Since SOCKEYE only outputs scores at checkpoints, we adjusted the checkpoint frequency as follows to get a score outputted by the end of 50 epochs: 1000 for 100 lines for all character & byte instances, 400 for 100 lines for word and 500 for 100 lines BPE, 3450 for 1000 lines for word & BPE. For the very few cases that this default does not suffice due to bucketing of similar length sequences, we manually set the checkpoint frequency to the last batch.) D DATA STATISTICS Note that Sockeye adds for its calculation 4 additional types: <pad>, <s>, </s>, <unk>. • Number of tokens. This excludes the 1 EOS/BOS (end-/beginning-of-sentence) marker added by Sockeye to each line. • Out-of-vocabulary (OOV) type rate (in %), i.e. the fraction of the types in the dev data that is not covered by the types in the training data. • OOV token rate (in %), i.e. the fraction of tokens in the dev data that is treated as UNKnowns. • Type-token-ratio (in %), i.e. the ratio between the number of types and tokens in the data. This is a rough proxy for lexical diversity in that a value of 1, 669, 230.67 1, 182, 122.14 790, 296.63 812, 541.59 801, 797.31 1, 682, 368.21 1, 214, 712.80 768, 329.55 773, 077.60 743, 187.12 898, 467.63 1, 039, 423.53 831, 131.85 792, 199.69 788, 770.52 1, 415, 204.23 1, 169, 086.06 889, 203.72 822, 286.95 1, 674, 603.44 1, 182, 589.30 791, 660.51 787, 110.43 799, 124.88 1, 677, 201.07 1, 195, 378.25 749, 358.45 729, 881.05 705, 375.66 926, 213.05 1, 023, 048.10 859, 552.52 819, 669.80 812, 027.11 1, 425, 834.11 1, 173, 956.40 925, 854.48 844, 751.77 1, 839, 654.03 1, 354, 401.84 848, 031.25 869, 551.01 869, 612.93 2, 197, 541.98 2, 228, 319.17 2, 431, 853.83 1, 947, 404.65 1, 829, 839.51 918, 172.29 1, 110, 884.62 831, 112.97 788, 608.20 777, 418.25 1, 587, 521.54 1, 327, 483.08 929, 790.50 860, 789.46 1, 280, 817.03 1, 281, 197.26 1, 132, 398.00 1, 120, 681.62 1, 121, 645.99 1, 568, 743.13 1, 201, 566.37 769, 183.30 752, 991.20 745, 264.85 789, 140.01 942, 441.68 729, 612.42 676, 409.89 673, 012.11 1, 215, 900.34 1, 090, 617.74 784, 579.94 697, 066.73 

F CORRELATION STATISTICS

Best correlating metrics, i.e. the union of top 3 metrics for all representations. For each representation, the top 3 metrics are boldfaced. All correlations are highly significant (p < 10 -30 ), except for min source length for WORD (p ≈ 0.0001) and min target length for WORD (p ≈ 0.3861). The full list of metrics used for the correlation analysis is: Languages are often compared with regard to their complexity from a computational, theoretical and learning perspective. In computational linguistics, it is generally known that methods mainly developed for the English language do not necessarily transfer well to other languages. The cross-linguistic variation in the amount of information encoded at the level of a word is, for instance, recognized as one of the main challenges for multilingual syntactic parsing (formulated as The Architectural Challenge (Tsarfaty et al., 2013) ). Complexity of this kind is also found to influence machine translation: translating from morphologically rich languages into English is easier than the other way around (Koehn, 2005) .

FROM ALL RUNS)

***** Morphology is "the study of the formation and internal structure of words". Morphemes are "the smallest meaningful units of language". (Bender, 2013) ***** AR and RU are traditionally considered morphologically complex (see e.g. Minkov et al. (2007 ), Seddah et al. (2010) and proceedings of related workshops in subsequent years), and ZH lacking morphological richness (Koehn, 2005) . But this definition of morphology is predicated on the notion of word, defined primarily from an alphabetic perspective. As pointed out by Zhang & Komachi (2018) , "the important differences between logographic and alphabetic writing systems have long been overlooked". In logographic languages (i.e. languages with logographic scripts), there can be units within a character that carry semantic and phonetic information that have never been accounted for in the traditional practice of morphology or in the computation of morphological complexity. For example, in the comparison of different morphological complexity measures by Bentz et al. (2016) , all measures studied are defined with the notion of word.foot_8 Yet, there is no universally valid definition of a "word" -the form/idea (as in, the philosophical concept) of a "word" may be there for most languages/cultures (though that is certainly also debatable), but its instantiations are different in different languages/cultures, as well as in different genres/settings within one language. The variability in the definition of word is evident in the variation in language-specific word tokenization algorithms, along with the "indeterminacy of word segmentation" or a work-in-progress status for the definition of "word" advocated by Haspelmath (2011) , as well as the contested nature of wordhood, esp. for logographic languages such as ZH (see Duanmu (2017) and Li et al. (2019b) for how some ZH speakers do indeed consider a ZH character to be a word or how "word", as conventionally used in NLP, is not a native term or does not correspond with speakers' judgement). Our results with the Transformer indicate that a notion of morphological complexity can be modeled given our word tokenization scheme, confirming that morphological complexity is only predicated on the notion of word and bounded within the word level, and orthogonal to the performance of character or byte models. That is, unless word-based segmentation has been applied, there is no reason to attribute crosslinguistic performance disparity to differences in morphological complexity. In fact, on the character and byte level, we were able to achieve performance without disparity. Hence disparity is not a necessary condition but an expectation that has been in mutual reinforcement with our practice of word segmentation, while the definitions of "morphological complexity" and "word" are in a circular dependency with each other. In this paper, we resolve language complexity, more specifically that of morphological complexity, in the context of computing through CLMing with the Transformer, in that we explain away the representation granularities and criteria relevant for such calculation. TLDR: Up to the point of our taking up the subject of language complexity in this paper, there has been not a rigorous definition of "language complexity". Conventionally, "language complexity" is synonymous to "linguistic complexity" (with the tradition of "linguistics" being primarily word-based), and people just assume linguistic complexity, e.g. morphological/syntactic complexity, to be intrinsic and necessary in languages (across representation levels). Our findings show that linguistic complexity is relative to the representation granularity, i.e. since morphology is based on words, it is bounded to the word level. ***** An alternative perspective, with finer prints: We have also developed a more rigorous interpretation. We take on the definition of "language complexity" as one that is related to the statistical attributes of languages. We assume and define solving as the elimination of statistically significant performance disparity. In larger (6-layer) models, and according to the conventional definition of "language" -i.e. language as a whole, we solved language complexity with compression of AR and RU in byte representations. In smaller (1-layer) models, one can think of the situation as: i) no complexity has been modeled by the Transformer hence there is nothing to solve, or ii) there is no complexity between these languages to begin with, or iii) the Transformer solved the complexity. With respect to each representation level/granularity in the larger models: • BYTE: one can think of us as having solved complexity with byte representations or with 1-layer models -for these 6 languages empirically. Theoretically, there could be languages with longer sequence lengths than RU and AR, in those cases, we don't claim to have solved the matter empirically but only resolved it conceptually. But this is the most that anyone could do at the moment, as there is no relevant parallel data available. • CHARACTER: one can think of us as having solved it via bytes or 1-layer models. Whether we can be considered to have solved it via Pinyin for ZH depends on whether the evaluator accepts decomposition into a phonetic representation only qualifies as a solution for the ZH language. • WORD: one can think of us as having solved it via bytes or 1-layer models. It is not possible to solve it strictly within the word level without creating word segmentation criteria that would be unrelatable to native speakers. And since "word" is exclusively a human concept, we must either claim that a universal solution is undefined or undefinable for computing, or retreat to a unit that is the greatest common factor crosslinguistically. Since some ZH speakers consider ZH characters as words, we return to the character-level solution. It is beyond the scope of our paper to solve the qualitative disparity on the word level. However, we do advocate a more inclusive evaluation and critical reflection on the possibility of discontinuing the usage of "word" as such a non-technical term biases against both "morphologically complex" and "morphologically simple" languages. The world of languages in written form can be divided into those with logographic scripts and those with (phonetic) alphabetic ones, with the unit of character being the greatest common factor of them all, from the human perspective. For technical processing, esp. for fair multilingual sequence-to-sequence modeling with the Transformer, we recommend measures that are more standardized, such as those based on bytes or characters. There is room for improvement in the design of character encoding that complements the statistical profiles, e.g. with relative rank in sequence length, of different languages. We believe there is crosslinguistic systematicity on the character level to be leveraged. One's readiness to accept this as a solution to language complexity can be a subjective matter. One may insist that language complexity be solved exclusively with monolingual LMing (which lies outside the scope of the present work), instead of being confounded with the logic of one language being conditional on another. One may also object to the idea of (re-)solving morphological complexity being equivalent to or leading to solving language complexity as a whole, for there could also be e.g. syntactic complexity (although as substantial "information concerning syntactic units and relations is expressed at word level" in morphologically rich languages (Tsarfaty et al., 2010) , the boundary between morphology and syntax is less distinct for some languages than others (Haspelmath, 2011) ). If, however, our results could be extended, we wonder if syntactic complexity could be due to our sentence segmentation or a combination of word and sentence segmentation practice. That we leave for future work for those who are interested in the topic.

K SAMPLE-WISE DOUBLE DESCENT (DD)

K.1 OUR EXPERIMENTAL FRAMEWORK ON DD DATASETS FROM (NAKKIRAN ET AL., 2020) Text experiments from previous work reporting sample-wise DD involved words (Belkin et al., 2019) and BPEs (Nakkiran et al., 2020) . We applied our experimental framework -by testing data points with 10 n lines -on the datasets reported in (Nakkiran et al., 2020) to exhibit DD. WMT'14 9 EN-FR was reported to demonstrate model-wise DD and IWSLT'14 (Cettolo et al., 2012 ) DE-EN model-wise and sample-wise DD. We downloaded and prepared the data with scripts 10 from the FAIRSEQ Toolkit (Ott et al., 2019) . The WMT data was preprocessed with 40,000 BPE operations and IWSLT 10,000. Our focus is on sample-wise DD and hence our goal was to see if the spike at 10 3 we observed with the UN data would apply also to these datasets. We used the same training regime 11 with the Transformer and Adam on SOCKEYE as before and tested both language directions on the entirety of both datasets, with no subsampling. For the IWSLT dataset, we tested data sizes with 10 2 -10 5 lines, then at 160, 239 as that is the total number of lines available. For the WMT dataset, we tested from 10 2 to 10 7 , then at 35, 762, 532. This shows that the effect we reported in § 5 also holds on these datasets: "the ratio of target training token count to number of parameters falls into O(10 -4 ) for 10 2 lines, O(10 -3 ) at 10 3 , O(10 -2 ) at 10 4 , and O(10 -1 ) for 10 5 lines and so on". 9 http://www.statmt.org/wmt14/translation-task.html 10 https://github.com/pytorch/fairseq/blob/master/examples/translation/ prepare-wmt14en2fr.sh and https://github.com/pytorch/fairseq/blob/master/ examples/translation/prepare-iwslt14.sh 11 max-seq-len 300; checkpoint-frequency 4000 except for cases where 50 epochs would be reached before the first checkpoint: 400 for 10 2 lines and 3450 for 10 3 lines. We experimented also on KenLM (Heafield, 2011; Heafield et al., 2013) , a non-neural LM with modified Kneser-Ney smoothing (Kneser & Ney, 1995; Chen & Goodman, 1999) , on our dataset A and found that on the word level, such a spike (or a hump) is common across all languages, see Figure 17 . The target-token-to-parameter ratio is under 1 for most of these smaller data sizes. This seems related to the analytical findings in Opper et al. (1990) where the pseudo-inverse solution to a simple learning problem was shown to exhibit non-monotonicity, with the peak exactly as the ratio of data to parameters (α) approaches 1. The number of parameters of a k-gram model is the number of unique n-grams, 1 ≤ n ≤ k. Table 4 shows the ratios for our trigram model (all n-gram models of higher order exhibit the same effect). On word level, where the function of number of bits to data size is not always monotonic, we observe less of a monotonic development whenever the token-to-parameter ratio is smaller than 1. This is more notably shown in the first 4 sizes in AR with a hump-like curve before the performance improves at 10 6 . This is different from the sharper descent for ES and FR, where only the first two data sizes have a non-monotonic relationship and a token-to-parameter ratio less than 1. Taking the token-to-parameter ratio as a rough proxy for over-(< 1) and under-parameterization (> 1), this can be seen as an instance of non-monotonicity with respect to data size in the "critical regime", i.e. when the model transitions from being (heavily) over-to under-parameterized (Belkin et al., 2019; Nakkiran, 2019) . A remark on modeling with finer granularity Our KenLM results show the performance of bytes and characters is not on par with that of words with non-neural algorithms. NNs/DL has enabled much progress in this regard. Figure 19 : Additional experiment with maximum length of 300 bytes (with no hyperparamter tuning, in our blind one-setting-for-all evaluation). Considering there are languages with much higher character sequence length than RU, there is food for thought for the design of next-generation Multilingual Plane.

N EXPERIMENTS WITH ONE-LAYER TRANSFORMER

We performed 1 run with dataset A in 4 sizes (10 2 -10 5 lines, seed=13) with the primary representations of characters, bytes, and words, on 1-layer Transformers (num-layers 1:1, all other hyperparameters remain the same as for our main experiments). We compared this against run A0 in 4 sizes with the same seed. (Based on how our null hypothesis is set up, the higher the number of runs, the more likely it is for there to be disparity. Important is that we evaluate based on an equal number of runs and on the same data for all candidates.) Results are shown in Table 5 with no statistically significant disparity observed on the models trained with 1 layer across the board. Many are under the impression that big data is the cause to the neutralization of language instances in DL/NNs. But, as this set of experiments shows, it is possible for there to be no statistically significant differences between them, with as little as our smallest data size of 100 lines. The experiments here used one setting for all. Some model configurations might train better and converge close to their optima while other configurations might not reach their full potential. Can this not create a distortion in the results? A: For conventional engineering practice, we agree that hyperparameter tuning would be a sine qua non. However, the evaluation objective is the relational distance between languages, hence we need to see it in a different light. Here is a loose analogy: *** Assume 3 objects in 3 different locations in space. Relative evaluation from one setting allows one to capture the distance between these objects. It does not matter whether these three objects are in their "best" states. For example, if one were to use a camera to capture these 3 objects and one does not adjust the setting (using just one random aperture, shutter speed, and focus), i.e. no tuning to capture any of these 3 specifically, nor does one try to model these 3 to their individual bests separately, what would result could be a picture that captures one of these 3 objects more favorably than the others, or it could be that all of these would be blurred. But either way, there is a degree of blurriness to be measured, giving us an idea of the relative distance between the objects. Such relative measurement is the evaluation strategy that our paper adopts. Now, to add to the camera analogy, say one of the objects is running water, which was extra blurry [erraticity]: we suggest freezing the water, so even from the one arbitrary angle, it could be captured better. And it worked. Also, while one might generally like to have a "pretty" photo, one that is e.g. taken with sub-optimal lighting, say, overexposure, can have a telling effect as it can bring out details in something dark, like a black box. *** Alternatively, one can tune hyperparameters for each model individually such that each model would be a more optimized one and then compare these models. In that case, one would be interpreting the differences between language in terms of hyperparameters, and the paper would be one that is algorithm-centric. That is of course also a possibility. Our approach, however, is a data-centric one. We would, first of all, like to understand the nature of language data, i.e. what it is about language, if there is anything at all, that makes it a different data type than other data, and what kind of structural constraints, if any, that we need to take into consideration. Then with findings from this data perspective, we try to relate back to the algorithm and make connections so to create a more holistic picture. O.2 TRANSLATIONESE / WORD ORDER Q: Multitexts are parallel texts or translations with the same meaning. There is little to no variation in word order, hence they are just "Translationese" (Gellerstam, 1986) . That is why they turn out to be the same, with no performance disparity. A: Our findings do show that when the semantics is properly controlled, such as in multitexts, the factors influencing performance are statistical properties related to sequence length and vocabulary, e.g. |V | or TTR, and the languages tested can be different. Semantic equivalence is also not a reason why we should expect neutralization of source language instances, as that would mean we should expect equal results across target languages. We agree that faithfulness is often a priority in producing good translations. Whether the translations are produced by humans or machines, only a single best translation can surface as the translation of choice. There may be many other competing hypotheses, but regardless of whether it is done through an automatic ranking algorithm by a machine or through a human expert, the purpose of translation is the same. However, styles and preferences in translations can vary. While faithfulness is generally preferred in the translations of legal texts, more freedom with skillful rearrangement of and play on words (or rather, character or sub-character sequences) or sounds being a criterion for literary texts could be appreciated by certain readers. We agree that it could be very interesting and necessary to model these variations, and we understand that languages can surface in many multimodal forms beyond the confines of texts as well. But with a data-driven perspective, to model this broader variation in language, we need corresponding datasets -we suggest contrast sets where the difference in e.g. sequential order is explicit. And for evaluation, we would require an even more systematic meta evaluation, one that spans different datasets. But the argument that language or data could be different beyond how it appears in one dataset is irrelevant in the evaluation of experiments involving said dataset.

P UNDERSTANDING THE PHENOMENA WITH ALTERNATE REPRESENTATIONS (EXTENDED VERSION)

[Appendix P is an extended version of § 4.] To understand why some languages show different results than others, we carried out a secondary set of control experiments with representations targeting the problematic statistical properties of the corresponding target languages. Character level On the character level, it is well known that ZH differs from the other languages in its high |V |, in this study it has an averaged mean±std of 2550±1449foot_9 across all 5 data sizes from all 3 datasets compared to 170±87 from all other 5 languages combined, may these be in Latin or Cyrillic alphabet or the Abjad script. But what is often not known is that the character sequence length of logographic languages such as ZH is typically short (think and compare the sequence length of the Ancient Egyptian hieroglyphs or the Demotic script with that of the Greek script on the Rosetta Stone). Here in our case, the averaged mean sequence length in characters for ZH is 35±19, compared to 129±71 from the other 5 languages. Heuristics to mitigate high |V | often involve decomposition, which automatically resolve the problem of short sequence length. We tried 2 methods to lower character |V | with representations in ASCII characters -Pinyin and Wubi. The former is a romanization of ZH characters based on their pronunciations and the latter is an input algorithm that decomposes character-internal information into stroke shape and ordering and matches these to 5 classes of radicals (Lunde, 2008) . We replaced the ZH data with these formats only on the target side and reran the experiments involving ZH as a target language (ZH trg ) on the character level. Results in Figure 2 and Table 1 show that the elimination of disparity on character level is possible if ZH is represented through Pinyin (transliteration), as in Subfigure 2c. But Wubi exhibits erraticity (Subfigure 2a). Wubi in our data has a maximum sequence length of 688 characters. As we shall also show in our byte-level analysis below, there are reasons to attribute length as cause to erraticity. Decomposition into strokes may seem like a natural remedy analogous to decomposing an EN word into character sequences, but one needs to be mindful of not exceeding an optimal length given finite computation. Considering the ZH in the UN data is represented in simplified characters, decomposing traditional characters would surely complicate the problem. As there are also sub-character semantic and phonetic units (Zhang & Komachi, 2018) that can be exploited for information and aligned with character sequences of other alphabets, qualitative advances in this area can indeed be a new state of the art. Byte level On the byte level, we observe irregularity for AR and RU. We find minimum sequence length of the target language to be one of the highest metrics correlating positively with the total number of bits (ρ = 0.60). 13 Our data is based on 300 characters as maximum length per line. While we wanted to retain at least 75% of the UN data after length filtering, this length still renders a maximum sequence length that exceeds 100 words (the default maximum length for the word alignment model, GIZA++ (Och & Ney, 2003) , in the traditional SMT pipeline). Translated into bytes with UTF-8 encoding, data with 300 characters maximum gives us, e.g. for the 10 6 -line datasets, an averaged mean±std of 185±106 in length for AR and 246±142 for RU, considerably larger than that for ZH (94±53) and for EN/ES/FR (≈145.41±77). With UTF-8 encoding, each character in AR, RU, and ZH contains 2 or more bytes. ZH typically has shorter line length in characters, compensating for the total byte sequence in length, even when most ZH characters are 3 bytes each. However, AR and RU generally have long line length in characters, so when converted to bytes, the sequence length remains long even when most of the characters might be just 2 bytes each. Results from our pairwise comparisons indicate 8 (non-directional) language pairs to be significantly different (see Table 1 under "BYTE"): ES-RU, EN-RU, FR-RU, RU-ZH, AR-RU, AR-EN, AR-ZH, and AR-FR -all involving AR or RU. (Appendix I lists also the language pairs with significant differences for other representations.)



We provide a re-visualization of these grouped in 6 facets by target language in Figure4in Appendix A. using implementation from https://github.com/rtmdrr/replicability-analysis-NLP aside from its statistical properties related to length and vocabulary. "Language" here refers to language represented through all representations. But there are no reasons why linguistics or linguistic typology cannot encompass a statistical science of language beyond/without "words", or with continuous representations of characters and bytes. In fact, that could complement the needs of language engineering and the NNs/DL/ML communities better. GPUs used for experiments in this paper range from a NVIDIA TITAN RTX (24 GB), NVIDIA GeForce RTX 2080 Ti (11 GB), a GTX Titan X (12 GB), to a GTX 1080 (8 GB). All jobs were run on a single GPU setting. Some word-level experiments involving ARtrg or RUtrg at 10 had to be run on a CPU as 24 GB VRAM were not sufficient. Models with higher maximum sequence lengths (e.g. byte models) were trained with 24 GB VRAM. Difference in equipment does not necessarily lead to degradation/improvement in scores.6 The terms "line" and "sentence" have been used interchangeably in the NLP literature. We use "line" to denote a sequence that ends with a newline character and "sentence" as one with an ending punctuation. Most parallel corpora, such as ours, are aligned by line, as a line may be part of a sentence or without an ending punctuation (e.g. a header/title). Using a standardized unit such as "line" would also be a fairer measure to linguae/scriptiones continuae (languages/scripts with no explicit punctuation). https://github.com/fxsjy/jieba (non-directional) language pairs total possible from 30 language directions, p=0.001.LANG PAIRCHAR Pinyin Wubi BYTE ARRU t ARRU s,t WORD BPE AR-EN X X X AR-ES EN-ES X AR-FR X EN-FR X X ES-FR AR-RU X EN-RU X X X X ES-RU X FR-RU X AR-ZH X X X X X EN-ZH X X ES-ZH X X X FR-ZH X X X X RU-ZH X X X X XLanguage pairs with significant differences indicate that the 2 languages are not equally/similarly good or equally/similarly bad.• Character models with ZH behave differently but the disparity can be eliminated with Pinyin.• Byte models with AR and RU exhibit unstable performance due to length but this can be rectified with compression on the target side only (ARRUt).• Word-based models, including BPE, however, consistently favor EN and ZH (though it is more of a "mis-segmentation" for the latter, see § 3 and Appendix J) and disfavor AR and RU (as morphologically complex languages with higher OOV rates). An exception could be that of the type/token ratio (TTR). One could imagine applying TTR on the character level for ZH, and that would be indicative of its morphological richness on the character level. However, that has thus far never been practiced or recognized in NLP. Figures are rounded to whole number. Complete tables of data statistics are provided in Appendix D. Top-3 correlates for each representation can be found in Appendix F.



Figure 1: Number of bits (the lower the better) as a function of data size plotted for all 30 directions. Subfigures 1d, 1e, and 1f depict the corresponding information as in 1a, 1b, and 1c (showing mean across 12 runs), respectively, but sorted in 6 facets by target language and with error bars. Legend in Subfigure 1g shows the correspondence between colors and source languages, in Subfigure 1h between line types and target languages. (These figures are also shown enlarged in Appendix G.)

Figure 2: Character-level remedies for ZH: Wubi vs. Pinyin.

Figure 4: Results of the Moses baseline systems (right group in each facet) and neural models (left) with 1.2 million iterations (1 iteration corresponds to 1 mini-batch) for the 30 directions of the 6-way UN corpus, tokenized (ZH segmented), lowercased, and length filtered to 100 BPE tokens.

Figure 5: CHAR: character models

Figure16: sample-wise DD shown at 10 3

Figure 17: Kneser-Ney (monolingual) n-gram LMs on the same data (A) used for our neural CLMs

Figure 18: Same data with differing seeds

Figure 20: One-layer Transformer models

Number of language pairs out of 15 with significant differences, with respective p-values. ARRUt refers to AR & RU being optimized only on the target side; whereas ARRUs,t denotes optimization on both source and target sides (relevant for directions AR-RU and RU-AR).

1 IN JUNCZYS-DOWMUNT ET AL. (2016) IN 6 .2 ADDITIONAL EXPERIMENT WITH LENGTH FILTERING TO 300 BYTES . . . . .

1would indicate that no type is ever seen twice, and a value very close to 0

Target-Train-Token-to-Parameter ratio (TTT2P ratio) for

Target-Train-Token-to-Parameter ratio (TTT2P ratio) for IWSLT'14DE-EN and EN-DE

Token-to-parameter ratios on non-neural monolingual trigram LMs

Number of language pairs out of 15 with significant differences, with respective p-values. BYTE 6layers is the representation with erratic ARtrg and RUtrg. CHAR 6layers BYTE 6layers WORD 6layers CHAR 1layer BYTE 1layer WORD 1layer

M ERRATICITY

Length has been an issue since the dawn of the encoder-decoder approach for NMT (Cho et al., 2014) . Most work on length bias, except for that by e.g. Sountsov & Sarawagi (2016) , seems to have focused on the evaluation of generated translation output and monitored performance degradation with respect to sequence length, often arguing that beam size plays a role (Koehn & Knowles, 2017; Murray & Chiang, 2018) . (Related work in Stahlberg & Byrne (2019) provides a good summary on this issue.) While there could also be confounds in search, our experiments show that a kind of length bias can surface already with CLMing, without generation taking place. To our knowledge, length bias has not been expressed as a sample-wise non-monotonicity across a large data size range as ours. While the connection between erraticity in CLMs and length bias in NMT models remains to be verified on a case-by-case basis, the knowledge of length also contributing to robustness (not just consistently poor/poorer performance) could support further experimentation/replication of any study. Failed attempts to reproduce results may be explainable by erraticity.One may argue that erraticity may not be relevant when each model is more optimally trained (as opposed to being treated with our one-setting-for-all regime). But we do want to stress that this very stark contrast between erratic and non-erratic behavior is possible, prompting a question on fairness: is there a one-for-all setting under which the languages with non-erratic behavior shown in our study would demonstrate erraticity and vice versa?To the best of our knowledge, the meta phenomenon of erraticity, as a sample-wise non-monotonicity measured intrinsically with cross-entropy and contributing to large variance across runs, is a novel and original discovery and contribution to research in robustness. We hope our work would inspire further evaluation on other models/architectures, reflection and theories on our assumption of unbounded computation (e.g. Xu et al. (2020) ), as well as new understanding and solutions that take data statistics and realistic computational aspects into account. We defer a more comprehensive analysis of erraticity with further experiments to future work.

DATA

To confirm that erraticity is not due to data-specific reasons, e.g. when certain data segments might be "easier" to model than others, we show figures from 2 runs (Figs. 18a and 18b ) on the same dataset of wildly differing performance that only differ in seed. Note that changes in the y-direction can vary much, indicating large variance across runs.By establishing that high variance holds across sample sizes, we showcased how it'd be possible to just test on 2 or 3 data points of smaller sizes to get a gauge on the robustness in higher order. It serves as a signal of when the system is being "stress-tested" and hyperparameters need re-tuning. Spot-testing on a couple of smaller data sizes can indeed save much time and energy. Take our run B0 byte models as an example: the training of the 10 2 -line model for EN-RU took 15 minutes, 10 3 40 minutes, 10 4 1 hour 50 minutes, and 10 5 3 hours 36 minutes. One can imagine how these would just be a fraction of training time for bigger models. (Likewise, for our ratio of target training token count to number of parameters -knowing when a representation might be prone to DD within a data size range could help prevent practitioners from prematurely declaring experimental results as negative or from unnecessarily rerunning an experiment because bigger data did not lead to better results.) M.2 ADDITIONAL EXPERIMENT WITH LENGTH FILTERING TO 300 BYTES Figure 19a and 19b show results of additional experiment with subset of data in byte (UTF-8) representation length-filtered to 300, including dev data:Erraticity remains for AR and RU. Scores are lower, though they cannot be compared with the experiments in the main paper due to difference in dev data size (3,077 lines vs. 1,804 lines here). Number of total lines for train is 5,533,672 lines for each language, from which we took the initial 10 2 -10 6 . As in our main experiments, we filtered out only whole lines, i.e. not by discarding the tails of longer lines. 300 bytes aren't long sequences, but without data transform or hyperparameter tuning, things can look unfair. The EN translation of the longest RU line in this dataset is: "47. It is Leveraging language-specific code pages can be a useful practical trick, a reminder that there are alternatives to UTF-8 for analyses and back-end processing if data is clean and homogeneous and if success of larger-scale prediction is not a concern. But one more sustainable alternative is to design a more adaptive and flexible character encoding scheme in general, taking into account the statistical profiles such as length (wrt characters and bytes) and sub-character (atomic/elementary/compound) information of all (or as many as possible) of the world's languages.

Word level

The main difference between word and character/byte models is the absence of length as a top contributing factor correlating with performance. Instead, what matters more are metrics concerning word vocabulary, with top correlate being OOV token rate in the target language (ρ = 0.66). This is understandable as word segmentation neutralizes sequence lengths -the longer lengths in phonetic alphabetic scripts are shortened through multiple-character groupings, while the shorter lengths in logographic scripts (cf. difference in length for the 3 scripts on the Rosetta Stone, logographic scripts are typically shorter than phonetic ones) are lengthened by the insertion of whitespaces. To remedy the OOV problem, we use BPE, which learns a fixed vocabulary of variable-length character sequences (on word level, as it presupposes word segmentation) from the training data. It is more fine-grained than word segmentation and is known for its capability to model subword units for morphologically complex languages (e.g. AR and RU). We use the same vocabulary of 30,000 as specified in Junczys-Dowmunt et al. (2016) . This reduced our averaged OOV token rate by 89-100% across the 5 sizes. The number of language pairs with significant differences (p ≤ 0.001) reduced to 7 from 8 for word models, showing how finer-grained modeling has a positive effect on closing the disparity gap.

