AMBERT: A PRE-TRAINED LANGUAGE MODEL WITH MULTI-GRAINED TOKENIZATION

Abstract

Pre-trained language models such as BERT have exhibited remarkable performances in many tasks in natural language understanding (NLU). The tokens in the models are usually fine-grained in the sense that for languages like English they are words or sub-words and for languages like Chinese they are characters. In English, for example, there are multi-word expressions which form natural lexical units and thus the use of coarse-grained tokenization also appears to be reasonable. In fact, both fine-grained and coarse-grained tokenizations have advantages and disadvantages for learning of pre-trained language models. In this paper, we propose a novel pre-trained language model, referred to as AMBERT (A Multi-grained BERT), on the basis of both fine-grained and coarse-grained tokenizations. For English, AMBERT takes both the sequence of words (finegrained tokens) and the sequence of phrases (coarse-grained tokens) as input after tokenization, employs one encoder for processing the sequence of words and the other encoder for processing the sequence of the phrases, utilizes shared parameters between the two encoders, and finally creates a sequence of contextualized representations of the words and a sequence of contextualized representations of the phrases. Experiments have been conducted on benchmark datasets for Chinese and English, including CLUE, GLUE, SQuAD and RACE. The results show that AMBERT outperforms the existing best performing models in almost all cases, particularly the improvements are significant for Chinese. We also develop a version of AMBERT which performs equally well as AMBERT but uses about half of its inference time.

1. INTRODUCTION

Pre-trained models such as BERT, RoBERTa, and ALBERT (Devlin et al., 2018; Liu et al., 2019; Lan et al., 2019) have shown great power in natural language understanding (NLU). The Transformerbased language models are first learned from a large corpus in pre-training, and then learned from labeled data of a downstream task in fine-tuning. With Transformer (Vaswani et al., 2017) , pre-training technique, and big data, the models can effectively capture the lexical, syntactic, and semantic relations between the tokens in the input text and achieve the state-of-the-art performances in many NLU tasks, such as sentiment analysis, text entailment, and machine reading comprehension. In BERT, for example, pre-training is mainly conducted based on mask language modeling (MLM) in which about 15% of the tokens in the input text are masked with a special token [MASK] , and the goal is to reconstruct the original text from the masked text. Fine-tuning is separately performed for individual tasks as text classification, text matching, text span detection, etc. Usually, the tokens in the input text are fine-grained; for example, they are words or sub-words in English and characters in Chinese. In principle, the tokens can also be coarse-grained, that is, for example, phrases in English and words in Chinese. There are many multi-word expressions in English such as 'New York' and 'ice cream' and the use of phrases also appears to be reasonable. It is more sensible to use words (including single character words) in Chinese, because they are basic lexical units. In fact, all existing pre-trained language models employ single-grained (usually fine-grained) tokenization. Previous work indicates that the fine-grained approach and the coarse-grained approach have both pros and cons. The tokens in the fine-grained approach are less complete as lexical units but their representations are easier to learn (because there are less token types and more tokens in training data), while the tokens in the coarse-grained approach are more complete as lexical units but their representations are more difficult to learn (because there are more token types and less tokens in training data). Moreover, for the coarse-grained approach there is no guarantee that tokenization (segmentation) is completely correct. Sometimes ambiguity exists and it would be better to retain all possibilities of tokenization. In contrast, for the fine-grained approach tokenization is carried out at the primitive level and there is no risk of 'incorrect' tokenization. For example, Li et al. (2019) observe that fine-grained models consistently outperform coarsegrained models in deep learning for Chinese language processing. They point out that the reason is that low frequency words (coarse-grained tokens) tend to have insufficient training data and tend to be out of vocabulary, and as a result the learned representations are not sufficiently reliable. On the other hand, previous work also demonstrates that masking of coarse-grained tokens in pre-training of language models is helpful (Cui et al., 2019; Joshi et al., 2020) . That is, although the model itself is fine-grained, masking on consecutive tokens (phrases in English and words in Chinese) can lead to learning of a more accurate model. In Appendix A, we give examples of attention maps in BERT to further support the assertion. In this paper, we propose A Multi-grained BERT model (AMBERT), which employs both finegrained and coarse-grained tokenizations. For English, AMBERT extends BERT by simultaneously constructing representations for both words and phrases in the input text using two encoders. Specifically, AMBERT first conducts tokenization at both word and phrase levels. It then takes the embeddings of words and phrases as input to the two encoders. It utilizes the same parameters across the two encoders. Finally it obtains a contextualized representation for the word and a contextualized representation for the phrase at each position. Note that the number of parameters in AMBERT is comparable to that of BERT, because the parameters in the two encoders are shared. There are only additional parameters from multi-grained embeddings. AMBERT can represent the input text at both word-level and phrase-level, to leverage the advantages of the two approaches of tokenization, and create richer representations for the input text at multiple granularity. We conduct extensive experiments to make comparison between AMBERT and the baselines as well as alternatives to AMBERT, using the benchmark datasets in English and Chinese. The results show that AMBERT significantly outperforms single-grained BERT models with a large margin in both Chinese and English. In English, compared to Google BERT, AMBERT achieves 2.0% higher GLUE score, 2.5% higher RACE score, and 5.1% more SQuAD score. In Chinese, AMBERT improves average score by over 2.7% in CLUE. Furthermore, a simplified version AMBERT with only the fine-grained encoder can preform much better than the single-grained BERT models with a similar amount of inference computation. We make the following contributions in this work. • Study of multi-grained pre-trained language models, • Proposal of a new pre-trained language model called AMBERT as an extension of BERT, which makes use of multi-grained tokens and shared parameters, • Empirical verification of AMBERT on the English and Chinese benchmark datasets GLUE, SQuAD, RACE, and CLUE.

2. RELATED WORK

There has been a large amount of work on pre-trained language models. ELMo (Peters et al., 2018) is one of the first pre-trained language models for learning of contextualized representations of words in the input text. Leveraging the power of Transformer (Vaswani et al., 2017) , GPTs (Radford et al., 2018; 2019) are developed as unidirectional models to make prediction on the input text in an autoregressive manner, and BERT (Devlin et al., 2018) is developed as a bidirectional model to make prediction on the whole or part of the input text. Mask language modeling (MLM) and next sentence prediction (NSP) are the two tasks in pre-training of BERT. Since the inception of BERT, a number of new models have been proposed to further enhance the performance of it. XLNet (Yang et al., 2019) is a permutation language model which can improve the accuracy of MLM. RoBERTa (Liu et al., 2019) represents a new way of training more reliable BERT with a very large amount of data. ALBERT (Lan et al., 2019) is a light-weight version of BERT, which shares parameters across layers. StructBERT (Wang et al., 2019) incorporates word and sentence structures into BERT for learning of better representations of tokens and sentences. ERNIE2.0 (Sun et al., 2020) is a variant of BERT pre-trained in multiple tasks with coarse-grained tokens masked. ELECTRA (Clark et al., 2020) has a GAN-style architecture for efficiently utilizing all tokens in pre-training. It has been found that the use of coarse-grained tokens is beneficial for pre-trained language models. Devlin et al. (2018) point out that 'whole word masking' is effective for training of BERT. It is also observed that whole word masking is useful for building a Chinese BERT (Cui et al., 2019) . In ERNIE (Sun et al., 2019b) , entity level masking is employed as a strategy for pre-training and proved to be effective for language understanding tasks (see also (Zhang et al., 2019) ). In SpanBERT (Joshi et al., 2020) , text spans are masked in pre-training and the learned model can substantially enhance the accuracies of span selection tasks. It is indicated that word segmentation is especially important for Chinese and a BERT-based Chinese text encoder is proposed with n-gram representations (Diao et al., 2019) . All existing work focuses on the use of single-grained tokens in learning and utilization of pre-trained language models. In this work, we propose a general technique of exploiting multigrained tokens for pre-trained language models and apply it to BERT.

3. OUR METHOD: AMBERT

In this section, we present the model, pre-training, and fine-tuning of AMBERT. We also make a discussion on alternatives of AMBERT.

3.1. MODEL

Fin e-g rai ne d En co de r Output : Contextualized representations of fine-grained and coarse-grained tokens. r x1 r x2 r x0 [CL S] … yor k min … [SE P] Co ars e-g rai ne d En co de r The input is a sentence in English and output is the overall representation of the sentence. There are two encoders for processing the sequence of fine-grained tokens and the sequence of coarsegrained tokens respectively. The final contextualized representations of fine-grained tokens and coarse-grained tokens are denoted as r x0 , r x1 , • • • , r xm and r z0 , r z1 , • • • , r zn respectively. Figure 1 gives an overview of AMBERT. AMBERT takes a text as input. Tokenization is conducted on the input text to obtain a sequence of fine-grained tokens and a sequence of coarse-grained tokens. AMBERT has two encoders, one for processing the fine-grained token sequence and the other for processing the coarse-grained token sequence. Each of the encoders has exactly the same architecture as that of BERT (Devlin et al., 2018) or Transformer encoder (Vaswani et al., 2017) . The two encoders share the same parameters at each corresponding layer, except that each has its own token embedding parameters. The fine-grained encoder generates contextualized representations from the sequence of fine-grained tokens through its layers. In parallel, the coarse-grained encoder generates contextualized representations from the sequence of coarse-grained tokens through its layers. AMBERT outputs a sequence of contextualized representations for the fine-grained tokens and a sequence of contextualized representations for the coarse-grained tokens. AMBERT is expressive in that it learns and utilizes contextualized representations of the input text at both fine-grained and coarse-grained levels. The model retains all possibilities of tokenizations and automatically learns the attention weights (importance) of representations of multi-grained tokens. AMBERT is also efficient through sharing of parameters between the two encoders. The parameters represent the same ways of combining representations, no matter whether representations are those of fine-grained tokens or coarse-grained tokens.

3.2. PRE-TRAINING

Pre-training of AMBERT is mainly conducted on the basis of mask language modeling (MLM), at both fine-grained and coarse-grained levels. Next sentence prediction (NSP) is not essential as indicated in many studies after BERT (Lan et al., 2019; Liu et al., 2019) . We only use NSP in our experiments for comparison purposes). Let x denote the sequence of fine-grained tokens with some of them being masked, and x denote the masked fine-grained tokens. Let ẑ denote the sequence of coarse-grained tokens with some of them being masked, and z denote the masked coarse-grained tokens. Pre-training is defined as optimization of the following function, min θ -log p θ (x, z|x, ẑ) ≈ min θ - m i=1 m i log p θ (x i |x) - n j=1 n j log p θ (z j |ẑ), where m i takes 1 or 0 as values and m i = 1 indicates that fine-grained token x i is masked, m denotes the total number of fine-grained tokens; n j takes 1 or 0 as values and n j = 1 indicates that coarse-grained token z j is masked, n denotes the total number of coarse-grained tokens; and θ denotes parameters.

3.3. FINE-TUNING

In fine-tuning of AMBERT for classification, the fine-grained encoder and coarse-grained encoder create special [CLS] representations, and both representations are used for classification. Finetuning is defined as optimization of the following function, which is a regularized loss of multi-task learning, starting from the pre-trained model, min θ -log p θ (y|x) = min θ -log p θ (y|r x0 ) -log p θ (y|r z0 ) -log p θ (y|[r x0 , r z0 ]) + λ ỹx -ỹz 2 , (2) where x is the input text, y is the classification label, r x0 and r z0 are the [CLS] representations of fine-grained encoder and coarse-grained encoder, [a, b] denotes concatenation of vectors a and b, λ is coefficient, and 2 denotes L2 norm. The last term is based on agreement regularization (Brantley et al., 2019) , which forces agreement between the predictions ( ỹx and ỹz ). Similarly, fine-tuning of AMBERT for span detection can be carried out, in which the representations of fine-grained tokens are concatenated with the representations of corresponding coarse-grained tokens. The concatenated representations are then utilized in the task.

3.4. ALTERNATIVES

We can consider two alternatives to AMBERT, which also rely on multi-grained tokenization. We refer to them as AMBERT-Combo and AMBERT-Hybrid and make comparisons of them with AM-BERT in our experiments. AMBERT-Combo has two individual encoders, an encoder (BERT) working on the fine-grained token sequence and the other encoder (BERT) working on the coarse-grained token sequence, without parameter sharing between them. In learning and inference AMBERT-Combo simply combines the output layers of the two encoders. Its fine-tuning is similar to that of AMBERT. AMBERT-Hybrid has only one encoder (BERT) working on both the fine-grained token sequence and the coarse-grained token sequence. It creates representations on the concatenation of two sequences and lets the representations of the two sequences interact with each other at each layer. Its pre-training is formalized in the following function, min θ -log p θ (x, z|x, ẑ) ≈ min θ - m i=1 m i log p θ (x i |x, ẑ) - n j=1 n j log p θ (z j |x, ẑ), where the notations are the same as in (1). Its fine-tuning is the same as that of BERT.

4. EXPERIMENTS

We make comparisons between AMBERT and the baselines including fine-grained BERT and coarse-grained BERT, as well as the alternatives including AMBERT-Combo and AMBERT-Hybrid, using benchmark datasets in both Chinese and English. The experiments on the alternatives can also be seen as ablation study on AMBERT.

4.1. DATA FOR PRE-TRAINING

For Chinese, we use a corpus consisting of 25 million documents (57G uncompressed text) from Jinri Toutiaofoot_0 . Note that there is no common corpus for training of Chinese BERT. For English, we use a corpus of 13.9 million documents (47G uncompressed text) from Wikipedia and OpenWeb-Text (Gokaslan & Cohen, 2019) . Unfortunately, BookCorpus, one of the two corpora in the original paper for English BERT, is no longer publicly available. The characters in the Chinese texts are naturally taken as fine-grained tokens. We conduct word segmentation on the texts and treat the words as coarse-grained tokens. We employ a word segmentation tool based on a n-gram model. Both tokenizations exploit WordPiece embeddings (Wu et al., 2016) . There are 21,128 characters and 72,635 words in the vocabulary of Chinese. The words in the English texts are naturally taken as fine-grained tokens. We perform coarse-grained tokenization on the English texts in the following way. Specifically, we first calculate the n-grams in the Wikipedia documents using KenLM (Heafield, 2011) . We next build a phrase-level dictionary consisting of phrases whose frequencies are sufficiently high and whose last words highly depend on their previous words. We then employ a left-to-right search algorithm to perform phrase-level tokenization on the texts. There are 30,522 words and 77,645 phrases in the vocabulary of English.

4.2. EXPERIMENTAL SETUP

We make use of the same parameter settings for the AMBERT and BERT models. All models in this paper are 'base-models' having 12 layers of encoder. It is too computationally expensive for us to train the models as 'large models' having 24 layers. The hyper-parameters are basically the same as those in the original BERT paper (Devlin et al., 2018) , which are given in Appendix C. The optimizer is Adam (Kingma & Ba, 2014). To enhance efficiency, we use mixed precision for all the models. Training is carried out on Nvidia V-100. The numbers of GPUs used for training are from 32 to 64, depending on the model sizes. In pre-training of the AMBERT models, in total 15% of the coarse-grained tokens are masked, which is the same proportion for the BERT models. To retain consistency, the masked coarse-grained tokens are also masked as fine-grained tokens. In fine-tuning, we use the same hyper-parameters as those in the original papers of the baselines, and all the hyper-parameters are given in Appendix C.

4.3.1. BENCHMARKS

We use the benchmark datasets, Chinese Language Understanding Evaluation (CLUE) (Xu et al., 2020) for experiments in Chinese. CLUE contains six classification tasks, that are TNEWS, IFLYTEK and CLUEWSC2020, AFQMC, CSL and CMNLIfoot_1 , and three reading comprehension tasks which are CMRC2018, ChID and C 3 . The details of all the benchmarks are shown in Appendix B. Data augmentation is also performed for all models in the tasks of TNEWS, CSL and CLUEWSC2020 to achieve better performances (see Appendix D for detailed explanation).

4.3.2. EXPERIMENTAL RESULTS

We compare AMBERT with the BERT baselines, including the BERT model released from Google, referred to as Google BERT, and the BERT model trained by us, referred to as Our BERT, including character based (fine-grained) and word based (coarse-grained) models. Case study in Appendix E. Table 1 shows the results of the classification tasks. AMBERT improves average scores of the BERT baselines by about 1.0% and also works better than AMBERT-Combo and AMBERT-Hybrid. The results of Machine Reading Comprehensive (MRC) tasks are shown in Table 2 . AMBERT improves average scores of the BERT baselines by over 3.0%. Our BERT (word) performs poorly in CMRC2018. This is probably because the results of word segmentation are not accurate enough for the task. AMBERT-Combo and AMBERT-Hybrid are on average better than single-grained BERT models. AMBERT further outperforms both of them. We also compare AMBERT with the state-of-the-art models at the leader board of CLUEfoot_2 . The base models, whose parameters are fewer than 200M, are trained with different datasets and procedures, and thus the comparisons should only be taken as references. Note that the settings of the base models are the same as that of Xu et al. (2020) . Table 3 shows the results. The average score of AMBERT is higher than all the other models. We conclude that multi-grained tokenization is very helpful for pre-trained language models and the design of AMBERT is reasonable. The General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2018 ) is a collection of nine NLU tasks. Following BERT (Devlin et al., 2018) , we exclude the task WNLI for the reason that results of different models on this task are undifferentiated. In addition, three machine reading comprehensive tasks are also included, i.e., SQuAD v1.1, SQuAD v2.0, and RACE. The details of English benchmarks can be found in Appendix B.

4.4.2. EXPERIMENTAL RESULTS

We compare AMBERT with the BERT models on the tasks in GLUE. The results of Google BERT are from the original paper (Devlin et al., 2018) , and the results of Our BERT are obtained by us. From Table 4 we can see the following trend. 1) Multi-grained models particularly AMBERT can achieve better results than single-grained models. 2) Among the multi-grained models, AMBERT performs best with fewer parameters and less computation. Case study is given in Appendix E. We also make comparison on the SQuAD tasks. The results of Google BERT are either from the papers (Devlin et al., 2018; Yang et al., 2019) or from our runs with the official code. From Table 5 we make the following conclusions. 1) in SQuAD, AMBERT outperforms Google BERT with a large margin. Our BERT (word) generally performs well and Our BERT (phrase) performs poorly in the span detection tasks. 2) In RACE, AMBERT performs best among all the baselines for both development set and test set. 3) AMBERT is the best multi-grained model. We compare AMBERT with the state-of-the-art models in both GLUEfoot_3 and MRC. The results of baselines, in Table 6 , are either reported in published papers or re-implemented by us with Hug-gingFace's Transformer (Wolf et al., 2019) . We use the provided implementation in HuggingFace's Transformer, without additional data augmentation or question-answering modulefoot_4 . Note that AM-BERT outperforms all the models on averagefoot_5 without using training techniques such as bigger batches and dynamic masking. 

4.5. ENHANCEMENT OF INFERENCE SPEED

Although AMBERT can make significant improvements over single-grained models, the computation time is doubled. To enhance the inference speed of AMBERT, we develop a simplified version of it, referred to as AMBERT-Single, in which the two encoders are pre-trained and fine-tuned in learning and only the single-grained encoder is utilized in inference. Note that it has the same amount of inference computation as BERT. We conduct experiments on CLUE/GLUE/SQuADs/RACE with AMBERT-Single. The results on the development sets are shown in Table 7 . We conclude that, a) for the English tasks, AMBERT-Single achieves similar results as AMBERT and outperforms "Our BERT (Single)" with a large margin using the same amount of inference time; b) for the Chinese tasks, AMBERT-Single is slightly worse than AM-BERT and performs much better than "Our BERT (Single)". Therefore, in practice, one can train an AMBERT with two encoders and use only one of them in inference, i.e., AMBERT-Single. 

4.6. REGULARIZATION IN FINE-TUNING

Table 8 shows the results of using different values as regularization coefficients in fine-tuning on the development sets of CLUE, GLUE and RACE. It appears that for most tasks the use of regularization is necessary. For simplicity, we did not use the best value of coefficient for each task and instead we adopt 0.0 for RACE and 1.0 for the other tasks. 

4.7. DISCUSSIONS

We further investigate the reason that AMBERT is superior to AMBERT-Combo. Figure 2 shows the distances between the [CLS] representations of the fine-grained encoder and coarse-grained encoder in AMBERT-Combo and AMBERT after pre-training, in terms of cosine dissimilarity (one minus cosine similarity) and normalized Euclidean distance. One can see that the distances in AMBERT-Combo are larger than the distances in AMBERT in the tasks. We perform the assessment using the data in the other tasks and find similar trends. The results indicate that the representations of fine-grained encoder and coarse-grained encoder are closer in AMBERT than in AMBERT-Combo. These are natural consequences of using AMBERT and AMBERT-Combo, whose parameters are respectively shared and unshared across encoders. It implies that the higher performances by AM-BERT is due to its parameter sharing, which can use less parameters to learn and represent similar ways of combining tokens no matter whether they are fine-grained or coarse-grained. An intuitive explanation is that the ways of combining representations of fine-grained tokens and the ways of combining representations of coarse-grained tokens "in the same contexts" are exactly the same. We also examine the reasons that AMBERT works better than AMBERT-Hybrid, while both of them exploit multi-grained tokenization. Figure 3 shows the attention weights of first layers in AMBERT and AMBERT-Hybrid, as well as the single-grained BERT models, after pre-training. In AMBERT-Hybrid, the fine-grained tokens attend more to the corresponding coarse-grained tokens and as a result the attention weights among fine-grained tokens are weakened. In contrast, in AMBERT the attention weights among fine-grained tokens and those among coarse-grained tokens are intact. It appears that attentions among single-grained tokens (fine-grained ones and coarse-grained ones) play important roles in downstream tasks. Our BERT (char) Our BERT (word) AMBERT-Hybrid AMBERT Our BERT (word) Our BERT (phrase) AMBERT-Hybrid AMBERT To answer the question why the improvements by AMBERT on Chinese are larger than on English in the same pre-training settings, we further make an analysis. We tokenize 10,000 randomly selected Chinese sentences in CLUE with our Chinese word tokenizer. As shown in Table 9 , the average proportion of words is 51.5%, which indicates that about half of the tokens are fine-grained and half are coarse-grained in Chinese. We also tokenize 10,000 randomly selected English sentences in GLUE with our English phrase tokenizer. The average proportion of phrases is only 13.1%, which means that there are much less coarse-grained tokens than fine-grained tokens in English. Therefore, we postulate that for Chinese it is necessary for a model to process the language at both fine-grained and coarse-grained levels. AMBERT indeed has the capability. 

5. CONCLUSION

In this paper, we have proposed a novel pre-trained language model called AMBERT, as an extension of BERT. AMBERT employs multi-grained tokenization, that is, it uses both words and phrases in English and both characters and words in Chinese. With multi-grained tokenization, AMBERT learns in parallel the representations of the fine-grained tokens and the coarse-grained tokens using two encoders with shared parameters. Experimental results have demonstrated that AMBERT significantly outperforms BERT and other models in NLU tasks in both English and Chinese. AM-BERT increases average score of Google BERT by about 2.7% in Chinese benchmark CLUE. AM-BERT improves Google BERT by over 3.0% on a variety of tasks in English benchmarks GLUE, SQuAD (1.1 and 2.0), and RACE. We also develop AMBERT-Simple which performs equally well as AMBERT with about half of inference time. As future work, we plan to study the following issues: 1) to investigate model acceleration methods in learning of AMBERT, such as sparse attention (Child et al., 2019; Kitaev et al., 2020; Zaheer et al., 2020) and synthetic attention (Tay et al., 2020) ; 2) to apply the technique of AMBERT into other pre-trained language models such as XLNet; 3) to employ AMBERT in other NLU tasks.

A ATTENTION MAPS FOR SINGLE-GRAINED MODELS

We construct fine-grained and coarse-grained BERT models for English and Chinese, and examine the attention maps of the models using the BertViz tool (Vig, 2019) . Figure A shows the attention maps of the first layer of fine-grained models for several sentences in English and Chinese. One can see that there are tokens that improperly attend to other tokens in the sentences. For example, in the English sentences, the words "drawing", "new", and "dog" have high attention weights to "portrait", "york", and "food", respectively, which are not appropriate. For example, in the Chinese sentences, the chars "拍", "北", "长" have high attention weights to "卖", "京", "市", respectively, which are also not reasonable. (It is verified that the bottom layers at BERT mainly represent lexical information, the middle layers mainly represent syntactic information, and the top layers mainly represent semantic information (Jawahar et al., 2019) .) Ideally a token should only attend to the tokens with which they form a lexical unit at the first layer. This cannot be guaranteed in the finegrained BERT model, however, because usually a fine-grained token may belong to multiple lexical units (i.e., there is ambiguity). Figure 5 shows the attention maps of the first layer of coarse-grained models for the same sentences in English and Chinese. In the English sentences, the words are combined into the phrases of "drawing room", "york minister", and "dog food". The attentions are appropriate in the first two sentences, but it is not in the last sentence because of the incorrect tokenization. Similarly, in the Chinese sentences, the high attention weights of words " 球拍(bat)" and "京城(capital)" are reasonable, but that of word "市长(mayor)" is not. Note that incorrect tokenization is inevitable.

B DETAILED DESCRIPTIONS FOR THE BENCHMARKS B.1 CHINESE TASKS

TNEWS is a text classification task in which titles of news articles in TouTiao are to be classified into 15 classes. IFLYTEK is a task of assigning app descriptions into 119 categories. CLUEWSC2020, standing for the Chinese Winograd Schema Challenge, is a co-reference resolution task. AFQMC is a binary classification task that aims to predict whether two sentences are semantically similar. CSL uses the Chinese Scientific Literature dataset containing abstracts and their keywords of papers and the goal is to identify whether given keywords are the original keywords of a paper. CMNLI is based on translation from MNLI (Williams et al., 2017) , which is a large-scale, crowd-sourced entailment classification task. CMRC2018 (Cui et al., 2018) makes use of a span-based dataset for Chinese machine reading comprehension. ChID (Zheng et al., 2019 ) is a large-scale Chinese IDiom cloze test. C 3 (Sun et al., 2019a ) is a free-form multiple-choice machine reading comprehension for Chinese.

B.2 ENGLISH TASKS

CoLA (Warstadt et al., 2019) contains English acceptability judgments drawn from books and journal articles on linguistic theory. SST-2 (Socher et al., 2013) consists of sentences from movie reviews 

D DATA AUGMENTATION

To enhance the performance, we conduct data augmentation for the three Chinese classification tasks of TNEWS, CSL, and CLUEWSC2020. In TNEWS, we use both keywords and titles. In CSL, we concatenate keywords with a special token " ". In CLUEWSC2020, we duplicate a few instances having pronouns in the training data such as "她 (she)". In December of that year, the ABC television network premiered The Dating Game, a pioneer series in its genre, which was a reworking of the blind date concept in which a suitor selected one of three contestants sight unseen based on the answers to selected questions. (In December of that year, the ABC television network premiered the dating game, a pioneer series in its genre, which was a reworking of the blind date concept in which a suitor selected one of three contestants sight unseen based on the answers to selected questions.) 0 0 1 0 What are two basic primary resources used to guage complexity? (What are two basic primary resources used to guage complexity?) The theory formalizes this intuition, by introducing mathematical models of computation to study these problems and quantifying the amount of resources needed to solve them, such as time and storage. (The theory formalizes this intuition, by introducing mathematical models of computation to study these problems and quantifying the amount of resources needed to solve them, such as time and storage.) 0 1 1 0 What is the frequency of the radio station WBT in North Carolina? (What is the frequency of the radio station WBT in north carolina?) WBT will also simulcast the game on its sister station WBTFM (99.3 FM), which is based in Chester, South Carolina. (WBT will also simulcast the game on its sister station WBTFM (99.3 FM), which is based in Chester, South Carolina.) We also qualitatively study the results of BERT and AMBERT, and find that they support our claims (cf., Section 1) very well. Here, we give some random examples from the entailment tasks (QNLI and CMNLI) in Table 13 . One can have the following observations. 1) The fine-grained models (e.g., Our BERT word) cannot effectively use complete lexical units such as "Doctor Who" and "打死" (sentence pairs 1 and 5), which may result in incorrect predictions. 2) The coarse-grained models (e.g., Our BERT phrase), on the other hand, cannot effectively deal with incorrect tokenizations, for example, "the blind" and "格式" (sentence pairs 2 and 6). 3) AMBERT is able to make effective use of complete lexical units such as "sister station" in sentence pair 4 and "员工/ 工人" in sentence pair 7, and robust to incorrect tokenizations, such as "used to" in sentence pair 3. 4) AMBERT can in general make more accurate decisions on difficult sentence pairs with both fine-grained and coarse-grained tokenization results.



Jinri Toutiao is a popular news app. in China. The task is introduced at the CLUE website. The leader board of CLUE is at https://www.cluebenchmarks.com/rank.html. The leader board of GLUE is at https://gluebenchmark.com/leaderboard. For that reason, we cannot use the results for SQuAD 2.0 inClark et al. (2020). In the previous versions, we reported the results (average score 82.3) of AMBERT when we were only able to use a smaller dataset for pre-training in English.



Figure 1: An overview of AMBERT, showing the process of creating multi-grained representations.The input is a sentence in English and output is the overall representation of the sentence. There are two encoders for processing the sequence of fine-grained tokens and the sequence of coarsegrained tokens respectively. The final contextualized representations of fine-grained tokens and coarse-grained tokens are denoted as r x0 , r x1 , • • • , r xm and r z0 , r z1 , • • • , r zn respectively.

Figure 2: Distances between representations of fine-grained and coarse-grained encoders (representations of [CLS]) in AMBERT-Combo and AMBERT. CD and ED stand for cosine dissimilarity (one minus cosine similarity) and normalized Euclidean distance respectively.

Figure 3: Attention weights of first layers of Our BERT (word/phrase), AMBERT-Hybrid and AM-BERT, for English and Chinese sentences.

Figure 4: Attention maps of first layers of fine-grained BERT models for English and Chinese sentences. The Chinese sentences are "商店里的兵乓球拍卖完了 (Table tennis bats are sold out in the shop)", "北上京城施展平生报复 (Go north to Beijing to fulfill the dream)", "南京市长江大 桥位于南京 (The Nanjing Yantze River bridge is located in Nanjing)". Different colors represent attention weights in different heads and darkness represents weight.

Figure 5: Attention maps of first layers of coarse-grained BERT models for English and Chinese sentences. Note that tokenizations may have errors.

打/那些/面对/我们/的/人/，/乔恩/告诉/阿/德/林/。)"打死那些面对我们的人，"阿德林对乔恩说。 ("/打死/那些/面对/我们/的/人/，/"/阿/德/林/对/乔恩/说//已 经/采 取/了/一/系 列/措 施/来/增 强/我 们/员 工/的/能 力/，/并/对/他们/进行/投资/。) /行 业/的/故 事/之 所 以/活 跃/起 来/，/是/因 为/现 实/太 平/淡/了/。)现实是如此平淡，以致于虚拟现实技术业务得到了刺激。 (现实/是/如此/平淡/，/以致/于/虚拟/现实/技术/业务/得到/了/刺激/

Performances on classification tasks in CLUE in terms of accuracy (%). The numbers in boldface denote the best results of tasks. Average accuracies of models are also given. Numbers of parameters (param) and time complexities (cmplx) of models are also shown, where l, n, and d denote layer number, sequence length, and hidden representation size respectively. The tasks with mark † are those with data augmentation.

Performances on MRC tasks in CLUE in terms of F1, EM (Exact Match) and accuracy. The numbers in boldface denote the best results of tasks. Average scores of models are also given.

State-of-the-art results of Chinese base models in CLUE.

Performance on the tasks in GLUE. Average score over all the tasks is slightly different from the official GLUE score, since we exclude WNLI. CoLA uses Matthew's Corr. MRPC and QQP use both F1 and accuracy scores. STS-B computes Pearson-Spearman Corr. Accuracy scores are reported for the other tasks. Results of MNLI include MNLI-m and MNLI-mm. The other settings are the same as Table 1.

Performance on three English MRC tasks. We use EM and F1 to evaluate the performance of text detection, and report accuracies for RACE, on both development set and test set.



Performances on the development sets of CLUE, GLUE, SQuAD and RACE with AMBERT-Single or Our BERT (better one) for inference. CN-Models and EN-Models denote Chinese and English pre-trained models respectively. CoLA uses Matthew's Corr. We report EM of CMRC2018 and the average EM of SQuAD1.1 and SQuAD2.0. The other metrics are all accuracies.

Performances on the development sets of CLUE, GLUE and RACE with different regularization coefficients in fine-tuning. CN-Models and EN-Models stand for Chinese and English pre-trained models respectively. CoLA uses Matthew's Corr. The other metrics are accuracies.



Hyper-parameters for fine-tuning of Chinese tasks.

Hyper-parameters for fine-tuning of English tasks.

Case study for sentence matching tasks in both English and Chinese (QNLI and CMNLI). The value "0" denotes entailment relation, while the value "1" denotes no entailment relation. WORD/PHRASE represents Our BERT word/phrase. In English the tokens in the same phrase are concatenated with " ", and in Chinese phrases are split with "/".There have also been many references to Doctor Who in popular culture and other science fiction, including Star Trek: The Next Generation ("The Neutral Zone") and Leverage. (There have also been many references to Doctor Who in popular culture and other science fiction, including Star Trek: the next generation ("the neutral zone") and What was the name of the blind date concept program debuted by ABC in 1966? (What was the name of the blind date concept program debuted byABC in 1966?)

C HYPER-PARAMETERS C.1 HYPER-PARAMETERS IN PRE-TRAINING

We adopt the standard hyper-parameters of BERT in pre-training of the models except batch sizes which are tuned to make our fine-grained BERT models comparable to the Google BERT models. Table 10 shows the hyper-parameters in our Chinese AMBERT and English AMBERT. Our BERT models and alternatives of AMBERT (AMBERT-Combo and AMBERT-Hybrid) all use the same hyper-parameters in pre-training. 

C.2 HYPER-PARAMETERS IN FINE-TUNING

For the Chinese tasks, since all the original papers do not report detailed hyper-parameters in finetuning of the baseline models, we uniformly use the same hyper-parameters as shown in Table 11 except training epoch, because AMBERT and AMBERT-Combo have more parameters and need more training to get converged. We choose the training epochs for all models when the performances on development sets stop to improve. As for the English tasks, Table 12 shows all the hyper-parameters in fine-tuning of the models. We adopt the best hyper-parameters in the original papers for the baselines. Moreover, for AMBERT ‡ , we also tune learning rate ([1e-5, 2e-5, 3e-5] ) and batch size ([16, 32] ) for GLUE with the same method in RoBERTa (Liu et al., 2019) .

