LEARNING UNCERTAINTY FOR UNKNOWN DOMAINS WITH ZERO-TARGET-ASSUMPTION

Abstract

We introduce our Maximum-Entropy Rewarded Reinforcement Learning (MERRL) framework that selects training data for more accurate Natural Language Processing (NLP). Because conventional data selection methods select training samples based on the test domain knowledge and not on real life data, they frequently fail in unknown domains like patent and Twitter. Our approach selects training samples that maximize information uncertainty measured by entropy, including observation entropy like empirical Shannon entropy, Min-entropy, Rényi entropy, and prediction entropy using mutual information, to cover more possible queries that may appear in unknown worlds. Our MERRL using regularized A2C and SAC achieves up to -99.7 perplexity decrease (-43.4% relatively) in language modeling, +25.0 accuracy increase (+40.0% relatively) in sentiment analysis, and +5.0 F1 score increase (+30.8% relatively) in named entity recognition over various domains, demonstrating strong generalization power on unknown test sets.

1. INTRODUCTION

We introduce novel training set selection method that does not require target-domain information to improve out-of-domain Natural Language Processing (NLP) model accuracy. Machine learning is a data-driven process whose success relies highly on the data in use. System performance is typically measured on a specific test set, however, in reality, the test domain is often oblivious during model training, resulting in a critical performance gap between laboratory findings and language use in the real world. For example, we often observe that a system that relies on human parity results generates surprising errors in real-life use scenarios. Some work has been done in augmenting or selecting data (Wang et al., 2022) to address this discrepancy. Data optimization can be expensive and error-prone for general domains (Jha et al., 2020) . Thus, conventional approaches choose critical in-domain data that may work well for a pre-defined target domain (Moore & Lewis, 2010; Kirchhoff & Bilmes, 2014; van der Wees et al., 2017; Fan et al., 2017; Qu et al., 2019; Liu et al., 2019; Kang et al., 2020) . However, there are two problems with domain-specific data selection: First, shifting data toward one target domain may fail in the source and other domains. Second, when target domains are unknown, as in the case of most real-world applications, we do not know what future data to receive before model launches. In our study, we select training data without using target-domain information to achieve learning generalization. Our data selection objective is to maximize the uncertainty of the training data. Specifically, we use entropy to measure the uncertainty based on the principle of maximum entropy, which states that subject to known constraints, the probability distribution that best represents the current state of knowledge is the one with the largest entropy (Jaynes, 1957; Katz, 1967; Hernando et al., 2012) . Therefore, a system with the largest remaining uncertainty contains the least extra biases or uncalled-for assumptions and is ideal for modeling distributions for unknown test domains. To that end, we propose to measure the amount of uncertainty in our observational data and in our model prediction output. As observation entropy, we use Shannon Entropy, Rényi Entropy, and Min Entropy on the n-gram relative frequency of all sentences in the dataset instead of one sentence to model the dependency among sentences. As prediction entropy, we compute the mutual information between the neural network input and its latent representation to quantify how well the information is compressed according to the Information Bottleneck principle. In this way, our approach makes it possible to model inter-dependencies among samples that are critical to improve learning but often neglected (Steinwart et al., 2009; Zhelezniak et al., 2019; Fan et al., 2017) . Putting things into NLP context, we may ask: "Why does higher entropy of the training dataset lead to a more generalized learning ability of an NLP model?" Consider a toy example of three sentences {To be. Not to be. To be or not to be.} with frequencies of the words "or" (1), "to" (4), "be" (4), "not" (2). Although "to" occurs more often, "not" represents the opposite meaning and contributes more to the Shannon entropy value. As a hypothetical example, we assume these four words compose the full vocabulary of our world. Now consider that each word is a sample, i.e., Pr("to") = 4 11 , Pr("or") = 1 11 , Pr("be") = 4 11 , and Pr("not") = 2 11 . Suppose there are subsets A and B, where subset A selects "to" four times, which has a unigram entropy of 0.16, while subset B selects "to", "or", "be", and "not" each one time, which has a unigram entropy of 0.49. The entropy of subset B is higher than subset A, and the (maximum) out-of-vocabulary (OOV) of subset B is smaller than subset A (for a random test), suggesting more generalized data for training that results in more accurate predictions. This observation denotes that increasing the entropy of training data helps build a generalized machine learning model. Moving from the above hypothetical example, in a real dataset, does higher entropy also indicate better learning generalization, specifically fewer OOV words, and higher prediction accuracy? & Meulder, 2003) with one in-domain and five out-of-domain (OOD) test sets, details in Appendix. We observe that the unigram entropy of the training subset negatively correlates (Pearson correlation coefficient: -0.94) to the OOV of six test sets and strongly positively correlates to the in-domain and out-of-domain test F1 scores (Pearson correlation coefficient: 0.80). This result indicates that the subset with higher entropy is more likely to generalize on a new test domain with a lower OOV rate and higher F1 score, demonstrating that the training set optimization using entropy can effectively enhance prediction accuracy on unseen domains. Knowing that a training set with higher entropy leads to more generalized learning, how can we optimize the subset to maximize the information content without any target domain assumption? In general, the subset selection optimization problem is computationally intractable, and we use regularized Advantage Actor Critic (A2C) (Mnih et al., 2016) and Soft Actor Critic (SAC) (Haarnoja et al., 2018) to approximate the set optimization. As illustrated in Figure 1 -(a), our method equipartitions the training data into mini-batches and simultaneously learns a policy network to select data sequentially and two Q networks to estimate future returns with our entropy rewards. MERRL has the advantages of low variance, monotonic policy improvement, sampling efficiency, and significantly outperforms data selection baselines (Ma et al., 2019; Liu et al., 2019; Aharoni & Goldberg, 2020) . Our work contributes four important components to ongoing work on learning generalization: 1. Maximizing uncertainty measured by entropy for learning generalization without target domain assumptions; 2. Entropy-regularized A2C and SAC reinforcement learning algorithms with entropy rewards for training subset optimization that is typically computational intractable; 3. A data selection framework MERRL by modeling training sample dependency that demonstrates significant improvement in NLP accuracy and generalization on various tasks and domains The rest of the paper is organized as follows. In Section 2, we introduce the MERRL in detail. Then in Section 3, we empirically verify the generalization and accuracy improvement using MERRL. We discuss related work in Section 4 and conclude the paper in the last section.

2. METHOD

Below, we describe our MERRL framework in detail, including problem definitions (Section 2.1), the proposed framework (Section 2.2), the training algorithms (Section 2.3), and the entropy-based reward functions (Section 2.4).

2.1. DEFINITIONS

In training set optimization, we formalize the components of the environment as illustrated in Figure 1 (a), including a training dataset, an NLP model F, and a reward function R. The training set is denoted as X = {x i } n i=1 where x i is a sentence (document) and n is the training set size. We shuffle and randomly partition X into T disjoint data batches (Liu et al., 2019)  so that X = {B t } T t=1 = {B 1 , B 2 , ..., B T }, with B t = {x (t-1)n|T +1 , x (t-1)n|T +2 , ..., x tn|T }. n|T is the integer division of n by T , and T ≤ t. If mod (n, T ) ̸ = 0, then the last batch has a variable size of mod (n, T ) and collects the remaining sentences. MERRL selects a subset of data from each mini batch in sequence. The series of selection can be viewed as a sequential decision-making process and can be modeled by a Markov decision process (MDP), consisting of four elements: a set of states S, a set of actions A, a transition function P : S × A × S → [0, ∞), and a reward function R : S → R. Given an MDP (S, A, P, R), the goal of a reinforcement learning system, or an agent, is to learn an optimal policy function π, which is a mapping from the set of states S perceived from the environment to a set of actions A, or formally π : S → A (Uc- Cetina et al., 2021) . In our data selection context, the MDP elements (S, A, P, R) are specified as: the observation space S ∈ R |Bt|×d where |B t | is size of a batch and d is the sentence (document) embedding dimension; the action space A ∈ R |Bt| ; the uniform transition function P which gives the next state; and the entropy-based reward functions R (details in Section 2.4).

2.2. MERRL FRAMEWORK

In our reinforcement learning (RL) setting, the policy π interacts with the environment over a number of discrete time steps T , and stores the collected experience (s, a, r) into the replay buffer. After some fixed time, the replay buffer samples a tuple and updates the Q networks and policy network respectively. At each time step t ∈ T , the policy π receives a batch of sentence embeddings from the environment and selects a subset of data. Then, the environment gives the next state s t+1 and a scalar reward r t to the agent. The reward r t measures how good the selected data is. The return is the total discounted accumulated reward R t = T -t j=0 γ j r t+j from time step t to terminal time step T with discount factor γ ∈ [0, 1]. Our goal is to learn an optimal policy π to maximize the expected return from each state s t . Each time step contains eight steps as shown in Figure 1 (a): At step 1, an encoder (e.g. an embedding layer in LSTM, or an encoder in transformer) inside the NLP model transforms the batch of raw data B t into a batch of (document) embeddings, denoted as s t . Next, at step 2 and step 3, the policy outputs action a t along with the selected data Bt . Specifically, the policy takes the state s t as input and outputs a probability distribution for s t , so that each sentence is associated with a probability representing how likely it is going to be selected. The selected subset Bt , is then obtained by Bernoulli sampling each sentence in the state s t . The result of Bernoulli sampling is represented as an action vector a t , where each value in it is either 0 or 1 representing each sentence in the batch not being or being selected. At step 4, as soon as we obtain Bt , the NLP model F as well as encoder g are finetuned by the selected subset Bt . At step 5, the scalar reward r t = R(s t , a t ) is calculated by designed reward functions R (which we give definitions in Section 2.4). Next in step 6, the tuple of (s t , a t , r t ) is stored in the replay buffer. After some fixed time steps, at step 7, we sample a previously stored tuple to update the two Q networks. Finally, at step 8, we take the minimum between the outputs of two Q networks given the sampled (s t ′ , a t ′ , r t ′ ) to update the policy network π with regard to the objectives expanded in next section 2.3.

2.3. ENTROPY-BASED TRAINING ALGORITHMS

We draw on two algorithms to estimate the policy function π, the on-policy Advantage Actor Critic (A2C) with entropy regularization and off-policy Soft Actor Critic (SAC).

2.3.1. A2C WITH ENTROPY REGULARIZATION

The A2C algorithm maintains a policy function π θ and a value function V θv . It builds on the vanilla policy gradient method that directly optimizes the policy function by performing gradient ascent on ∇ θ log π(a t |s t )R t , which is an unbiased estimate of ∇ θ E[R t ]. Intuitively, it increases the log probability of the sampled action, weighted by the return R t (Uc-Cetina et al., 2021) . The value function V θv is used as a baseline scaling the policy gradient to reduce the variance in the optimization process with the objective E t (r t -V(s t )) 2 (Schulman et al., 2015; Mnih et al., 2015) . With the baseline, the gradient of policy function becomes ∇ θ log π(a t |s t )A t , where A t is estimated by the difference between the the empirical return R t and value function V(s t ) as T -t-1 j=0 γ j r t+j + γ T -t V(s T ) -V(s t ). To enhance the robustness of the policy in the face of highdimensional action space, we refer to the maximum entropy objective (Ziebart, 2010) which augments the standard reinforcement learning objective with an entropy term H(π(•|s t )) to encourage exploration of diverse behaviours and stabilize training (Mnih et al., 2016; Schulman et al., 2017) . Consequently, the parameters of policy function θ and value function θ v are updated by: θ t+1 = θ t + α(∇ θ log π θ (a t |s t )A t + β∇ θ H(π(s t ; θ))) (1) θ v(t+1) = θ vt -α∇ θvt (r t -V(s t )) 2 (2) where α is learning rate, H is the entropy of policy π, and β controls the trade-off between exploitation and exploration.

2.3.2. SAC

Though A2C with maximum entropy objective improves the stability of training, it suffers from poor sample efficiency. In contrast, SAC (Haarnoja et al., 2018) uses a replay buffer to reuse past experiences to reduce sample complexity. To this end, SAC maintains a soft Q-function Q ϕ (s t , a t ) and a policy function π θ (a t , s t ), where ϕ and θ are the parameters for these networks respectively. The soft Q-function parameters can be optimized with the objective of soft Bellman residual: J Q (ϕ) = E (st,at)∼D 1 2 (Q ϕ (s t , a t ) -(r(s t , a t ) + γE st+1∼p V φ(s t+1 ) )) 2 (3) where the parameters φ are obtained as an exponentially moving average of ϕ, and the soft state value function V is defined as below following the SAC for discrete action settings (Christodoulou, 2019 ): V (s t ) = π(s t ) T [Q(s t ) -β log(π(s t ))] The policy parameters are updated towards the exponential of the new Q-function with the KLdivergence objective, and it can be further transformed to the following form for the discrete action settings: J π (θ) = E st∼D π t (s t ) T [β log(π θ (s t )) -Q ϕ (s t )] In practice, SAC maintains two soft Q-functions Q ϕ1 and Q ϕ2 and substitutes the soft Q-functions in equation 3 and equation 5 with min(Q ϕ1 , Q ϕ2 ) to mitigate bias (Fujimoto et al., 2018) .

2.4. ENTROPY-BASED REWARD FUNCTIONS

We introduce two classes of reward functions from the angle of syntactic heuristics of training data (2.4.1) and the theory of information bottleneck (2.4.2).

2.4.1. OBSERVATION ENTROPY

Although there is no consensus on what are the best in-domain data for generalization, experiments (Adila & Kang, 2022) find models latch on syntactic heuristics, like the overlap of words between indomain and out-of-distribution sentences to make predictions. Ruder & Plank (2017) demonstrates extracting word entropy as heuristic features to select training data favors domain adaptation in NLP. Based on these findings, we follow classic count-based methods (Song et al., 2012; Ruder & Plank, 2017; Parcheta et al., 2018; Tevet & Berant, 2020) , or N -grams, as an indicator of how good the selected data is. Specifically, we apply Shannon Entropy (Shannon, 1948), Rényi Entropy (Rényi, 1961) and Min Entropy (Smith, 2011) as reward functions in our reinforcement learning framework. All entropy measures are computed on word n-gram relative frequency on all sentences in the dataset. For a set G with M sentences, and each sentence x i containing J i words, we define the empirical set entropy as the sum of n-gram entropy: H(G) = M i=1 h(x i ; n) h(x i ; n) = 1 α -1 log p(x j+n-1 ij ) α , where p(x j+n-1 ij ) is the relative frequency of n-gram from word j to j + n -1 of sentence x i , and α is a parameter of Rényi Entropy controlling the order of entropy. Especially, when α approaching 1, Rényi Entropy is equivalent as Shannon Entropy; when α approaching infinity, Rényi Entropy converges to Min Entropy.  H ′ (G) = k n=1 λ n M i=1 h(x i ; n), We use the 2-nd order interpolated set entropy as our default setting in the following sections.

2.4.2. PREDICTION ENTROPY

From the information theory perspective, the Information Bottleneck (IB) principle indicates the mutual information between the input of a neural network and its latent representation needs to be well-compressed to generalize well on out-of-domain data (Tishby et al., 2000; Tishby & Zaslavsky, 2015) . Specifically, IB seeks to obtain a latent representation Z such that the mutual information between input X and Z, denoting as I(X ; Z) is minimized, and the mutual information between Z and output Y, denoting as I(Y; Z), is maximized. Formally, IB is implemented by minimizing the following Lagrangian: minimize{I(X ; Z) -λI(Y; Z)} Intuitively, the smaller mutual information I(X ; Z) is, the better Z compresses X , the less likely Z learns spurious correlations with X , the more robust representation Z is. However, since Z is high dimensional, the exact computation of mutual information I(X ; Z) is intractable.  I(X ; Z) ≥ I(X ; Ŷ) = H( Ŷ) -H( Ŷ|X ) = H( Ŷ) H( Ŷ) ≈ - 1 n n i=1 |Y| j=1 p j (x i ; θ) log p j (x i ; θ) where p j (x i ; θ) is the predicted probability of label Y j of sample x i , given the model θ, and |Y| is the set of all labels. Adopting this observation into our context, we minimize I(X ; Z) using -H( Ŷ) as the reward to select training data within a mini-batch that can learn the optimal latent representation for out-of-distribution generalization.

3. EXPERIMENTS

We describe our experimental details and demonstrate that MERRL improves baselines in three NLP applications among various out-of-distribution domains, including two classification tasks, sentiment analysis, named entity recognition, and one generation task of language modeling, without any out-of-domain knowledge. For each task, we experiment with two reinforcement learning algorithms to train the data selector, as well as three reward functions, i.e. A2C-OE denotes A2C with entropy regularization rewarded by Observation Entropy. We give a list of hyperparameters used in MERRL in appendix A.2.

3.1. NLP EXPERIMENTS

Baselines We compare our methods with six baselines: 1) ALL The models are trained on all in-domain training data; 2) RAND The models are trained on randomly selected 50% in-domain data; 3) MTL Marginal transfer learning by Blanchard et al. (2021) , a domain generalization framework using kernel methods to augment feature space. 4) PLM (Ma et al., 2019) that uses the large pretrained language model BERT (Devlin et al., 2018b) to learn a domain classifier and select data according to probability given by the domain classifier. 5) COS (Aharoni & Goldberg, 2020) that uses cosine distance to measure the distance between an in-domain sentence and the centroid of a target domain (out-of-distribution domain), and select sentences close to the target domain. 6) VPG (Liu et al., 2019) that uses the vanilla policy gradient method to choose data from a target distribution that resembles in-domain distribution. To be noted, PLM, COS and VPG are all data selection methods requiring the usage of out-of-domain data, while ALL, RAND, MTL and all our methods do not use any out-of-domain knowledge. For the training data size, ALL and MTL use all in-domain training data only; VPG and our methods choose roughly 50% in-domain training data (complete data statistics in Appendix A.4), and we control PLM & COS which both require a pre-defined selected data size to select 50% in-domain data. Sentiment Analysis We use the Amazon product review dataset (Blitzer et al., 2007) for the sentiment analysis task. Specifically, we use the processed labeled domain data (books, dvd and kitchen) to train our task model and the unprocessed 21 domains as test data. We use a CNN classifier (Kim, politics 2003) as an in-domain training set and the five domains from CrossNER dataset (Liu et al., 2020) as test sets, which has specialized entity categories for each domain. We finetune the pretrained BERT model (Devlin et al., 2018a) on source training set by adding a linear layer on top of the hiddenstates output of the last layer and then report the F1-scores on five test sets in left of Table 2 . SAC outperforms A2C across all domains and SAC-PE improves the test score on music domain up to 14.3% compared to MTL. Language Modeling We experiment with two moderate size datasets WikiText-2 (Merity et al., 2016) and Penn Treebank. Our baseline is a Transformer language model (Vaswani et al., 2017) trained with default hyper-parameters from scratch. The RL loop in Figure 1-(a ) initializes the language model to be the checkpoint of the pre-trained transformer model. As for evaluation, we report perplexity scores on datasets from different domains, the English side of IWSLT'17 (TED talk) and the English side of WMT Biomedical'21. The baseline transformer model and all language models trained on selected data are updated by the fairseq toolkit (Ott et al., 2019) and stopped until the in-domain validation perplexity score does not improve for 5 epochs. The evaluation results are shown on the right of Table 2 . The perplexity on two test domains has been largely improved and there is at most 43.4% (decrease from VPG-229.53 to SAC-PE-129.78) relative improvement in the biomedical domain with WikiText-2 as in-domain data.

3.2. ANALYSIS

SAC VS. A2C We plot the learning curves of three reinforcement learning algorithms on the left of Figure 2 . The average reward of SAC is significantly higher than A2C with entropy regularization (shortened as A2C) and VPG. SAC and A2C both converge at around 10000 timesteps, while VPG converges at around 20000 timesteps. Comparing A2C and VPG, it is clearly shown that A2C has a smaller variance than VPG. In short, SAC is the most effective algorithm among the three, and A2C can reduce variance compared to VPG. Particularly, with a limited training time budget (e.g. training with 5000 timesteps), SAC can lead to the best performance in training set optimization, which matches our empirical results. Batch size Unlike previous applications of reinforcement learning in NLP (Yoon et al., 2020; Fang et al., 2017; Wu et al., 2018) which give reward to a single sample/sentence, our reward function measures the informative level of a whole set of data. In this case, the observation (state) space is no longer a vector, but a batch of vectors. Thus, the batch size |B t | is a newly introduced hyperparameter in our subset optimization problem that affects both action space and state space. While previous work uses larger batch size (|B t | ≥ 2000) to improve the stability of reinforcement learning training (Yoon et al., 2020; McCandlish et al., 2018) , we find that training set optimization can benefit from smaller batch size than large batch size when total training step T budget is fixed, as shown in the right of Fig 2 . The reason can be related to our designed state space, which is not a single vector but a batch of vectors so that a larger batch size can directly enlarge the action space into 2 |Bt| and make the training harder. Visualization We plot the t-SNE 2D visualizations using the data selected from the training source domains (books, DVD and kitchen) by VPG (Liu et al., 2019) (blue) and by SAC-OE (red), as well as a surprising (unknown) test domain (magazines, green dots). We embed each sentence using the sentence-transformer tool (Reimers & Gurevych, 2019) . In Figure 3 , the middle plot shows the coverage of selected data from SAC-OE (3361 sentences, including 53.3% sentences not overlapping with VPG). While similar in dataset size, blue dots are more densely spread, especially the several dense clusters formed by blue points in the bottom part of the left plot. On the contrary, red dots cover more test domain areas than blue dots, especially in those yellow highlighted areas. To gain more intuition, we draw the convex hull (Barber et al., 2013) for red dots and blue dots respectively, shown on the right side. The red circle encloses the blue circle after removing the outliers from both sets. Furthermore, we compute the out-of-vocabulary of all test domains in the Amazon product review dataset and in-domain vocabulary size of VPG and SAC-OE selected set. In 



Figure 1: (a): Maximum-Entropy Rewarded Reinforcement Learning framework. (b): Higher training set entropy, better learning generalization, w.r.t. F1 score and OOV.

Figure 1-(b) shows our named entity recognition (NER) task results on the CoNLL2003 dataset (Sang

For a set G with M training examples, we define the k-th order interpolated set entropy as a linear combination of n-gram entropy from n = 1 until n = k, weighted λ, where k n=1 λ n = 1. For example, if k = 3, then it combines unigram, bigram, and trigram set entropy with weight λ 1 , λ 2 , and λ 3 , respectively:

Figure 2: Left: Learning curve of three reinforcement learning algorithms in three random seeds in the NER task. Right: Smaller batch size |Bt| results in better test perplexity on two test sets.

Dictionary duni, d bi , dtri that stores unigram entropy,bigram entropy, trigram entropy for all samples in source training set, a batch of training samples G = {(si) M i=1 } with size M ; ratio α, β, γ ∈ [0, 1). Output: Reward value of set N -gram entropy H(G) 1: Initialize H(G) = 0; 2: Initialize unigram set entropy, bigram set entropy, trigram set entropy: h1(G) = 0, h2(G) = 0, h3(G) = 0; 3: for all s ∈ G do 4: Obtain sentence entropy duni[s], d bi [s], dtri[s], for s; 5: Update unigram set entropy, bigram set entropy, trigram set entropy:h1(G) = h1(G) + duni[s] h2(G) = h2(G) + d bi [s] h3(G) = h3(G) + dtri[s];6: end for 7: H(G) = αh1(G) + βh2(G) + γh3(G); 8: return H(G)

Sentiment analysis accuracy [%] on amazon unprocessed domains. Baselines PLM(Ma et al., 2019), COS(Aharoni & Goldberg, 2020) and VPG(Liu et al., 2019) use test/target domain data of each column, while our methods outperform all of them without using any target domain knowledge. Last row: absolute improvement between SAC-PE and best domain generalization method MTL(Blanchard et al., 2021)

Left: NER F1-scores. Right: Language modeling perplexity scores on two test domains. First row: source training domain; Second row: test domains. Results are averaged over three runs.

OOV of VPG selected data(Liu et al., 2019) and SAC selected data on test domains of amazon product review dataset. Last column: training vocab of the selected set.2014) as the sentiment analysis model and pre-train the CNN classifier for two epochs followingLiu et al. (2019) for a fair comparison. Table1shows the results averaged over five random seeds. Our methods outperform all baselines on all unprocessed amazon domains. It is worth noting that even with the test domain knowledge, baselines PLM or COS fail to select the "right" data for specific do-

SAC-OE has a significantly lower OOV size among all test domains, while more in-domain vocabulary than VPG. In summary, we infer SAC-OE selected data has a superior generalization ability than VPG since it is capable of selecting a training set with a more diverse vocabulary and wider coverage of semantic space.4 RELATED WORKThere have been a number of influential workMoore & Lewis (2010);Axelrod et al. (2011);Ruder & Plank (2017) on data selection that significantly contributed to today's NLP state-of-the-arts. More recently,Fan et al. (2017), Feng et al. (2018),Qu et al. (2019),Fang et al. (2017) andLiu et al. (2019) incorporate reinforcement learning with data selection. Another direction examines the potential of large pretrained language models to select dataYuan et al. (2020); Aharoni & Goldberg (2020);Ma et al. (2019). These work mainly select the training data close to a given target domain for domain adaptation. In contrast, we aim to enhance the model generalization and increase the accuracy on any arbitrary domain. Furthermore, we advance the existing data selection techniques using A2C and SAC that simultaneously optimize the value (Q) network and the policy network for better convergence and lower variance, resulting in higher prediction accuracy and generality. Thus, it requires further consideration on how to generalize text data when adapting these methods to NLP tasks. Our method puts an emphasis on both characteristics of text data and general prediction entropy which could be directly generalized to other fields.Another relevant and emergent line of work is data pruning, which aims at selecting a minimum subset of training data to reduce training costsSorscher et al. (2022);Yang et al. (2022), or to enhance the robustness of the modelKaufmann et al. (2022).

Notation table

A.5 DATA STATISTICS See table 6. MERRL selected data statistics A.6 OOV OF MERRL SELECTED DATA We show the full result of OOV of selected data in table 7.

Out-of-vocabulary of VPG-selected dataLiu et al. (2019) and SAC-selected data on test domains of amazon product review dataset.

ACKNOWLEDGMENTS

We appreciate Amazon Alexa Prize, National Science Foundation (NSF) Award No. 1747728, and NSF CRAFT Award No. 22001 to fund this research.

