DIRICHLET-BASED UNCERTAINTY CALIBRATION FOR ACTIVE DOMAIN ADAPTATION

Abstract

Active domain adaptation (DA) aims to maximally boost the model adaptation on a new target domain by actively selecting limited target data to annotate, whereas traditional active learning methods may be less effective since they do not consider the domain shift issue. Despite active DA methods address this by further proposing targetness to measure the representativeness of target domain characteristics, their predictive uncertainty is usually based on the prediction of deterministic models, which can easily be miscalibrated on data with distribution shift. Considering this, we propose a Dirichlet-based Uncertainty Calibration (DUC) approach for active DA, which simultaneously achieves the mitigation of miscalibration and the selection of informative target samples. Specifically, we place a Dirichlet prior on the prediction and interpret the prediction as a distribution on the probability simplex, rather than a point estimate like deterministic models. This manner enables us to consider all possible predictions, mitigating the miscalibration of unilateral prediction. Then a two-round selection strategy based on different uncertainty origins is designed to select target samples that are both representative of target domain and conducive to discriminability. Extensive experiments on cross-domain image classification and semantic segmentation validate the superiority of DUC.

1. INTRODUCTION

Despite the superb performances of deep neural networks (DNNs) on various tasks (Krizhevsky et al., 2012; Chen et al., 2015) , their training typically requires massive annotations, which poses formidable cost for practical applications. Moreover, they commonly assume training and testing data follow the same distribution, making the model brittle to distribution shifts (Ben-David et al., 2010) . Alternatively, unsupervised domain adaptation (UDA) has been widely studied, which assists the model learning on an unlabeled target domain by transferring the knowledge from a labeled source domain (Ganin & Lempitsky, 2015; Long et al., 2018) . Despite the great advances of UDA, the unavailability of target labels greatly limits its performance, presenting a huge gap with the supervised counterpart. Actually, given an acceptable budget, a small set of target data can be annotated to significantly boost the performance of UDA. With this consideration, recent works (Fu et al., 2021; Prabhu et al., 2021) integrate the idea of active learning (AL) into DA, resulting in active DA. The core of active DA is to annotate the most valuable target samples for maximally benefiting the adaptation. However, traditional AL methods based on either predictive uncertainty or diversity are less effective for active DA, since they do not consider the domain shift. For predictive uncertainty (e.g., margin (Joshi et al., 2009) , entropy (Wang & Shang, 2014) ) based methods, they cannot measure the target-representativeness of samples. As a result, the selected samples are often redundant and less informative. As for diversity based methods (Sener & Savarese, 2018; Nguyen & Smeulders, 2004) , they may select samples that are already well-aligned with source domain (Prabhu et al., 2021) . Aware of these, active DA methods integrate both predictive uncertainty and targetness into the selection process (Su et al., 2019; Fu et al., 2021; Prabhu et al., 2021 ). Yet, existing focus is on the measurement of targetness, e.g., using domain discriminator (Su et al., 2019) or clustering (Prabhu et al., 2021) . The predictive uncertainty they used is still mainly based on the prediction of deterministic models, which is essentially a point estimate (Sensoy et al., 2018) and can easily be miscalibrated on data with distribution shift (Guo et al., 2017) . As in Fig. 1 (a), standard DNN is wrongly overconfident on most target data. Correspondingly, its predictive uncertainty is unreliable. The model is trained with images of "keyboard", "computer" and "monitor" from the Clipart domain of Office-Home dataset. For the two images from the Real-World domain, the entropy of expected prediction cannot distinguish them, whereas U dis and U data calculated based on the prediction distribution can reflect what contributes more to their uncertainty and be utilized to guarantee the information diversity of selected data. To solve this, we propose a Dirichlet-based Uncertainty Calibration (DUC) method for active DA, which is mainly built on the Dirichlet-based evidential deep learning (EDL) (Sensoy et al., 2018) . In EDL, a Dirichlet prior is placed on the class probabilities, by which the prediction is interpreted as a distribution on the probability simplex. That is, the prediction is no longer a point estimate and each prediction occurs with a certain probability. The resulting benefit is that the miscalibration of unilateral prediction can be mitigated by considering all possible predictions. For illustration, we plot the expected entropy of all possible predictions using the Dirichlet-based model in Fig. 1 (a). And we see that most target data with domain shift are calibrated to have greater uncertainty, which can avoid the omission of potentially valuable target samples in deterministic model based-methods. Besides, based on Subjective Logic (Jøsang, 2016) , the Dirichlet-based evidential model intrinsically captures different origins of uncertainty: the lack of evidences and the conflict of evidences. This property further motivates us to consider different uncertainty origins during the process of sample selection, so as to comprehensively measure the value of samples from different aspects. Specifically, we introduce the distribution uncertainty to express the lack of evidences, which mainly arises from the distribution mismatch, i.e., the model is unfamiliar with the data and lacks knowledge about it. In addition, the conflict of evidences is expressed as the data uncertainty, which comes from the natural data complexity, e.g., low discriminability. And the two uncertainties are respectively captured by the spread and location of the Dirichlet distribution on the probability simplex. As in Fig. 1 (b), the real-world style of the first target image obviously differs from source domain and presents a broader spread on the probability simplex, i.e., higher distribution uncertainty. This uncertainty enables us to measure the targetness without introducing the domain discriminator or clustering, greatly saving computation costs. While the second target image provides different information mainly from the aspect of discriminability, with the Dirichlet distribution concentrated around the center of the simplex. Based on the two different origins of uncertainty, we design a two-round selection strategy to select both target-representative and discriminability-conducive samples for label query. Contributions: 1) We explore the uncertainty miscalibration problem that is ignored by existing active DA methods, and achieve the informative sample selection and uncertainty calibration simultaneously within a unified framework. 2) We provide a novel perspective for active DA by introducing the Dirichlet-based evidential model, and design an uncertainty origin-aware selection strategy to comprehensively evaluate the value of samples. Notably, no domain discriminator or clustering is used, which is more elegant and saves computation costs. 3) Extensive experiments on both cross-domain image classification and semantic segmentation validate the superiority of our method.

2. RELATED WORK

Active Learning (AL) aims to reduce the labeling cost by querying the most informative samples to annotate (Ren et al., 2022) , and the core of AL is the query strategy for sample selection. Committeebased strategy selects samples with the largest prediction disagreement between multiple classifiers (Seung et al., 1992; Dagan & Engelson, 1995) . Representative-based strategy chooses a set of representative samples in the latent space by clustering or core-set selection (Nguyen & Smeulders, 2004; Sener & Savarese, 2018) . Uncertainty-based strategy picks samples based on the prediction confidence (Lewis & Catlett, 1994) , entropy (Wang & Shang, 2014; Huang et al., 2018) , etc, to annotate samples that the model is most uncertain about. Although these query strategies have shown promising performances, traditional AL usually assumes that the labeled data and unlabeled data follow the same distribution, which may not well deal with the domain shift in active DA. Active Learning for Domain Adaptation intends to maximally boost the model adaption from source to target domain by selecting the most valuable target data to annotate, given a limited labeling budget. With the limitation of traditional AL, researchers incorporate AL with additional criteria of targetness (i.e., the representativeness of target domain). For instance, besides predictive uncertainty, AADA (Su et al., 2019) and TQS (Fu et al., 2021) additionally use the score of domain discriminator to represent targetness. Yet, the learning of domain discriminator is not directly linked with the classifier, which may cause selected samples not necessarily beneficial for classification. Another line models targetness based on clustering, e.g., CLUE (Prabhu et al., 2021) and DBAL (Deheeger et al., 2021) . Differently, EADA (Xie et al., 2021) represents targetness as free energy bias and explicitly reduces the free energy bias across domains to mitigate the domain shift. Despite the advances, the focus of existing active DA methods is on the measurement of targetness. Their predictive uncertainty is still based on the point estimate of prediction, which can easily be miscalibrated on target data. Deep Learning Uncertainty measures the trustworthiness of decisions from DNNs. One line of the research concentrates on better estimating the predictive uncertainty of deterministic models via ensemble (Lakshminarayanan et al., 2017) or calibration (Guo et al., 2017) . Another line explores to combine deep learning with Bayesian probability theory (Denker & LeCun, 1990; Goan & Fookes, 2020) . Despite the potential benefits, BNNs are limited by the intractable posterior inference and expensive sampling for uncertainty estimation (Amini et al., 2020) . Recently, evidential deep learning (EDL) (Sensoy et al., 2018) is proposed to reason the uncertainty based on the belief or evidence theory (Dempster, 2008; Jøsang, 2016) , where the categorical prediction is interpreted as a distribution by placing a Dirichlet prior on the class probabilities. Compared with BNNs which need multiple samplings to estimate the uncertainty, EDL requires only a single forward pass, greatly saving computational costs. Attracted by the benefit, TNT (Chen et al., 2022) leverages it for detecting novel classes, GKDE Zhao et al. (2020) integrates it into graph neural networks for detecting out-ofdistribution nodes, and TCL (Li et al., 2022) (Fu et al., 2021) , we assume that source and target domains share the same label space Y = {1, 2, • • • , C} but follow different data distributions. Meanwhile, we denote a labeled target set as T l , which is an empty set ∅ initially. When training reaches the active selection step, b unlabeled target samples will be selected to query their labels from the oracle and added into T l . Then we have T = T l ∪ T u , where T u is the remaining unlabeled target set. Such active selection step repeats several times until reaching the total labeling budget B. To get maximal benefit from limited labeling budget, the main challenge of active DA is how to select the most valuable target samples to annotate under the domain shift, which has been studied by several active DA methods (Su et al., 2019; Fu et al., 2021; Prabhu et al., 2021; Deheeger et al., 2021; Xie et al., 2021) . Though they have specially considered targetness to represent target domain characteristics, their predictive uncertainty is still mainly based on the prediction of deterministic models, which can easily be miscalibrated under the domain shift, as found in (Lakshminarayanan et al., 2017; Guo et al., 2017) . Instead, we tackle active DA via the Dirichlet-based evidential model, which treats categorical prediction as a distribution rather than a point estimate like previous methods.

3.2. PRELIMINARY OF DIRICHLET-BASED EVIDENTIAL MODEL

Let us start with the general C-class classification. X denotes the input space and the deep model f parameterized with θ maps the instance x ∈ X into a C-dimensional vector, i.e., f : X → R C . For standard DNN, the softmax operator is usually adopted on the top of f to convert the logit vector into the prediction of class probability vector ρfoot_0 , while this manner essentially gives a point estimate of ρ and can easily be miscalibrated on data with distribution shift (Guo et al., 2017) . To overcome this, Dirichlet-based evidential model is proposed by Sensoy et al. (2018) , which treats the prediction of class probability vector ρ as the generation of subjective opinions. And each subjective opinion appears with certain degrees of uncertainty. In other words, unlike traditional DNNs, evidential model treats ρ as a random variable. Specifically, a Dirichlet distribution, the conjugate prior distribution of the multinomial distribution, is placed over ρ to represent the probability density of each possible ρ. Given sample x i , the probability density function of ρ is denoted as p(ρ|x i , θ) = Dir(ρ|α i ) = Γ( C c=1 αic) C c=1 Γ(αic) C c=1 ρ αic-1 c , if ρ ∈ C 0 , otherwise , α ic > 0, (1) where α i is the parameters of the Dirichlet distribution for sample x i , Γ(•) is the Gamma function and C is the C-dimensional unit simplex: C = {ρ| C c=1 ρ c = 1 and ∀ρ c , 0 ≤ ρ c ≤ 1}. For α i , it can be expressed as α i = g(f (x i , θ)) , where g(•) is a function (e.g., exponential function) to keep α i positive. In this way, the prediction of each sample is interpreted as a distribution over the probability simplex, rather than a point on it. And we can mitigate the uncertainty miscalibration by considering all possible predictions rather than unilateral prediction. Further, based on the theory of Subjective Logic (Jøsang, 2016) and DST (Dempster, 2008) , the parameters α i of Dirichlet distribution is closely linked with the evidences collected to support the subjective opinion for sample x i , via the equation e i = α i -1 where e i is the evidence vector. And the uncertainty of each subjective opinion ρ also relates to the collected evidences. Both the lack of evidences and the conflict of evidences can result in uncertainty. Having the relation between α i and evidences, the two origins of uncertainty are naturally reflected by the different characteristics of Dirichlet distribution: the spread and the location over the simplex, respectively. As shown in Fig. 1 (b), opinions with lower amount of evidences have broader spread on the simplex, while the opinions with conflicting evidences locate close to the center of the simplex and present low discriminability. Connection with softmax-based DNNs. Considering sample x i , the predicted probability for class c can be denoted as Eq. ( 2), by marginalizing over ρ. The derivation is in Sec. E.1 of the appendix. P (y = c|xi, θ) = p(y = c|ρ)p(ρ|xi, θ)dρ = αic C k=1 α ik = g(fc(xi, θ)) C k=1 g(f k (xi, θ)) = E[Dir(ρc|αi)]. (2) Specially, if g(•) adopts the exponential function, then softmax-based DNNs can be viewed as predicting the expectation of Dirichlet distribution. However, the marginalization process will conflate uncertainties from different origins, making it hard to ensure the information diversity of selected samples, because we do not know what information the sample can bring.

3.3. SELECTION STRATEGY WITH AWARENESS OF UNCERTAINTY ORIGINS

In active DA, to gain the utmost benefit from limited labeling budget, the selected samples ideally should be 1) representative of target distribution and 2) conducive to discriminability. For the former, existing active DA methods either use the score of domain discriminator (Su et al., 2019; Fu et al., 2021) or the distance to cluster centers (Prabhu et al., 2021; Deheeger et al., 2021) . As for the latter, predictive uncertainty (e.g., margin (Xie et al., 2021) , entropy (Prabhu et al., 2021) ) of standard DNNs is utilized to express the discriminability of target samples. Differently, we denote the two characteristics in a unified framework, without introducing domain discriminator or clustering. For the evidential model supervised with source data, if target samples are obviously distinct from source domain, e.g., the realistic v.s. clipart style, the evidences collected for these target samples may be insufficient, because the model lacks the knowledge about this kind of data. Built on this, we use the uncertainty resulting from the lack of evidences, called distribution uncertainty, to measure the targetness. Specifically, the distribution uncertainty U dis of sample x j is defined as Here, we use mutual information to measure the spread of Dirichlet distribution on the simplex like Malinin & Gales (2018) . The higher U dis indicates larger variance of opinions due to the lack of evidences, i.e., the Dirichlet distribution is broadly spread on the probability simplex. U dis (x j , θ) I[y, ρ|x j , θ] = C c=1 ρjc ψ(α jc + 1) -ψ( C k=1 α jk + 1) - C c=1 ρjc log ρjc , For the discriminability, we also utilize the predictive entropy to quantify. But different from previous methods which are based on the point estimate (i.e., the expectation of Dirichlet distribution), we denote it as the expected entropy of all possible predictions. Specifically, given sample x j and model parameters θ, the data uncertainty U data is expressed as U data (x j , θ) E p(ρ|xj ,θ) [H[P (y|ρ)]] = C c=1 ρjc ψ( C k=1 α jk + 1) -ψ(α jc + 1) . Here, we do not adopt H[E[Dir(ρ|α j )]], i.e., the entropy of point estimate, to denote data uncertainty, in that the expectation operation will conflate uncertainties from different origins as shown in Eq. ( 2). Having the distribution uncertainty U dis and data uncertainty U data , we select target samples according to the strategy in Fig. 2 . In each active selection step, we select samples in two rounds. In the first round, top κb target samples with highest U dis are selected. Then according to data uncertainty U data , we choose the top b target samples from the candidates in the first round to query labels. Experiments on the uncertainty ordering and selection ratio in the first round are provided in Sec. D.1 and Sec. 4.3. Relation between U dis , U data and typical entropy. Firstly, according to Eq. ( 2), the typical entropy of sample x j can be denoted as H[P (y|x j , θ)] = H[E[Dir(ρ|α j )]] = - C c=1 ρjc log ρjc , where ρjc = E[Dir(ρ c |α j )]. Then we have U dis (x j , θ) + U data (x j , θ) = H[P (y|x j , θ)] , by adding Eq. (3) and Eq. 4 together. We can see that our method actually equals to decomposing the typical entropy into two origins of uncertainty, by which our selection criteria are both closely related to the prediction. While the targetness measured with domain discriminator or clustering centers is not directly linked with the prediction, and thus the selected samples may already be nicely classified. Discussion. Although Malinin & Gales (2018) propose Dirichlet Prior Network (DPN) to distinguish between data and distribution uncertainty, their objective differs from us. Malinin & Gales (2018) mainly aims to detect out-of-distribution (OOD) data, and DPN is trained using the KL-divergence between the model and the ground-truth Dirichlet distribution. Frustratingly, the ground-truth Dirichlet distribution is unknown. Though they manually construct a Dirichlet distribution as the proxy, the parameter of Dirichlet for the ground-truth class still needs to be set by hand, rather than learned from data. In contrast, by interpreting from an evidential perspective, our method does not require the ground-truth Dirichlet distribution and automatically learns sample-wise Dirichlet distribution by maximizing the evidence of ground-truth class and minimizing the evidences of wrong classes, which is shown in Sec. 3.4. Besides, they expect to generate a flat Dirichlet distribution for OOD data, while this is not desired on our target data, since our goal is to improve their accuracy. Hence, we additionally introduce losses to reduce the distribution and data uncertainty of target data.

3.4. EVIDENTIAL MODEL LEARNING

To get reliable and consistent opinions for labeled data, the evidential model is trained to generate sharp Dirichlet distribution located at the corner of the simlpex for these labeled data. Concretely, we train the model by minimizing the negative logarithm of the marginal likelihood (L nll ) and the KL-divergence between two Dirichlet distributions (L KL ). L nll is expressed as L nll = 1 ns x i ∈S -log p(y = yi|ρ)p(ρ|xi, θ)dρ + 1 |T l | x j ∈T l -log p(y = yj|ρ)p(ρ|xj, θ)dρ = 1 ns x i ∈S C c=1 Υic log C c=1 αic -logαic + 1 |T l | x j ∈T l C c=1 Υjc log C c=1 αjc -logαjc , (5) where Υ ic /Υ jc is the c-th element of the one-hot label vector Υ i /Υ j of sample x i /x j . L nll is minimized to ensure the correctness of prediction. As for L kl , it is denoted as L kl = 1 C • ns x i ∈S KL Dir(ρ| αi) Dir(ρ|1) + 1 C • |T l | x j ∈T l KL Dir(ρ| αj) Dir(ρ|1) , (6) where αi/j = Υ i/j + (1 -Υ i/j ) α i/j and is the element-wise multiplication. αi/j can be seen as removing the evidence of ground-truth class. Minimizing L kl will force the evidences of other classes to reduce, avoiding the collection of mis-leading evidences and increasing discriminability. Here, we divide the KL-divergence by the number of classes, since its scale differs largely for different C. Due to the space limitation, the computable expression is given in Sec. E.4 of the appendix. In addition to the training on the labeled data, we also explicitly reduce the distribution and data uncertainties of unlabeled target data by minimizing L un , which is formulated as L un = βL U dis + λL U data = β |T u | x k ∈T u U dis (x k , θ) + λ |T u | x k ∈T u U data (x k , θ), where β and λ are two hyper-parameters to balance the two losses. On the one hand, this regularizer term is conducive to improving the predictive confidence of some target samples. On the other hand, it contributes to selecting valuable samples, whose uncertainty can not be easily reduced by the model itself and external annotation is needed to provide more guidance. To sum up, the total training loss is L total = L edl + L un = (L nll + L kl ) + (βL U dis + λL U data ). The training procedure of DUC is shown in Sec. B of the appendix. And for the inference stage, we simply use the expected opinion, i.e., the expectation of Dirichlet distribution, as the final prediction. Discussion. Firstly, L nll , L kl are actually widely used in EDL-inspired methods for supervision, e.g., Bao et al. (2021) ; Chen et al. (2022) ; Li et al. (2022) . Secondly, our motivation and methodology differs from EDL. EDL does not consider the origin of uncertainty, since it is mainly proposed for OOD detection, which is less concerned with that. And models can reject samples as long as the total uncertainty is high. By contrast, our goal is to select the most valuable target samples for model adaption. Though target samples can be seen as OOD samples to some extent, simply sorting them by the total uncertainty is not a good strategy, since the total uncertainty can not reflect the diversity of information. A better choice is to measure the value of samples from multiple aspects. Hence, we introduce a two-round selection strategy based on different uncertainty origins. Besides, according to the results in Sec. D.5, our method can empirically mitigate the domain shift by minimizing L U dis , which makes our method more suitable for active DA. Comparatively, this is not included in EDL. Methods with # are based on DeepLab-v3+ (Chen et al., 2018) and others are based on DeepLab-v2 (Chen et al., 2015) . Method with budget "-" are the source-only or UDA methods.

Time

considering targetness. And our method beats EADA by 1.3%, validating the efficacy of regrading the prediction as a distribution and selecting data based on both the distribution and data uncertainties. Results on VisDA-2017 are given in Table 2 . On this large-scale dataset, our method still works well, achieving the highest accuracy of 88.9%, which further validates the effectiveness of our approach.

4.2.2. SEMANTIC SEGMENTATION

Results on GTAV→Cityscapes are shown in Table 3 . Firstly, we can see that with only 5% labeling budget, the performance of domain adaptation can be significantly boosted, compared with UDA methods. Besides, compared with active DA methods (AADA and MADA), our DUC largely surpasses them according to mIoU: DUC (69.8, 10.5↑) v.s. AADA (59.3), DUC (69.8, 4.9↑) v.s. MADA (64.9). This can be explained as the more informative target samples selected by DUC and the implicitly mitigated domain shift by reducing the distribution uncertainty of unlabeled target data. Results on SYNTHIA→Cityscapes are presented in Table 4 . Due to the domain shift from virtual to realistic as well as a variety of driving scenes and weather conditions, this adaptation task is challenging, while our method still achieves considerable improvements. Concretely, according to the average mIoU of 16 classes, DUC exceeds AADA and MADA by 8.4% and 2.2%, respectively. We owe the advances to the better measurement of targetness and discriminability, which are both closely related with the prediction. Thus the selected target pixels are really conducive to the classification. Distribution of U dis Across Domains. To answer whether the distribution uncertainty can represent targetness, we plot in Fig. 3 (a) the distribution of U dis , where the model is trained on source domain with L edl . We see that the U dis of target data is noticeably biased from source domain. Such results show that our U dis can play the role of domain discriminator without introducing it. Moreover, the score of domain discriminator is not directly linked with the prediction, which causes the selected samples not necessarily beneficial for classifier, while our U dis is closely related with the prediction.

4.3. ANALYTICAL EXPERIMENTS

Expected Calibration Error (ECE). Following (Joo et al., 2020) , we plot the expected calibration error (ECE) (Naeini et al., 2015) on target data in Fig. 3 (b) to evaluate the calibration. Obviously, our model presents better calibration performance, with much lower ECE. While the accuracy of standard DNN is much lower than the confidence, when the confidence is high. This implies that standard DNN can easily produce overconfident but wrong predictions for target data, leading to the estimated predictive uncertainty unreliable. Contrastively, DUC mitigates the miscalibration problem. Effect of Selection Ratio in the First Round. Hyper-parameter κ controls the selection ratio in the first round and Fig. 4 (a) presents the results on Office-Home with different κ. The performance with too much or too small κ is inferior, which results from the imbalance between targetness and discriminability. When κ = 1 or κ = 100, the selection degenerates to the one-round sampling manner according to U dis and U data , respectively. In general, we find κ ∈ {10, 20, 30} works better. Hyper-parameter Sensitivity. β and λ control the tradeoff between L U dis and L U data . we test the sensitivity of the two hyper-parameters on the Office-Home dataset. The results are presented in Fig. 4(b) , where β ∈ {0.01, 0.05, 0.1, 0.5, 1.0} and λ ∈ {0.001, 0.005, 0.01, 0.05, 0.1}. According to the results, DUC is not that sensitive to β but is a little bit sensitive to λ. In general, we recommend λ ∈ {0.01, 0.05, 0.1} for trying.

5. CONCLUSION

In this paper, we address active domain adaptation (DA) from the evidential perspective and propose a Dirichlet-based Uncertainty Calibration (DUC) approach. Compared with existing active DA methods which estimate predictive uncertainty based on the the prediction of deterministic models, we interpret the prediction as a distribution on the probability simplex via placing a Dirichlet prior on the class probabilities. Then, based on the prediction distribution, two uncertainties from different origins are designed in a unified framework to select informative target samples. Extensive experiments on both image classification and semantic segmentation verify the efficacy of DUC. 

A BROADER IMPACT AND LIMITATIONS

Our work focuses on active domain adaptation (DA), which aims to maximally improve the model adaptation from one labeled domain (termed source domain) to another unlabeled domain (termed target domain) by annotating limited target data. In this paper, we suggest a new perspective for active DA and further boost the adaptation performances on both cross-domain image classification and semantic segmentation benchmarks. The advances mean that our method may potentially benefit relevant social activities, e.g., commodity classification, autonomous driving in different scenes, without consuming high labor cost to annotate massive new data for different scenes. While we do not anticipate adverse impacts, our method may suffer from some limitations. For example, our work is restricted to classification and segmentation tasks in this paper. In the future, we will explore our method in other tasks, e.g., object detection and regression, hoping to benefit more diverse fields. Besides, we only try to train the Dirichlet-based model using the evidential deep learning in the paper. Yet, there may exist better training frameworks, e.g., normalizing flow-based Dirichlet Posterior Network which can predict a closed-form posterior distribution over predicted probabilities for any input sample. In the future, we may also explore to extend our approach into the training framework of normalizing flow-based Dirichlet Posterior Network.

B ALGORITHM OF DUC

The training procedure of DUC is shown in Algorithm 1. ∀x j ∈ T u , compute its distribution and data uncertainties: U dis (x j , θ), U data (x j , θ).

7:

temp Candi ← select top κb samples with highest U dis from T u .

8:

Candi ← select top b samples with highest U data from temp Candi.

9:

Query the labels of Candi from the oracle. 10: For the optimizer, we adopt the mini-batch stochastic gradient descent (SGD) optimizer with batch size 32, momentum 0.9, weight decay 0.001 and the learning rate schedule strategy in (Long et al., 2018) . The initial learning rates for miniDomainNet, Office-Home and VisDA-2017 are 0.002, 0.004 and 0.001, respectively. As for hyper-parameters, we select them by the grid search and finally use β = 1.0, λ = 0.05, κ = 10 for miniDomainNet and Office-Home datasets. We run each task on a single NVIDIA GeForce RTX 2080 Ti GPU. T u = T u \Candi, T l = T l ∪ Candi.

SEMANTIC SEGMENTATION

For semantic segmentation, we also implements the experiment using PyTorch (Paszke et al., 2019) and adopt the DeepLab-v2 (Chen et al., 2015) and DeepLab-v3+ (Chen et al., 2018) with the backbone ResNet-101 (He et al., 2016) pre-trained on ImageNet (Deng et al., 2009) . Regarding the total labeling budget B, we totally annotate 5% pixels of target images, which is divided into 5 steps. In other words, we annotate 1% pixels for every image in each active selection step. For data preprocessing, source images are resized into 1280×720 and target images are resized into 1280×640. Similarly, the model is optimized using the mini-batch SGD optimizer with batch size 2, momentum 0.9, weight decay 0.0005. The "poly" learning rate schedule strategy with initial learning rate of 3e-4 is employed. And we set β = 1.0, λ = 0.01, κ = 10 for the semantic segmentation tasks. For each semantic segmentation task, we run the experiment on a single NVIDIA GeForce RTX 3090 GPU. U dis and U data in the two rounds. In Table 8 , we explore the influence of different orderings, where U dis , U data denotes U dis and U data are respectively used in the first and second round. We notice that U dis , U data generally surpasses U data , U dis . It shows that selecting discriminability-conducive samples from target-representative samples is better for active DA than the converse manner. Thus, we adopt U dis , U data throughout the paper.

D.2 PERFORMANCE GAIN WITH DIFFERENT LABELING BUDGETS

In Table 6 , we present the performances on Office-Home dataset with different total labeling budget B. As expected, better performances can be obtained with more labeled target samples accessible. In addition, we observe that the increasing speed of performance generally gets slower, as the total labeling budget increases. This observation demonstrates that all samples are not equally informative and our method can successfully select relatively informative samples. For example, when labeling budget increasing from 17.5% to 20%, the performance gain is much smaller, which implies that the majority of informative samples has been selected by our method.

D.3 COMBINATION WITH SEMI-SUPERVISED LEARNING

To further improve the performance, one can incorporate ideas from semi-supervised learning to use the unlabeled target data in training as well. Here, we consider one representative semi-supervised learning method: FixMatch (Sohn et al., 2020) . Specifically, we apply strong and weak augmentations to each unlabeled target sample x j , obtaining two views x strong j and x weak j . And we use the pseudo label of weakly augmented view x weak j as the label of strongly augmented view x strong j . Then the model is trained to minimize the loss L f ixmatch nll , i.e., the negative logarithm of the marginal likelihood of strongly augmented views. Concretely, L f ixmatch nll is formulated as L f ixmatch nll = 1 M xj ∈T u ∧τ <maxc ρweak jc -log p(y = ŷweak j |ρ)p(ρ|x strong j , θ)dρ = 1 M xj ∈T u ∧τ <maxc ρweak jc C c=1 Υweak jc log C c=1 α strong jc -logα strong jc , where ŷweak Table 7 presents the results on the Office-Home dataset when combining our method DUC with the semi-supervised learning method FixMatch Sohn et al. (2020) , where the hyper-parameter τ is set to 0.8. We can see that utilizing unlabeled target data indeed conduces to improving the performance. Of course, other semi-supervised learning methods are also possible.

D.4 QUALITATIVE VISUALIZATION OF SELECTED SAMPLES

In the label histogram of Fig. 6 , we plot the ground truth label distribution of the samples that are selected by DUC, with the total labeling budget B = 5% × n t . For the Ar→Cl task, "Bottle", "Knives" and "Toys" are the top 3 classes that are picked, while "Bucket", "Pencil" and "Spoon" turn out to be the top 3 picked classes in the Cl→Ar task. It shows that our method DUC can adaptively select informative samples for different target domains. Despite few categories are not picked, we can still see that the samples selected by DUC are generally category-diverse. And, according to the visualization of selected samples, the style of target domain is indeed reflected in theses selected samples. In addition, we also visualize the selected pixels for the task GTAV→Cityscapes in Fig. 7 . Overall, the selected pixels are from diverse objects that are hard to classify or are nearby together. Annotating such pixels can bring more beneficial knowledge for the model.

D.5 T-SNE VISUALIZATION FOR SHOWING EFFECTS OF L U dis

To verify that reducing our distribution uncertainty U dis conduces to the domain alignment, we respectively train the model with L edl and L edl + βL U dis , where there is no labeling budget. And the t-SNE (van der Maaten & Hinton, 2008) visualization of features from source and target domains on task Ar → Cl and Cl → Ar is shown in Fig. 8 . Form the results, we can see that reducing the distribution uncertainty of target data indeed helps to alleviate the domain shift, which makes our method more suitable for active DA, compared with EDL Sensoy et al. (2018) . Besides, the results also verify that our distribution uncertainty can measure the targetness of samples. According to the definition of mutual information (Kieffer, 1994; Shannon, 1948) , I[y, ρ|x j , θ] can be expressed as I[y, ρ|x j , θ] = C c=1 p(y = c, ρ|x j , θ) log p(y = c, ρ|x j , θ) p(y = c|x j , θ)p(ρ|x j , θ) dρ. (18) Since the deep model induces the Markov chain (x j , θ) → ρ → y, we have y and (x j , θ) conditionally independent given ρ, i.e., p(y, ρ|x j , θ) = p(y|ρ)p(ρ|x j , θ). Then, Eq. ( 18) can be further  The derivation of Eq. ( 19) is based on the conclusion from Eq. 12, i.e., P (y = c|x j , θ) =  Thus, the computable expression of L kl is given by  L kl = 1 C • ns



ρ = [ρ1, ρ2, • • • , ρC ] = [P (y = 1), P (y = 2), • • • , P (y = C)] is a vector of class probabilities.



Figure 1: (a): point-estimate entropy of DNN and expected entropy of Dirichlet-based model, where colors of points denote class identities. Both models are trained with source data. (b): examples of the prediction distribution of three "monitor" images on the simplex. The model is trained with images of "keyboard", "computer" and "monitor" from the Clipart domain of Office-Home dataset. For the two images from the Real-World domain, the entropy of expected prediction cannot distinguish them, whereas U dis and U data calculated based on the prediction distribution can reflect what contributes more to their uncertainty and be utilized to guarantee the information diversity of selected data.

Figure 2: Illustration of DUC. When the training reaches the active selection steps, the distribution uncertainty U dis and data uncertainty U data of unlabeled target samples are calculated according to the Dirichlet distribution with parameter α. Then κb samples with the highest U dis are chosen in the first round. In the second round, according to U data , we select the top b samples from the instances chosen in the first round to query their labels. These labeled target samples are added into the supervised learning. When reaching the total labeling budget B, the active selection stops. where θ is the parameters of the evidential deep model, ψ(•) is the digamma function and ρjc = E[Dir(ρ c |α j )].Here, we use mutual information to measure the spread of Dirichlet distribution on the simplex likeMalinin & Gales (2018). The higher U dis indicates larger variance of opinions due to the lack of evidences, i.e., the Dirichlet distribution is broadly spread on the probability simplex.

Complexity of Query Selection. The consumed time in the selection process mainly comes from the sorting of samples. In the first round of each active section step, the time complexity is O(|T u |log|T u |). And in the second round, the complexity is O((κb)log(κb)). Thus, the complexity of each selection step is O(|T u |log|T u |) + O((κb)log(κb)). Assuming the number of total selection steps is r, then the total complexity is r m=1 (O(|T u m |log|T u m |) + O((κb)log(κb))), where T u m is the unlabeled target set in the m-th active selection step. Since r is quite small (5 in our paper), and κb ≤ |T u m | ≤ n t , the approximated time complexity is denoted as O(n t logn t ).

Figure 3: (a): The distribution of logU dis for source and target data on task Ar → Cl and Cl → Ar. For elegancy, we apply logarithm to U dis . (b): Expected calibration error (ECE) of target data, where the standard DNN with cross entropy (CE) loss and our model are both trained with source data. when U dis is large, because opinions are derived from insufficient evidences and unreliable. Instead, reducing both U dis and U data is the right choice, which is further verified by the 7.1% improvements of variant D over pure EDL. Besides. Even without L U dis and L U data , variant A still exceeds CLUE by 1.6%, showing our superiority. Finally, we try different selection strategies. Variant E to H denote only one criterion is used in the selection. We see DUC beats variant F, G, H, since the entropy degenerates into the uncertainty based on point estimate, while U dis or U data only considers either targetness or discriminability. Contrastively, DUC selects samples with both characteristics.

Figure 4: (a): Effect of different first-round selection ration κ% on Office-Home. (b): Hyperparameter sensitivity of β, λ on Office-Home.

Pseudo code of the proposed DUC Input: labeled source dataset S, unlabeled target dataset T , selection steps R, total annotation budget B, hyperparameters κ, β, λ, total training steps T . Output: learned model parameters θ. 1: Initialize model parameters θ. 2: Define T l = ∅ and T u = T , b = B |R| . 3: for t = 1 to T do 4: Update parameters θ via minimizing L total .

12: end for 13: return Final model parameters θ.

c E[Dir(ρ c |α weak j )] and M = |{x j |x j ∈ T u ∧ τ < max c ρweak jc }|. τ is a hyper-parameter denoting the threshold above which the pseudo label is retained, and Υweak jc is the c-th element of the one-hot label vector Υweak j for pseudo label ŷweak j .

Figure7: Visualization of selected pixels in the task GTAV→Cityscapes, with the total labeling budget of 5% pixels. Here, we randomly choose ten images from Cityscapes for display.

Figure 8: The t-SNE visualization of features learned by the model trained with L edl and L edl + βL U dis respectively. Red and blue dots represent source and target features, respectively.

ρ|x j , θ] = p(ρ|x j , θ) C c=1 p(y = c|ρ) log p(y = c|ρ) p(y = c|x j , θ) dρ = p(ρ|x j , θ) C c=1 ρ c log ρ cρ c log p(y = c|x j , θ) dρ = p(ρ|x j , θ) C c=1 (ρ c log ρ c )dρp(ρ|x j , θ) ρ|xj ,θ)∼Dir(ρ|αj ) [ρ c ] = E p(ρ|xj ,θ)∼Dir(ρ|αj ) [

20 is based on the conclusion in Section E.2. E.4 KULLBACK-LEIBLER DIVERGENCE L kl For p(ρ| αi ) ∼ Dir(ρ| αi ), its probability density function is defined as p(ρ| αi ) •) is the multivariate Beta function, B( αi ) = C c=1 Γ( αic) Γ( C c=1 αic) and Γ(•) is the Gamma function. The Kullback-Leibler Divergence between Dirichlet distribution Dir(ρ| αi ) and Dir(ρ|1) is formulated as KL Dir(ρ| αi) Dir(ρ|1) = p(ρ| αi) log p(ρ| αi) 1)E ρc∼Beta(ρc| αic , αi0 -αic ) [logρc] = log Γ( C c=1 αic) Γ(C) C c=1 Γ( αic) + C c=1 ( αic -1) ψ( αic) -ψ( C k=1 αik ) .

x i ∈S KL Dir(ρ| αi) Dir(ρ|1) + 1 C • |T l | x j ∈T l KL Dir(ρ| αj) Dir(ρ|1) ,

Accuracy (%) on miniDomainNet with 5% target samples as the labeling budget (ResNet-50). clp → pn t clp → rel clp → skt pn t→ clp pn t→ rel pn t→ skt rel → clp rel → pn t rel → skt skt → clp skt → pn t skt → rel Avg For miniDomainNet, since these compared baselines do not report the results on this dataset, we report our own runs based on their open source code.

Accuracy (%) on Office-Home and VisDA-2017 with 5% target samples as the labeling budget (ResNet-50).

mIoU (%) comparisons on the task GTAV → Cityscapes.Methodbudget roa d sid e. bui l. wa ll fen ce pol e ligh t sig n veg . terr . sky per s. rid er car tru ck bus trai n mo tor bik e mIoU Methods with # are based on DeepLab-v3+(Chen et al., 2018) and others are based on DeepLab-v2(Chen et al., 2015). Method with budget "-" are the source-only or UDA methods. EADA denotes the results are based on our own runs according to the corresponding open source code.

mIoU (%) comparisons on the task SYNTHIA → Cityscapes. mIoU * is reported according to the average of 13 classes, excluding the "wall", "fence" and "pole".

Ablation study of DUC on Office-Home.

Results with different total labeling budget B on Office-Home (ResNet-50).

Accuracy (%) on Office-Home with 5% target samples as the labeling budget (ResNet-50), when DUC is combined with semi-supervised learning method. steps, i.e., the labeling budget in each selection step is b = B/5 = 1% × n t . For data preprocessing, we use RandomHorizontalFlip, RandomResizedCrop and ColorJitter during the training process and use CenterCrop during the test stage.

.1 EFFECTS OF THE ORDERING OF U dis , U data IN TWO-ROUND SAMPLING Since our selection strategy is a two-round sampling manner, there naturally exists the ordering of Analysis on the ordering of U dis , U data .

ACKNOWLEDGEMENTS

This work was supported by National Key R&D Program of China (No. 2021YFB3301503).

VisDA-2017

Synthetic Real miniDomainNet (Zhou et al., 2021) is a subset of DomainNet (Peng et al., 2019) , a large-scale image classification dataset for domain adaptation. miniDomainNet contains more than 130,000 images of 126 classes from four domains: Clipart (clp), Painting (pnt), Real (rel) and Sketch (skt). The large data scale and multiplicity make the adaptation on this dataset quite challenging. And we build 12 adaptation tasks: clp→pnt, • • • , skt→rel, by permuting the four domains, to evaluate our method.

Office-Home

ArtOffice-Home (Venkateswara et al., 2017) The image illustration of different datasets is shown in Fig. 5 .

C.2 IMPLEMENTATION DETAILS IMAGE CLASSIFICATION

All experiments are implemented via PyTorch (Paszke et al., 2019) . For image classification, we use ResNet-50 (He et al., 2016) pre-trained on ImageNet (Deng et al., 2009) as the backbone, and the exponential function is employed to the model output to ensure α non-negative. Following (Xie et al., 2021; Fu et al., 2021) , The total labeling budget B is set as 5% of target samples, which is Given sample x i and model f parameterized with θ, the predicted class probability for class c can be obtained aswhere ρ c is the c-th element of the class probability vector ρ. According to (Ng et al., 2011) , the marginal distributions of Dirichlet is Beta distributions. Thus, given p(ρ|x i , θ) ∼ Dir(ρ|α i ), we have p(ρ c |x i , θ) ∼ Beta(ρ c |α ic , α i0α ic ), where α i = g(f (x i , θ)), α i0 = C k=1 α ik and g(•) is a function (e.g., exponential function) to keep α i (i.e., the parameters of Dirichlet distribution for sample x i ) non-negative. And according to the probability density function of Beta distribution, we further havewhere B(•, •) is the Beta function and B(α ic , α i0α ic ) = Γ(αic)Γ(αi0-αic) Γ(αic+αi0-αic) , with Γ(•) denoting the Gamma function. Based on these, we can further derive P (y = c|x i , θ) as follows:Specially, if g(•) adopts the exponential function, traditional softmax-based models can be viewed as predicting the expectation of Dirichlet distribution.

