GENERALIZABLE PERSON RE-IDENTIFICATION WITH-OUT DEMOGRAPHICS

Abstract

Domain generalizable person re-identification (DG-ReID) aims to learn a readyto-use domain-agnostic model directly for cross-dataset/domain evaluation, while current methods mainly explore the demographic information such as domain and/or camera labels for domain-invariant representation learning. However, the above-mentioned demographic information is not always accessible in practice due to privacy and security issues. In this paper, we consider the problem of person re-identification in a more general setting, i.e., domain generalizable person re-identification without demographics (DGWD-ReID). To address the underlying uncertainty of domain distribution, we introduce distributionally robust optimization (DRO) to learn robust person re-identification models that perform well on all possible data distributions within the uncertainty set without demographics. However, directly applying the popular Kullback-Leibler divergence constrained DRO (or KL-DRO) fails to generalize well under the distribution shifts in real-world scenarios, since the convex condition may not hold for overparameterized neural networks. Inspired by this, we analyze and reformulate the popular KL-DRO by applying the change-of-measure technique, and then propose a simple yet efficient approach, Unit-DRO, which minimizes the loss over a new dataset with hard samples up-weighted and other samples down-weighted. We perform extensive experiments on both domain generalizable and cross-domain person ReID tasks, and the empirical results show that Unit-DRO achieves superior performance compared to all baselines without using demographics. Person re-identification (ReID) aims to find the correspondences between person images from the same identity across multiple camera views. As illustrated in Figure 1 , previous studies mainly follow three different settings: 1) supervised person ReID Zhang et al. (2020), where training and test data are independently and identically (i.i.d) drawn from the same distribution. Though recent supervised methods have achieved remarkable performance, they are usually non-robust in out-of-distribution (OOD) settings; 2) unsupervised domain adaptative person ReID (UDA-ReID) and cross-domain person ReID (CD-ReID) Luo et al. (2020), where UDA-ReID relies on large amounts of unlabeled data for retraining and CD-ReID cannot exploit the benefits brought by multisource domains; 3) domain generalizable person ReID (DG-ReID) Dai et al. (2021a), where the model is trained on multiple large-scale datasets and tested on unseen domains directly without extra data collection/annotation and model updating on new domains. Therefore, DG-ReID is receiving increasing attention due to its great value in real-world person retrieval applications. However, current DG-ReID research usually comes at a serious disadvantage: it requires the demographic information (e.g., domain labels Choi et al. (2021); Zhao et al. (2021), camera IDs Zhang et al. (2021b); Dai et al. (2021a), and video timestamps Yuan et al. (2020)) as the extra supervision for model training. Such demographics implicitly define the variations in training data that the learned model should be invariant or robust to. Unfortunately, the demographic information is usually not available in practice due to the following reasons: 1) the collection of demographics inevitably leads to privacy problems Veale & Binns (2017), e.g., the risks of exposing the geographical location and/or the environment information; 2) the collection/annotation of domain labels is very expensive and ethically fraught endeavours Michel et al. (2021); and 3) such coarse-grained labels and the noise of manual annotation collected domain labels may exacerbate the hidden stratification issue, which hinders a variety of safety-critical applications Creager et al. (2021); Kim & Lee (2021) (see Appendix A for more discussions). Therefore, as shown in Figure 1d, we consider a more general



To address the underlying uncertainty of domain distribution without using demographics, distributionally robust optimization (DRO) is a promising paradigm, which explicitly obtains prediction functions robust to distribution shifts Hu et al. (2018) . Specifically, DRO considers a minimax game: the inner optimization objective is to shift the training distribution within a pre-specified uncertainty set so as to maximize the expected loss on the test distribution. The outer optimization minimizes the adversarial expected loss. The uncertainty set defined by an f -divergence ball (such as Kullback-Leibler divergence) from the training distribution has been very popular, which is also known as KL- DRO Hu & Hong (2013) . However, the convex assumption in KL-DRO usually cannot hold in real-world scenarios, thus leading to inferior performance for overparameterized neural networks. We address the above-mentioned issue and reformulate KL-DRO to first solve the inner step optimization problem and then obtain a closed-form expression of the optimal objective. Specifically, different from previous work that converts the minimax DRO problem into a single minimization problem by the closed-form expression Hu & Hong (2013) , we utilize a change-of-measure technique and reformulate the minimax optimization as an importance sampling problem, termed Unit-DROfoot_0 . By doing this, Unit-DRO avoids bi-level optimization in traditional DRO problems and scales well to overparameterized regimes. Specifically, Unit-DRO upweights samples that are prone to be misclassified and downweights others. It assigns a normalized weight e ℓ/τ * /E[e ℓ/τ * ] to each pair of data and label (x, y), where ℓ indicates the error incurred by (x, y) and τ * is a hyperparameter. During implementation, there are still two main challenges for applying Unit-DRO: 1) it struggles with the hyperparameter parameter τ * and we observe that a constant τ * during training always leads to inferior performance in practice; 2) the normalization factor E[e ℓ/τ * ] requires an expectation over the training distribution, which is not complementary with the stochastic mini-batch training. To tackle the first problem, we propose a multi-step solution to adaptively determine the value of τ * by the training step. We then maintain a weight queue to store historical sample weights for a better estimation of E[e ℓ/τ * ] over the training distribution. Compared to previous DG-ReID methods, Unit-DRO avoids the need for either meta-learning pipelines or model structure engineering. In this paper, we evaluate the proposed Unit-DRO for person ReID by comparing it with existing DG-ReID and CD-ReID methods. Unit-DRO outperforms a variety of recent methods with a large margin on both DG-ReID and CD-ReID benchmarks, even including those methods using demographics. To better understand the proposed Unit-DRO, we perform comprehensive ablation studies on several important components, such as the multi-step τ * solution and the weight queue. Furthermore, we also visualize the learned weight distributions, t-SNE embeddings, and measure the domain divergence and error set to show the good invariant learning capability of Unit-DRO. Empirical results show that the proposed Unit-DRO can effectively retrieve valuable samples or subgroups without demographics.

2. RELATED WORK

DG-ReID. Generalizable methods are recently proposed to learn invariant representations that can generalize to unseen domains Song et al. (2019) ; Choi et al. (2021) ; Zhang et al. (2021b) . Existing methods mainly utilize domain divergence minimization strategies or a meta-learning pipeline. In view of the current research trend (Table 1 ), most methods rely on demographics to learn invariant features. 

3. METHOD

Problem Formulation. Given the current DG-ReID setting, there is a labeled set of training data from several different domains: P = ∪ N k=1 P k and P k = {(x i , y i )} N k i=1 , where N is the number of domains, N k is the number of images in domain P k , and x i ∈ X , y i ∈ Y indicate an image and its corresponding label, respectively. During training, we use all aggregated image-label pairs from P. During testing, we evaluate the person retrieval performance on the unseen target domain G without any additional model updating. Therefore, the goal of DG-ReID is to learn a model f θ : X → Y that minimizes the empirical error on the unseen target domain G: min θ∈Θ E (x,y)∈G [ℓ(x, y; θ)] , where ℓ is the predefined loss function. This objective encodes the goal of learning a model that does not depend on spurious correlations. If a model makes decisions according to domain-specific information, it is natural to be brittle in an entirely distinct domain. However, previous studies mostly leverage demographics (e.g., domain/camera labels and video timestamps) to clip the spurious correlations, which is not always available in real-world applications. Therefore, we consider a more general setting where the above-mentioned demographic information is unknown during training, i.e., DG-ReID without demographics or DGWD-ReID, which is in line with the motivation that annotating demographics is expensive and also likely to expose privacy information. Baseline Algorithm. We introduce the objectives used in our baseline as follows. The first one is the cross-entropy loss. Given n training points {(x 1 , y 1 ), ..., (x n , y n )}, we then have the loss for person identity classification: L ce = 1 n n i=1 ℓ(x i , y i ; θ) , where ℓ indicates the cross-entropy loss function. Label-smoothing is also applied to prevent the model from overfitting to the identity labels. Inspired by recent ReID methods, we further introduce triplet loss to enhance the intra-class compactness and inter-class separability in the embedding space. Following Hermans et al. (2017) , given an anchor sample x a i , we then evaluate triplet loss using the hardest positive and negative samples, x p i and x n i within each mini-batch: L tr (x a i , x p i , x n i ; θ) = max {d(x a i , x p i ; θ) -d(x a i , x n i ; θ) + m, 0} , where d(•, •) indicates a pairwise distance such as the Euclidean distance, and m is the margin between positive and negative pairs. We use a BNNeck structure Luo et al. (2019a) to maximize the synergy between L ce and L tr and integrate a mixture of batch normalization and instance normalization with learnable parameters Choi et al. (2021) , which are shown very useful for DG-ReID. In the following, we reuse ℓ(x, y; θ) as the sum of both cross-entropy loss and triplet loss.

3.1. UNIT-DRO

To address the underlying uncertainty of domain distribution without demographics, we introduce Unit-DRO, a novel generalization framework that does not require priors about demographics. We first introduce the basic distributionally robust optimization (DRO) framework Ben-Tal et al. (2009) ; Rahimian & Mehrotra (2019) as follows. In DRO, the worst-case expected risk over a predefined family of distributions Q (termed uncertainty set) is used to replace the expected risk on the unseen target distribution G in Equ.1. Therefore, the objective is as follows, min  θ∈Θ max q∈Q E (x,y)∈q [ℓ(x, y; θ)]. E (x,y)∈Q [ℓ(x, y; θ)] . Lemma 1 (Modified from Section 2 in Hu & Hong (2013) ) Assume the model family θ ∈ Θ and Q to be convex and compact. The loss ℓ is continuous and convex for all x ∈ X , y ∈ Y. Suppose empirical distribution P has density p(x, y). Then the inner maximum of Equ.3 has a closed-form solution q * (x, y) = p(x, y)e ℓ(x,y;θ)/τ * E P e ℓ(x,y;θ)/τ * , where τ * satisfies E P e ℓ(x,y;θ)/τ * E P [e ℓ(x,y;θ)/τ * ] ℓ(x,y;θ) τ * -log E P [e ℓ(x,y;θ)/τ * ] = η and q * (x, y) is the optimal density of Q. The min-max problem in Equ.3 is then equivalent to min θ∈Θ,τ >0 τ log E P e ℓ(x,y;θ)/τ + ητ. (5) We refer to Equ.5 as KL-DRO. Unfortunately, the convex condition of KL-DRO is not held for overparameterized neural networks, such that applying it may fail to generalize under the distribution shifts in real-world scenarios. As illustrated in Figure 2 , we compare the training statistics with the baseline, where KL-DRO is highly unstable and attains inferior results. Therefore, instead of following KL-DRO to directly use the inner maximum, we reformulate Equ.3 as follows.  Specifically, to obtain the third line, we apply the change-of-measure technique. The fourth line replaces the inner maximum by its closed-form solution q * (x, y) in Equ.4. Note that both the value of τ * and the normalizer E P [e ℓ(x,y;θ)/τ * ] depend on the expectation of losses over all training data, which is untrackable at each mini-batch based optimization step. For simplicity, we can serve τ * as a hyperparameter and take the average over each mini-batch as a preliminary estimator of the normalizer. Therefore, we have the formulation of vanilla Unit-DRO as follows: L Unit-DRO (θ, τ * ) = min θ∈Θ 1 N N i=1 e ℓ(x,y;θ)/τ * 1 N N i=1 e ℓ(x,y;θ)/τ * ℓ(x, y; θ) , where N is the batch size. However, vanilla Unit-DRO does not work well in practice, and we address the following two problems to form a robust Unit-DRO solution. Multi-Step τ * . The first problem is that a constant hyperparameter τ * is usually suboptimal for the whole learning process. As shown in Figure 3 , we visualize the densities of the weight e ℓ(x,y;θ)/τ * /E P [e ℓ(x,y;θ)/τ * ] at different optimization steps when using a constant τ * (please refer to Section 4.3 for the detailed setups). Specifically, we find that: 1) a small τ * leads to the high variance on the weight distribution and is also sensitive to outliers; 2) a large τ * is so conservative that the weights for all samples are almost similar, and the performance is thus similar to the baseline method. To tackle this problem, we propose a multi-step solution for the hyparameter τ * , which declines with the training/optimization steps. The intuition behind the multi-step τ * is that: at the beginning, we use a large τ * , and the model thus assigns almost similar weights to all samples and cannot identify which sample is more important or not. With the increase of training steps, we decrease the value of τ * and improve the weights for important (i.e., hard-to-distinguish) samples. Weight Queue M. The second problem is that the expectation over each mini-batch may not be a good estimator of the normalizer E P [e ℓ(x,y;θ)/τ * ]. To address this problem, we introduce a queue M = {w i := e ℓ(xi,yi;θ)/τ * } M i=1 to maintain the historical weights, where M depends on the batch size N and determines how well M can estimate E P [e ℓ(x,y;θ)/τ * ]. (Please see more empirical analysis in Section 4.3). Lastly, we have the objective function of Unit-DRO as follows: L Unit-DRO (θ, τ * (t)) = min θ∈Θ 1 N N i=1 e ℓ(x,y;θ)/τ * (t) 1 |M| wi∈M (w i ) ℓ(x, y; θ) , ( ) where t is the index of training step and τ * is a piecewise function of t. As shown in Figure 2 , the training statistics of Unit-DRO is more stable than KL-DRO, and its performance also outperforms baseline methods by a large margin. We depict the online optimization algorithm in Appendix Algorithm 1. Note that in Algorithm 1 of Group- DRO Sagawa et al. (2019) , all samples in the same domain share the same weight, which can be seen as a special case of the proposed Unit-DRO. Compared with Group-DRO, one of the key improvements is the implementation trick in that the group weights are updated using exponential gradient ascent instead of picking the group with the worst average loss at each step. Specifically, Group-DRO shows that such an improvement is useful for training stability and model convergence but cannot explain why it works. In contrast, the adaptive weights used in this paper are interpretable: the optimal distribution of DRO with KL constraint is proportional to the empirical distribution composite with the exponential term e ℓ(x,y;θ)/τ * .

4. EXPERIMENTS

In this section, we evaluate the proposed Unit-DRO and try to answer the following questions: "without demographics, how does Unit-DRO perform compared to other CD-ReID and DG-ReID methods? what is the influence of different hyperparameters in Unit-DRO? why Unit-DRO improves the baseline?". To answer the first question, we compare Unit-DRO with baseline methods on both DG-ReID and CD-ReID benchmarks. We then perform detailed ablation studies to answer the second question. Comprehensive analyses are conducted for the third question, e.g., error set analysis, feature visualization, and domain divergence measure. 4 2018) with the width multiplier of 1.4 as the backbone network, which is initialized using the weights pre-trained on ImageNet Deng et al. (2009) . All training images are resized to 256 × 128 pixels and the batch size is N = 80. We use the SGD optimizer with momentum 0.9 and the weight decay 5e -4. The learning rate starts from 0.01 and then decays to its 0.1× at 40 and 70 epochs. We also use a warmup learning rate schedule at the first 10 epochs. We initialize the multi-step τ * with τ * = 100, which is then decayed to 20 and 5 at 40 and 70 epochs, respectively. The default size of the weight queue is M = 800. The training process includes 100 epochs. During training, we also use the label-smoothing with the parameter 0.1 and the margin of triplet loss is 0.3. We conduct all the experiments on a machine with i7-8700K CPU, 32G RAM, and four GeForce RTX2080Ti (12GB) GPU cards.

4.2. RESULTS

DG-ReID. Considering that most existing methods can not work without demographic information, we thus compare the proposed Unit-DRO with methods on a typical DG-ReID setting, i.e., all other methods use the demographics except Unit-DRO. As shown in Table 3 , the proposed Unit-DRO significantly outperforms Group-DRO, while it also achieves either comparable or better performance when compared with recent state-of-the-art DG-ReID methods using demographics. By doing this, we hope that the proposed Unit-DRO can serve as a strong baseline for both DG-ReID and DGWD-ReID. We observe that current DG ReID methods all apply a utopian model selection method to report their best result by carefully checking the test performance after each training epoch. So the numbers of training epochs corresponding to the best performance are varying for different test datasets We argue that such a model selection method is inadvisable. Under the DG setting, we should restrict access to the test domain data for model selection. Thus, we use the last checkpoint and report its results as the final performance over all test datasets. The results in Table 3 show that, without the utopian model selection method, there is always a certain degree of performance decline for existing DG ReID methods, which further indicates the advantages of the proposed Unit-DRO.. Other DG-ReID Protocols. In addition to the most popular evaluation protocols used in Gulrajani & Lopez-Paz (2020) and Tab. 12, we also compare Unit-DRO with other methods using the following protocols: Besides, due to privacy issues, the Duke dataset is not appropriate for use. We thus conduct experiments under different evaluation protocols but remove the Duke from source/target domains, and Table 4 shows that the performance margin between Unit-DRO and other baselines becomes larger. Because these protocols are used in different DG-ReID papers, we choose the SOTA method under every protocol for comparison. The privacy issue is also the motivation of our DGWD setting, where images are randomly sampled and are harder to identify which domain it is from. Choosing the best τ * . Actually, the selection of multi-step τ * is not complicated. At the first few epochs, we simply set τ * to a large value such as 100. In epochs 40, and 70, we decay it to smaller values.When |M| = 800, the performance gap will not be much sensitive to different choices of τ * . Sensitivity. As shown in line 3, even with a constant τ * and without a memory bank, the performance of Unit-DRO beats both ERM and KL-DRO by a large margin, which is not too sensitive. To avoid grid search for hyper-parameters, we further propose an alternative approach (Linear-decay τ * ), which models τ * as a decreasing function of training steps. Specifically, τ * = 100(1 -t T +1 ), such a method attains 65.0 R-1 accuracy and 72.3 mAP, which is comparable to the grid search result and simpler. Sample Weights. Considering that the proposed Unit-DRO will upweight and downweight different samples, we thus visualize the distribution of sample weight to better understand the influences of different components. Specifically, during training, we save the mean and variance of sample weights for every 1k iterations/steps. We assume these weights follow the Gaussian distribution N (µ, δ) and plot diagrams based on the mean µ and variance δ. The x-coordinate of these diagrams is just the value between [µ-3 * δ, µ+3 * δ], not the real values of weights. Based on the loss values of each sample, we calculate the weights under the following two settings: 1) sample weights without the weight queue. In this case, these weights are normalized in their batches, so the mean of all distributions here is 1. As shown in Figure 3 , we have already discussed this setting in the former sections; 2) sample weights with different length of weight queue |M|. In Figure 4 , we show the distribution of sample weight at 1k, 5k, 10k, 20k training steps to indicate how the weight distribution changes during training. Intuitively, we need a large |M| to better estimate E P [e ℓ(x,y;θ)/τ * ]. However, as |M| becomes larger, the estimation becomes inaccurate. For example, we consider an extreme case: |M| = T -1, then the queue absolutely contains all training data. Therefore, it is catastrophic to estimate E P [e ℓ(x,y;θ)/τ * ] in step T by such a queue. The large queue contains too many old weights which are unsuitable for the current model. Figure 4 depicts the phenomenon, where the distribution with a larger |M| has smaller µ. See Appendix E.2 for more visualization results and discussions about the distribution diagrams of the multi-step τ * . GRID i-LIDS R-1 mAP R-1 R-5 R-10 mAP R-1 R-5 R-10 mAP R-1 R-5 R-10 mAP R-1 R-5 R- Visualization using t-SNE. We compare the proposed Unit-DRO with MetaBIN and DualNorm through t-SNE visualization. We observe a distinct division of different domains in Figure 5a , which indicates that a domain-specific feature space is learned by the DualNorm. MetaBIN and the proposed Unit-DRO tackle this problem well and the overlaps in Figure 5b and Figure 5c between different domains are more prominent. With t-SNE visualization, we see that Unit-DRO can learn domain-invariant representations while keeping discriminative capability for ReID tasks. However, MetaBIN follows a meta-learning pipeline and requires extra expensive demographics. In contrast, the proposed framework Unit-DROis much simpler without using any demographics. We also provide more visualization results and analysis of the discriminative capability in Section E.3 of Appendix. Domain Divergence. We explore MMD distance Tolstikhin et al. ( 2016) or A-distance Long et al. (2015) as the measure of domain discrepancy Ben-David et al. (2010) . As shown in Table 7 , we find that Unit-DRO can learn comparable or even more invariant representations compared to MetaBIN, which outperforms DualNorm by a large margin. We also study the correlation between the weights for each dataset and the MMD distance. For each dataset, we calculate the sum of MMD distance between it to all other datasets. Besides, we calculate the average weights of the final model for each dataset. Table 8 shows that for a tough dataset (e.g., , CUHK02) that has a large divergence to other datasets, Unit-DRO assigns a relatively higher average weight. This phenomenon depicts that even without demographics, Unit-DRO can also find meaningful subgroups and upweight them. We can also see that Unit-DRO upweights samples in CUHK-SYSU which has a relatively small MMD distance with other datasets. It is because the generalization ability is not only dependent on domain divergence, but also on some other factors. We discuss these influence factors and perform error set analysis in Section E.6 of Appendix. We also plot the MMD distance for every pair of datasets and give further analysis in Section E.5 of Appendix.

5. CONCLUSION AND FUTURE WORK

Traditional DG-ReID methods fail to work in the cases where domain information, such as camera labels, or other demographics, are not available due to security and privacy issues. To this end, we introduce DGWD-ReID, a more general setting that requires the model to learn domain-invariant representations without demographics. To address this problem, we propose Unit-DRO, which is a simple yet effective algorithm that substantially improves the model generalization performance without requiring expensive demographics during training. Extensive experimental results demonstrate that the proposed Unit-DRO not only achieves comparable or better performance compared with other DG-ReID methods using the demographic but also attains superior generalization capability on general domain generalization applications. Different from the typical classification datasets, where domains are usually partitioned by image styles, ReID datasets have more fine-grained variation factors, e.g., the styles of images, camera perspective changes within one dataset, and the shooting conditions at different times on the same camera. We believe that simply specifying each dataset as a domain is suboptimal and better domain inference methods that consider the above variation factors will be the subject of future study. 1. Privacy Risks on using demographics in ReID tasks. ReID is a research about person which is naturally with higher requirements for privacy. Utilizing demographic information will bring the following two concerns. 1) Although the demographic data used in current ReID research is simple, e.g., the different campuses in Market1501 and CUHK, the demographic information in practical ReID deployments is much richer and can be obtained easily based on some inherent physical properties of cameras (such as cameras' MAC address and geographical locations). Thus, the utilization of finer and richer physical properties of cameras will increase the leakage risk of the privacy information of pedestrians. 2) The social relationships between different people are also possibly to be exposed by demographic information, which may be more sensitive than geographical information. For example, if two-person IDs's images are marked with the same camera, their social relationships may be easily measured by counting their occurrence frequency in all cameras. In summary, the uses of demographic data in real-world ReID applications have evident risks for personal privacy. Thus, we propose the setting of DGWD-ReID, where only a large-scale gallery of pedestrian images without any demographic data, e.g., camera IDs, can be used for training ReID models. Based on the setting of DGWD-ReID, researchers will be forced to exploit invariant features from the training data itself, rather than resort to the side information, i.e., demographic data, so as to reduce the risks of privacy leakage in ReID applications. We also agree that the annotation of subject ID is much more expensive and has more risks to personal privacy. We will tackle the challenging problem of unsupervised DG-ReID in future work. Thanks for your valuable suggestion again! 2. The importance of DGWD setting in ML. Learning generalizable models without domain labels is becoming an important matter of concern in the ML community Creager et al. (2021) ; Kim & Lee (2021). Our algorithm can also be applied to generalized DG tasks where environment labels are unavailable at training time, either because they are difficult to obtain or due to privacy. 3. Finally, a previous study Srivastava et al. (2020) shows that how to optimally partition domains in training images that can benefit generalization capability most is still unclear, which indicates the direct use of demographic data as domain labels may be inferior for learning domain-invariant representations. Thus, it is also a promising topic for DGWD-ReID as we discussed in the conclusion. In the fairness community, existing work has found that designing rational methodology can find domains that are maximally informative for downstream invariance learners. These domain IDs, and camera IDs make sense for humans, however, the similarities between different domains/camera views vary greatly. How to find more optimal domain partitions for ReID tasks is still an open problem. e ℓ(x,y;θ t-1 )/τ * (t) 1 |M| w i ∈M (wi) ℓ(x, y; θ t-1 ) //Calculate the reweighted loss M t = M t-1 :], {e ℓ(xi,yi;θ t-1 )/τ * (t) } N i=1 //Update the weight queue θ t ← SGD L, θ t-1 , η //Update model parameters end 11d and Figure 11b shows that Unit-DRO performs well matching on the i-LIDS and the PRID dataset. However, we observe an interesting phenomenon, termed "Inter-Identity Cluster". Specifically, probes and galleries of different identities came together in some clusters. These clusters are always seen on the VIPeR and the GRID datasets (Figure 11a and Figure 11b ), which reveals why Unit-DRO performs much poorly on these two datasets.

E.4 IMPLEMENTATION OF DOMAIN DIVERGENCE MEASUREMENT

In general, MMD distance Tolstikhin et al. ( 2016) is defined by the idea of representing distances between distributions as distances between mean embeddings of features. Following MMFA model Lin  A(U ) = 1 6 4 i=1 4 j=i+1 A(d i , d j ).

E.5 ADDITIONAL DOMAIN DIVERGENCE MEASUREMENT RESULTS

The MMD distance between every dataset pair of all the datasets is plotted in Figure 13a . The MMD distance between every dataset pair of five training datasets is shown in Figure 13b and that of four test datasets is shown in Figure 13c . For the training dataset, we find that the CUHK02 dataset remains large divergences with almost all the other domains. Namely, the CUHK02 dataset is more likely to be an out-of-distribution dataset and is more important to generalization capability. Hence, Unit-DRO assigns relatively higher weights for samples in the CUHK02 dataset. In terms of test datasets, the GRID dataset maintains the largest MMD distance among these datasets. It is also the reason why Unit-DRO performs badly on the GRID dataset. However, domain divergence is not the only factor that affects generalization performance. Figure 13c shows that the PRID dataset has a larger domain than VIPeR. However, Unit-DRO performs better on the PRID dataset than on the VIPeR dataset. We exploit the underlying reasons in Section E.6.

E.6 ERROR SET ANALYSIS

We select some successfully retrieved pairsfoot_1 and failure cases in Figure 14 . We plot query images and corresponding gallery images at the top and bottom of these figures respectively. Figure 14a shows that query and gallery images in the failure case have a relatively large view change (front and back shooting). In contrast, successfully retrieved query-gallery pairs in Figure 14b have almost the same camera view. This result shows that Unit-DROcannot well overcome the challenges brought by changes in the camera view. Namely, we can leverage advanced structure in supervised ReID methods to eliminate the sensitivity of Unit-DRO to camera perspective. Figure 14c and Figure 14b show that the camera perspective changes between query and gallery set in the PRID dataset are small, which is one of the reasons why Unit-DRO performs much better on the PRID dataset than the GRID datasetfoot_2 . According to error set analysis, we can explain the phenomenon mentioned in Section E.5 that Unit-DRO performs superior on the dataset with a relatively high domain divergence (the PRID dataset) than the dataset with low domain divergence (the VIPeR dataset). Figure 14e shows that query-gallery pairs in the VIPeR dataset always maintain camera view changes that more than 90 • , which is harder to identify compared to the PRID dataset. Finally, the i-LIDS dataset has the lowest MMD distances among other datasets and the camera perspective changes between its query-gallery pairs are always small. These good properties enable Unit-DRO to achieve a rank-1 accuracy of 80.7 on the i-LIDS dataset. So far, we can conclude that all of the domain style divergence, intrinsic characteristics of datasets (camera perspective changes), and model capacityfoot_3 affect the performance of DG ReID and DGWD-ReID methods. 



In contrast to the word "Group" in Group-DRO Sagawa et al. (2019) where it assigns weights for domains, the word "Unit" in our proposed Unit-DRO assigns weights for samples. We name a query-gallery images pair "successfully retrieved pair" such that the distance between the query image and its corresponding gallery image is the closest in all of the gallery images. Other pairs are named failure cases. Another reason is the domain divergence, as we discussed in Section E.5. larger backbones and advanced learning paradigm always attains better generalization capability.



Figure 1: An illustration of different person re-identification settings. (a) Supervised person ReID. (b) CD-ReID and UDA-ReID. (c) DG-ReID. (d) DGWD-ReID.

Figure 2: Training statistics.

(Q||P)≤η E (x,y)∈P q(x, y) p(x, y) ℓ(x, y; θ) = min θ∈Θ E (x,y)∈P e ℓ(x,y;θ)/τ *EP [e ℓ(x,y;θ)/τ * ] ℓ(x, y; θ) .

Figure 3: Visualizing the distribution of sample weight at 1k, 5k, 10k, 20k steps, respectively (from left to right). The horizontal axis represents the weight.

(1) one-to-multiple settingJin et al. (2020); (2) multiple-to-one settingDai  et al. (2021b); (3) multiple-to-one settingZhao et al. (2021); and (4) multiple-to-multiple settingJin et al. (2020). We summarize the difference between different protocols in Tab. 2. As shown in Tab. 11, Unit-DRO outperforms other methods with a clear margin in both average mAP and Rank-1 accuracy. Due to the page limit, please also see the results of protocols (2)∼(4) in Appendix, which demonstrate the robustness of the proposed Unit-DRO across different evaluation protocols.

Figure 4: Visualizing the distribution of the sample weight at 1k, 5k, 10k, 20k steps, respectively (from left to right). The horizontal axis is the value of the weight and the vertical axis is the density.

Figure 5: Visualization of the embeddings on training and test datasets. Query and gallery samples of these unseen datasets are shown using different types of mark. Best viewed in color.

Figure 6: Samples on ReID datasets.

Figure 7: An illustration of DGWD-ReIDthat propose to reweight training instances without demographic information.

Figure 14: Error set analysis. (a): Failure cases in the GRID datasets. (b) Successfully retrieved pairs in the GRID datasets. (c) Failure cases in the PRID datasets. (d) Successfully retrieved pairs in the PRID datasets. (e): Failure cases in the VIPeR datasets. (f) Successful retrieved pairs in the VIPeR datasets. (g) Failure cases in the i-LIDS datasets. (h) Successfully retrieved pairs in the i-LIDS datasets.

Though existing strong baseline Liao & Shao (2020), normalization Jin et al. (2020), and augmentation methods Yan et al. (2020) require no demographics, The current research trend of DG-ReID.

Specifically, the uncertainty set Q encodes the possible test distributions that we want our model to perform well on. If Q contains G, the DRO object can upper bound the expected risk under G. We first introduce the construction of uncertainty set Q based on the KL-divergence ball around the empirical distribution P. Given the KL upper bound (radius) η, we have the uncertainty set Q = {Q : KL(Q||P) ≤ η}. The min-max problem in Equ.2 can then be reformulated as min

.1 EXPERIMENTAL SETUP Datasets. Following Song et al. (2019); Jia et al. (2019); Zhang et al. (2021b), we evaluate the Unit-DRO with multiple data sources (MS), where source domains cover five large-scale ReID datasets, including CUHK02 Li & Wang (2013), CUHK03 Li et al. (2014), Market1501 Zheng et al. (2015), DukeMTMC-ReID Zheng et al. (2017), and CUHK-SYSU PersonSearch Xiao et al. (2016). The unseen test domains are VIPeR Gray et al. (2007), PRID Hirzer et al. (2011), QMUL GRID Liu et al. (2012), and i-LIDSWei-Shi et al. (2009). We include the detailed illustration of datasets and evaluation protocols in Appendix D.1. In the CD domain setting, we employ Market1501 and DukeMTMC-ReID. We alternately construct the two datasets into source and target domains. Summary of different DG-ReID protocols.

Comparison with recent state-of-the-art DG-ReID methods. † means the results of the last checkpoint are reported.

Results for general DG tasks.

Comparison with SOTA DG-ReID methods under different evaluation protocols, where the Duke is removed from source and target domains. The best accuracy is highlighted by bold.

Ablation studies on different Unit-DRO components.

Divergence measurement on unseen datasets (U), training datasets (T), and all datasets (A).

Average weight and one-to-all MMD distance for training datasets.

Domain generalization. Domain/Out-of-distribution generalizationMuandet et al. (2013);Zhang  et al. (2021a)  aims to learn a model that can extrapolate well in unseen environments. Representative methods like Invariant Risk Minimization (IRM)Arjovsky et al. (2020) and its variantAhuja et al. (2020) are recently proposed to tackle this challenge. IRM centers on the objective of extracting data representations that lead to invariant prediction across environments under a multi-environment setting. The main difference here is that we propose to learn invariant representations without demographics.

Training Datasets Statistics.For cross-domain evaluation, we use the Market1501 dataset and DukeMTMC-ReID as the source/target domains iteratively. For example, "Market-Duke" indicates that the labeled source domain is Market1501 and DukeMTMC-ReID is the unseen target domain. Since the style variation within a single dataset is relatively small, previous DG-ReID methods must utilize fine-grained demographics, e.g., camera labelsZhang et al. (2021b), or carefully tune all hyperparametersChoi et al. (2021). Similar to DG-ReID setting, here Unit-DRO does not use any demographic information. As shown in Table12, we see that the proposed Unit-DRO even achieves a slightly better performance than recent state-of-the-art CD-ReID methods using demographics, suggesting the good cross-domain generalizability of Unit-DRO. Figure10shows the t-SNE results on five training datasets and Figure12shows the t-SNE results on the Market-Duke benchmark. All of these results demonstrate a common pattern, DualNormJia et al. (2019) retains large domain divergences and its embedding vector is far from "domain invariant".MetaBIN Choi et al. (2021)  utilizes a complex framework and expensive demographics, which is able to reduce domain divergences. Unit-DRO achieves a comparable or even better result thanMetaBIN Choi et al. (2021)  in a simpler and cheaper paradigm. Consider discriminative capability. Figure11visualizes the probe and gallery samples on four test datasets individually. The utopian discrimination result is that every query-galley pair has the closest intra-identity distance and a relatively large inter-identity distance. Figure

Comparison with recent DG-ReID methods using the protocol (1).

Comparison with recent state-of-the-art CD-ReID methods.

Comparisons against state-of-the-art DG methods for person ReID on evaluation protocol (ii) and (iii). Protocol (ii) and (iii) are both multiple-to-one setting which used in RaMoEDai et al.  (2021b)  andM3L Zhao et al. (2021)  respectively. Unit DRO beats them in both these two settings.

Comparisons against state-of-the-art DG methods for person ReID on evaluation protocol (iv). Unit DRO outperforms RaMoEDai et al. (2021b)  in protocols (iv) by a large margin.

annex

 9 and the test datasets are summarized in Table 10 . All the assets (i.e., datasets and the codes for baselines) we use include an MIT license containing a copyright notice and this permission notice shall be included in all copies or substantial portions of the software. D.2 EVALUATION PROTOCOLS GRID Liu et al. (2012) contains 250 probe images and 250 true match images of the probes in the gallery. Besides, there are a total of 775 additional images that do not belong to any of the probes. We randomly take out 125 probe images. The remaining 125 probe images and 1025(775 + 250) images in the gallery are used for testing.i-LIDS Wei-Shi et al. (2009) has two versions, images and sequences. The former is used in our experiments. It involves 300 different pedestrian pairs observed across two disjoint camera views 1 and 2 in public open space. We randomly select 60 pedestrian pairs, two images per pair are randomly selected as probe image and gallery image respectively. PRID2011 Hirzer et al. (2011) has single-shot and multi-shot versions. We use the former in our experiments. The single-shot version has two camera views A and B, which capture 385 and 749 pedestrians respectively. Only 200 pedestrians appear in both views. During the evaluation, 100 randomly identities presented in both views are selected, the remaining 100 identities in view A constitute the probe set, and the remaining 649 identities in view B constitute the gallery set.VIPeR Gray et al. (2007) contains 632 pedestrian image pairs. Each pair contains two images of the same individual seen from different camera views 1 and 2. Each image pair was taken from an arbitrary viewpoint under varying illumination conditions. To compare to other methods, we randomly select half of these identities from camera view 1 as probe images and their matched images in view 2 as gallery images.We follow the single-shot setting. The average rank-k (R-k) accuracy and mean Average Precision (mAP) over 10 random splits are reported based on the evaluation protocol 

