LEARNING A MAX-MARGIN CLASSIFIER FOR CROSS-DOMAIN SENTIMENT ANALYSIS Anonymous

Abstract

Sentiment analysis is a costly yet necessary task for enterprises to study the opinions of their customers to improve their products and services and to determine optimal marketing strategies. Due to the existence of a wide range of domains across different products and services, cross-domain sentiment analysis methods have received significant attention in recent years. These methods mitigate the domain gap between different applications by training cross-domain generalizable classifiers which help to relax the need for individual data annotation for each domain. Most existing methods focus on learning domain-agnostic representations that are invariant with respect to both the source and the target domains. As a result, a classifier that is trained using annotated data in a source domain, would generalize well in a related target domain. In this work, we introduce a new domain adaptation method which induces large margins between different classes in an embedding space based on the notion of prototypical distribution. This embedding space is trained to be domain-agnostic by matching the data distributions across the domains. Large margins in the source domain help to reduce the effect of "domain shift" on the performance of a trained classifier in the target domain. Theoretical and empirical analysis are provided to demonstrate that the method is effective.

1. INTRODUCTION

The main goal in sentiment classification is to predict the polarity of users automatically after collecting their feedback, e.g., Amazon customer reviews. Popularity of online shopping and reviews, fueled further by the recent pandemic, provides a valuable resource for businesses to study the behavior and preferences of consumers and to align their products and services with the market demand. A major challenge for automatic sentiment analysis is that polarity is expressed using completely dissimilar terms and phrases in different domains. For example, while terms such as "fascinating" and "boring" are used to describe books, terms such as "tasty" and "stale" are used to describe food products. As a result of this discrepancy, a model that is trained for a particular domain may not generalize well in other different domains, referred as the problem of "domain gap" (Wei et al., 2018) . Since generating annotated training data for all domains is expensive and time-consuming, cross-domain sentiment analysis has gained significant attention recently (Saito et al., 2018; Li et al., 2017; Peng et al., 2018; He et al., 2018; Li et al., 2018; Barnes et al., 2018; Sarma et al., 2019; Li et al., 2019; Guo et al., 2020; Xi et al., 2020; Dai et al., 2020; Lin et al., 2020) . The goal in cross-domain sentiment classification is to relax the need for data annotation via transferring knowledge from another domain with annotated data to domains with unannotated data. The above problem has been studied more broadly in the "domain adaptation" literature. A common approach for domain adaptation is to map data points from two domains into a shared embedding space to align the data distributions (Redko & Sebban, 2017) . Since the embedding space would become domain-agnostic, i.e., a classifier that is trained using the source domain annotated data, will generalize in the target domain. In the sentiment analysis problem, this means that polarity of natural language can be expressed independent of the domain in the embedding space. We can model this embedding space as the output of a shared deep encoder which is trained to align the distributions of both domains at its output. This training procedure have been implemented using both adversarial learning (Pei et al., 2018; Long et al., 2018; Li et al., 2019; Dai et al., 2020) , which aligns distributions indirectly, or by loss functions that are designed to directly align the two distributions (Peng et al., 2018; Barnes et al., 2018; Kang et al., 2019; Guo et al., 2020; Xi et al., 2020; Lin et al., 2020) . Contributions: our main contribution is to develop a new cross-domain sentiment analysis algorithm for model adaptation by introducing large margins between classes in the source domain. Our idea is based on learning a prototypical distribution for the source domain in a cross-domain embedding space which is trained to be domain-agnostic. We model this distribution as a Gaussian mixture modal (GMM). We estimate the parameters of the prototypical distribution using a subset of source samples for which the classifier is confident about its predictions. As a result, larger margins between classes are introduced in the prototypical distribution which help reducing domain gap. We then use this prototypical distribution to align the source and the target distributions via minimizing the Sliced Wasserstein Distance (SWD) (Lee et al., 2019) . We draw confident random samples from this distribution and enforce the distribution of the target in the embedding matches this prototypical distribution in addition to the source distribution. We provide a theoretical proof to demonstrate that our method minimizes an upperbound for the target domain expected error. Experimental results demonstrate that our algorithm outperforms state-of-the-art sentiment analysis algorithms.

2. RELATED WORK

While domain adaptation methods for visual domains usually use generative adversarial networks (GANs) (Goodfellow et al., 2014) and align distributions indirectly, the dominant approach for cross-domain sentiment analysis is to design appropriate loss functions that directly impose domain alignment. The main reason is that natural language is expressed in terms of discrete values such as words, phrases, and sentences. Since this domain is not continuous, even if we convert natural language into real-valued vectors, it is not differentiable. Hence, adversarial learning procedure cannot be easily implemented for pure natural language processing (NLP) applications. Several alignment loss functions have been designed for cross-domain sentiment analysis. A group of methods are based on aligning the lower-order distributional moments, e.g., means and covariances, across the two domains, in an embedding space (Wu & Huang, 2016; Peng et al., 2018; Sarma et al., 2019; Guo et al., 2020) . An improvement over these methods is to use probability distribution metrics to consider the encoded information in higher order statistics (Shen et al., 2018) . Damodaran et al. (Bhushan Damodaran et al., 2018) demonstrated that using Wasserstein distance (WD) for domain alignment boosts the performance significantly in visual domain applications (Long et al., 2015; Sun & Saenko, 2016) . In the current work, we rely on the sliced Wasserstein distance (SWD) for aligning distribution. SWD has been used for domain adaptation in visual domains (Lee et al., 2019) . The major reason for performance degradation of a source-trained model in a target domain stems from "domain shift", i.e., the boundaries between the classes change in the embedding space even for related domains which in turn increases possibility of misclassification. It has been argued that if a max-margin classifier is trained in the source domain, it can generalize better than many methods that try to align distributions without further model adaptation (Tommasi & Caputo, 2013) . Inspired by the notion of "class prototypes", our method is based on both aligning distributions in the embedding space and also inducing larger margins between classes using the notion of "prototypical distributions". Recently, cross-domain alignment of the class prototypes has been used for domain adaptation (Pan et al., 2019; Chen et al., 2019) . The idea is that when a deep network classifier is trained in a domain with annotated data, data points of classes form separable clusters in an embedding space, modeled via network responses in hidden layers. A class prototype is defined as the mean of each class-specific data cluster in the embedding space. Domain adaptation then can be addressed by aligning the prototypes across the two domains as a surrogate for distributional alignment. Following the above, our work is based on using the prototypical distribution, rather simply the prototypes, to induce maximum margin between the class-specific clusters after an initial training phase in the source domain. Since the prototypical distribution is a multimodal distribution, we can estimate it using a Gaussian mixture model (GMM). We estimate the GMM using the source sample for which the classifier is confident and use random samples with high-confident labels to induce larger margins between classes, compared to using the original source domain data.

3. CROSS-DOMAIN SENTIMENT ANALYSIS

Consider two sentiment analysis problems in a source domain S with an annotated dataset D S = (X S , Y S ), where X S = [x s 1 , . . . , x s N ] ∈ X ⊂ R d×N and Y S = [y s 1 , ..., y s N ] ∈ Y ⊂ R k×N and a target domain T with an unannotated dataset D T = (X S ), where R d×M . The real-valued feature vectors X S and X T are obtained after pre-processing the input text data using common NLP methods, e.g., bag of words or word2vec. We consider that both domains share the same type of sentiments and hence the one-hot labels y s i encode k sentiment types, e.g., negative or positive in binary sentiment analysis. Additionally, we assume that the source and the target feature data points are drawn independent and identically distributed from the domain-specific distributions x s i ∼ p S (x) and x t i ∼ p T (x), respectively, where p T (x) = p S (x). Given a family of parametric functions f θ : R d → Y, e.g., deep neural networks with learnable parameters θ, and considering an ideal labeling function f (•), e.g., ∀(x, y) : y = f (x), the goal is to search for the optimal predictor model f θ * (•) in this family for the target domain. This model should have minimal expected error for sentiment analysis, i.e., θ * = arg min θ {e θ } = arg min θ {E x t ∼p T (x) (L(f (x t ), f θ (x t ))}, where L(•) is a proper loss function and E(•) denotes the expectation operator. Since the target domain data is unlabeled, the naive approach is to estimate the optimal model using the standard empirical risk minimization (ERM) in the source domain: X T = [x t 1 , . . . , x t N ] ∈ X ⊂ θ = arg min θ {ê θ (X S , Y S , L)} = arg min θ { 1 N i L(f θ (x s i ), y s i )} Given a large enough labeled datasets in the source domain, ERM model generalizes well in the source domain. The source-trained model may also perform much better than chance in the target domain, given cross-domain. However, its performance will degrade in the target domain compared to its performance in the source domain due to existing distributional discrepancy between the two domains, since p S = p T . Our goal is to benefit from the encoded information in the unlabeled target domain data and adapt the source-trained classifier f θ to generalize in the target domain. We use the common approach of reducing the domain gap by mapping data into a shared embedding space. We consider that the predictor model f θ (•) can be decomposed into a deep encoder φ v (•) : X → Z ⊂ R p and a classifier h w (•) : Z → Y such that f θ = h w • φ v , where θ = (w, v). Z is an embedding space which is modeled by the encoder responses at its output. We assume that the classes have become separable for the source domain in this space after an initial training phase (see Figure 1 , left). If we can adapt the source-trained encoder network such that the two domains share similar distributions Z, i.e., φ(p S )(•) ≈ φ(p T )(•), the embedding space would become domain-agnostic. As a result, the source-trained classifier network will generalize with similar performance in the target domain. A number of prior cross-domain sentiment analysis algorithms use this strategy, select a proper probability distribution metric to compute the distance between the distributions φ(p S (x s )) and φ(p T (x t )), and then train the encoder network to align the domains via minimizing this distance: v, ŵ = arg min v,w 1 N N i=1 L h w (φ v (x s i )), y s i + λD φ v (p S (X T )), φ v (p T (X T )) , where D(•, •) denotes a probability metric to measure the domain discrepancy and λ is a trade-off parameter between the source ERM and the domain alignment term. In this work, we use SWD (Lee et al., 2019) to compute D(•, •) in equation 2. Using SWD has three advantages. First, SWD can be computed efficiently compared to WD based on a closed form solution of WD distance in 2D. Second, SWD can be computed using the empirical samples that are drawn from the two distributions. Finally, SWD possesses a non-vanishing gradient even when the support of the two distributions do not overlap (Bonnotte, 2013; Lee et al., 2019) . Hence SWD is suitable for deep learning problems which are normally solved using first-order gradient-based optimization techniques, e.g., Adam. While methods based on variations of equation 2 are effective to reduce the domain gap to some extent, our goal is to improve upon the baseline obtained by equation 2 by introducing a loss term that increases the margin between classes in the target domain (check the embedding space in Figure 1 , right). By doing so, our goal is to mitigate the negative effect of domain shift in the target domain.

4. LEARNING MAX-MARGIN CLASSIFIERS

Algorithm 1 SAUM 2 (λ, τ ) 1: Initial Training: 2: Input: source dataset D S = (X S , Y S ), 3: Training on the Source Domain: = arg min θ i L(f θ (x s i ), y s i ) 6: Prototypical Distribution Estimation: 7: Use equation 4 and estimate α j , µ j , Σ j 8: Model Adaptation: 9: Input: target dataset D T = X T 10: Pseudo-Dataset Generation: 11: DP = (Z P , Y P ) = 12: ([z p 1 , . . . , z p N ], [y p 1 , . . . , y p N ] ), where: 13: Update the model by solving equation 6 19: end for Our idea for increasing margins between the classes is based on using an intermediate prototypical distribution in the embedding space. We demonstrate that this distribution can be used to induce larger margins between the classes. To this end, we consider that the classifier subnetwork consists of a softmax layer. This means that the classifier should become a maximum a posteriori (MAP) estimator after training to be able to assign a membership probability to a given input feature vector. Under this formulation, the model will generalize in the source domain if after supervised training of the model using the source data, the input distribution is transformed into a multi-modal distribution p J (•) = φ v (p S )(•) with k modes in the embedding space (see Figure 1 , left). Each mode of this distribution corresponds to one type of sentiments. The mean for each of these modes corresponds to the notion of "class prototype" in the prior works (Pan et al., 2019; Chen et al., 2019) . Following a similar terminology, we refer to this distribution as the prototypical distribution. The geometric distance between the modes of prototypical distribution corresponds to the margins between classes. If we test the source-trained model in the target domain, the boundaries between class modes will change due to the existence of "domain shift", i.e., φ v (p T )(•) = φ v (p S )(•). Intuitively, as visualized in Figure 1 , if we can increase the margins between the class-specific modes in the source domain, domain shift will cause less performance degradation (Tommasi & Caputo, 2013) . z p i ∼ pJ (z), 1 ≤ i ≤ N p 14: y p i = arg max j {h ŵ0 (z p i )}, 15: max{h ŵ0 (z p i )} > τ 16: for itr = 1, . . . , IT R do We estimate the prototypical distribution in the embedding space as a parametric GMM as follows: p J (z) = k j=1 α j N (z|µ j , Σ j ), where µ j and Σ j denote the mean and co-variance matrices for each component and α j denotes mixture weights for each component. We need to solve for these parameters to estimate the prototypical distribution. Note that unlike usual cases in which expectation maximization algorithm (Bilmes et al., 1998) is used to estimate GMM parameters, the source data points are labeled. As a result, we can estimate µ j and Σ j for each component independently using standard MAP estimates. Similarly, the weights α j can be computed by a MAP estimate. Let S j denote the support set for class j in the training dataset, i.e., S j = {(x s i , y s i ) ∈ D S | arg max y s i = j}. To cancel out outliers, we include only those source samples in the S j sets, for which the source-trained model predicts the corresponding labels correctly. The MAP estimate for the mode parameters can be computed as: αj = |S j | N , μj = (x s i ,y s i )∈Sj 1 |S j | φ v (x s i ), Σj = (x s i ,y s i )∈Sj 1 |S j | φ v (x s i ) -μj φ v (x s i ) -μj . Computations in Eq. equation 4 can be done efficiently. For a complexity analysis, please refer to the Appendix. Our idea is to use this prototypical distributional estimate to induce larger margins in the source domain (see Figure 1 , right). We update the domain alignment term in equation 2 to induce larger margins. To this end, we update the source domain samples in the domain alignment term by samples of a labeled pseudo-dataset D P = (Z P , Y P ) that we generate using the GMM estimate, where Z P = [z p 1 , . . . , z p Np ] ∈ R p×Np , Y P = [y p 1 , ..., y p Np ] ∈ R k×Np . This pseudo-dataset is generated using the prototypical distribution. We draw samples from the prototypical distributional estimate z p i ∼ pJ (z) for this purpose. To induce larger margins between classes, we feed the initial drawn samples into the classifier network and check the confidence level of the classifier about its predictions for these randomly drawn samples. We set a threshold 0 < τ < 1 level and select a subset of the drawn samples for which the confidence level of the classifier is more than τ : (z p i , y p i ) ∈ D P if z p i ∼ pJ (z) and max{h(z p i )} > τ and y p i = arg max i {h(z p i )}. Given the GMM distributional form, selection of samples based on the threshold τ means that we include GMM samples that are closer to the class prototypes (see Figure 1 ). In other words, the margin between the clusters in the source domain increase if we use the generated pseudo-dataset for domain alignment. Hence, we update equation 2 and solve the following optimization problem: v, ŵ = arg min v,w 1 N N i=1 L h w (φ w (x s i )), y s i + 1 N p Np i=1 L h w (z s i ), y s i + λ D φ v (X T ), X P )) + λ D φ v (X S ), X P , The first and the second terms in equation 6 are ERM terms for the source dataset and the generated pseudo-dataset to guarantee that the classifier continues to generalize well in the source domain after adaptation. The third and the fourth terms are empirical SWD losses (see Appendix for more details) that align the source and the target domain distributions using the pseudo-dataset which as we describe induces larger margins. The hope is that as visualized in Figure 1 , these terms can reduce the effect of domain shift. Our proposed solution, named Sentiment Aanlysis Using Max-Margin classifiers (SAUM 2 ), is presented and visualized in Algorithm 1 and Figure 1 , respectively.

5. THEORETICAL ANALYSIS

We provide a theoretical justification for our algorithm. Following a standard PAC-learning framework, we prove that Algorithm 1 minimizes an upperbound for the target domain expected error. Consider that the hypothesis class in a PAC-learning setting is the family of classifier sub-networks H = {h w (•)|h w (•) : Z → R k , v ∈ R V } , where V denotes the number of learnable parameters. We represent the expected error for a model h w (•) ∈ H on the source and the target domains by e S (w) and e T (w), respectively. Given the source and the target datasets, we can represent the empirical source and target distributions in the embedding space as μS = 1 N N n=1 δ(φ v (x s n )) and μT = 1 M M m=1 δ(φ v (x t m ) ). Similarly, we can build an empirical distribution for prototypical distribution μP = 1 Np Np q=1 δ(z q n ). In our analysis we also use the notion of joint-optimal model h S,T (•) in our analysis which is defined as: w * = arg min w e S,T = arg min w {e S + e T } for any given domains S and T . When we have labeled data in both domains, this is the best model that can be trained using ERM. Existence of a good joint-trained model guarantees that the domains are related, e.g., similar sentiment polarities are encoded consistently across the two domains. Theorem 1: Consider that we the procedure described in Algorithm 1 for cross-domain sentiment analysis, then the following inequality holds for the target expected error: eT ≤eS + D(μS , μP ) + D(μT , μP ) + (1 -τ ) + eS,P + 2 log( 1 ξ )/ζ 1 N + 1 M + 2 1 Np , where ξ is a constant number which depends on characteristic of the loss function L(•). Proof: The complete proof is included in the Appendix due to space limit. Theorem 1 provides an explanation to justify Algorithm 1. We observe that all the terms in the upperbound of the target expected error in the right-hand side of equation 7 are minimized by SAU M 2 algorithm. The source expected error is minimized as the first term in equation 6. The second and the third terms terms are minimized as the third and fourth terms of equation 6. The fourth term 1 -τ will be small if we set τ ≈ 1. Note however, when we select τ too close to 1, the samples will be centered around the prototypes. As a result, we will not match the higher-order distributional moments in the terms D(μ S , μP ) and D(μ S , μT ) and this can make the upperbound looser. The term e S,P is minimized through the first and the second term of equation 6. This is highly important as using the pseudo-dataset provides a way to minimize this term. As can be seen in our proof in the Appendix, if we don't use the pseudo-dataset, this terms is replace with e S,T which cannot be minimized directly due to lack of having annotated data in the target domain. The last term in equation 7 is a constant term that as common in PAC-learning can become negligible states that in order to train a good model if we have access to large enough datasets. Hence all the terms in the upperbound are minimized and if this upperbound is tight, then a good model will be trained for the target domain. If the two domain are related, e.g., share the same classes, and also classes become separable in the embedding space, i.e., GMM estimation error is small, then the upperbound is going to be likely tight. However, we highlight that possibility of a tight upperbound is a condition for our algorithm to work. This is a common limitation for most parametric algorithms.

6. EXPERIMENTAL VALIDATION

Our implemented code is available at Appendix.

6.1. EXPERIMENTAL SETUP

Most existing works report performance cross-domain tasks that are defined using the Amazon Reviews benchmark dataset (Blitzer et al., 2007) . The dataset is built using Amazon product reviews from four product domains: Books (B), DVD (D), Electronics (E), and Kitchen (K) appliances. Each review is considered to have positive (higher than 3 stars) or negative (3 stars or lower) sentiment. Each Review is encoded in a 5000 dimensional or 30000 dimensional tf-idf feature vector of bag-ofwords unigrams and bigrams. We report our performance on the 12 definable cross-domain tasks for this dataset. Each task consists of 2000 labeled reviews for the source domain and 2000 unlabeled reviews for the target domain, and 2500-5500 examples for testing. We report the average prediction accuracy and standard deviation (std) over 10 runs on the target domain testing split for our algorithm. We compare our method against several recently developed algorithms. We compare against DSN (Bousmalis et al., 2016) CMD (Zellinger et al., 2017) , ASYM (Saito et al., 2018) , PBLM (Ziser & Reichart, 2018) , MT-Tri (Ruder & Plank, 2018) , TRL (Ziser & Reichart, 2019) , and TAT (Liu et al., 2019) . DSN and CMD are similar to SAU M 2 in that both align distributions in an embedding space. DSN learns shared and domain-specific knowledge for each domain and aligns the shared knowledge using the mean-based maximum mean discrepancy metric. CMD uses the central moment discrepancy metric for domain alignment. ASYM benefits from the idea of pseudo-labeling of the target samples to updated the base model. MT-Tri is based on ASYM but it also benefits from multi-task learning. TRL and PBLM do not use distribution alignment and are based on the pivot based language model. TAT is a recent work that has used adversarial learning successfully for cross-domain sentiment analysis. We provided results by the authors for the tasks in our table. We report std if std is reported in the original paper. All the methods except TAT that uses 30000 dimensional features use 5000 dimensional features. In our results, methods are comparable if they use features with the same dimension. We report performance of the source only (SO) model as a lowerbound. We used the benchmark network architecture that is used in the above mentioned works. We used an encoder with one hidden dense layer with 50 nodes with sigmoid activation function. The classifiers consist of a softmax layer with two output nodes. We implemented our method in Keras, used adam optimizer, and tuned the learning rate in the source domain. We set τ = 0.99 and λ = 10 -2 . We observed empirically that our algorithm is not sensitive to the value of λ.

6.2. RESULTS

Our results are reported in Table 1 . In this table, bold font denotes best performance among the methods that use 5000 dimensional features. We see that SAU M 2 algorithm performs reasonably well and in most cases leads to the best performance. Note that this is not unexpected as none of the methods has the best performance across all tasks. We observe from this table that overall the methods DSN and CMD which are based on aligning the source and target distributions-which are more similar to our algorithm-have relatively similar performances. This observation suggests that we should not expect considerable performance boost if we simply align the distributions by designing a new alignment loss function. This means that outperformance of SAU M 2 compared to these methods likely stems from inducing larger margins. We verify this intuition in our ablative study. We also observe that increasing the dimensional of tf-idf features to 30000 leads to performance boost which is probably the reason behind good performance of TAT compared to the rest of methods. Hence, we need to use the same dimension for features for fair comparison among the methods. To provide an intuition for the rationale we used, we have used UMAP McInnes et al. (2018) visualization tool to reduce the dimension of the data representations in the 50D embedding space to two for the purpose of 2D visualization. Figure 2 visualizes the testing splits of the source domain before model adaptation, the testing splits of the target domain before and after model adaptation, and finally random samples drawn from the prototypical distribution for the D→K task. Each point represents one data point and each color represents one of the sentiments. Observing Figure 2a and Figure 2b , we conclude that GMM prototypical distribution approximates the source domain distribution reasonably well and at the same time, a margin between the classes in the boundary region is observable. Figure 2c visualizes the target domain samples prior to model adaptation. As expected, we observe that domain gap has caused less separations between the classes, as also evident from SO performance in Table 1 . Figure 2d visualizes the target domain samples after adaptation using SAU M 2 algorithm. Comparing Figure 2d with Figure 2c and Figure 2a , we see that the classes Task B→D B→E B→K 81.9 ± 0.5 80.9 ± 0.8 83.2 ± 0.8 Task D→B D→E D→K 74.0 ± 0.9 80.9 ± 0.7 83.4 ± 0.6 Task E→B E→D E→K 74.1 ± 0.6 74.0 ± 0.3 87.8 ± 0.9 Task D→B D→E D→K 74.1 ± 0.6 74.0 ± 0.3 87.8 ± 0.9 have become more separated. Also, careful comparison of Figure 2d and Figure 2b reveals SAU M 2 algorithm has led to a bias in the target domain to move the data points further from the boundary.

6.3. ABLATION STUDIES

First note that the source only (SO) model result, which is trained using equation 1, already serve as a basic ablative study to verify the effect of domain alignment. Improvement over this baseline demonstrates effect of using the information which is encoded in the target unlabeled data. In Table 2 , we have provided an additional ablative studies. We have reported result of alignment only (AO) model adaptation based on equation 2. The AO model does not benefit from the margins that SAU M 2 algorithm induces between the classes. Comparing AO results with Table 1 , we can conclude that the effect of increased margins is important in our performance. Compared to other cross-domain sentiment analysis methods, the performance boost for our algorithm stems from inducing large margins. This suggests that researchers may check to investigate secondary techniques for domain adaptation in NLP domains, in addition to probability distribution alignment. Finally, we have studied the effect of the value of the confidence parameter on performance. In Figure 3 , we have visualized the performance of our algorithm for the task B → D when τ is varied in the interval [0, 0.99]. When τ = 0, the samples are not necessarily confident samples. We observe that as we increase the value of τ , the performance increases as a result of inducing larger margins. For values τ > 0.8, the performance has less variance which suggests robustness of performance if τ ≈ 1. These empirical observations about τ accord with our theoretical result, stated in equation 7.

7. CONCLUSIONS

We developed a method for cross-domain sentiment analysis based on aligning two domain-specific distributions in a shared embedding space. We demonstrated that one can improve upon this baseline by inducing larger margins between the classes in the source domain using an intermediate multimodal prototypical distribution. As a result, the domain shift problem is mitigated in the target domain. Our experiments demonstrate that our algorithm is effective. A future research direction is to address cross-domain sentiment analysis when different types of sentiment exists across the domains.

A APPENDIX A.1 SLICED WASSERSTEIN DISTANCE

We relied on minimizing the Sliced Wasserstein (SWD) distance for domain alignment. SWD is defined based on the Wasserstein distance (WD) and is a mean to come up with a more computationally efficient distribution metric. The WD between two distributions p S and p T , is defined as: W c (p S , p T ) = inf γ∈Γ(p S ,p T ) X×Y c(x, y)dγ(x, y) where Γ(p S , p T ) is the set of all joint distributions p S,T with marginal single variable distributions p S and p T , and c : X × Y → R + is the cost function, e.g., 2 -norm Euclidean distance. We observe that computing WD involves solving a complicated optimization problem in the general case. However, when the two distributions are 1-dimensional, WD has a closed-form solution: W c (p S , p T ) = 1 0 c(P -1 S (τ ), P -1 T (τ ))dτ, where P S and P T are the cumulative distributions of the 1D distributions p S and p T . This closedform solution motivates the definition of SWD in order to extend applicability of equation 9 for higher dimensional distributions. SWD is defined based on the idea of slice sampling Neal (2003) . The idea is to project two ddimensional distributions into their marginal one-dimensional distributions along a subspace, i.e., slicing the high-dimensional distributions, and then compute the distance between the two distribution by integrating over all the WD between the resulting 1D marginal probability distributions over all possible 1D subspaces using the closed form solution of WD. This can be a good replacement for the WD as any probability distribution can be expressed by the set of 1-dimensional marginal projection distributions Helgason (2011) . More specifically, a one-dimensional slice of the distribution for the distribution p S is defined: Rp S (t; γ) = S d-1 p S (x)δ(t -γ, x )dx, where δ(•) denotes the Kronecker delta function, •, • denotes the vector inner dot product, S d-1 is the d-dimensional unit sphere, and γ is the projection direction. The SWD is defined as the integral of all WD between the sliced distributions over all 1D subspaces γ on the unit sphere as follows: SW (p S , p T ) = S d-1 W (Rp S (•; γ), Rp T (•; γ))dγ The main advantage of using SWD is that, computing SWD does not require solving a numerically expensive optimization. In our practical setting, only samples from the distributions are available and we don't have the distributional form. Another advantage of SWD is that its empirical version can be computed based on the one-dimensional empirical WD. One-dimensional empirical WD be approximated as the pdistance between the sorted samples. We can compute merely the integrand function in equation 11 for a known γ and then the integral in equation 11 via Monte Carlo style numerical integration. To this end, we draw random projection subspace γ from a uniform distribution that is defined over the unit sphere and then compute 1D WD along this sample. We can then approximate the integral in equation 11 by computing the arithmetic average over a suitably large enough number of drawn samples. More specifically, the SWD between f -dimensional samples {φ(x S i ) ∈ R f ∼ p S } M i=1 and {φ(x T i ) ∈ R f ∼ p T } M j=1 in our setting can be approximated as the following sum: SW 2 (p S , p T ) ≈ 1 L L l=1 M i=1 | γ l , φ(x S s l [i] ) -γ l , φ(x T t l [i] ) | 2 where γ l ∈ S f -1 is uniformly drawn random sample from the unit f -dimensional ball S f -1 , and s l [i] and t l [i] are the sorted indices of {γ l • φ(x i )} M i=1 for source and target domains, respectively. We utilize this empirical version of SWD in equation 12 to align the distributions in the embedding space. Note that the function in equation 12 is differentiable with respect to the encoder parameters and hence we can use gradient-based optimization techniques to minimize it with respect to the model parameters. A.2 PROOF OF THEOREM 1 We use the following theorem by Redko et al. Redko & Sebban (2017) and a result by Bolley Bolley et al. ( 2007) on convergence of the empirical distribution to the true distribution in terms of the WD distance in our proof. Theorem 2 (Redko et al. Redko & Sebban (2017) ): Under the assumptions described in our framework, assume that a model is trained on the source domain, then for any d > d and ζ < √ 2, there exists a constant number N 0 depending on d such that for any ξ > 0 and min(N, M ) ≥ max(ξ -(d +2),1 ) with probability at least 1 -ξ, the following holds: e T ≤e S + (μ T , μS ) + e S,T + 2 log( 1 ξ )/ζ 1 N + 1 M . Theorem 2 provides an upperbound for the performance of a source-trained model in the target domain Redko et al. Redko & Sebban (2017) prove Theorem 2 for a binary classification setting. We also provide our proof in this case but it can be extended. The second term in Eq. equation 13 demonstrates the effect of domain shift on the performance of a source-trained model in a target domain. When the distance between the two distributions is significant, this term will be large and hence the upperbound in Eq. equation 13 will be loose which means potential performance degradation. Our algorithm mitigates domain gap because this term is minimized by minimization of the second and the third terms in Theorem 1. Theorem 1 : Consider that we the procedure described in Algorithm 1 for cross-domain sentiment analysis, then the following inequality holds for the target expected error:  where ξ is a constant which depends on L(•) and e C (w * ) denotes the expected risk of the optimally joint trained model when used on both the source domain and the pseudo-dataset. Proof: Due to the construction of the pseudo-dataset, the probability that the predicted labels for the pseudo-data points to be false is equal to 1 -τ . Let: |L(h w0 (z p i ), y p i ) -L(h w0 (z p i ), ŷp i )| = 0, if y t i = ŷt i . 1, otherwise. ( ) We use Jensen's inequality and take expectation on both sides of equation 15 to deduce: that performance of our algorithm slightly has degraded but our algorithm has been robust to a large extent with respect to label imbalance. As a secondary sanity check, we have presented the UMAP visualization for the testing data split of the task D→K for the imbalanced 90/10 task in Figure 4 . Observations in Figure 4 match what we reported in Table 3 , confirming that our algorithm does suffer considerable degradation if the target domain data imbalanced for the tasks built using the Amazon reviews dataset.



Figure 1: Architecture of the proposed cross-domain sentiment analysis framework. Left: separable clusters are formed in the embedding space after initial supervised model training in the source domain and then the prototypical distribution is estimated as a GMM. Right: random samples from the GMM with high-confident labels are used to generate a pseudo-dataset. The pseudo-dataset induces larger margins between the classes to mitigate the effect of domain shift in the target domain.

from D S , D T , and D P 18:

Proto. Dist. Samples (c) Target: Pre-Adapt. (d) Target: Post-Adapt.

Figure 2: UMAP visualization for the task D→K task: (a) the source domain testing split, (b) the prototypical distribution samples, (c) the target domain testing split prior to adaptation, and (d) the target domain testing split after adaptation. (Best viewed in color).

Figure 3: Effect of τ on performance.

eT ≤eS + D(μS , μP ) + D(μT , μP ) + (1 -τ ) + eS,

|e P -e T | ≤ E |L(h w0 (z p i ), y p i ) -L(h w0 (z p i ), ŷp i )| ≤ (1 -τ ).(16) Applying equation 16 in the below, deduce: e S + e T = e S + e T + e P -e P ≤ e S + e P + |e T -e P | ≤ e S + e P + (1 -τ ). (17) Taking infimum on both sides of equation 17, we deduce: e S,T ≤ e S,P + (1 -τ ).(18)Now by considering Theorem 2 for the two domains S and T and then using equation 18 in equation 13, we can conclude:e T ≤e S + D(μ T , μS ) + e S,P + (1 -τ ) 82.8 ± 0.3 83.2 ± 0.5 85.5 ± 0.3 78.7 ± 0.2 83.3 ± 0.2 86.8 ± 0.2 90/10 82.9 ± 0.5 83.4 ± 0.3 85.8 ± 0.2 78.5 ± 0.4 83.3 ± 0.4 86.8 ± 078.7 ± 0.2 78.5 ± 0.5 88.6 ± 0.1 76.3 ± 0.6 77.9 ± 0.4 86.6 ± 0.1 90/10 78.7 ± 0.2 78.0 ± 0.4 88.0 ± 0.2 76.5 ± 0.5 77.3 ± 0.3 86.7 ± 0.2 Table3: Effect of label imbalance in the target domain on the proposed method.

(a) Source (b) Proto. Dist. Samples (c) Target: Pre-Adapt. (d) Target: Post-Adapt.

Figure 4: UMAP visualization for the task D→K task in the imbalanced regime of 90/10: (a) the source domain testing split, (b) the prototypical distribution samples, (c) the target domain testing split prior to adaptation, and (d) the target domain testing split after adaptation. (Best viewed in color).

Classification accuracy for the cross-domain sentiment analysis tasks for Amazon Reviews dataset. In this table,

Performance of AO setting.

annex

Now using the triangular inequality on the metrics we can deduce: D(μ T , μS ) ≤ D(μ T , µ P ) + D(μ S , µ P ) ≤ D(μ T , μP ) + D(μ S , μP ) + 2D(μ P , µ P ).(20)Now we replace the term D(μ P , µ P ) with its empirical counterpart using Theorem 1.1 in the work by Bolley et al. (2007) .Theorem 3 (Theorem 1.1 by Bolley et al. Bolley et al. (2007) ): consider that p(•) ∈ P(Z) and) denote the empirical distribution that is built from the samples {x i } N i=1 that are drawn i.i.d from x i ∼ p(x). Then for any d > d and ξ < √ 2, there exists N 0 such that for any > 0 and N ≥ N o max(1, -(d +2) ), we have:where W denotes the WD distance. This relation measures the distance between the empirical distribution and the true distribution, expressed in the WD distance.Applying equation 20 and equation 21 on equation 19 concludes Theorem 2 as stated:A.3 COMPLEXITY ANALYSIS FOR GMM ESTIMATION Estimating a GMM distribution usually is a computationally expensive tasks. The major reason is that normally the data points are unlabeled. This would necessitate relying on iterative algorithms such expectation maximization (EM) algorithm Moon (1996) . Preforming iterative E and M steps until convergence leads to high computational complexity Roweis (1998). However, estimating the prototypical distribution with a GMM distribution is much simpler in our learning setting. Existence of labels helps us to decouple the Gaussian components and compute the parameters using MAP estimate for each of the mode parameters in one step as follows:Given the above and considering that the source domain data is balanced, complexity of computing α j is O(N ) (just checking whether data points x s i belong to S j ). Complexity of computing µ j is O(N F/k), where F is the dimension of the embedding space. Complexity of computing the covariance matrices, which seems to be a reasonable practical assumption, then the total complexity of computing GMM would be O(N 2 ). Given the large number of learnable parameters in most deep neural networks which are more than N for most cases, this complexity is fully dominated by complexity of a single step of backpropagation. Hence, this computing the GMM parameters does not increase the computational complexity for.

A.4 ADDITIONAL EXPERIMENTS ON IMBALANCED DATA

We have used a balanced dataset in terms of class labels in our experiments. However, the label distribution for the target domain training dataset cannot be enforced to be balanced in practical applications due to absence of labels. This a highly unexplored challenge in the domain adaptation literature which we study in this section. To study the effect of label imbalance using a controlled experiment, we synthetically design an imbalanced dataset using the Amazon dataset. We repeat the experiments we had, only with the difference of using imbalanced target domain datasets. We design two experiments by only include 111 and 250 data points from class one such that the target domain datasets has the 90/10 and 80/20 ratios of imbalance between the two classes, respectively. We have provided domain adaptation results using for these two imbalanced scenarios in Table 3 . We can see

