META AUXILIARY LABELS WITH CONSTITUENT-BASED TRANSFORMER FOR ASPECT-BASED SENTI-MENT ANALYSIS

Abstract

Aspect based sentiment analysis (ABSA) is a challenging natural language processing task that could benefit from syntactic information. Previous work exploit dependency parses to improve performance on the task, but this requires the existence of good dependency parsers. In this paper, we build a constituent-based transformer for ABSA that can induce constituents without constituent parsers. We also apply meta auxiliary learning to generate labels on edges between tokens, supervised by the objective of the ABSA task. Without input from dependency parsers, our models outperform previous work on three Twitter data sets and match previous work closely on two review data sets.

1. INTRODUCTION

Aspect-based Sentiment Analysis (ABSA) is the task of predicting sentiment polarity towards observed aspects in a sentence. Recent work (Bai et al., 2020; Huang & Carley, 2019; Sun et al., 2019; Wang et al., 2020) used syntactic information from dependency parses to achieve new stateof-the-art results on benchmark ABSA data sets. However, these works (i) assumed the existence of good dependency parsers, and (ii) could not further optimize the pre-defined dependency labels for downstream performance of ABSA. Motivated by these limitations, we propose to induce syntactic information with supervision from the ABSA task. To take syntax into account, we aim to induce the necessary syntactic information for the ABSA task with inductive biases. We first design a Constituent-based Transformer (ConsTrans) to group tokens into constituents supervised by the ABSA objective. We argue that the formation of constituents provides a hierarchical structure of the sentence that is suitable for sentiment analysis. For example, in the sentence "Chinese dumplings in this restaurant taste very good" with the aspect term "Chinese dumplings", it is important to accurately assign the phrase "taste very good" to the aspect. Next, as seen in Figure 1 , even though the dependency graph structures for both sentences are identical, the sentiment towards "Chelsea" is positive for the input sentence on the left and negative for the one on the right. Therefore, the type of syntactic relationship between tokens would be useful to identify the sentiment towards the aspect term. Hence, we further extend ConsTrans into a Relational Constituent-based Transformer (RelConsTrans) to learn relation embeddings between every pair of tokens in the input sentence. We find that simply adding relation embedding fails to outperform ConsTrans. Inspired by Liu et al. (2019) , we further extend RelConsTrans to supervise the relation embedding with an auxiliary label generator (RelConsTransLG). In previous work (e.g. Bai et al., 2020; Huang & Carley, 2019) , the dependency parser played the role of the auxiliary label generator. However, such dependency parsers were not trained to provide auxiliary labels meant to improve ABSA. RelConsTransLG enables us to train the auxiliary label generator alongside the primary task to generate auxiliary labels that could directly enhance the performance of ABSA. We evaluate our models on five data sets -restaurant and laptop reviews (Pontiki et al., 2014) , ACL14 Twitter14 data (Dong et al., 2014) , Twitter15 and Twitter17 from a multi-modal ABSA data set (Yu & Jiang, 2019) . Compared against previous work which used dependency parsers, our models outperform them on all the Twitter data sets and matched previous work closely on the review data sets even without the use of constituent or dependency parser. 

2. MODEL FORMULATION

Given a sentence of m tokens, s = {w 0 , . . . , w m-1 }, and a target aspect, t = {w j , . . . , w j+q-1 } of length q, the objective of ABSA is to predict the sentiment polarity y ∈ {negative, neutral, positive} towards the target aspect t mentioned in sentence s. In all our models, we use the pretrained BERT (Devlin et al., 2018) model (BERT-base-uncased) to obtain contextual embeddings as inputs to our model, and we fine tune it together with the model. We format the input to the BERT model as a sentence pair: [CLS] + s + [SEP ] + t + [SEP ]. We represent each token w i with the representation h bert,12 i obtained from the last layer of BERT as input to our model. Our base model is a 4-layer transformer on this representation, similar to the baseline Transformer(B) in Bai et al. (2020) . In the rest of this section, we describe the modifications we make to this transformer to build our three proposed models, ConsTrans, RelConsTrans and RelConsTransLG.

2.1. CONSTITUENT-BASED TRANSFORMER (ConsTrans)

ConsTrans contains a stack of 4 Transformer encoder layers (Vaswani et al., 2017) with Multi-Head Attention (MHA) and a point-wise feed forward sub-layer in each layer. As illustrated in Figure 2a , the encoder stack of ConsTrans is grouped into two parts -the lower layers and the upper layers. In all our experiments, we have 2 layers each in both the lower and upper layers. The main difference between a vanilla Transformer network and ConsTrans is that the attention scores computed in the MHA layer between a pair of tokens are adjusted based on the probability that the two tokens belong to the same constituent. In the lower layers, attention weights are adjusted such that greater attention weights are assigned to tokens within the same constituent. This adjustment is not imposed at upper layers of the encoder to allow for longer range interactions. Figure 2b shows a single encoder layer from the lower layers of the encoder stack. The shaded region in the figure, which emphasizes the difference from a vanilla Transformer encoder layer, contains three components: the MHA which provides the vanilla attention scores, the constituent probability scorer, and finally the adjusted MHA scorer that computes the final attention. Kim et al. (2020b) found that tokens from the same constituent tend to exhibit similar attention distributions. Hence we propose to determine the probability that a pair of tokens belong to the same constituent by the similarity of their attention distributions. We use the scaled dot-product attention (Vaswani et al., 2017) in the MHA layer to first obtain the attention



Figure 1: Dependency parse labels as auxiliary labels that help sentiment disambiguation. Tokens in bold and underlined are the aspect terms. Example taken from Bai et al. (2020).

(a) ConsTrans Encoder Stack: dotted arrows refer to lower attention weights between tokens from different constituents. (b) A lower ConsTrans layer: the shaded region is different from the vanilla Transformer.

