SO-TVAE: SENTIMENT-ORIENTED TRANSFORMER-BASED VARIATIONAL AUTOENCODER NETWORK FOR LIVE VIDEO COMMENTING

Abstract

Automatic live video commenting is with increasing attention due to its significance in narration generation, topic explanation, etc. However, the sentiment consideration of the generated comments is missing from the current methods. Sentimental factors are critical in interactive commenting, and lack of research so far. Thus, in this paper, we introduce and investigate a task, namely sentiment-guided automatic live video commenting, which aims to generate live video comments based on sentiment guidance. To address this problem, we propose a Sentimentoriented Transformer-based Variational Autoencoder (So-TVAE) network, which consists of a sentiment-oriented diversity encoder module and a batch-attention module. Specifically, our sentiment-oriented diversity encoder elegantly combines VAE and random mask mechanism to achieve semantic diversity under sentiment guidance, which is then fused with cross-modal features to generate live video comments. Furthermore, a batch attention module is also proposed in this paper to alleviate the problem of missing sentimental samples, caused by the data imbalance, which is common in live videos as the popularity of video varies. Extensive experiments on Livebot and VideoIC datasets demonstrate that the proposed So-TVAE outperforms the state-of-the-art methods in terms of the quality and diversity of generated comments. Related codes will be released.

1. INTRODUCTION

Live video commenting, commonly known as "danmaku" or "bullet screen", is a new interactive mode on online video websites which allows viewers to write real-time comments to interact with others. Recently, the automatic live video commenting (ALVC) task, which aims to generate realtime video comments for viewers, is increasingly used for narration generation, topic explanation, and video science popularization as it can assist in attracting the attention and discussion of viewers. However, previous works aim to generate factual and subjective comments without considering the sentimental factor. In real-world applications, it is difficult for comments with a single sentiment to resonate with everyone (Wang & Zong, 2021; Yan et al., 2021) . On the other hand, sentiment-guided comments would help video labeling and distribution, and further encourage the human-interacted comment generation. Thus, in this paper, we introduce and investigate a task, namely sentimentguided automatic live video commenting, which aims to generate live video comments based on sentiment guidance. Two major difficulties make this task extremely challenging. Firstly, sentiment, comment, and video are heterogeneous modalities, and they are both important in live video comment generation with sentiment. Previous works take their effort on sentiment-guided text generation (Li et al., 2021; Kim et al., 2022; Sabour et al., 2021) or video comment generation (Ma et al., 2019; Wang et al., 2020) , but cannot apply to the complex situation that needs considering all three modalities simultaneously. Secondly, the imbalance of video data (Wu et al., 2021) hinders the generation of comments with the desired sentiment. The imbalance of video data lies in two aspects. On the one hand, as the popularity of the videos varies, the number of comments is a huge difference between videos, and on the other hand, in a certain video, comments usually show some sentimental trend, but lack comments on the other sentiments. To this end, in this paper, we propose a Sentiment-oriented Transformer-based Variational Autoencoder (So-TVAE) network, which consists of a sentiment-oriented diversity encoder module and a batch attention module, to deal with above two challenges. The proposed sentiment-oriented diversity encoder elegantly combines VAE and random mask mechanism to achieve semantic diversity under sentiment guidance. Specifically, we firstly leverage a Gaussian mixture distribution mapping guided by sentimental information to learn diverse comments with different sentiments. In addition, to effectively avoid VAE ignoring the input information and directly learns the mapping of the latent space, we propose a novel sentiment-oriented random mask encoding mechanism, balancing the learning ability of the model to various modalities, which further improves the performance. Moreover, a batch-attention module is proposed in this paper to alleviate the data imbalance problem existing in live video. We leverage batch-level attention along with multi-modal attention in a parallel way to simultaneously deal with the multi-modal features in batch dimension and spatial dimension. In this way, the proposed batch-level attention module can mitigate the gap between samples, explore the sample relationships in a mini-batch, and further improve the quality and diversity of generated comments. The main contributions of this work are as follows: Firstly, we introduce and investigate a task in this paper, namely sentiment-guided automatic live video commenting, which aims to generate live video commenting by sentiment guidance. To this end, we propose a Sentiment-oriented Transformer-based Variational Autoencoder (So-TVAE) network, which consists of a sentiment-oriented diversity encoder module and a batch attention module. Secondly, we propose a sentiment-oriented diversity encoder module, which elegantly combines VAE and random mask mechanism to achieve semantic diversity and further align sentiment features with language and video modalities under sentiment guidance. Meanwhile, we propose a batchattention module for sample relationship learning, to alleviate the problem of missing sentimental samples caused by the data imbalance. Thirdly, we perform extensive experiments on two public datasets. The experimental results based on all evaluation metrics prove that our model outperforms the state-of-the-art models in terms of quality and diversity. 



Figure 1: A live video commenting example in Livebot with selected video frames and live comments. Green: Positive comments. Black: neutral comments. Blue: Negative comments.

2.1 PROBLEM FORMULATION AND MULTI-MODAL ENCODERGiven the video clip V t with k frames I t = {I 1 , . . . , I t , . . . , I k } and m surrounding comments S t = {s 1 , s 2 , . . . , s m } which are firstly concatenated into a word sequence x t = {x 1 , x 2 , . . . , x p }, where p represents the total number of words of S t , automatic live video commenting aims to generate human-like comments at timestamp t. The overview of the framework is illustrated in Figure2. Specifically, the input frames and surrounding comments are first represented as the initial features with a pre-trained convolution neural network (ResNet) He et al. (2016) and Long short-term memory network (LSTM)Hochreiter et al. (1997), respectively. Then we employ a co-attention module to further enhance feature encoding, and generate the attended visual representations VI and the

