LEARNING FROM LABELED IMAGES AND UNLABELED VIDEOS FOR VIDEO SEGMENTATION

Abstract

Performance on video object segmentation still lags behind that of image segmentation due to a paucity of labeled videos. Annotations are time-consuming and laborious to collect, and may not be feasibly obtained in certain situations. However there is a growing amount of freely available unlabeled video data which has spurred interest in unsupervised video representation learning. In this work we focus on the setting in which there is no/little access to labeled videos for video object segmentation. To this end we leverage large-scale image segmentation datasets and adversarial learning to train 2D/3D networks for video object segmentation. We first motivate the treatment of images and videos as two separate domains by analyzing the performance gap of an image segmentation network trained on images and applied to videos. Through studies using several image and video segmentation datasets, we show how an adversarial loss placed at various locations within the network can make feature representations invariant to these domains and improve the performance when the network has access to only labeled images and unlabeled videos. To prevent the loss of discriminative semantic class information we apply our adversarial loss within clusters of features and show this boosts our method's performance within Transformer-based models.

1. INTRODUCTION

Video object segmentation is attracting more attention due to its importance in many applications such as robotics and video editing. Much progress has been made on general video understanding tasks such as action recognition, but is lagging on dense labelling tasks such as video segmentation. This is mainly due to the laborious and time-consuming nature of pixel annotation collection, resulting in small, sparsely annotated video datasets. To deal with this problem, researchers often use large-scale image segmentation datasets to learn semantic representations through pretraining. While image segmentation datasets are a useful source of labeled data for pretraining, the representations learned do not translate well to videos containing video artifacts such as motion blur, low lighting and low resolution. Figure 1 shows an example of the domain differences arising in common datasets, where the boundaries and small spokes of the moving bicycle are blurred. When image-pretrained models are applied directly to videos, a performance drop is observed in object detection (Kalogeiton et al., 2015; Tang et al., 2012) and in our own video segmentation experiments. Further, Kalogeiton et al. find that in addition to motion blur, the location of objects in the frame, diversity of appearance and aspects, and camera framing all have an effect on the performance of image-pretrained models on videos. Thus supervised training on images is insufficient for pixel-wise video understanding, producing a need for a convenient alternative representation learning method that uses unlabeled videos. Unsupervised video representation learning has mainly focused on the classification task rather than on segmentation. Some methods learn useful features from unlabeled videos with data augmentations (Behrmann et al., 2021) , contrastive losses (Han et al., 2021) , and pretext tasks such as frame shuffling (Xu et al., 2019) . The learned networks are successful on video classification, but do not translate well to segmentation due to a loss of detail that occurs in the spatial bottleneck. Applying the same self-supervised methods for segmentation leads to poor performance because the features learned do not contain enough local information for the decoder to reconstruct the output. In this paper, we propose an approach to video segmentation that takes advantage of both labeled image segmentation data and unlabeled videos. Taking inspiration from (Tang et al., 2012) , we use unlabeled videos to minimize the domain difference between image representations and the spatial component of video representations. We train our video segmentation networks to be invariant to properties specific to video (motion blur, viewpoints, etc.) so that we can train on labeled images and apply them to videos without a performance drop. To achieve this invariance we propose inserting a domain discriminator within a video segmentation network to discourage it from learning image or video-specific features. We train the network for segmentation using labeled images while using unlabeled videos to adversarially train the discriminator predicting the domain of the sample, shown in Figure 2 . We experiment with two different segmentation backbones: convolutional neural networks (CNNs) and Transformer-based networks. To take advantage of temporal information in unlabeled videos while retaining spatial information from labeled images, we also apply our method to VideoSwin (Liu et al., 2021b) with a spatiotemporal window size. In our CNN we place the discriminator at the end of the encoder prior to the decoding stage. In our Transformers we experiment with placing the discriminator at either the end of the encoder or after the patch embedding layer. We find that placing the discriminator after the patch embedding to target low-level features boosts the contribution of the adversarial loss. We conduct experiments using the video object segmentation (VOS) datasets Davis 2019 and FBMS and show that in our target setting with no access to labeled videos, our method improves segmentation performance over models supervised with images.

2. RELATED WORK

For video object segmentation a network must generate a pixel-wise classification for one or more moving target objects in a video. Labels must be consistent across frames but not necessarily across videos. In the unsupervised VOS track no annotations are provided during inference. Without the target object annotation from the first frame, unsupervised VOS methods rely on knowledge from image pretraining or optical flow. Li et al. ( 2018) leverage embeddings from an instance segmentation network trained on still images to generate an embedding for each object in a scene and then use semantic score and motion features from optical flow to select foreground object embeddings for a track. Similarly, RTNet (Ren et al., 2021) distinguishes foreground objects from moving distractors using a module that computes similarities between pairs of motion and appearance features from different objects. Recently, TransportNet (Zhang et al., 2021) aligns RGB and flow features in a two-stream network by optimizing Wasserstein distance with a Factorized Sinkhorn method. Yang et al. (2021c) do this alignment using co-attention between RGB and flow. Yang et al. (2021a) rely solely on optical flow as the input to their segmentation network. Optical flow features can be cumbersome to generate prior to segmentation so in our work we focus on learning strong semantic object representations from labeled images and unlabeled videos. Much work has been done on self/un-supervised learning for classification, but video segmentation representations have not been as well explored. Inspired by MoCo (He et al., 2020 ), VideoMoCo (Pan et al., 2021) adopts a momentum encoder but drops frames from videos to improve temporal robustness. Vi2CLR (Diba et al., 2021) learns features from unlabeled videos within separate 3D and 2D CNN encoders by clustering latent frame and clip features then applying a contrastive loss between clusters. Our method differs from these because it targets learning detailed segmentation features. Self-training Zoph et al. (2020) is complementary to our method, as are self-supervised



Figure 1: A single image from the COCO dataset (left) and a frame from Davis 2019 (right) shows domain differences such as motion blur arising from video artifacts.

