LEARNING FROM LABELED IMAGES AND UNLABELED VIDEOS FOR VIDEO SEGMENTATION

Abstract

Performance on video object segmentation still lags behind that of image segmentation due to a paucity of labeled videos. Annotations are time-consuming and laborious to collect, and may not be feasibly obtained in certain situations. However there is a growing amount of freely available unlabeled video data which has spurred interest in unsupervised video representation learning. In this work we focus on the setting in which there is no/little access to labeled videos for video object segmentation. To this end we leverage large-scale image segmentation datasets and adversarial learning to train 2D/3D networks for video object segmentation. We first motivate the treatment of images and videos as two separate domains by analyzing the performance gap of an image segmentation network trained on images and applied to videos. Through studies using several image and video segmentation datasets, we show how an adversarial loss placed at various locations within the network can make feature representations invariant to these domains and improve the performance when the network has access to only labeled images and unlabeled videos. To prevent the loss of discriminative semantic class information we apply our adversarial loss within clusters of features and show this boosts our method's performance within Transformer-based models.

1. INTRODUCTION

Video object segmentation is attracting more attention due to its importance in many applications such as robotics and video editing. Much progress has been made on general video understanding tasks such as action recognition, but is lagging on dense labelling tasks such as video segmentation. This is mainly due to the laborious and time-consuming nature of pixel annotation collection, resulting in small, sparsely annotated video datasets. To deal with this problem, researchers often use large-scale image segmentation datasets to learn semantic representations through pretraining. While image segmentation datasets are a useful source of labeled data for pretraining, the representations learned do not translate well to videos containing video artifacts such as motion blur, low lighting and low resolution. Figure 1 shows an example of the domain differences arising in common datasets, where the boundaries and small spokes of the moving bicycle are blurred. When image-pretrained models are applied directly to videos, a performance drop is observed in object detection (Kalogeiton et al., 2015; Tang et al., 2012) and in our own video segmentation experiments. Further, Kalogeiton et al. find that in addition to motion blur, the location of objects in the frame, diversity of appearance and aspects, and camera framing all have an effect on the performance of image-pretrained models on videos. Thus supervised training on images is insufficient for pixel-wise video understanding, producing a need for a convenient alternative representation learning method that uses unlabeled videos. Unsupervised video representation learning has mainly focused on the classification task rather than on segmentation. Some methods learn useful features from unlabeled videos with data augmentations (Behrmann et al., 2021) , contrastive losses (Han et al., 2021) , and pretext tasks such as frame shuffling (Xu et al., 2019) . The learned networks are successful on video classification, but do not translate well to segmentation due to a loss of detail that occurs in the spatial bottleneck. Applying the same self-supervised methods for segmentation leads to poor performance because the features learned do not contain enough local information for the decoder to reconstruct the output.

