CLIP-FLOW: CONTRASTIVE LEARNING WITH ITERA-TIVE PSEUDO LABELING FOR OPTICAL FLOW

Abstract

Synthetic datasets are often used to pretrain end-to-end optical flow networks, due to the lack of a large amount of labeled real scene data. But major drops in accuracy occur when moving from synthetic to real scenes. How do we better transfer the knowledge learned from synthetic to real domains? To this end, we propose CLIP-Flow, a semi-supervised iterative pseudo labeling framework to transfer the pretraining knowledge to the target real domain. We leverage large-scale, unlabeled real data to facilitate transfer learning with the supervision of iteratively updated pseudo ground truth labels, bridging the domain gap between the synthetic and the real. In addition, we propose a contrastive flow loss on reference features and the warped features by pseudo ground truth flows, to further boost the accurate matching and dampen the mismatching due to motion, occlusion, or noisy pseudo labels. We adopt RAFT as backbone and obtain an F1-all error of 4.11%, i.e. a 19% error reduction from RAFT (5.10%) and ranking 2 nd place at submission on KITTI 2015 benchmark. Our framework can also be extended to other models, e.g. CRAFT, reducing the F1-all error from 4.79% to 4.66% on KITTI 2015 benchmark.

1. INTRODUCTION

Optical flow is critical in many high level vision problems, such as action recognition (Simonyan & Zisserman, 2014; Sevilla-Lara et al., 2018; Sun et al., 2018b) , video segmentation (Yang et al., 2021; Yang & Ramanan, 2021) and editing (Bonneel et al., 2015) , autonomous driving (Janai et al., 2020) and so on. Traditional methods (Horn & Schunck, 1981; Menze et al., 2015; Ranftl et al., 2014; Zach et al., 2007) mainly focus on formulating flow estimation as solving optimization problems using hand-crafted features. The optimization is searched over the space of dense displacement fields between a pair of input images, which is often time-consuming. Recently, data driven deep learning methods Dosovitskiy et al. (2015) ; Ilg et al. (2017) ; Teed & Deng (2020) have been proved successful in estimating the optical flow thanks to the availability of all kinds of high quality synthetic datasets (Butler et al., 2012b; Dosovitskiy et al., 2015; Mayer et al., 2016; Krispin et al., 2016) . Most of the recent works (Dosovitskiy et al., 2015; Ilg et al., 2017; Teed & Deng, 2020; Jeong et al., 2022) mainly train on the synthetic datasets given that there is no sufficient real labeled optical flow datasets to be used to train a deep learning model. State-of-the-art (SOTA) models always get more accurate results on the synthetic dataset like Sintel (Butler et al., 2012a) than the real scene dataset like KITTI 2015 (Menze & Geiger, 2015) . This is mainly because that the model tends to overfit the small training data, which echos in Tab. 1, i.e., there is a big gap between the training F1-all error and test F1-all error when train and test on the KITTI dataset in all of the previous SOTA methods. Therefore, we argue that this gap in performance is because of dearth of real training data, and a big distribution gap between the synthetic data and real scene data. Although the model can perfectly explain all kinds of synthetic data, however, when dealing with real data, it performs rather unsatisfactorily. Our proposed work focuses on bridging the glaring performance gap between the synthetic data and the real scene data. As in previous data driven approaches, smarter and longer training strategies prove to be beneficial and helps in obtaining better optical flow results. Through our work we also try to find answers of the following two questions: (i) How to take advantage of the current SOTA optical flow models to further consolidate gain on real datasets? and (ii) How can we use semi-supervised learning along with contrastive feature representation learning strategies to effectively utilize the huge amount of unlabeled real data at our disposal? 1

