CLIP-FLOW: CONTRASTIVE LEARNING WITH ITERA-TIVE PSEUDO LABELING FOR OPTICAL FLOW

Abstract

Synthetic datasets are often used to pretrain end-to-end optical flow networks, due to the lack of a large amount of labeled real scene data. But major drops in accuracy occur when moving from synthetic to real scenes. How do we better transfer the knowledge learned from synthetic to real domains? To this end, we propose CLIP-Flow, a semi-supervised iterative pseudo labeling framework to transfer the pretraining knowledge to the target real domain. We leverage large-scale, unlabeled real data to facilitate transfer learning with the supervision of iteratively updated pseudo ground truth labels, bridging the domain gap between the synthetic and the real. In addition, we propose a contrastive flow loss on reference features and the warped features by pseudo ground truth flows, to further boost the accurate matching and dampen the mismatching due to motion, occlusion, or noisy pseudo labels. We adopt RAFT as backbone and obtain an F1-all error of 4.11%, i.e. a 19% error reduction from RAFT (5.10%) and ranking 2 nd place at submission on KITTI 2015 benchmark. Our framework can also be extended to other models, e.g. CRAFT, reducing the F1-all error from 4.79% to 4.66% on KITTI 2015 benchmark.

1. INTRODUCTION

Optical flow is critical in many high level vision problems, such as action recognition (Simonyan & Zisserman, 2014; Sevilla-Lara et al., 2018; Sun et al., 2018b) , video segmentation (Yang et al., 2021; Yang & Ramanan, 2021) and editing (Bonneel et al., 2015) , autonomous driving (Janai et al., 2020) and so on. Traditional methods (Horn & Schunck, 1981; Menze et al., 2015; Ranftl et al., 2014; Zach et al., 2007) mainly focus on formulating flow estimation as solving optimization problems using hand-crafted features. The optimization is searched over the space of dense displacement fields between a pair of input images, which is often time-consuming. Recently, data driven deep learning methods Dosovitskiy et al. (2015) ; Ilg et al. (2017) ; Teed & Deng (2020) have been proved successful in estimating the optical flow thanks to the availability of all kinds of high quality synthetic datasets (Butler et al., 2012b; Dosovitskiy et al., 2015; Mayer et al., 2016; Krispin et al., 2016) . Most of the recent works (Dosovitskiy et al., 2015; Ilg et al., 2017; Teed & Deng, 2020; Jeong et al., 2022) mainly train on the synthetic datasets given that there is no sufficient real labeled optical flow datasets to be used to train a deep learning model. State-of-the-art (SOTA) models always get more accurate results on the synthetic dataset like Sintel (Butler et al., 2012a) than the real scene dataset like KITTI 2015 (Menze & Geiger, 2015) . This is mainly because that the model tends to overfit the small training data, which echos in Tab. 1, i.e., there is a big gap between the training F1-all error and test F1-all error when train and test on the KITTI dataset in all of the previous SOTA methods. Therefore, we argue that this gap in performance is because of dearth of real training data, and a big distribution gap between the synthetic data and real scene data. Although the model can perfectly explain all kinds of synthetic data, however, when dealing with real data, it performs rather unsatisfactorily. Our proposed work focuses on bridging the glaring performance gap between the synthetic data and the real scene data. As in previous data driven approaches, smarter and longer training strategies prove to be beneficial and helps in obtaining better optical flow results. Through our work we also try to find answers of the following two questions: (i) How to take advantage of the current SOTA optical flow models to further consolidate gain on real datasets? and (ii) How can we use semi-supervised learning along with contrastive feature representation learning strategies to effectively utilize the huge amount of unlabeled real data at our disposal? Unsupervised visual representation learning (He et al., 2020; Chen et al., 2020b) has proved successful in boosting most of major vision related tasks like image classification, object detection and semantic segmentation to name a few. Work such as (He et al., 2020; Chen et al., 2020b ) also emphasizes the importance of the contrastive loss, when dealing with huge dense dataset. Given that optical flow tasks generally lack real ground truth labels, we ask if leveraging the unsupervised visual representation learning boosts optical flow performance? In order to answer this question, we examine the impact of contrastive learning and pseudo labeling during training under a semisupervised setting. We particularly conduct exhaustive experiments using KITTI-Raw Geiger et al. (2013 ) and KITTI 2015 (Menze & Geiger, 2015) datasets to evaluate its performance gain, and show encouraging results. We believe that gain seen in model's performance is reflective of the fact that employing representation learning techniques such as contrasting learning helps in achieving a much more refined 4D cost correlation volume. To constrain the flow per pixel, we employ a simple positional encoding of 2D cartesian coordinates on the input frames as suggested in (Liu et al., 2018) , which further consolidates the gain achieved by contrastive learning. We follow this up with an iterative-flow-refinement training using pseudo labeling which further consolidates on the previous gains to give us SOTA results. At this point we would also like to highlight that we follow a specific and well calibrated training strategy to fully exploit the gains of our method. Without loss of generality, we use RAFT (Teed & Deng, 2020) as the backbone network for our experiments. To fairly compare our method with existing SOTA methods, we tested our proposed method on the KITTI 2015 test dataset, and we achieve the best F1-all error score among all the published methods by a significant margin. To summarize our main contributions: 1) We provide a detailed training strategy, which uses SSL methods on top of the well known RAFT model to improve SOTA performance for optical flow estimation. 2) We present the ways to employ contrastive learning and pseudo labeling effectively and intelligently, such that both jointly help in improving upon existing benchmarks. 3) We discuss the positive impact of a simple 2D positional encoding, which benefits flow training both for Sintel and KITTI 2015 datasets.

2. RELATED WORK

Optical flow estimation. Maximizing visual similarity between neighboring frames by formulating the problem as an energy minimization (Black & Anandan, 1993; Bruhn et al., 2005; Sun et al., 2014) has been the primary approach for optical flow estimation. Previous works such as (Dosovitskiy et al., 2015; Ilg et al., 2017; Ranjan & Black, 2017; Sun et al., 2018a; 2019; Hui et al., 2018; 2020; Zou et al., 2018) have successfully established efficacy of deep neural networks in estimating optical flow both under supervised and self-supervised settings. Iterative improvement in model architecture and better regularization terms has been primarily responsible for achieving better results. But, most of these works fail to better handle occlusion, small fast-moving objects, capture global motion and rectify and recover from early mistakes. To overcome these limitations, Teed & Deng (2020) proposed RAFT, which adopts a learning-tooptimize strategy using a recurrent GRU-based decoder to iteratively update a flow field f which is initialized at zero. Inspired by the success of RAFT, there has been a number of variants such as CRAFT (Sui et al., 2022) Semi-Supervised and Representation Learning. Semi-supervised learning (SSL) and representation learning have shown success for a range of computer vision tasks, both during the pretext task training and during specific downstream tasks. Most of these methods leverage contrastive learning (Chen et al., 2020a; b; He et al., 2020 ), clustering (Caron et al., 2020) and pseudo-labeling (Caron et al., 2021; Chen & He, 2021; Grill et al., 2020; Hoyer et al., 2021) as an enforcing mechanism. Recent works such as (Caron et al., 2020; 2021; Chen et al., 2020a; b; 2021; Grill et al., 2020; He 



, GMA(Jiang et al., 2021), Sparse volumeRAFT (Jiang et al., 2021)   and FlowFormer(Huang et al., 2022), all of which benefits from all-pair correlation volume way of estimating optical flow. The current state-of-the-art work RAFT-OCTC (Jeong et al., 2022) also uses RAFT based architecture, and it imposes consistency based on various proxy tasks to improve flow estimation. Considering RAFT's effectiveness, generalizability and relatively smaller model size, we adopt RAFT (Teed & Deng, 2020) as our base architecture and employ semi-supervised iterative pseudo labeling together with the contrastive flow loss, to achieve a state-of-the-arts result onKITTI 2015 (Menze & Geiger, 2015)  benchmark.

