LEARNING TO ESTIMATE SINGLE-VIEW VOLUMETRIC FLOW MOTIONS WITHOUT 3D SUPERVISION

Abstract

We address the challenging problem of jointly inferring the 3D flow and volumetric densities moving in a fluid from a monocular input video with a deep neural network. Despite the complexity of this task, we show that it is possible to train the corresponding networks without requiring any 3D ground truth for training. In the absence of ground truth data we can train our model with observations from realworld capture setups instead of relying on synthetic reconstructions. We make this unsupervised training approach possible by first generating an initial prototype volume which is then moved and transported over time without the need for volumetric supervision. Our approach relies purely on image-based losses, an adversarial discriminator network, and regularization. Our method can estimate long-term sequences in a stable manner, while achieving closely matching targets for inputs such as rising smoke plumes.

1. INTRODUCTION

Estimating motion is a fundamental problem, and is studied for a variety of settings in two and three dimensions (Ranjan & Black, 2017; Hur & Roth, 2021; Gregson et al., 2014) . It is also a highly challenging problem, since the motion u is a secondary quantity that typically can't be measured directly and has to be recovered from changes observed in transported markers ρ. We focus on volumetric, momentum-driven materials like fluids, where in contrast to the single-step estimation in optical flow (OF), motion estimation typically considers multiple coupled steps to achieve a stable global transport. Furthermore, in this setting the volume distribution of markers ρ is usually unknown and needs to be reconstructed from the observations in parallel to the motion estimation. So far, most research is focused on the reconstruction of single scenes. Classic methods use an optimization process working with an explicit volumetric representation (Eckert et al., 2019; Zang et al., 2020; Franz et al., 2021) while some more recent approaches optimize single scenes with neural fields (Mildenhall et al., 2020; Chu et al., 2022) . As such an optimization is typically extremely costly, and has to be redone for each new scene, training a neural network to infer an estimate of the motion in a single pass is very appealing. Similar to most direct optimizations, existing neural network methods rely on multiple input views to simplify the reconstruction (Qiu et al., 2021) . However, this severely limits the settings in which inputs can be captured, as a fully calibrated lab environment is often the only place where such input sequences can be recorded. The flexibility of motion estimation from single views makes them a highly attractive direction, and physical priors in the form of governing equations make this possible in the context of fluids (Eckert et al., 2018; Franz et al., 2021) . Nonetheless, despite using strong priors, the single viewpoint makes it challenging to adequately handle the otherwise fully unconstrained depth dimension. We target a deep learning-based approach where a neural network learns to represent the underlying motion structures, such that almost instantaneous, single-pass motion inference is made possible without relying on ground truth motion data. The latter is especially important for complex volumetric motions, as reference motions of real fluids can not be acquired directly. Instead, one has to work with reconstructions or even simulated data, suffering from a mismatch between the observations and the synthetic motion data. While obtaining multiple calibrated captures for training is feasible, using additional views only for losses results in issues with the depth ambiguity of single-view inputs. In this work, we address the challenging problem of training neural networks to infer 3D motions from monocular videos in scenarios where no 3D reference data is available. To the best of our knowledge, we are to first to propose an end-to-end approach, denoted by Neural Global Transport (NGT) in the following, which (i) yields a neural network to estimate a global, dense 3D velocity from a single-view image sequence without requiring any 3D ground truth as targets. Among others, this is made possible by a custom 2D-to-3D UNet architecture. (ii) We address the resulting depth-ambiguity problem using a new approach with differentiable rendering and an adversarial technique. (iii) A single network trained with the proposed approach generalizes across a range of different inputs, vastly outperforming optimization-based approaches in terms of performance.

2. RELATED WORK

Optical flow Flow estimation is of great interest in a multitude of settings, from 2D optical flow over scene flow to the capture and physically accurate reconstruction of volumetric fluid flows. Operating on a pair of 2D images, optical flow estimates a motion that maps one to the other (Sun et al., 2014) . In this setting, multi-scale approaches in the form of spatial pyramids are a longstanding technique to handle displacements of different scales (Glazer, 1987) . More recent CNN-based methods also employ spatial pyramids (Dosovitskiy et al., 2015; Ranjan & Black, 2017) and can learn to estimate optical flow in an unsupervised fashion (Ahmadi & Patras, 2016; Yu et al., 2016; Luo et al., 2021) . Scene flow Scene flow (Vedula et al., 1999) bridges the gap to 3D where earlier approaches using energy minimization or variational methods (Zhang & Kambhamettu, 2001; Huguet & Devernay, 2007) have been surpassed by CNNs that bring superior runtime performance while retaining stateof-the-art accuracy (Ilg et al., 2018; Saxena et al., 2019) and can also be trained without the need for ground truth data (Lee et al., 2019; Wang et al., 2019) . Flow estimation from a single input is of particular importance as it vastly simplifies the data acquisition and several methods for monocular scene flow have been proposed (Brickwedde et al., 2019; Yang & Ramanan, 2020; Luo et al., 2020) . These can be un-or self-supervised (Ranjan et al., 2019; Hur & Roth, 2020) and benefit from using multiple frames (Hur & Roth, 2021) . 3D reconstruction In the context of fluids it is common to also reconstruct an explicit representation of the transported quantities. Such 3D reconstruction is typically addressed for clearly visible surfaces (Musialski et al., 2013; Koutsoudis et al., 2014) where some of the algorithms that have been proposed can incorporate deformations (Zang et al., 2018; Kato et al., 2018) . In this context, volumetric reconstructions make use of voxel grids (Papon et al., 2013; Moon et al., 2018) 



Fluid flows are traditionally extremely difficult to capture and methods ranging from Schlieren imaging Dalziel et al. (2000); Atcheson et al. (2008; 2009) and particle imaging velocimetry (PIV) methods Grant (1997); Elsinga et al. (2006); Xiong et al. (2017) over laser scanners Hawkins et al. (2005); Fuchs et al. (2007) to structured light Gu et al. (2013) and light path Ji et al. (2013) approaches all require specialized setups. The use of multiple commodity cameras simplifies the acquisition (Gregson et al., 2014; Eckert et al., 2019) and allows for view-interpolation to create additional constrains (Zang et al., 2020). Few works have attempted to solve monocular flow estimation in the fluids setting. For single-scene optimization Eckert et al. (2018) constrain the motion along the view depth, while Franz et al. (2021) use an adversarial approach to regularize the reconstruction from unseen views. Qiu et al. (2021) have proposed a network that can estimate a long-term motion from a single view, but still require 3D ground truth for training.

