LEARNING TO ESTIMATE SINGLE-VIEW VOLUMETRIC FLOW MOTIONS WITHOUT 3D SUPERVISION

Abstract

We address the challenging problem of jointly inferring the 3D flow and volumetric densities moving in a fluid from a monocular input video with a deep neural network. Despite the complexity of this task, we show that it is possible to train the corresponding networks without requiring any 3D ground truth for training. In the absence of ground truth data we can train our model with observations from realworld capture setups instead of relying on synthetic reconstructions. We make this unsupervised training approach possible by first generating an initial prototype volume which is then moved and transported over time without the need for volumetric supervision. Our approach relies purely on image-based losses, an adversarial discriminator network, and regularization. Our method can estimate long-term sequences in a stable manner, while achieving closely matching targets for inputs such as rising smoke plumes.

1. INTRODUCTION

Estimating motion is a fundamental problem, and is studied for a variety of settings in two and three dimensions (Ranjan & Black, 2017; Hur & Roth, 2021; Gregson et al., 2014) . It is also a highly challenging problem, since the motion u is a secondary quantity that typically can't be measured directly and has to be recovered from changes observed in transported markers ρ. We focus on volumetric, momentum-driven materials like fluids, where in contrast to the single-step estimation in optical flow (OF), motion estimation typically considers multiple coupled steps to achieve a stable global transport. Furthermore, in this setting the volume distribution of markers ρ is usually unknown and needs to be reconstructed from the observations in parallel to the motion estimation. So far, most research is focused on the reconstruction of single scenes. Classic methods use an optimization process working with an explicit volumetric representation (Eckert et al., 2019; Zang et al., 2020; Franz et al., 2021) while some more recent approaches optimize single scenes with neural fields (Mildenhall et al., 2020; Chu et al., 2022) . As such an optimization is typically extremely costly, and has to be redone for each new scene, training a neural network to infer an estimate of the motion in a single pass is very appealing. Similar to most direct optimizations, existing neural network methods rely on multiple input views to simplify the reconstruction (Qiu et al., 2021) . However, this severely limits the settings in which inputs can be captured, as a fully calibrated lab environment is often the only place where such input sequences can be recorded. The flexibility of motion estimation from single views makes them a highly attractive direction, and physical priors in the form of governing equations make this possible in the context of fluids (Eckert et al., 2018; Franz et al., 2021) . Nonetheless, despite using strong priors, the single viewpoint makes it challenging to adequately handle the otherwise fully unconstrained depth dimension. We target a deep learning-based approach where a neural network learns to represent the underlying motion structures, such that almost instantaneous, single-pass motion inference is made possible without relying on ground truth motion data. The latter is especially important for complex volumetric

