VIDEOFLOW: A FRAMEWORK FOR BUILDING VISUAL ANALYSIS PIPELINES Anonymous authors Paper under double-blind review

Abstract

The past years have witnessed an explosion of deep learning frameworks like Py-Torch and TensorFlow since the success of deep neural networks. These frameworks have significantly facilitated algorithm development in multimedia research and production. However, how to easily and efficiently build an end-to-end visual analysis pipeline with these algorithms is still an open issue. In most cases, developers have to spend a huge amount of time tackling data input and output, optimizing computation efficiency, or even debugging exhausting memory leaks together with algorithm development. VideoFlow aims to overcome these challenges by providing a flexible, efficient, extensible, and secure visual analysis framework for both the academia and industry. With VideoFlow, developers can focus on the improvement of algorithms themselves, as well as the construction of a complete visual analysis workflow. VideoFlow has been incubated in the practices of smart city innovation for more than three years. It has been widely used in tens of intelligent visual analysis systems. VideoFlow will be open-sourced at https://github.com/xxx/videoflow.

1. INTRODUCTION

The success of computer vision techniques is spawning intelligent visual analysis systems in real applications. Rather than serving individual models, these systems are often powered by a workflow of image/video decoding, several serial or parallel algorithm processing stages, as well as sinking analysis results. The varied visual analysis requirements in different real scenarios put forward a high demand on a framework for fast algorithm development, flexible pipeline construction, efficient workflow execution, as well as secure model protection. There exist some frameworks approaching some of the above mentioned targets, like Deep-Stream (Purandare, 2018) and MediaPipe (Lugaresi et al., 2019) . DeepStream is on top of GStreamer (GSTREAMER, 1999), which primarily targets audio/video media editing rather than analysis. MediaPipe can be used to build prototypes to polished cross-platform applications and measure performance. Though it is flexible and extensible on calculators, efficiency, model security, and extension on more aspects are expected by real online services in industry. In this paper, we present VideoFlow, to meet the visual analysis requirements for both algorithm development and deployment in real systems with the following highlights. Flexibility. VideoFlow is designed around stateful Computation Graph and stateless Resource. Computation graph abstracts the visual processing workflow into a stateful directed acyclic graph. Developers can focus on the implementation of processing units (graph nodes) and the construction of the whole workflow. Resource is a stateless shared computation module of computation graphs. The most typical resource is deep learning model inference. Resource decouples the stateless visual processing components from the whole complicated visual analysis pipeline, helping developers focus on the optimization of these computation or Input/Output(IO) intensive implementation. Efficiency. VideoFlow is designed for better efficiency from four levels. (1) Resource-level: resources can aggregate the scattered computation requests from computation graph instances into intensive processing for better efficiency. (2) Video-level: all videos are analyzed in parallel in a shared execution engine. (3) Frame-level: video frames are parallelized on operations which are irrelevant to frame orders. (4) Operator-level: visual analysis is a multi-branch pipeline in most

