VIDEOFLOW: A FRAMEWORK FOR BUILDING VISUAL ANALYSIS PIPELINES Anonymous authors Paper under double-blind review

Abstract

The past years have witnessed an explosion of deep learning frameworks like Py-Torch and TensorFlow since the success of deep neural networks. These frameworks have significantly facilitated algorithm development in multimedia research and production. However, how to easily and efficiently build an end-to-end visual analysis pipeline with these algorithms is still an open issue. In most cases, developers have to spend a huge amount of time tackling data input and output, optimizing computation efficiency, or even debugging exhausting memory leaks together with algorithm development. VideoFlow aims to overcome these challenges by providing a flexible, efficient, extensible, and secure visual analysis framework for both the academia and industry. With VideoFlow, developers can focus on the improvement of algorithms themselves, as well as the construction of a complete visual analysis workflow. VideoFlow has been incubated in the practices of smart city innovation for more than three years. It has been widely used in tens of intelligent visual analysis systems. VideoFlow will be open-sourced at https://github.com/xxx/videoflow.

1. INTRODUCTION

The success of computer vision techniques is spawning intelligent visual analysis systems in real applications. Rather than serving individual models, these systems are often powered by a workflow of image/video decoding, several serial or parallel algorithm processing stages, as well as sinking analysis results. The varied visual analysis requirements in different real scenarios put forward a high demand on a framework for fast algorithm development, flexible pipeline construction, efficient workflow execution, as well as secure model protection. There exist some frameworks approaching some of the above mentioned targets, like Deep-Stream (Purandare, 2018) and MediaPipe (Lugaresi et al., 2019) . DeepStream is on top of GStreamer (GSTREAMER, 1999) , which primarily targets audio/video media editing rather than analysis. MediaPipe can be used to build prototypes to polished cross-platform applications and measure performance. Though it is flexible and extensible on calculators, efficiency, model security, and extension on more aspects are expected by real online services in industry. In this paper, we present VideoFlow, to meet the visual analysis requirements for both algorithm development and deployment in real systems with the following highlights. Flexibility. VideoFlow is designed around stateful Computation Graph and stateless Resource. Computation graph abstracts the visual processing workflow into a stateful directed acyclic graph. Developers can focus on the implementation of processing units (graph nodes) and the construction of the whole workflow. Resource is a stateless shared computation module of computation graphs. The most typical resource is deep learning model inference. Resource decouples the stateless visual processing components from the whole complicated visual analysis pipeline, helping developers focus on the optimization of these computation or Input/Output(IO) intensive implementation. Efficiency. VideoFlow is designed for better efficiency from four levels. (1) Resource-level: resources can aggregate the scattered computation requests from computation graph instances into intensive processing for better efficiency. (2) Video-level: all videos are analyzed in parallel in a shared execution engine. (3) Frame-level: video frames are parallelized on operations which are irrelevant to frame orders. (4) Operator-level: visual analysis is a multi-branch pipeline in most cases. The different branches and different operators of each branch without sequential dependency are analyzed in parallel. Extensibility. VideoFlow is designed from the beginning to be as modular as possible to allow easy extension to almost all its components. It can be extended to different hardware devices like Graphic Processing Units(GPU), Neural Processing Unit (NPU), etc. It can be hosted on either x86 or ARM platforms. Developers can customize their own implementations with VideoFlow as a dependent library. The extended implementations can be registered back to VideoFlow as plugins at runtime. Security. Model protection is an important problem in industry. VideoFlow encodes model files into encrypted binary codes as part of the compiled library. The secret key can be obscured into the same library, or exported to a separate key management service (KMS). At runtime, VideoFlow decrypts the models and verifies authorization from a remote service periodically. VideoFlow has been incubated in the practices of the smart city innovation for more than three years. It is designed for computer vision practitioners, including engineers, researchers, students, and software developers. The targets of VideoFlow include: 1) free developers from the exhausting data loading/sinking, parallel programming and debugging to the optimization of algorithms; 2) enable easy extension of video decoding, deep model inference and algorithm implementation; 3) provide highly efficient framework for large scale visual processing in industry rather than just experimental prototypes. 4) protect the intellectual property of models and algorithms to make sure that they can only work with authorization.

2.1. DEEP LEARNING FRAMEWORKS

Almost all existing deep learning frameworks like Caffe (Jia et al., 2014 ), TensorFlow (Abadi et al., 2016 ), PyTorch (Paszke et al., 2017 ), MXNet (Chen et al., 2015) describe networks in directed graphs or even dynamic graphs. VideoFlow draws lessons from this successful design for visual analysis. The difference is that the basic units in deep networks are low level operations like convolutions, compared to higher level processing like object tracking in VideoFlow. The data transferred between operators in VideoFlow is also much more complex than the Tensor in deep learning. As to model inference, there are some specially optimized engines , like TensorRT (Vanholder, 2016) and MKL-DNN/oneAPI (Intel) by hardware manufactures. In the open source community, developers put forward TVM for easy extension to different hardware for more effective inference (Chen et al., 2017) . On top of these engines, there are some serving platforms for individual models rather workflow construction, like tensorflow serving (Google, 2016) , NVIDIA Triton Inference Server (Goodwin & Jeong, 2019) . VideoFlow integrates these inference engines as Resources with their C++ interfaces.

2.2. VISUAL ANALYSIS FRAMEWORKS

The recent has witnessed some visual analysis frameworks. Nvidia launches the DeepStream project in the early days for video analysis on GPU (Purandare, 2018) . It is oriented as well as optimized on GPU and TensorRT, regardless of the bustling heterogeneous hardware devices. Besides, it is built on top of GStreamer (GSTREAMER, 1999), which primarily targets audio/video media editing rather than analysis, limiting its flexibility and extensibility. The gst-video-analytics project (Intel, 2019) is also built on top of GStreamer(Deuermeyer & Andrey). Google proposed MediaPipe by building graphs for arbitrary streaming data processing with a computation graph as well (Lugaresi et al., 2019) . MediaPipe can be used to build prototypes to polished cross-platform applications and measure performance. Though it is flexible and extensible on calculators, real online visual analysis expects extension on more aspects, more efficiency optimization, and model security protection. Compared to MediaPipe, VideoFlow features these advantages for better application in both academia and industry. Another framework also named Videoflow (de Armas, 2019) is designed to facilitate easy and quick definition of computer vision stream processing pipelines. However, it is just a prototype experimental platform, with limitations on extensibility, efficiency, and security.

