LDMIC: LEARNING-BASED DISTRIBUTED MULTI-VIEW IMAGE CODING

Abstract

Multi-view image compression plays a critical role in 3D-related applications. Existing methods adopt a predictive coding architecture, which requires joint encoding to compress the corresponding disparity as well as residual information. This demands collaboration among cameras and enforces the epipolar geometric constraint between different views, which makes it challenging to deploy these methods in distributed camera systems with randomly overlapping fields of view. Meanwhile, distributed source coding theory indicates that efficient data compression of correlated sources can be achieved by independent encoding and joint decoding, which motivates us to design a learning-based distributed multi-view image coding (LDMIC) framework. With independent encoders, LDMIC introduces a simple yet effective joint context transfer module based on the crossattention mechanism at the decoder to effectively capture the global inter-view correlations, which is insensitive to the geometric relationships between images. Experimental results show that LDMIC significantly outperforms both traditional and learning-based MIC methods while enjoying fast encoding speed. Code is released at https://github.com/Xinjie-Q/LDMIC.

1. INTRODUCTION

Multi-view image coding (MIC) aims to jointly compress a set of correlated images captured from different viewpoints, which is promising to achieve high coding efficiency for the whole image set by exploiting inter-image correlation. It plays an important role in many applications, such as autonomous driving (Yin et al., 2020 ), virtual reality (Fehn, 2004) , and robot navigation (Sanchez-Rodriguez & Aceves-Lopez, 2018) . As shown in Figure 1 (a), existing multi-view coding standards, e.g., H.264-based MVC (Vetro et al., 2011) and H.265-based MV-HEVC (Tech et al., 2015) , adopt a joint coding architecture to compress different views. Specifically, they follow the predictive compression procedure of video standards, in which a selected base view is compressed by single image coding. When compressing the dependent view, both the disparity estimation and compensation are employed at the encoder to generate the predicted image. Then the disparity information as well as residual errors between the input and predicted image are compressed and passed to the decoder. In this way, the inner relationship between different views decreases in sequel. These methods depend on hand-crafted modules, which prevents the whole compression system from enjoying the benefits of end-to-end optimization. Inspired by the great success of learning-based single image compression (Ballé et al., 2017; 2018; Minnen et al., 2018; Cheng et al., 2020) , several recent works have investigated the application of deep learning techniques to stereo image coding, a special case of MIC. In particular, Liu et al. (2019 ), Deng et al. (2021 ) and Wödlinger et al. (2022) , mimicking traditional MIC techniques, adopt a unidirectional coding mechanism and explicitly utilize the disparity compensation prediction in the pixel/feature space to reduce the inter-view redundancy. Meanwhile, Lei et al. (2022) introduces a bi-directional coding framework, called as BCSIC, to jointly compress left and right images simultaneously for exploring the content dependency between the stereo pair. These rudimentary studies demonstrate the potentials of deep neural networks (DNNs) in saving significant bit-rate for MIC. However, there are several significant shortcomings hampering the deployment and application scope of existing MIC methods. Firstly, both the traditional and learning-based approaches demand inter-view prediction at the encoder, i.e., joint encoding, which requires the cameras to communi-

