LDMIC: LEARNING-BASED DISTRIBUTED MULTI-VIEW IMAGE CODING

Abstract

Multi-view image compression plays a critical role in 3D-related applications. Existing methods adopt a predictive coding architecture, which requires joint encoding to compress the corresponding disparity as well as residual information. This demands collaboration among cameras and enforces the epipolar geometric constraint between different views, which makes it challenging to deploy these methods in distributed camera systems with randomly overlapping fields of view. Meanwhile, distributed source coding theory indicates that efficient data compression of correlated sources can be achieved by independent encoding and joint decoding, which motivates us to design a learning-based distributed multi-view image coding (LDMIC) framework. With independent encoders, LDMIC introduces a simple yet effective joint context transfer module based on the crossattention mechanism at the decoder to effectively capture the global inter-view correlations, which is insensitive to the geometric relationships between images. Experimental results show that LDMIC significantly outperforms both traditional and learning-based MIC methods while enjoying fast encoding speed. Code is released at https://github.com/Xinjie-Q/LDMIC.

1. INTRODUCTION

Multi-view image coding (MIC) aims to jointly compress a set of correlated images captured from different viewpoints, which is promising to achieve high coding efficiency for the whole image set by exploiting inter-image correlation. It plays an important role in many applications, such as autonomous driving (Yin et al., 2020) , virtual reality (Fehn, 2004) , and robot navigation (Sanchez-Rodriguez & Aceves-Lopez, 2018) . As shown in Figure 1 (a), existing multi-view coding standards, e.g., H.264-based MVC (Vetro et al., 2011) and H.265-based MV-HEVC (Tech et al., 2015) , adopt a joint coding architecture to compress different views. Specifically, they follow the predictive compression procedure of video standards, in which a selected base view is compressed by single image coding. When compressing the dependent view, both the disparity estimation and compensation are employed at the encoder to generate the predicted image. Then the disparity information as well as residual errors between the input and predicted image are compressed and passed to the decoder. In this way, the inner relationship between different views decreases in sequel. These methods depend on hand-crafted modules, which prevents the whole compression system from enjoying the benefits of end-to-end optimization. Inspired by the great success of learning-based single image compression (Ballé et al., 2017; 2018; Minnen et al., 2018; Cheng et al., 2020) , several recent works have investigated the application of deep learning techniques to stereo image coding, a special case of MIC. In particular, Liu et al. (2019 ), Deng et al. (2021 ) and Wödlinger et al. (2022) , mimicking traditional MIC techniques, adopt a unidirectional coding mechanism and explicitly utilize the disparity compensation prediction in the pixel/feature space to reduce the inter-view redundancy. Meanwhile, Lei et al. ( 2022) introduces a bi-directional coding framework, called as BCSIC, to jointly compress left and right images simultaneously for exploring the content dependency between the stereo pair. These rudimentary studies demonstrate the potentials of deep neural networks (DNNs) in saving significant bit-rate for MIC. However, there are several significant shortcomings hampering the deployment and application scope of existing MIC methods. Firstly, both the traditional and learning-based approaches demand inter-view prediction at the encoder, i.e., joint encoding, which requires the cameras to communi- Dragotti, 2007) . This is undesirable in applications relevant to wireless multimedia sensor networks (Akyildiz et al., 2007 ). An alternative is to deploy special sensors like stereo cameras as the encoder devices to acquire the data, but these devices are generally more expensive than monocular sensors and suffer from limited field of view (FoV) due to the constraints of distance and position between built-in sensors (Li, 2008) . Secondly, most of the prevailing schemes, except BCSIC, are developed based on disparity correlations defined by the epipolar geometric constraint (Scharstein & Szeliski, 2002) , which usually requires to know the internal and external parameters of the camera in advance, such as camera locations, orientations, and camera matrices. Whereas, it is difficult for a distributed camera system without communication to access the prior knowledge of cameras (Devarajan et al., 2008) . For example, the specific location information of cameras in autonomous driving is usually not expected to be perceived by other vehicles or infrastructure in order to avoid leaking the location and trajectory of individuals (Xiong et al., 2020) . Finally, as shown in Table 1 and Figure 4 , compared with state-of-the-art (SOTA) learning-based single image codecs (Minnen et al., 2018; Cheng et al., 2020) , existing DNN-based MIC methods are not competitive in terms of rate-distortion (RD) performance, which is potentially caused by inefficient inter-view prediction networks. To address the above challenges, we resort to innovations in the image coding architecture. Particularly, our inspiration comes from the Slepian-Wolf (SW) theorem (Slepian & Wolf, 1973; Wolf, 1973) on distributed source coding (DSC)foot_0 . The SW theorem illustrates that separate encoding and joint decoding of two or more correlated sources can theoretically achieve the same compression rate as a joint encoding-decoding scheme under lossless compression. It has been extended to the lossy case by Berger (1978) and Tung (1978) , which provides the inner and outer bounds of the achievable rate region. Based on these information-theoretic results on DSC, we develop a learning-based distributed multi-view image coding (LDMIC) framework. Specifically, to avoid collaboration between different cameras, as shown in Figure 1 (b), each view image is mapped to the corresponding quantized latent representation by an individual encoder, while a joint decoder is used to reconstruct the whole image set, which can successfully avoid the communication among cameras or the usage of special sensors. This architectural innovation is theoretically supported by the DSC theory. Instead of disparity-based correlations, we design a joint context transfer (JCT) module based on the cross-attention mechanism agnostic to geometry priors to exploit the global content dependencies between different views at the decoder, making our approach applicable to arbitrary multi-camera systems with overlapping FoV. Finally, since the separate encoding and joint decoding scheme is implemented by DNNs, the end-to-end RD optimization strategy is leveraged to implicitly help the encoder to learn to remove the partial inter-view redundancy, thus improving the compression performance of the overall system. In summary, our main contributions are as follows: • To the best of our knowledge, this is the first work to develop a novel deep learning-based view-symmetric framework for multi-view image coding. It decouples the inter-view operations at the encoder, which is highly desirable for distributed camera systems. • We present a joint context transfer module at the decoder to explicitly capture inter-view correlations for generating more informative representations. We also propose an end-toend encoder-decoder training strategy to implicitly make the latent representations more compact.



More details about the theorem and proposition of distributed source coding are provided in Appendix 6.4.



Figure 1: Overview of different multi-view image coding architectures, including (a) a joint encoding architecture and (b) the proposed symmetric distributed coding architecture.cate with each other or to transmit the data to an intermediate common receiver, thereby consuming a tremendous amount of communication resources and increasing the deployment cost(Gehrig &  Dragotti, 2007). This is undesirable in applications relevant to wireless multimedia sensor networks(Akyildiz et al., 2007). An alternative is to deploy special sensors like stereo cameras as the encoder devices to acquire the data, but these devices are generally more expensive than monocular sensors and suffer from limited field of view (FoV) due to the constraints of distance and position between built-in sensors(Li, 2008). Secondly, most of the prevailing schemes, except BCSIC, are developed based on disparity correlations defined by the epipolar geometric constraint(Scharstein & Szeliski,  2002), which usually requires to know the internal and external parameters of the camera in advance, such as camera locations, orientations, and camera matrices. Whereas, it is difficult for a distributed camera system without communication to access the prior knowledge of cameras(Devarajan et al.,  2008). For example, the specific location information of cameras in autonomous driving is usually not expected to be perceived by other vehicles or infrastructure in order to avoid leaking the location and trajectory of individuals(Xiong et al., 2020). Finally, as shown in Table1and Figure4, compared with state-of-the-art (SOTA) learning-based single image codecs(Minnen et al., 2018; Cheng  et al., 2020), existing DNN-based MIC methods are not competitive in terms of rate-distortion (RD) performance, which is potentially caused by inefficient inter-view prediction networks.

