DEEAPR: CONTROLLABLE DEPTH ENHANCEMENT VIA ADAPTIVE PARAMETRIC FEATURE ROTATION

Abstract

Understanding depth of an image provides viewers with a better interpretation of the 3D structures within an image. Photographers utilize numerous factors that can affect depth perception to aesthetically improve a scene. Unfortunately, controlling depth perception after the image has been captured is a difficult process as it requires accurate and explicit depth information. Also, defining a quantitative metric of a subjective quality (i.e., depth perception) is difficult which makes supervised learning a great challenge. To this end, we propose DEpth Enhancement via Adaptive Parametric feature Rotation (DEEAPR), which modulates the perceptual depth of an input scene using a single control parameter without the need for explicit depth information. We first embed content-independent depth perception of a scene by visual representation learning. Then, we train the controllable depth enhancer network with a novel modulator, parametric feature rotation block (PFRB), that allows for continuous modulation of a representative feature. We demonstrate the effectiveness of our proposed approach by verifying each component through an ablation study and comparison to other controllable methods.



Enhancing depth perception in 2D images exhibits a more realistic image content to the viewers as it allows for better interpretation of the 3D scene structure. The human visual system uses a variety of cues to inference depth information, such as disparity that arises from binocular vision or motion (Kim et al., 2016) . Depth cues in single static images, such as occlusion, shading, or blur, are referred to as pictorial depth cues (O'Shea et al., 1997) . Many artists and photographers utilize pictorial depth cues by adding synthetic effects to increase impression of depth in still images. For example, amplifying the defocus blur triggers the depth-of-focus cue, where the objects in the range of focus appear sharp and those further away appear blurry. However, enhancing depth perception of an image by manipulating depth cues without explicit depth information of a scene is a challenging task. Moreover, depth perception of a scene is a subjective quality that varies from image to image, which is difficult to learn in a supervised manner. Our final goal is to modulate the perceptual depth of an input scene using a single control parameter without the need for explicit depth information. While the traits of depth perception make supervised training very difficult, we rely on the recent success of unsupervised visual representation learning that considers a high dimensional space in which the degree of similarity is inversely correlated with the distance between instances. As the first step, we embed content-independent depth perception of a scene onto the representation space by combining contrastive learning (Hadsell et al., 2006) and metric learning (Balntas et al., 2016) . The representation space of the encoder then bridges the image space and the control parameter axis. Instead of relying on full supervision, we aim to train the direction of depth enhancement during depth enhancer training. Fig. 1 illustrates our DEEAPR framework combining visual representation learning and a controllable neural network. We first train a depth representation encoder in an unsupervised manner, then train the direction of depth perception enhancer such that appropriate change would be induced in the image domain when the depth representative feature is modulated. Our contributions are three-fold. First, we propose a novel strategy to learn visual representation space of style disentangled from image contents. Second, we present a controllable neural network that enhances the depth perception of an input image according to a single control parameter. Lastly, we present a novel modulator, parametric feature rotation block (PFRB), that allows for continuous modulation of feature representation while preserving its norm.

2.1. VISUAL REPRESENTATION LEARNING

Recent success of unsupervised representation learning in natural language processing has motivated the computer vision community to pursue learning a manifold of natural images. To learn an effective feature space for downstream tasks, many works utilized contrastive loss (Hadsell et al., 2006) to discriminate highly correlated instances (i.e., positive pairs) from others. In general, various content-preserving data augmentation is applied to the same image independently in order to generate positive pairs (Chen et al., 2020a). All other examples obtained from different images are considered as negative samples, and the objective is to minimize the distance between positive pairs with respect to myriads of negative samples. Momentum Contrast (MoCo) (He et al., 2020b) is a representative approach which aims to build a dynamic dictionary encoded by a moving-averaged momentum encoder. Furthermore, DASR (Wang et al., 2021) adapts MoCo to learn a contentindependent degradation representation for the blind super-resolution task. Our strategy is similar to MoCo v2 (Chen et al., 2020b) and partly motivated from DASR in that we aim to train contentindependent representation of depth perception via contrastive learning. We use triplet margin loss (Balntas et al., 2016) as an additional regularizer to disentangle depth perception from the image content.

2.2. DEPTH PERCEPTION ENHANCEMENT

Interpretation of 3D scene structure from single static images can be obtained from various pictorial depth cues, such as occlusion, relative size, linear perspective, aerial perspective, texture gradients, depth-of-focus and shading (Cipiloglu et al., 2010) . These depth cues are utilized in conventional methods by for example, enhancing local contrast (Ritschel et al., 2008; Lee et al., 2014) , slightly darkening or lightening the background or partly occluded objects (Luft et al., 2004) , or applying shading and/or shadows (Vergne et al., 2011; Lopez-Moreno et al., 2011) . While these methods require explicit use of depth information at the pixel level, Hel-Or et al. ( 2017) extract reflectance derivatives of an image to manipulate shading cues. Conventional methods often require precise manipulation of multitude of parameters for plausible and visually pleasing output. More recently, there have been great interest in developing deep learning-based approaches to render artificial depth-offield effect, often referred to as the Bokeh effect. Indeed, through the AIM Boken Challenges (Ignatov et al., 2019; 2020b) , numerous approaches created depth-dependent blur effects by using depth information explicitly (Peng et al., 2022) or implicitly (Ignatov et al., 2020a; Qian et al., 2020; Dutta et al., 2021) . Our DEEAPR differs from aforementioned approaches in two major aspects: DEEAPR procudes a depth enhanced image based on a single parameter without explicit depth information.

2.3. CONTROLLABLE NEURAL NETWORK

Conventional deep learning methods learn a deterministic mapping for each task and output a single image given an input. Controllable methods, on the other hand, provide flexibility in producing images at continuous levels for modulation over diverse imagery effects. Controllability has appeared in several low-level vision tasks such as style transfer to permit smooth interpolation between different artistic styles (Dumoulin et al., 2016; Ghiasi et al., 2017; Huang & Belongie, 2017; Wang et al., 



Figure 1: Illustration of DEEAPR framework.

