ADVERSARIAL TRAINING OF SELF-SUPERVISED MONOCULAR DEPTH ESTIMATION AGAINST PHYSICAL-WORLD ATTACKS

Abstract

Monocular Depth Estimation (MDE) is a critical component in applications such as autonomous driving. There are various attacks against MDE networks. These attacks, especially the physical ones, pose a great threat to the security of such systems. Traditional adversarial training method requires ground-truth labels hence cannot be directly applied to self-supervised MDE that does not have groundtruth depth. Some self-supervised model hardening techniques (e.g., contrastive learning) ignore the domain knowledge of MDE and can hardly achieve optimal performance. In this work, we propose a novel adversarial training method for self-supervised MDE models based on view synthesis without using ground-truth depth. We improve adversarial robustness against physical-world attacks using L 0norm-bounded perturbation in training. We compare our method with supervised learning based and contrastive learning based methods that are tailored for MDE. Results on two representative MDE networks show that we achieve better robustness against various adversarial attacks with nearly no benign performance degradation.

1. INTRODUCTION

Monocular Depth Estimation (MDE) is a technique that estimates depth from a single image. It enables 2D-to-3D projection by predicting the depth value for each pixel in a 2D image and serves as a very affordable replacement for the expensive Lidar sensors. It hence has a wide range of applications such as autonomous driving (Liu et al., 2021a ), visual SLAM (Wimbauer et al., 2021) , and visual relocalization (Liu et al., 2021b) , etc. In particular, self-supervised MDE gains fast-growing popularity in the industry (e.g., Tesla Autopilot (Karpathy, 2020)) because it does not require the ground-truth depth collected by Lidar during training while achieving comparable accuracy with supervised training. Exploiting vulnerabilities of deep neural networks, multiple digital-world (Zhang et al., 2020; Wong et al., 2020) and physical-world attacks (Cheng et al., 2022) against MDE have been proposed. They mainly use optimization-based methods to generate adversarial examples to fool the MDE network. Due to the importance and broad usage of self-supervised MDE, these adversarial attacks have posed a great threat to the security of applications such as autonomous driving, which makes the defense and MDE model hardening an urgent need. Adversarial training (Goodfellow et al., 2014) is the most popular and effective way to defend adversarial attacks. However, it usually requires ground truth labels in training, making it not directly applicable to self-supervised MDE models with no depth ground truth. Although contrastive learning gains a lot of attention recently and has been used for self-supervised adversarial training (Ho & Nvasconcelos, 2020; Kim et al., 2020) , it does not consider the domain knowledge of depth estimation and can hardly achieve optimal results (shown in Section 4.2). In addition, many existing adversarial training methods do not consider certain properties of physical-world attacks such as strong perturbations. Hence in this paper, we focus on addressing the problem of hardening self-supervised MDE models against physical-world attacks without requiring the ground-truth depth. While traditional adversarial training assumes bounded perturbations in L 2 or L ∞ norm (i.e., measuring the overall perturbation magnitude on all pixels), physical-world attacks are usually unbounded in those norms. They tend to be stronger attacks in order to be persistent with environmental condition variations. To harden MDE models against such attacks, we utilize a loss function that can effectively approximate the L 0 norm (measuring the number of perturbed pixels regardless of their perturbation magnitude) while remaining differentiable. Adversarial samples generated by minimizing this loss can effectively mimic physical attacks. We make the following contributions: x We develop a new method to synthesize 2D images that follow physical-world constraints (e.g., relative camera positions) and directly perturb such images in adversarial training. The physical world cost is hence minimized. y Our method utilizes the reconstruction consistency from one view to the other view to enable self-supervised adversarial training without the ground-truth depth labels. z We generate L 0 -bounded perturbations with a differentiable loss and randomize the camera and object settings during synthesis to effectively mimic physical-world attacks and improve robustness.



Figure 1: Self-supervised adversarial training of MDE with view synthesis.A straightforward proposal to harden MDE models is to perturb 3D objects in various scenes and ensure the estimated depths remain correct. However, it is difficult to realize such adversarial training. First, 3D perturbations are difficult to achieve in the physical world. While one can train the model in simulation, such training needs to be supported by a high-fidelity simulator and a powerful scene rendering engine that can precisely project 3D perturbations to 2D variations. Second, since self-supervised MDE training does not have the ground-truth depth, even if realistic 3D perturbations could be achieved and used in training, the model may converge on incorrect (but robust) depth estimations. In this paper, we propose a new self-supervised adversarial training method for MDE models. Figure1aprovides a conceptual illustration of our technique. A board A printed with the 2D image of a 3D object (e.g., a car) is placed at a fixed location (next to the car at the top-right corner). We use two cameras (close to each other at the bottom) C t and C s to provide a stereo view of the board (images I t and I s in Figure1b). Observe that there are fixed geometric relations between pixels in the two 2D views produced by the two respective cameras such that the image in one view can be transformed to yield the image from the other view. Intuitively, I t can be acquired by shifting I s to the right. Note that when two cameras are not available, one can use two close-by frames in a video stream to form the two views as well. During adversarial training, camera C t takes a picture I t of the original 2D image board A. Similarly, camera C s takes a picture I s of the board too (step ). The bounding box of the board A is recognized in I t and the pixels in the bounding box corresponding to A are perturbed (step ). Note that these are 2D perturbations similar to those in traditional adversarial training. At step , the perturbed image I t +perturbations is fed to the MDE model to make depth estimation, achieving a 3D projection of the object. Due to the perturbations, a vulnerable model generates distance errors as denoted by the red arrow between A and the projected 3D object in Figure1a. At step , we try to reconstruct I t from I s . The reconstruction is parameterized on the cameras' relative pose transformations and the estimated distance of the object from the camera. Due to the distance error, the reconstructed image I s→t (shown in Figure1b) is different from I t . Observe that part of the car (the upper part inside the red circle) is distorted. In comparison, Figure1balso shows the reconstructed image I ben s→t without the perturbation, which is much more similar to I t . The goal of our training (of the subject MDE model) is hence to reduce the differences between original and reconstructed images. The above process is conceptual, whose faithful realization entails a substantial physical-world overhead. In Section 3, we describe how to avoid the majority of the physical-world cost through image synthesis and training on synthesized data.

