ADVERSARIAL TRAINING OF SELF-SUPERVISED MONOCULAR DEPTH ESTIMATION AGAINST PHYSICAL-WORLD ATTACKS

Abstract

Monocular Depth Estimation (MDE) is a critical component in applications such as autonomous driving. There are various attacks against MDE networks. These attacks, especially the physical ones, pose a great threat to the security of such systems. Traditional adversarial training method requires ground-truth labels hence cannot be directly applied to self-supervised MDE that does not have groundtruth depth. Some self-supervised model hardening techniques (e.g., contrastive learning) ignore the domain knowledge of MDE and can hardly achieve optimal performance. In this work, we propose a novel adversarial training method for self-supervised MDE models based on view synthesis without using ground-truth depth. We improve adversarial robustness against physical-world attacks using L 0norm-bounded perturbation in training. We compare our method with supervised learning based and contrastive learning based methods that are tailored for MDE. Results on two representative MDE networks show that we achieve better robustness against various adversarial attacks with nearly no benign performance degradation.

1. INTRODUCTION

Monocular Depth Estimation (MDE) is a technique that estimates depth from a single image. It enables 2D-to-3D projection by predicting the depth value for each pixel in a 2D image and serves as a very affordable replacement for the expensive Lidar sensors. It hence has a wide range of applications such as autonomous driving (Liu et al., 2021a ), visual SLAM (Wimbauer et al., 2021 ), and visual relocalization (Liu et al., 2021b) , etc. In particular, self-supervised MDE gains fast-growing popularity in the industry (e.g., Tesla Autopilot (Karpathy, 2020)) because it does not require the ground Adversarial training (Goodfellow et al., 2014) is the most popular and effective way to defend adversarial attacks. However, it usually requires ground truth labels in training, making it not directly applicable to self-supervised MDE models with no depth ground truth. Although contrastive learning gains a lot of attention recently and has been used for self-supervised adversarial training (Ho & Nvasconcelos, 2020; Kim et al., 2020) , it does not consider the domain knowledge of depth estimation and can hardly achieve optimal results (shown in Section 4.2). In addition, many existing adversarial training methods do not consider certain properties of physical-world attacks such as strong perturbations. Hence in this paper, we focus on addressing the problem of hardening self-supervised MDE models against physical-world attacks without requiring the ground-truth depth.



-truth depth collected by Lidar during training while achieving comparable accuracy with supervised training. Exploiting vulnerabilities of deep neural networks, multiple digital-world (Zhang et al., 2020; Wong et al., 2020) and physical-world attacks (Cheng et al., 2022) against MDE have been proposed. They mainly use optimization-based methods to generate adversarial examples to fool the MDE network. Due to the importance and broad usage of self-supervised MDE, these adversarial attacks have posed a great threat to the security of applications such as autonomous driving, which makes the defense and MDE model hardening an urgent need.

