BEYOND THE PIXELS: EXPLORING THE EFFECTS OF BIT-LEVEL NETWORK AND FILE CORRUPTIONS ON VIDEO MODEL ROBUSTNESS

Abstract

We investigate the robustness of video machine learning models to bit-level network and file corruptions, which can arise from network transmission failures or hardware errors, and explore defenses against such corruptions. We simulate network and file corruptions at multiple corruption levels, and find that bit-level corruptions can cause substantial performance drops on common action recognition and multi-object tracking tasks. We explore two types of defenses against bit-level corruptions: corruption-agnostic and corruption-aware defenses. We find that corruption-agnostic defenses such as adversarial training have limited effectiveness, performing up to 11.3 accuracy points worse than a no-defense baseline. In response, we propose Bit-corruption Augmented Training (BAT), a corruptionaware baseline that exploits knowledge of bit-level corruptions to enforce model invariance to such corruptions. BAT outperforms corruption-agnostic defenses, recovering up to 7.1 accuracy points over a no-defense baseline on highly-corrupted videos while maintaining competitive performance on clean/near-clean data.

1. INTRODUCTION

Video is becoming an increasingly common data modality, with applications in online conferencing (Jansen et al., 2018) , autonomous vehicles (Chen et al., 2018; Bojarski et al., 2016) , action recognition (Kuehne et al., 2011; Soomro et al., 2012; Carreira & Zisserman, 2017) , and event detection (Fu et al., 2019; Gaidon et al., 2013) . Recent work in computer vision has studied the robustness of machine learning (ML) models to pixel-space corruptions such as adversarial examples (Madry et al., 2017; Carlini & Wagner, 2016; Goodfellow et al., 2014; Moosavi-Dezfooli et al., 2016; Szegedy et al., 2013) . However, in the real world, videos are susceptible to corruptions beyond the pixels, such as bit-level network and file corruptions (Fig. 1 ). Bit-level corruptions are generally non-adversarial and arise in the real world from network congestion (Mushtaq & Mellouk, 2017), i.e., transmission problems in video conferencing, or hardware errors during storage (Sivathanu et al., 2005; Zhang et al., 2010) . These corruptions spuriously modify bits in a video file, leading to data loss or visual distortion (e.g. object duplication, noisy patches, or freeze-frames). Although some previous work in computer vision robustness has been compresssion-aware, e.g. studying the effects of JPEG compression (Aydemir et al., 2018; Dziugaite et al., 2016; Das et al., 2018) or changing bitrates (Guo et al., 2017b) , direct corruptions to the bit representations of videos remain underexplored. In this work, we take a first step in understanding video model robustness to naturally-occurring levels of network and file corruptions, and explore baseline defenses to these corruptions. To this end, we simulate a wide range of corruption severity levels consistent with empirical studies of real-world corruptions (Schroeder et al., 2016; Hu & Zhang, 2018) , and explore their effects on video artifacts and model performance. We evaluate model performance on two action recognition datasets (Kuehne et al., 2011; Soomro et al., 2012) and one multi-object tracking dataset (Leal-Taixé et al., 2015) under simulated network packet loss and random bit-flip file corruptions. Our experiments demonstrate that model performance begins to degrade when packet loss rates exceed 1%, or when the proportion of randomly-flipped bits in a file exceeds 10 ´6, and ultimately drops by up to 77.1% under the most severe corruption levels.

