BEYOND THE PIXELS: EXPLORING THE EFFECTS OF BIT-LEVEL NETWORK AND FILE CORRUPTIONS ON VIDEO MODEL ROBUSTNESS

Abstract

We investigate the robustness of video machine learning models to bit-level network and file corruptions, which can arise from network transmission failures or hardware errors, and explore defenses against such corruptions. We simulate network and file corruptions at multiple corruption levels, and find that bit-level corruptions can cause substantial performance drops on common action recognition and multi-object tracking tasks. We explore two types of defenses against bit-level corruptions: corruption-agnostic and corruption-aware defenses. We find that corruption-agnostic defenses such as adversarial training have limited effectiveness, performing up to 11.3 accuracy points worse than a no-defense baseline. In response, we propose Bit-corruption Augmented Training (BAT), a corruptionaware baseline that exploits knowledge of bit-level corruptions to enforce model invariance to such corruptions. BAT outperforms corruption-agnostic defenses, recovering up to 7.1 accuracy points over a no-defense baseline on highly-corrupted videos while maintaining competitive performance on clean/near-clean data.

1. INTRODUCTION

Video is becoming an increasingly common data modality, with applications in online conferencing (Jansen et al., 2018) , autonomous vehicles (Chen et al., 2018; Bojarski et al., 2016) , action recognition (Kuehne et al., 2011; Soomro et al., 2012; Carreira & Zisserman, 2017) , and event detection (Fu et al., 2019; Gaidon et al., 2013) . Recent work in computer vision has studied the robustness of machine learning (ML) models to pixel-space corruptions such as adversarial examples (Madry et al., 2017; Carlini & Wagner, 2016; Goodfellow et al., 2014; Moosavi-Dezfooli et al., 2016; Szegedy et al., 2013) . However, in the real world, videos are susceptible to corruptions beyond the pixels, such as bit-level network and file corruptions (Fig. 1 ). Bit-level corruptions are generally non-adversarial and arise in the real world from network congestion (Mushtaq & Mellouk, 2017) , i.e., transmission problems in video conferencing, or hardware errors during storage (Sivathanu et al., 2005; Zhang et al., 2010) . These corruptions spuriously modify bits in a video file, leading to data loss or visual distortion (e.g. object duplication, noisy patches, or freeze-frames). Although some previous work in computer vision robustness has been compresssion-aware, e.g. studying the effects of JPEG compression (Aydemir et al., 2018; Dziugaite et al., 2016; Das et al., 2018) or changing bitrates (Guo et al., 2017b) , direct corruptions to the bit representations of videos remain underexplored. In this work, we take a first step in understanding video model robustness to naturally-occurring levels of network and file corruptions, and explore baseline defenses to these corruptions. To this end, we simulate a wide range of corruption severity levels consistent with empirical studies of real-world corruptions (Schroeder et al., 2016; Hu & Zhang, 2018) , and explore their effects on video artifacts and model performance. We evaluate model performance on two action recognition datasets (Kuehne et al., 2011; Soomro et al., 2012) and one multi-object tracking dataset (Leal-Taixé et al., 2015) under simulated network packet loss and random bit-flip file corruptions. Our experiments demonstrate that model performance begins to degrade when packet loss rates exceed 1%, or when the proportion of randomly-flipped bits in a file exceeds 10 ´6, and ultimately drops by up to 77.1% under the most severe corruption levels. In response, we propose Bit-corruption Augmented Training (BAT), a corruption-aware baseline. BAT augments data with corrupted video files from randomly chosen bit-level corruptions during training. Experiments show that BAT recovers up to 7.1 points over a no-defense baseline on highlycorrupted videos, while maintaining competitive performance on clean/near-clean data. This suggests that the noise structure of bit-level corruptions is fundamentally different from that of adversarial noise, and that incorporating knowledge of bit-level corruptions is crucial to model robustness against network and file corruptions. Lastly, to better understand the effect of augmenting with different corruptions, we explore variants of BAT that sample augmentations from different subsets of simulated corruptions. Interestingly, we find that augmenting with high levels of network and file corruption is key to improving model robustness, resulting in an accuracy gain of up to 10.1 points over the no-defense baseline. We conclude that BAT is a promising starting point for robustness to bit-level corruptions in video machine learning. Our results motivate future studies to understand and defend against bit-level corruption in ML robustness.

2. BIT-LEVEL VIDEO CORRUPTIONS

In this section, we first provide a brief overview on video encoding to provide intuition about the effects of network and file corruptions (Section 2.1), and then describe how we simulate a naturallyoccurring range of network and file corruptions (Section 2.2).

2.1. VIDEO ENCODING

We present a simplified explanation of video encoding, using the H.264 codec as an example (Richardson, 2003) . Let X " R HˆW ˆCˆT be the space of possible videos. An uncompressed video x P X consists of a sequence of frames. These frames are often redundant (i.e., two neighboring frames in a video are likely to look similar), so the H.264 codec stores only a small fraction of frames (known as I-frames) and encodes the remaining frames as differences from I-frames. To further save space, the frames go through a series of compression steps like frequency space conversion, quantization, and entropy encoding, much like JPEG compression. We notate the encoding process as the function Encpxq, which is a mapping from pixel-space to bit-space. The corresponding decoding function is denoted as Decp¨q.



Figure 1: An overview of the video machine learning pipeline under bit-level corruptions, which can arise due to file storage and network transmission. We explore various defenses against network and file corruptions.We then focus on exploring two categories of defenses against network and file corruptions: corruption-agnostic methods without knowledge of bit-level corruptions, and corruption-aware methods that exploit knowledge of bit-level corruptions to enforce model invariance to network and file corruptions. For corruption-agnostic defenses, we evaluate out-of-distribution (OOD)detection (Hendrycks & Gimpel, 2016; Liang et al., 2018) and adversarial training (Goodfellow et al.,  2014)  as baselines. Our findings suggest that corruption-agnostic defenses have limited effectiveness, especially at low corruption levels; for example, under adversarial training, model performance drops by up to 11.3 points on corrupted data, and 8.6 points on clean data, compared to the no-defense baseline.

