DISENTANGLING STYLE AND CONTENT FOR LOW RESOURCE VIDEO DOMAIN ADAPTATION: A CASE STUDY ON KEYSTROKE INFERENCE ATTACKS

Abstract

Keystroke inference attacks are a form of side-channels attacks in which an attacker leverages various techniques to recover a user's keystrokes as she inputs information into some display (for example, while sending a text message or entering her pin). Typically, these attacks leverage machine learning approaches, but assessing the realism of the threat space has lagged behind the pace of machine learning advancements, due in-part, to the challenges in curating large real-life datasets. This paper aims to overcome the challenge of having limited number of real data by introducing a video domain adaptation technique that is able to leverage synthetic data through supervised disentangled learning. Specifically, for a given domain, we decompose the observed data into two factors of variation: Style and Content. Doing so provides four learned representations: real-life style, synthetic style, real-life content and synthetic content. Then, we combine them into feature representations from all combinations of style-content pairings across domains, and train a model on these combined representations to classify the content (i.e., labels) of a given datapoint in the style of another domain. We evaluate our method on real-life data using a variety of metrics to quantify the amount of information an attacker is able to recover. We show that our method prevents our model from overfitting to a small real-life training set, indicating that our method is an effective form of data augmentation.

1. INTRODUCTION

We are exceedingly reliant on our mobile devices in our everyday lives. Numerous activities, such as banking, communications, and information retrieval, have gone from having separate channels to collapsing into one: through our mobile phones. While this has made many of our lives more convenient, this phenomena further incentivizes attackers seeking to steal information from users. Therefore, studying different attack vectors and understanding the realistic threats that arise from attackers' abilities to recover user information is imperative to formulating defenses. The argument for studying these attacks is not a new one. A rich literature of prior works studying both attacks and defenses has assessed a wide array of potential attack vectors. The majority of these attacks utilize various machine learning algorithms to predict the user's keystrokes, (Raguram et al., 2011; Cai & Chen, 2012; Xu et al., 2013; Sun et al., 2016; Chen et al., 2018; Lim et al., 2020) , but the ability to assess attackers leveraging deep learning methods has lagged due to the high costs of curating real-life datasets for this domain, and the lack of publicly available datasets. Despite all the recent attention to keystroke inference attacks, numerous questions have gone unanswered. Which defenses work against adversaries who leverage deep learning systems? Which defenses are easily undermined? Are there weaknesses in deep learning systems that we can use to develop better defenses to thwart state-of-the-art attacks? These questions capture the essence of the underlying principles for research into defenses for keystroke inference atttacks. Given the backand-forth nature of researching attacks and defenses, these questions can not be addressed because of the current inability to assess attacks with deep learning methods. This paper aims to overcome the challenge of having limited number of labeled, real-life data by introducing a video domain adaptation technique that is able to leverage abundantly labeled synthetic Figure 1 : An example to highlight the discrepancies between the Synthetic Data (Rows 1 and 3) and Real-Life Data (Rows 2 and 4). In rows 1 and 2, we show sequences of the word order being typed with the same number of frames between keypresses sampled. The frames with green boxes indicate ones in which a key was pressed, i.e, in the first frame for first two rows, the key o was pressed. While the content between the two sequences is the same, the style is different, e.g., the texture, and trajectory in between keypresses are different. To further highlight the temporal distribution shift, we show the thumb trajectory between w and h for both synthetic and real sequences in rows 3 and 4. While the finger is linearly interpolated in the synthetic domain, the real-life one has a more complex one that is challenging to model with a simulator. We highlight the thumb tip in red and the trajectories in blue. data. We show that by disentangling our data into separate style and content representations, we can subsequently create style-content pairs across both domains, and combine them into representations that contain the content in the style of its inputs, i.e., style transfer in the feature space. This is especially attractive in the case of pairs of real-life style and synthetic content, as this is an effective data augmentation scheme. Style representations need to be well separated between domains whereas content needs to be indistinguishable. To do this, we introduce auxiliary losses on the latent spaces to enforce disentanglement. Through a series of ablations, we show that doing so improves performance. In our context, Content answers the question: What was typed?. For example, the sentence that a user types. Style answers the question: How was it typed?. For example, the texting pattern. The majority of visual domain adaptation methods do not work well in our problem setting because they mainly focus on tasks in which the domain shift is limited to a shift in texture, e.g., image classification, semantic segmentation, etc. (Ganin & Lempitsky, 2014; Shrivastava et al., 2016; Tzeng et al., 2017; Hoffman et al., 2017; Motiian et al., 2017) . When predicting keystroke sequences, addressing the domain shift with respect to texture is not sufficient. While there is a clear difference in texture, we have to also address the temporal domain shift, e.g., different finger motions, speeds, etc. Notice the difference between the trajectories of thumbs in the two example videos displayed in Figure 1 . The synthetic thumb is linearly interpolated whereas the real one moves in a more complex fashion. Our pairing mechanism is inspired by the one introduced by Motiian et al. (2017) . They devise a training regime that pairs the scarce data in the target domain with the data from the source domain. This strategy aims to augment the data in the target domain on the order of the source domain. In our work, we loosen the restriction of needing pairs with the same label to adapt to our setting of not having paired sentences. This makes our pairing mechanism more general and applicable to other settings. To summarize, our main contributions are: 1) A framework for low-resource video domain adaptation using supervised disentangled learning. 2) A novel method to assess the threat of keystroke inference attacks by an attacker using a deep learning system while having limited real-life data.

2. BACKGROUND

Keystroke Inference Attacks Some of the early works in (vision-based) keystroke inference attacks have focused on direct line of sight and reflective surfaces (i.e., teapots, sunglasses, eyes) (Backes et al., 2008; 2009; Raguram et al., 2011; Xu et al., 2013; Yue et al., 2014; Ye et al., 2017; Lim et al., 2020) to infer sensitive data. The attackers train models that account for various capture angles by aligning the user's mobile phone to a template keyboard. Collectively, these works showed that

