MACTA: A MULTI-AGENT REINFORCEMENT LEARNING APPROACH FOR CACHE TIMING ATTACKS AND DETECTION

Abstract

Security vulnerabilities in computer systems raise serious concerns as computers process an unprecedented amount of private and sensitive data today. Cachetiming attacks (CTA) pose an important practical threat as they can effectively breach many protection mechanisms in today's systems. However, the current detection techniques for cache timing attacks heavily rely on heuristics and expert knowledge, which can lead to brittleness and the inability to adapt to new attacks. To mitigate the CTA threat, we propose using MACTA, a multi-agent reinforcement learning (MARL) approach that leverages population-based training to train both attackers and detectors. Following best practices, we develop a realistic simulated MARL environment, MA-AUTOCAT, which enables training and evaluation of cache-timing attackers and detectors. Our empirical results suggest that MACTA is an effective solution without any manual input from security experts. MACTA detectors can generalize to a heuristic attack not exposed in training with a 97.8% detection rate and reduce the attack bandwidth of RL-based attackers by 20% on average. In the meantime, MACTA attackers are qualitatively more effective than other attacks studied, and the average evasion rate of MACTA attackers against an unseen state-of-the-art detector can reach up to 99%. Furthermore, we found that agents equipped with a Transformer encoder can learn effective policies in situations when agents with multi-layer perceptron encoders do not in this environment, suggesting the potential of Transformer structures in CTA problems.

1. INTRODUCTION

With increasingly sensitive data and tasks, security in modern computer systems is recognized as one of the 14 grand challenges for engineering (National Academy of Engineering, 2007). As a concrete example, cache-timing attacks (CTA) in processor caches have been shown to leak private encryption keys (Yarom & Falkner, 2014; Liu et al., 2015) , break existing security isolation (Kocher et al., 2019) , cause privilege escalation (Lipp et al., 2018) , and break new hardware security features in the latest processors (Ravichandran et al., 2022) . In CTA, the attacker is able to gain such access to private information (e.g., via memory access patterns) from the victim who shares a cache with the attacker. Over decades, the attack and defense policies in CTA have been explored manually by computer architecture experts. To defend against such attacks, statistical analysis and machine learning models with static strategies have been proposed for CTA detection, e.g., CC-Hunter (Chen & Venkataramani, 2014) uses auto-correlation and Cyclone (Harris et al., 2019) uses an SVM classifier. Yet, new CTA attacks are still being reported (Xiong & Szefer, 2020; Briongos et al., 2020; Saileshwar et al., 2021; Guo et al., 2022b; a) , showing higher leakage rates or the ability to bypass existing defensive mechanisms. Computer security can be seen as a competitive game between the attackers and the defenders, and game-theoretic approaches that analyze strategy (policies) for both sides have been proposed (Anwar et al., 2018; Elderman et al., 2017; Eghtesad et al., 2020) . These methods highly abstract the attack and defense strategies, usually based on known attacks and defenses, and analyze simplified games in the limited strategy spaces. For example, Anwar et al. ( 2018) studies CTA-like attack scenarios where the attacker decides when to terminate its attack and the defender decides an abstract security level. However, real-world CTA has large action and state spaces for different agents, sparse reward, and long game horizons, making the game analysis hard without exploring all possible policies. In this work, we use multi-agent reinforcement learning (MARL) to jointly explore and optimize complex attack/defense policies in CTA. We take an integrated approach of reinforcement learning and game theory. First, we build a multi-agent gym environment, MA-AUTOCAT, that closely models a realistic CTA setting and allows efficient learning for both attackers and defenders. Specifically, we study a detect-and-terminate defense. Second, we introduce and evaluate a MARL approach, named MACTA, to automatically find both attacker and detector policies through self-play, similar to past successes in games with large state/action spaces (e.g., StarCraft (Vinyals et al., 2019) , Go (Silver et al., 2016) , and Poker (Brown & Sandholm, 2019)). MACTA adopts Fictitious Play (FP) (Brown, 1951) , population-based training in MARL (Vinyals et al., 2019) and Proximal Policy Optimization (PPO) (Schulman et al., 2017) to learn the best response policy to a pool of diverse opponents, to avoid cyclic behaviors of the attacker/defender policies. Finally, MACTA uses a Transformer architecture to parameterize the policy/value function so that an important subset of actions can be picked up quickly during training, yielding fast policy learning. We performed extensive experiments in a representative setting of cache-timing attack. The experiments show that learned policies trained with MACTA can generalize to detectors/attackers that they were not exposed to during the learning phase (henceforth referred to as "unseen detectors/attackers"). The MACTA detector exhibits a 97.8% detection rate on an existing humandesigned attack without training on it and can lower the number of attacks per episode (bandwidth) of adaptive attackers by 20% on average. The MACTA attacker can bypass previously unseen detector, Cyclone, with a more than 99% success rate. While there has been increasing interest and effort in using machine learning for computer system security recently, our work is the first to introduce the hardware timing attack problem as a promising application of MARL and show that MARL can be effectively applied to detect simulated CTA attacks with strong generalization. Our main contributions are as follows: • We contribute a simulated multi-agent environment MA-AUTOCAT that models realistic CTA and allows learning in both cache timing attacks and defenses. • We introduce and evaluate MACTA, a multi-agent learning approach for CTA, and show the resultant detector acquires interesting high-level patterns that can generalize to novel attackers and make the cache less exploitable to high-bandwidth attacks. • Our study on the neural architecture of learning agents indicates that the CTA is one case where Transformers are significantly better for retrieving state information than multi-layer perceptrons.

2. THE CACHE TIMING ATTACK CHALLENGE

The cache timing attack challenge is a fundamental problem to address as such kinds of attacks are stealthy but powerful. We leave the detailed reasons for studying the problem in Appendix A.1 and introduce the domain knowledge and problem formulation in this section.

2.1. DOMAIN DESCRIPTION

A cache is a small and fast on-chip memory commonly used in modern processor designs to reduce latency of memory accesses. Accessing memory addresses whose data are available in a cache is fast (called "cache hit"). If the data is not in the cache, data has to be retrieved from the main memory, which is much slower (called "cache miss"). Surprisingly, this timing difference in memory accesses due to caching could leak information across different programs/processes executing with a shared cache, a vulnerability known as cache timing attacks (CTA). As shown in Figure 1 (a), CTA involves the attacker process and the victim process

