DEEPDFA: DATAFLOW ANALYSIS-GUIDED EFFICIENT GRAPH LEARNING FOR VULNERABILITY DETECTION Anonymous

Abstract

Deep learning-based vulnerability detection models have recently been shown to be effective and, in some cases, outperform static analysis tools. However, the highest-performing approaches use token-based transformer models, which do not leverage domain knowledge. Classical program analysis techniques such as dataflow analysis can detect many types of bugs and are the most commonly used methods in practice. Motivated by the causal relationship between bugs and dataflow analysis, we present DeepDFA, a dataflow analysis-guided graph learning framework and embedding that use program semantic features for vulnerability detection. We show that DeepDFA is performant and efficient. DeepDFA ranked first in recall, first in generalizing over unseen projects, and second in F1 among all the state-of-the-art models we experimented with. It is also the smallest model in terms of the number of parameters, and was trained in 9 minutes, 69x faster than the highest-performing baseline. DeepDFA can be used with other models. By integrating LineVul and DeepDFA, we achieved the best vulnerability detection performance of 96.4 F1 score, 98.69 precision, and 94.22 recall. 

1. INTRODUCTION

Software vulnerabilities cause great harm to people and corporations. Many Internet users have had their personal information breached because of security vulnerabilities, with common reports of breaches exposing millions of records (wik, 2021) . The average data breach costs the target company $4.24 million, according to IBM's 2021 report (ibm, 2021) . The number of vulnerabilities is growing every year, as reported by the Common Vulnerability Enumeration (CVE) from 2016-2021 (cve, 2021) . Because of its importance, software companies invested heavily to develop vulnerability detection tools that can scan software before its release (Lu et al., 2021; Zheng et al., 2021) . Deep neural networks have reported great progress for vulnerability detection, with the recent Line-Vul paper (Fu & Tantithamthavorn, 2022) reporting 0.91 F1 score on a commonly used real-world vulnerability dataset (Fan et al., 2020) , and many deep learning-based tools outperforming static analysis (Li et al., 2018; Ding et al., 2022; Cao et al., 2022) . Current state-of-the-art models use graph neural networks (GNNs) with unsupervised word embeddings and large pretrained transformers, which can perform well on this task. However, these models are large and expensive to train, and we showed that they did not generalize well beyond unseen projects. Empirical studies have discovered that the models can focus on spurious features which are not relevant to the cause of the bug, such as variable names (Chakraborty et al., 2021) . Inspired by the work done by Xu et al. (2019) and by Cranmer et al. (2020) , we designed a novel graph learning framework and embedding technique that is guided by program analysis algorithms of vulnerability detection, namely Dataflow Analysis (DFA). DFA computes the data usage patterns and relations in the control flow graph (CFG) of a program and reports a vulnerability based on its root cause, i.e., whether the values and data relations collected from the program indicate the occurrence of the vulnerable conditions. We explored an analogy between DFA and the GNN messagepassing algorithm, and designed an embedding technique that encodes dataflow information at each node of the CFG. Graph learning on such embedding thus simulates the dataflow computation in DFA. We propose DeepDFA, a Deep Learning Framework guided by DFA, shown in Figure 1 . Given the source code of a potentially vulnerable program, we convert it to a CFG and encode the nodes Figure 1 : Overview of our DeepDFA approach using an abstract dataflow embedding we designed. Our abstract dataflow embedding represents variable definitions using the properties which are most important for vulnerability detection, based on domain knowledge from program analysis, e.g., the data types, API calls, constants, and operators present in the definition. Then, we apply graph learning and its message passing mechanism on the CFG edges to propagate such information across the edges of the graph, similar to what is done in a dataflow analysis. We then use the learned graph representation to classify whether the function is vulnerable or not. Our evaluation shows that DeepDFA is significantly faster than our baseline models, both in terms of training and inference time. It only took 9 minutes to train and 1.92 ms/example for inference on CPU. Yet, DeepDFA still achieved top one for recall, ranked second for F1 and generalized to unseen projects the best, compared with other baseline models. Importantly, we show that DeepDFA embedding can be used with other models to further improve their performance. We created the top DeepDFA did not use any token and text level features which hardly reveal the cause of vulnerabilities; instead it used abstract dataflow embedding to simulate the dataflow propagation (a casual domain algorithm) via graph learning, and used the idea of bitvector in dataflow analysis to achieve the efficiency of embedding and learning. In summary, we made the following contributions in this paper: 1. We designed an abstract dataflow embedding to efficiently learn the program semantics relevant to vulnerability detection; 2. We applied graph learning on the control flow graph (CFG) of the program and abstract dataflow embedding to simulate reaching definition dataflow analysis; 3. We implemented DeepDFA and experimentally demonstrated that DeepDFA outperforms baselines in vulnerability detection for effectiveness, efficiency, and generalization over unseen projects; 4. We provided a comprehensive understanding on the analogy of dataflow analysis and graph learning, which can understand why DeepDFA performs well and is efficient; and 5. We showed that DeepDFA can be used to improve other models, and we delivered the best vulnerability detection model in the state-of-the-art by combining LineVul and DeepDFA.

2. RELATED WORK

A vulnerability is a flaw in a software program that can be exploited to negatively impact the consumer or system (cve, 2022). In the literature, vulnerability detection is typically framed as a binary classification problem. Devign (Zhou et al., 2019 ), ReVeal (Chakraborty et al., 2021) , IVDetect (Li



state-of-the-art vulnerability detection model by combining LineVul and DeepDFA and achieved 96.40 F1 score, 98.69 precision and 94.22 recall. Along the research like GIN Xu et al. (2019) and Cranmer et al. (2020), DeepDFA also demonstrated that guided by domain algorithms, deep learning can achieve better results with minimal resources.

