DEEPDFA: DATAFLOW ANALYSIS-GUIDED EFFICIENT GRAPH LEARNING FOR VULNERABILITY DETECTION Anonymous

Abstract

Deep learning-based vulnerability detection models have recently been shown to be effective and, in some cases, outperform static analysis tools. However, the highest-performing approaches use token-based transformer models, which do not leverage domain knowledge. Classical program analysis techniques such as dataflow analysis can detect many types of bugs and are the most commonly used methods in practice. Motivated by the causal relationship between bugs and dataflow analysis, we present DeepDFA, a dataflow analysis-guided graph learning framework and embedding that use program semantic features for vulnerability detection. We show that DeepDFA is performant and efficient. DeepDFA ranked first in recall, first in generalizing over unseen projects, and second in F1 among all the state-of-the-art models we experimented with. It is also the smallest model in terms of the number of parameters, and was trained in 9 minutes, 69x faster than the highest-performing baseline. DeepDFA can be used with other models. By integrating LineVul and DeepDFA, we achieved the best vulnerability detection performance of 96.4 F1 score, 98.69 precision, and 94.22 recall. 

1. INTRODUCTION

Software vulnerabilities cause great harm to people and corporations. Many Internet users have had their personal information breached because of security vulnerabilities, with common reports of breaches exposing millions of records (wik, 2021) . The average data breach costs the target company $4.24 million, according to IBM's 2021 report (ibm, 2021) . The number of vulnerabilities is growing every year, as reported by the Common Vulnerability Enumeration (CVE) from 2016-2021 (cve, 2021) . Because of its importance, software companies invested heavily to develop vulnerability detection tools that can scan software before its release (Lu et al., 2021; Zheng et al., 2021) . Deep neural networks have reported great progress for vulnerability detection, with the recent Line-Vul paper (Fu & Tantithamthavorn, 2022) reporting 0.91 F1 score on a commonly used real-world vulnerability dataset (Fan et al., 2020) , and many deep learning-based tools outperforming static analysis (Li et al., 2018; Ding et al., 2022; Cao et al., 2022) . Current state-of-the-art models use graph neural networks (GNNs) with unsupervised word embeddings and large pretrained transformers, which can perform well on this task. However, these models are large and expensive to train, and we showed that they did not generalize well beyond unseen projects. Empirical studies have discovered that the models can focus on spurious features which are not relevant to the cause of the bug, such as variable names (Chakraborty et al., 2021) . Inspired by the work done by Xu et al. (2019) and by Cranmer et al. (2020) , we designed a novel graph learning framework and embedding technique that is guided by program analysis algorithms of vulnerability detection, namely Dataflow Analysis (DFA). DFA computes the data usage patterns and relations in the control flow graph (CFG) of a program and reports a vulnerability based on its root cause, i.e., whether the values and data relations collected from the program indicate the occurrence of the vulnerable conditions. We explored an analogy between DFA and the GNN messagepassing algorithm, and designed an embedding technique that encodes dataflow information at each node of the CFG. Graph learning on such embedding thus simulates the dataflow computation in DFA. We propose DeepDFA, a Deep Learning Framework guided by DFA, shown in Figure 1 . Given the source code of a potentially vulnerable program, we convert it to a CFG and encode the nodes

