VQR: AUTOMATED SOFTWARE VULNERABILITY RE-PAIR THROUGH VULNERABILITY QUERIES

Abstract

Recently, automated vulnerability repair (AVR) approaches have been widely adopted to combat the increasing number of software security issues. In particular, transformer-based models achieve competitive results. While existing models are learned to generate vulnerability repairs, existing AVR models lack a mechanism to provide their models with the precise location of vulnerable code (i.e., models may generate repairs for the non-vulnerable areas). To address this problem, we base our framework on the vision transformer(VIT)-based approaches for object detection that learn to locate bounding boxes via the cross-matching between object queries and image patches. We cross-match vulnerability queries and their corresponding vulnerable code areas through the cross-attention mechanism to generate more accurate repairs. To strengthen our cross-matching, we propose to learn a novel vulnerability query mask that greatly focuses on vulnerable code areas and integrate it into the cross-attention. Moreover, we also incorporate the vulnerability query mask into the self-attention to learn embeddings that emphasize the vulnerable areas of a program. Through an extensive evaluation using the real-world 5,417 vulnerabilities, our approach outperforms all of the baseline methods by 2.68%-32.33%. The training code and pre-trained models are available at https://github.

1. INTRODUCTION

Software vulnerabilities are security flaws, glitches, or weaknesses found in software code that could lead to a severe system crash or be leveraged as a threat source by attackers (CSRC, 2020) . According to National Vulnerability Database (NVD), the number of vulnerabilities discovered yearly has increased from 6,447 in 2016 to 20,156 in 2021 and 18,017 vulnerabilities have been found in 2022. This trend indicates more vulnerabilities are being discovered and released every year, meaning that there will be more workloads for security analysts to track down and patch those vulnerabilities. In particular, it may take 58 days on average to fix a vulnerability based on vulnerability statistics reported in 2022 (Edgescan, 2022) . Recently, Deep Learning (DL)-based approaches have been proposed to automate the vulnerability repair process by learning the representation of vulnerable programs and generating repair patches accordingly, which may potentially accelerate manual security analysis processes. Specifically, the transformer architecture has been widely adopted to generate accurate vulnerability patches that repair the vulnerable code automatically (Chen et al., 2022; Chi et al., 2022; Berabi et al., 2021; Fu et al., 2022) . The attention-based transformer is shown to be more effective than RNNs because its self-attention mechanism learns global dependencies when scanning through each word embedding rather than processing input sequentially. For the software vulnerability repair (SVR) problem, awareness and attention to the vulnerable code areas including vulnerable statements are crucially important. This further helps to guide an SVR model to emphasize and focus more on the vulnerable statements for producing better repairs. However, it is challenging because the vulnerable areas locate spatially in a source code. Toward this challenge, we observe that object detection in computer vision intuitively shares a similar concept to vulnerability repair because both approaches need to localize specific items in the input. Particularly, by linking the vulnerable code areas in a source code to the objects in an image, we hope to borrow the principles from the VIT-based objection detection approaches (Carion et al., 2020; Zhu et al., 2020; Wang et al., 2021a) to propose a novel solution for the SVR problem. Specifically, our approach is inspired by the VIT-based approaches for object detection (Carion et al., 2020; Zhu et al., 2020; Wang et al., 2021a) where we connect detecting spatial objects in an image for predicting bounding boxes to localizing vulnerable code tokens in a source code for generating the repair tokens. Our model consists of a vulnerability repair encoder to produce code token embeddings for code tokens and a vulnerability repair decoder to generate repair tokens. Similar to the object queries in the VIT-based approaches for object detection aiming to attend to objects in an image for predicting the corresponding bounding boxes, we devise vulnerability queries (VQ) aiming to attend to the vulnerable areas in a source code for predicting repair tokens. Additionally, the cross-attention mechanism employed in the vulnerability repair decoder assists the VQs in cross-matching and paying more attention to the vulnerable code areas. Furthermore, for real-world software vulnerabilities, not all code tokens in a source code are considered vulnerable, meaning that only some of the code tokens are likely to be more vulnerable than the others (Nguyen et al., 2021; Fu & Tantithamthavorn, 2022) . To strengthen the attention of the VQs to the vulnerable code areas or code tokens, we train an additional model to learn a vulnerability mask. Specifically, given a source code, the vulnerability mask has significantly higher vulnerability scores for the vulnerable code tokens. We then apply the vulnerability mask to both the vulnerability repair encoder/decoder to enrich our approach, named Vulnerability Query based Software Vulnerability Repair (VQR). Finally, Figure 1 presents a conceptual overview of our VQR. In summary, our contributions are (i) a novel vulnerability repair framework based on object detection that uses vulnerability queries to generate repair patches; (ii) a novel vulnerability query mask that facilitates the repair model to locate vulnerable code tokens more accurately during vulnerability query; (iii) a comprehensive evaluation of our proposed approach against other automated vulnerability repair approaches using a benchmark dataset including real-world vulnerabilities.

2. RELATED WORK

Automated Vulnerability Repair (AVR) is a task that uses machine learning models to generate repair patches for vulnerable programs. RNN-based models such as SequenceR (Chen et al., 2019) have been proposed to encode the vulnerable programs and decode corresponding repairs sequentially. SequenceR used Bi-LSTMs as encoders with unidirectional LSTMs to generate repairs. Recently, attention-based Transformer models have been leveraged in the AVR domain, which was shown to be more accurate than RNNs. For instance, VRepair (Chen et al., 2022) relied on an encoder-decoder Transformer with transfer learning using the bug-fix data to boost the performance of the vulnerability repair on C/C++ programs. SeqTrans (Chi et al., 2022) constructed code sequences by considering data flow dependencies of programs and leveraged an identical architecture as VRepair. On the other hand, Berabi et al. (2021) proposed to use a T5 model pre-trained on natural language corpus (i.e., T5-large (Raffel et al., 2020) ) to fix JavaScript programs and Fu et al. ( 2022) utilized a T5 model pre-trained on source code (i.e., CodeT5 (Wang et al., 2021b) ) to repair C/C++ programs. Additionally, Mashhadi & Hemmati (2021) applied the CodeBERT (Feng et al., 2020) model to repair Java bugs. Those large pre-trained language models have demonstrated



Figure1: Intuitively, not all code tokens in a program need to be repaired and the repair can be in multiple areas. Similarly, not all pixels in an image has objects and the objects can appear in multiple locations in an image. Thus, in object detection, object queries are used in VIT-based approaches(Carion et al., 2020; Zhu et al., 2020; Wang et al., 2021a)  to predict bounding boxes and locate objects. With a similar principle of object detection, we leverage vulnerability queries to attend more to the vulnerable code tokens in the vulnerable code areas and generate repairs for them.

availability

Code Block Vulnerable Code Block Vulnerable Source Code

