Skip to content

vulnerabilitydetection/VulnerabilityDetectionResearch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 

Repository files navigation

Vulnerability Detection with Fine-grained Interpretations

This repository contains the code and data for Vulnerability Detection with Fine-grained Interpretations

Introduction

Despite the successes of machine learning-based vulnerability detectors (VD), they are limited to providing only the decision on whether a given code is vulnerable or not, without details on what part of the code is relevant to the detected vulnerability. We present IVDetect, an interpretable vulnerability detector with the philosophy of using Artificial Intelligence (AI) to detect vulnerabilities, while using Intelligence Assistant (IA) via providing VD interpretations at the fine-grained level in term of vulnerable statements. For vulnerability detection, we separately consider the vulnerable statements and their surrounding contexts via data and control dependencies. This allows our model better discriminate vulnerable statements than using the mixture of vulnerable code and contextual code as in existing approaches. In addition to the coarsegrained vulnerability detection result, we leverage interpretable ML to provide users with fine-grained interpretations that include the sub-graph in the PDG with the crucial statements that are relevant to the detected vulnerability. Our empirical evaluation on vulnerability databases shows that IVDetect outperforms the existing ML-based approaches 64–122% and 105–255% in top-10 nDCG and MAP ranking scores. IVDetect correctly points out the vulnerable statements relevant to the vulnerability via its interpretations in 67% of the cases with a top-5 ranked list. It improves over ATT and GRAD interpretation models by 12.3–400% and 9–400% in accuracy.


Contents

  1. Dataset
  2. AST and Graph Generation
  3. Preprocessing
  4. Requirement
  5. Settings
  6. Code
  7. Reference

Dataset

The Dataset we used in the paper:

Fan et al.[1]: https://drive.google.com/file/d/1-0VhnHBp9IGh90s2wCNjeCMuy70HPl8X/view?usp=sharing

Reveal [2]: https://drive.google.com/drive/folders/1KuIYgFcvWUXheDhT--cBALsfy1I4utOy

FFMPeg+Qemu [3]: https://drive.google.com/file/d/1x6hoF7G-tSYxg8AFybggypLZgMGDNHfF

AST and Graph Generation

In this study, we use Joern to generate AST and graphs. However, the Joern is updating quickly with some functionality changes. So if you want to use the scripts that we used to generate the graphs. Please use:

git checkout cbca30d2631a48aed47be1ba46c6d8b5aa23c103

to roll back the joern to the old version that we previously used. The scripts for generating the graphs can be found in:

https://github.com/vulnerabilitydetection/VulnerabilityDetectionResearch/tree/new_implementation/IVDetect/scripts/joern_graphs.sc

If you are using newer versions of Joern or you have any detailed questions about Joern, please go to Joern's website: https://github.com/joernio/joern for more details on AST and graph generation.

We put an example CSV dataset to show how the generated dataset looks: https://drive.google.com/file/d/1LHOC4JDpnQ7gWnEHGfc4soQYHAPomlNp/view?usp=sharing You can see more details in utils/process.py about how to use the generated dataset.

We want to clarify that the AST and graphs generated by different versions of Joern may have significant differences based on our findings. So if using the newer versions of Joern to generate ASTs and graphs, the model may have a different performance compared with the results we reported in the paper.

Preprocessing

After you generate the AST and graphs and store them into the same format as the example data. You can use our provided preprocessing code in utils/process.py to preprocess the data and generate the features that used in our model.

Or you can directly go to Code section. The gen_graphs.py contains the usage of the preprocessing code in utils/process.py for generating the features for the model.

Requirement

Please check all requirements in the requirement.txt

Settings

Our approach can use NNI (Auto-ML) to tune the parameters. To do so, uncomment all lines with nni in main.py and comment line 195 in main.py. Then run nnictl create --config config.yml to automatically tune the model parameters.

Code

  1. Please use git clone https://github.com/vulnerabilitydetection/VulnerabilityDetectionResearch.git to get the repository

  2. Run gen_graphs.py. The line 166 is the output dir and line 52 is the input data name. This running will end with a file not found error

  3. Run glove/ash.sh and glove/pdg.sh to generate the GloVe embedding.

  4. Comment line 55 in gen_graphs.py and run gen_graphs.py again.

  5. Run train_test_valid.py to split the dataset

  6. Run main.py to train and test the model.

Pre-trained model can be downloaded from: https://drive.google.com/file/d/1KQv0aRUFCh-_jQCu8K7uQsB0c_5uCQKa/view?usp=sharing

The relevant test dataset can be downloaded from: https://drive.google.com/file/d/1uMnm7_W9DgXN4AbJ0iUir052H1AF4hA1/view?usp=sharing

Because of the randomness in the deep learning model and the different data splitting, the model performance may be different from the results reported in the paper.

Reference

[1] Jiahao Fan, Yi Li, Shaohua Wang, and Tien Nguyen. 2020. A C/C++ Code Vulnerability Dataset with Code Changes and CVE Summaries. In The 2020 International Conference on Mining Software Repositories (MSR). IEEE.

[2] Saikat Chakraborty, Rahul Krishna, Yangruibo Ding, and Baishakhi Ray. 2020. Deep Learning based Vulnerability Detection: Are We There Yet? arXiv preprint arXiv:2009.07235 (2020).

[3] Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. 2019. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. In Advances in Neural Information Processing Systems. 10197–10207.

About

VulnerabilityDetectionResearch

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published