This repository contains the dataset CWE-Bench-Java presented in the paper LLM-Assisted Static Analysis for Detecting Security Vulnerabilities. At a high level, this dataset contains 120 CVEs spanning 4 CWEs, namely path-traversal, OS-command injection, cross-site scripting, and code-injection. Each CVE includes the buggy and fixed source code of the project, along with the information of the fixed files and functions. We provide the seed information in this repository, and we provide scripts for fetching, patching, and building the repositories. The dataset collection process is illustrated in the figure below:
In this dataset, each project is uniquely identified with a Project Slug, encompassing its repository name, CVE ID, and a tag corresponding to the buggy version of the project. We show one example below:
DSpace__DSpace_CVE-2016-10726_4.4
^^^^^^ ^^^^^^ ^^^^^^^^^^^^^^ ^^^
| | | |--> Version Tag
| | |--> CVE ID
| |--> Repository name
|--> Github Username
All the patches, advisory information, build information, and fix information are associated with project slugs. Since there are 120 projects in the CWE-Bench-Java dataset, we have 120 unique project slugs. Note that a single repository may be found to have different CVEs in different versions.
- data/
- project_info.csv
- build_info.csv
- fix_info.csv
- patches/<project_slug>.patch
- advisory/<project_slug>.json
The core set of information in this dataset lies in two files, data/project_info.csv
and data/fix_info.csv
.
We also provide other essential information such as CVE advisory, build information, and patches for the projects to be compiled and built.
We now go into the project information and fix information CSVs.
id | project_slug | cve_id | cwe_id | cwe_name | github_username | github_repository_name | github_tag | github_url | advisory_id | buggy_commit_id | fix_commit_ids |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | DSpace__DSpace_CVE-2016-10726_4.4 | CVE-2016-10726 | CWE-022 | Path Traversal | DSpace | DSpace | 4.4 | https://github.com/DSpace/DSpace | GHSA-4m9r-5gqp-7j82 | ca4c86b1baa4e0b07975b1da86a34a6e7170b3b7 | 4239abd2dd2ae0dedd7edc95a5c9f264fdcf639d |
Each row in data/project_info.csv
looks like the example above.
We now get into each field and explain what they are.
id
: an integer from 1 to 120project_slug
: (explained in the previous section)cve_id
: a common vulnerability identifierCVE-XXXX-XXXXX
cwe_id
: a common weakness enumeration (CWE) identifier. In our dataset, there is onlyCWE-022
,CWE-078
,CWE-079
,CWE-094
cwe_name
: the name of the CWEgithub_username
: the user/organization that owns the repository on Githubgithub_repository_name
: the repository name on Githubgithub_tag
: the tag associated with the version where the vulnerability is found; usually a version taggithub_url
: the URL to the github repositoryadvisory_id
: the advisory ID in Github Security Advisory databasebuggy_commit_id
: the commit hash (likeca4c86b1baa4e0b07975b1da86a34a6e7170b3b7
) where the vulnerability can be reproducedfix_commit_ids
: the set of commit hashes (sequentially ordered and separated with semicolon;
) corresponding to the fix of the vulnerability
The data/fix_info.csv
file contains the fixed Java methods and classes to each CVE.
In general, the fix could span over multiple commits, and a change could be made to arbitrary files in the repository, including resources (like .txt
, .html
) and Java source files (including core source code and test cases).
In this table, we only include the methods and classes that are considered core.
Many of the rows in this table is manually vetted and labeled.
Note that there may be fixes on class variables, in which case there will not be method information associated with the fix.
A single function may be "fixed" by multiple commits.
Each row in data/fix_info.csv
looks like the following.
project_slug | cve | github_username | github_repository_name | commit | file | class | class_start | class_end | method | method_start | method_end | signature |
---|---|---|---|---|---|---|---|---|---|---|---|---|
apache__activemq_CVE-2014-3576_5.10.2 | CVE-2014-3576 | apache | activemq | 00921f22ff9a8792d7663ef8fadd4823402a6324 |
activemq-broker/src/main/java/org/apache/activemq/broker/TransportConnection.java |
TransportConnection |
104 | 1655 | processControlCommand |
1536 | 1541 | Response processControlCommand(ControlCommand) |
project_slug
: the unique identifier of each projectcve_id
: the CVE idgithub_username
: the user/organization that owns the repository on Githubgithub_repository_name
: the repository name on Githubcommit
: the commit hash containing this fixfile
: the.java
file that is fixedclass
: the name of the class that is fixedclass_start
,class_end
: the start and end line number of the classmethod
: the name of the method that is fixedmethod_start
,method_end
: the start and end line number of the methodsignature
: the signature of the method. Note that we might have multiple overloaded methods with the same name but with different signatures
We provide the scripts to fetch, patch, and build projects, assuming that you have a machine with any distribution of Linux x64.
In order to run the scripts, make sure that you have plenty of space on the host machine (as many projects can be very large).
Fetching repository requires that you have git
, wget
, zip
/unzip
, tar
, and python3
available on your system.
For building, we need Java distributions as well as Maven and Gradle for package management.
In case you have a different system than Linux x64, please modify scripts/jdk_version.json
, scripts/mvn_version.json
, and scripts/gradle_version.json
to specify the corresponding JDK/MVN/Gradle files.
In addition, please prepare 3 versions of JDK and put them under the java-env
folder.
This is due to that Oracle requires an account to download JDK, and we are unable to provide an automated script.
Download from the following URLs:
- JDK 7u80: https://www.oracle.com/java/technologies/javase/javase7-archive-downloads.html
- JDK 8u202: https://www.oracle.com/java/technologies/javase/javase8-archive-downloads.html
- JDK 17: https://www.oracle.com/java/technologies/javase/jdk17-archive-downloads.html
At this point, you should have a java-env
directory that looks like
- java-env/
- jdk-7u80-linux-x64.tar.gz
- jdk-8u202-linux-x64.tar.gz
- jdk-17_linux-x64_bin.tar.gz
Now, you can run our script using the command:
$ python3 scripts/setup.py
This script will do the following things
- Create
build-info
andproject-sources
directories - Install multiple JDK versions by unzipping your provided JDK distributions (in
java-env
) - Install multiple MAVEN versions by automatically downloading and unzipping the MAVEN distributions (in
java-env
) - Install multiple Gradle versions by automatically downloading and unzipping the Gradle distributions (in
java-env
) - For each project in our dataset
- Fetch the project into
project-sources/<project_slug>
directory - Build the project by trying multiple versions of JDK and MAVEN. The build information (whether it succeed or not) will be stored to
build-info/
- Fetch the project into
The resulting build information will be stored under build-info
directory.
Each file is a .json
storing content like the following that specifies the corresponding Java version and the used Maven/Gradle/Gradlew versions.
{"jdk": "8u202", "mvn": "3.5.0"}
For those of you who do not want to build, you do not need to provide the JDK distributions. Just directly run the following command:
$ python3 scripts/setup.py --no-build
You can use the following arguments (or a combination of them) to specify the set of projects you want to build.
For --filter
and --exclude
, we will use the specified names to match against project slugs.
Here is a few examples:
$ python3 scripts/setup.py --cwe CWE-022 CWE-078 # only builds projects under CWE-022 and CWE-078
$ python3 scripts/setup.py --filter keycloak # only build keycloak projects (there are multiple of them)
$ python3 scripts/setup.py --exclude apache # do not build any apache related projects
After building attempts, results will be output to the build-info/
directory as well as the data/build_info.csv
.
Specifically, build-info/
directory contains individual JSON files for each project; data/build_info.csv
contains a full CSV table for all the projects.
Each row in data/build_info.csv
consists of the following:
project_slug | status | jdk_version | mvn_version | gradle_version | use_gradlew |
---|---|---|---|---|---|
perwendel__spark_CVE-2018-9159_2.7.1 | success | 8u202 | 3.5.0 | n/a | n/a |
project_slug
: the identifier of the projectstatus
: eithersuccess
orfailure
indicating whether the project has been successfully builtjdk_version
: if built successfully, the JDK version used. May be7u80
,8u202
, or17
mvn_version
: if built successfully with MAVEN, the MAVEN version used. Maybe 3.5.0 or 3.9.8. If not built successful or not built with MAVEN, it will ben/a
gradle_version
: if built successfully with Gradle, the Gradle version used. Otherwise, it will ben/a
use_gradlew
: if built successfully with the customgradlew
script provided with the project itself, it will be1
. Otherwise it will ben/a
Consider citing our paper:
@article{li2024iris,
title={LLM-Assisted Static Analysis for Detecting Security Vulnerabilities},
author={Ziyang Li and Saikat Dutta and Mayur Naik},
year={2024},
eprint={2405.17238},
archivePrefix={arXiv},
primaryClass={cs.CR},
url={https://arxiv.org/abs/2405.17238},
}