CWE-Bench-Java

This repository contains the dataset CWE-Bench-Java presented in the paper LLM-Assisted Static Analysis for Detecting Security Vulnerabilities. At a high level, this dataset contains 120 CVEs spanning 4 CWEs, namely path-traversal, OS-command injection, cross-site scripting, and code-injection. Each CVE includes the buggy and fixed source code of the project, along with the information of the fixed files and functions. We provide the seed information in this repository, and we provide scripts for fetching, patching, and building the repositories. The dataset collection process is illustrated in the figure below:

Project Identifier

In this dataset, each project is uniquely identified with a Project Slug, encompassing its repository name, CVE ID, and a tag corresponding to the buggy version of the project. We show one example below:

DSpace__DSpace_CVE-2016-10726_4.4
^^^^^^  ^^^^^^ ^^^^^^^^^^^^^^ ^^^
|       |      |              |--> Version Tag
|       |      |--> CVE ID
|       |--> Repository name
|--> Github Username

All the patches, advisory information, build information, and fix information are associated with project slugs. Since there are 120 projects in the CWE-Bench-Java dataset, we have 120 unique project slugs. Note that a single repository may be found to have different CVEs in different versions.

Packaged Data

- data/
  - project_info.csv
  - build_info.csv
  - fix_info.csv
- patches/<project_slug>.patch
- advisory/<project_slug>.json

The core set of information in this dataset lies in two files, data/project_info.csv and data/fix_info.csv. We also provide other essential information such as CVE advisory, build information, and patches for the projects to be compiled and built. We now go into the project information and fix information CSVs.

Project Info

id	project_slug	cve_id	cwe_id	cwe_name	github_username	github_repository_name	github_tag	github_url	advisory_id	buggy_commit_id	fix_commit_ids
1	DSpace__DSpace_CVE-2016-10726_4.4	CVE-2016-10726	CWE-022	Path Traversal	DSpace	DSpace	4.4	https://github.com/DSpace/DSpace	GHSA-4m9r-5gqp-7j82	ca4c86b1baa4e0b07975b1da86a34a6e7170b3b7	4239abd2dd2ae0dedd7edc95a5c9f264fdcf639d

Each row in data/project_info.csv looks like the example above. We now get into each field and explain what they are.

id: an integer from 1 to 120
project_slug: (explained in the previous section)
cve_id: a common vulnerability identifier CVE-XXXX-XXXXX
cwe_id: a common weakness enumeration (CWE) identifier. In our dataset, there is only CWE-022, CWE-078, CWE-079, CWE-094
cwe_name: the name of the CWE
github_username: the user/organization that owns the repository on Github
github_repository_name: the repository name on Github
github_tag: the tag associated with the version where the vulnerability is found; usually a version tag
github_url: the URL to the github repository
advisory_id: the advisory ID in Github Security Advisory database
buggy_commit_id: the commit hash (like ca4c86b1baa4e0b07975b1da86a34a6e7170b3b7) where the vulnerability can be reproduced
fix_commit_ids: the set of commit hashes (sequentially ordered and separated with semicolon ;) corresponding to the fix of the vulnerability

Fix Info

The data/fix_info.csv file contains the fixed Java methods and classes to each CVE. In general, the fix could span over multiple commits, and a change could be made to arbitrary files in the repository, including resources (like .txt, .html) and Java source files (including core source code and test cases). In this table, we only include the methods and classes that are considered core. Many of the rows in this table is manually vetted and labeled. Note that there may be fixes on class variables, in which case there will not be method information associated with the fix. A single function may be "fixed" by multiple commits.

Each row in data/fix_info.csv looks like the following.

project_slug	cve	github_username	github_repository_name	commit	file	class	class_start	class_end	method	method_start	method_end	signature
apache__activemq_CVE-2014-3576_5.10.2	CVE-2014-3576	apache	activemq	`00921f22ff9a8792d7663ef8fadd4823402a6324`	`activemq-broker/src/main/java/org/apache/activemq/broker/TransportConnection.java`	`TransportConnection`	104	1655	`processControlCommand`	1536	1541	`Response processControlCommand(ControlCommand)`

project_slug: the unique identifier of each project
cve_id: the CVE id
github_username: the user/organization that owns the repository on Github
github_repository_name: the repository name on Github
commit: the commit hash containing this fix
file: the .java file that is fixed
class: the name of the class that is fixed
class_start, class_end: the start and end line number of the class
method: the name of the method that is fixed
method_start, method_end: the start and end line number of the method
signature: the signature of the method. Note that we might have multiple overloaded methods with the same name but with different signatures

Fetch and Build Projects

We provide the scripts to fetch, patch, and build projects, assuming that you have a machine with any distribution of Linux x64. In order to run the scripts, make sure that you have plenty of space on the host machine (as many projects can be very large). Fetching repository requires that you have git, wget, zip/unzip, tar, and python3 available on your system.

For building, we need Java distributions as well as Maven and Gradle for package management. In case you have a different system than Linux x64, please modify scripts/jdk_version.json, scripts/mvn_version.json, and scripts/gradle_version.json to specify the corresponding JDK/MVN/Gradle files. In addition, please prepare 3 versions of JDK and put them under the java-env folder. This is due to that Oracle requires an account to download JDK, and we are unable to provide an automated script. Download from the following URLs:

At this point, you should have a java-env directory that looks like

- java-env/
  - jdk-7u80-linux-x64.tar.gz
  - jdk-8u202-linux-x64.tar.gz
  - jdk-17_linux-x64_bin.tar.gz

Now, you can run our script using the command:

$ python3 scripts/setup.py

This script will do the following things

Create build-info and project-sources directories
Install multiple JDK versions by unzipping your provided JDK distributions (in java-env)
Install multiple MAVEN versions by automatically downloading and unzipping the MAVEN distributions (in java-env)
Install multiple Gradle versions by automatically downloading and unzipping the Gradle distributions (in java-env)
For each project in our dataset
- Fetch the project into project-sources/<project_slug> directory
- Build the project by trying multiple versions of JDK and MAVEN. The build information (whether it succeed or not) will be stored to build-info/

The resulting build information will be stored under build-info directory. Each file is a .json storing content like the following that specifies the corresponding Java version and the used Maven/Gradle/Gradlew versions.

{"jdk": "8u202", "mvn": "3.5.0"}

Fetch without building

For those of you who do not want to build, you do not need to provide the JDK distributions. Just directly run the following command:

$ python3 scripts/setup.py --no-build

Only fetch/build a subset of projects

You can use the following arguments (or a combination of them) to specify the set of projects you want to build. For --filter and --exclude, we will use the specified names to match against project slugs. Here is a few examples:

$ python3 scripts/setup.py --cwe CWE-022 CWE-078   # only builds projects under CWE-022 and CWE-078
$ python3 scripts/setup.py --filter keycloak       # only build keycloak projects (there are multiple of them)
$ python3 scripts/setup.py --exclude apache        # do not build any apache related projects

Build Information

After building attempts, results will be output to the build-info/ directory as well as the data/build_info.csv. Specifically, build-info/ directory contains individual JSON files for each project; data/build_info.csv contains a full CSV table for all the projects. Each row in data/build_info.csv consists of the following:

project_slug	status	jdk_version	mvn_version	gradle_version	use_gradlew
perwendel__spark_CVE-2018-9159_2.7.1	success	8u202	3.5.0	n/a	n/a

project_slug: the identifier of the project
status: either success or failure indicating whether the project has been successfully built
jdk_version: if built successfully, the JDK version used. May be 7u80, 8u202, or 17
mvn_version: if built successfully with MAVEN, the MAVEN version used. Maybe 3.5.0 or 3.9.8. If not built successful or not built with MAVEN, it will be n/a
gradle_version: if built successfully with Gradle, the Gradle version used. Otherwise, it will be n/a
use_gradlew: if built successfully with the custom gradlew script provided with the project itself, it will be 1. Otherwise it will be n/a

Citation

Consider citing our paper:

@article{li2024iris,
      title={LLM-Assisted Static Analysis for Detecting Security Vulnerabilities},
      author={Ziyang Li and Saikat Dutta and Mayur Naik},
      year={2024},
      eprint={2405.17238},
      archivePrefix={arXiv},
      primaryClass={cs.CR},
      url={https://arxiv.org/abs/2405.17238},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CWE-Bench-Java

Project Identifier

Packaged Data

Project Info

Fix Info

Fetch and Build Projects

Fetch without building

Only fetch/build a subset of projects

Build Information

Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
advisory		advisory
baselines		baselines
data		data
java-env		java-env
patches		patches
resources		resources
scripts		scripts
.gitignore		.gitignore
README.md		README.md

clairew/cwe-bench-java

Folders and files

Latest commit

History

Repository files navigation

CWE-Bench-Java

Project Identifier

Packaged Data

Project Info

Fix Info

Fetch and Build Projects

Fetch without building

Only fetch/build a subset of projects

Build Information

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages