Skip to content

Detection of machine-generated codes. Paper accepted to ICSE 2025.

License

Notifications You must be signed in to change notification settings

YerbaPage/DetectCodeGPT

Repository files navigation

DetectCodeGPT

Conference License Python Version

Welcome to the repository for the research paper: "Between Lines of Code: Unraveling the Distinct Patterns of Machine and Human Programmers." Our paper has been accepted to the 47th International Conference on Software Engineering (ICSE 2025).

Table of Contents

Getting Started

Prerequisites

Experiments are conducted using Python 3.9.7 on an Ubuntu 22.04.1 server.

To install all required packages, navigate to the root directory of this project and run:

pip install -r requirements.txt

Data Preparation

To prepare the datasets used in our study:

  1. Navigate to the code-generation directory.

  2. Obtain datasets from either:

  3. Update the data paths and model specifications in generate.py to reflect your local setup.

  4. Execute the data generation script:

    python generate.py

Usage

Conducting the Empirical Study

Note: You can skip the empirical study if you are only interested in detecting machine-generated code with DetectCodeGPT.

After data preparation, you can proceed to the empirical analysis:

  1. Navigate to the code-analysis directory.

  2. Analyze code length:

    python analyze_length.py
  3. Verify Zipf's and Heaps' laws, and compute token frequencies:

    python analyze_law_and_frequency.py
  4. Analyze the proportion of different token categories:

    python analyze_proportion.py
  5. Study the naturalness of code snippets:

    python analyze_naturalness.py

Using DetectCodeGPT

To evaluate our DetectCodeGPT model:

  1. Navigate to the code-detection directory.

  2. Configure main.py with the appropriate model and dataset paths.

  3. Run the model evaluation script:

    python main.py

Note: If you are using your custom model to generate code, please update 'base_model_name': "codellama/CodeLlama-7b-hf" in main.py to your model name during the detection stage.

Acknowledgements

The code is modified based on the original repositories of DetectGPT and DetectLLM. We thank the authors for their contributions.

Citation

If you use DetectCodeGPT in your research, please cite our paper:

@inproceedings{shi2025detectcodegpt,
  title={Between Lines of Code: Unraveling the Distinct Patterns of Machine and Human Programmers},
  author={Shi, Yuling and Zhang, Hongyu and Wan, Chengcheng and Gu, Xiaodong},
  booktitle={Proceedings of the 47th International Conference on Software Engineering (ICSE 2025)},
  year={2025},
  organization={IEEE}
}

About

Detection of machine-generated codes. Paper accepted to ICSE 2025.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages