PoorCodeSumEval

This repository for our ASE 2024 paper "How Effective Do Code Language Models Understand Poor-Readability Code?" includes benchmark suite, results, methods for acquiring and preparing materials, and source code of our automatic scoring tool. We hope this artifact can motivate and help future research on code summarization.

What's inside this repository

Script to construct perturbed datasets from source data.
Automatic inference scripts. Models include: CodeBERT, CodeT5, Codellama. Programming languages include: Go, Java, Python. Data types include: source data and perturbation generated data.
Script for automatic scoring, scoring targets include: CodeBERT, CodeT5, CodeLlama, GPT-4o 's inference results. Evaluation indicators include: BLEUScore, BERTScore, P-value

Get Started

Experiments are conducted using Python 3.9.7 on a Ubuntu 22.01.1 server.

To install all required packages, navigate to the root directory of this project and run:

git clone https://github.com/ythere-y/PoorCodeSumEval.git
cd PoorCodeSumEval
pip install -r requirements.txt

DataPreparation

Get CodeXGlue dataset from https://huggingface.co/datasets/google/code_x_glue_ct_code_to_text

Get TL-CodeSum from https://github.com/xing-hu/TL-CodeSum

Get DeppCom from https://github.com/xing-hu/EMSE-DeepCom

Construct Dataset

process_data/RobustCodeSum.py process Python Code.

process_data/RobustCodeSumGo.py process Go Code.

process_data/RobustCodeSumJava.py process Java Code.

Example

Use Python & CodeXGlue as an example to construct the IOE perturbation dataset.

Code Edit: `process_data/CodeXGlue.py`

    DATASET_PATH = "path_to_code_x/code_x_glue_ct_code_to_text"

in main function

if __name__ == "__main__":
    robust = PythonRobustCodeSum()
    robust.gen_IOE()

Run code

python process_data/RobustCodeSum.py

Result

The result dataset will saved into local_data/single/semantic/IOE/python/CSN

Inference

Inference with CodeBERT or CodeT5 or CodeLlama-7b, in tasks

Example

Use Go & CodeLlama & FNE as an example to conduct inference.

Code Edit: `tasks/single_llama_task.py`

Edit the main function like this to set the language and the dataset type.

if __name__ == "__main__":
    lang_name = "go"
    limit = 2000
    single_dataset_gen(
        partition_name="single",
        type_name="semantic",
        mode_name="FNE",
        task_name="work",
        lang_name=lang_name,
        limit=limit,
    )

Run Code

python tasks/single_llama_task.py

Result

The result includes the inference results of CodeLlama-7b on the Go dataset with FNE perturbation and the reference summaries.

The result will be saved into path ref_and_gen/codellama-7b/single/semantic/FNE/go/work_gen_[0-2000].json and ref_and_gen/codellama-7b/single/semantic/FNE/go/work_ref_[0-2000].json

Calculating the score

In scores, BLEUScore, BERTScore and P-value scores are calculated.

`BLEUScore` and `BERTScore`

Edit Code: `scores/bleu_BERTScore.py`

if __name__ == "__main__":
    reset_summary("CodeLlama-7b-hf")
    model_name = "CodeLlama-7b-hf"
    task_name = "work"
    start_point = 0
    limit = 2000
    for lang_name in ["python", "go", "java"]:
        print(f"start scoring model : {model_name}, lang : {lang_name}")
        AllBLEUScore(model_name, lang_name, task_name, start_point, limit)
        AllBERTScore(model_name, lang_name, task_name, start_point, limit)
    t1 = time.time()

Run Code

python scores/bleu_BERTScore.py

Description: This script will read the inference results from the default path of the model and calculate the BLEU and BERT scores.

Result

The result details will be saved into scores/CodeLlama-7b-hf, and the summary of the scores will be saved into scores/CodeLlama-7b-hf/summary.json.

`P-value`

Edit Code: `scores/significant.py`

def analysis_and_log():

    model_name = "CodeLlama-7b-hf"
    task_name = "work"
    score_name = "BERTScore"
    start_point = 0
    limit = 2000
    for lang_name in ["python", "go", "java"]:
        ALLSignificant(model_name, lang_name, task_name, start_point, score_name, limit)

Run Code

python scores/significant.py

Description: This script will read the BERTScore results from the default path of the model and calculate the P-value.

Result

The result details will be printout directly.

Appendix

Some explanations of common questions and experiments on P-value can be found in appendix.pdf.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
bigcode_eval		bigcode_eval
docs		docs
finetuning		finetuning
process_data		process_data
scores		scores
scripts		scripts
tasks		tasks
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
Dockerfile-multiple		Dockerfile-multiple
LICENSE		LICENSE
README.md		README.md
appendix.pdf		appendix.pdf
display.py		display.py
get_input_data.py		get_input_data.py
logs.json		logs.json
main.py		main.py
makefile		makefile
requirements.txt		requirements.txt
setup.py		setup.py
test_gen_java_2.json		test_gen_java_2.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PoorCodeSumEval

What's inside this repository

Get Started

DataPreparation

Construct Dataset

Example

Code Edit: `process_data/CodeXGlue.py`

Run code

Result

Inference

Example

Code Edit: `tasks/single_llama_task.py`

Run Code

Result

Calculating the score

`BLEUScore` and `BERTScore`

Edit Code: `scores/bleu_BERTScore.py`

Run Code

Result

`P-value`

Edit Code: `scores/significant.py`

Run Code

Result

Appendix

About

Releases

Packages

Languages

License

ythere-y/PoorCodeSumEval

Folders and files

Latest commit

History

Repository files navigation

PoorCodeSumEval

What's inside this repository

Get Started

DataPreparation

Construct Dataset

Example

Code Edit: process_data/CodeXGlue.py

Run code

Result

Inference

Example

Code Edit: tasks/single_llama_task.py

Run Code

Result

Calculating the score

BLEUScore and BERTScore

Edit Code: scores/bleu_BERTScore.py

Run Code

Result

P-value

Edit Code: scores/significant.py

Run Code

Result

Appendix

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Code Edit: `process_data/CodeXGlue.py`

Code Edit: `tasks/single_llama_task.py`

`BLEUScore` and `BERTScore`

Edit Code: `scores/bleu_BERTScore.py`

`P-value`

Edit Code: `scores/significant.py`

Packages