This repository for our ASE 2024 paper "How Effective Do Code Language Models Understand Poor-Readability Code?" includes benchmark suite, results, methods for acquiring and preparing materials, and source code of our automatic scoring tool. We hope this artifact can motivate and help future research on code summarization.
- Script to construct perturbed datasets from source data.
- Automatic inference scripts. Models include:
CodeBERT
,CodeT5
,Codellama
. Programming languages include:Go
,Java
,Python
. Data types include:source data
andperturbation generated data
. - Script for automatic scoring, scoring targets include:
CodeBERT
,CodeT5
,CodeLlama
,GPT-4o
's inference results. Evaluation indicators include:BLEUScore
,BERTScore
,P-value
Experiments are conducted using Python 3.9.7 on a Ubuntu 22.01.1 server.
To install all required packages, navigate to the root directory of this project and run:
git clone https://github.com/ythere-y/PoorCodeSumEval.git
cd PoorCodeSumEval
pip install -r requirements.txt
Get CodeXGlue dataset from https://huggingface.co/datasets/google/code_x_glue_ct_code_to_text
Get TL-CodeSum from https://github.com/xing-hu/TL-CodeSum
Get DeppCom from https://github.com/xing-hu/EMSE-DeepCom
process_data/RobustCodeSum.py
process Python Code.
process_data/RobustCodeSumGo.py
process Go Code.
process_data/RobustCodeSumJava.py
process Java Code.
Use Python
& CodeXGlue
as an example to construct the IOE perturbation dataset.
DATASET_PATH = "path_to_code_x/code_x_glue_ct_code_to_text"
in main function
if __name__ == "__main__":
robust = PythonRobustCodeSum()
robust.gen_IOE()
python process_data/RobustCodeSum.py
The result dataset will saved into local_data/single/semantic/IOE/python/CSN
Inference with CodeBERT or CodeT5 or CodeLlama-7b, in tasks
Use Go
& CodeLlama
& FNE
as an example to conduct inference.
Edit the main function like this to set the language and the dataset type.
if __name__ == "__main__":
lang_name = "go"
limit = 2000
single_dataset_gen(
partition_name="single",
type_name="semantic",
mode_name="FNE",
task_name="work",
lang_name=lang_name,
limit=limit,
)
python tasks/single_llama_task.py
The result includes the inference results of CodeLlama-7b on the Go dataset with FNE perturbation and the reference summaries.
The result will be saved into path ref_and_gen/codellama-7b/single/semantic/FNE/go/work_gen_[0-2000].json
and ref_and_gen/codellama-7b/single/semantic/FNE/go/work_ref_[0-2000].json
In scores
, BLEUScore
, BERTScore
and P-value
scores are calculated.
if __name__ == "__main__":
reset_summary("CodeLlama-7b-hf")
model_name = "CodeLlama-7b-hf"
task_name = "work"
start_point = 0
limit = 2000
for lang_name in ["python", "go", "java"]:
print(f"start scoring model : {model_name}, lang : {lang_name}")
AllBLEUScore(model_name, lang_name, task_name, start_point, limit)
AllBERTScore(model_name, lang_name, task_name, start_point, limit)
t1 = time.time()
python scores/bleu_BERTScore.py
Description: This script will read the inference results from the default path of the model and calculate the BLEU and BERT scores.
The result details will be saved into scores/CodeLlama-7b-hf
, and the summary of the scores will be saved into scores/CodeLlama-7b-hf/summary.json
.
def analysis_and_log():
model_name = "CodeLlama-7b-hf"
task_name = "work"
score_name = "BERTScore"
start_point = 0
limit = 2000
for lang_name in ["python", "go", "java"]:
ALLSignificant(model_name, lang_name, task_name, start_point, score_name, limit)
python scores/significant.py
Description: This script will read the BERTScore results from the default path of the model and calculate the P-value.
The result details will be printout directly.
Some explanations of common questions and experiments on P-value can be found in appendix.pdf
.