Skip to content

Latest commit

 

History

History
74 lines (57 loc) · 1.69 KB

README.md

File metadata and controls

74 lines (57 loc) · 1.69 KB

DOCE

This repo is for code in our arxived paper:

DOCE: Finding the Sweet Spot for Execution-Based Code Generation
Haau-Sing Li, Patrick Fernandes, Iryna Gurevych, André F. T. Martins

Contact person: Haau-Sing Li

Usage

  1. Installing packages from requirements*.txt.

  2. Inference on HumanEval/MBPP task

python3 codegen/generate.py \
    --model ${model} \
    --bs ${batch_size} \
    --temperature ${temperature} \
    --n_samples ${num_of_samples_for_reranking} \
    --dataset ${humaneval/mbpp} \
    --resume \
    --root ${path_to_store_output}
  1. Evaluation
evalplus.evaluate \
    --dataset {humaneval/mbpp} \
    --samples ${path to generated samples} \
    --parallel 30 \
    --test-details
  1. Get execution outputs of generated samples (for MBR-Exec)
python3 evalplus/gen_outputs.py \
    --gen_dir {model_name_plus_temperature} \
    --dataset {humaneval/mbpp} \
    --gen_fast
  1. Self-Debugging You should get execution feedback first:
python3 evalplus/error_feedback.py \
    --gen_dir {model_name_plus_temperature} \
    --dataset {humaneval/mbpp} 

Then we can do self-debugging:

python3 codegen/ape_sd_ut.py \
    --model ${model} \
    --bs ${batch_size} \
    --temperature ${temperature} \
    --n_samples ${num_of_samples_for_reranking} \
    --dataset ${humaneval/mbpp} \
    --resume \
    --root ${path_to_store_output}
    --debugging_turn ${ith_debugging_turn}
  1. For MBR and N-Best-Reranking, please refer to our notebooks for now.

We will release our generated candidates soon if you want to save compute.

Our code is built upon EvalPlus.