This prototype synthesizes verified code with an LLM.
Using Monte Carlo Tree Search (MCTS), it explores the space of possible generation of a verified program, and it checks at every step that it's on the right track by calling the verifier.
This prototype uses Dafny, Coq, Lean, Scala or Rust.
Logs for example runs can be found in the log directory. Scroll to the very end of a log to see a chosen solution. Note that the linked solution is optimal for the problem.
By using this technique, weaker models that might not even know the generated language all that well can compete with stronger models.
This project relies on GPU access. It has been tested on a multi-GPU machine with two NVIDIA A100s.
Clone the github repo. Note that it has linked submodules so it should be cloned with the following command:
git clone --recurse-submodules https://github.com/namin/llm-verified-with-monte-carlo-tree-search.git
Note that you will be prompted to paste your huggingface authentication token.
conda create --name llm-verified python=3.10
conda activate llm-verified
pip install -r requirements.txt
huggingface-cli login
(If you want to use the singularity sandbox) Download the llm-verified
docker image (created by the Dockerfile here and pushed to the hub) and put it in ~/singularity
.
mkdir -p ~/singularity
cd ~/singularity
singularity pull docker://namin/llm-verified
(If you want to use the docker alternative) Download the Docker image: docker pull namin/llm-verified
. Use the command line --docker_sandbox
.
(If you want to use Dafny) Install Dafny: Download a binary here.
Please confirm that Dafny verification works before experimenting, for example by using the okdafny.py
script we provide. We have sometimes run into issues where z3 was not set up correctly, and Dafny skipped the verification process without returning an error code. Should you run into these problems, compile z3 from source and place the executable into your PATH.
(If you want to use Coq) Install Coq: Install opam, then:
opam init
opam install coq
opam install "coq-serapi>=8.10.0+0.7.0"
opam repo add coq-released https://coq.inria.fr/opam/released
opam install coq-hammer
(If you want to use Lean) Install Lean: See detailed instructions here. Then,
git clone https://github.com/leanprover-community/repl.git
and add require mathlib from git "https://github.com/leanprover-community/mathlib4"
to the file repl/lakefile.lean
and run
curl https://raw.githubusercontent.com/leanprover-community/mathlib4/master/lean-toolchain -o lean-toolchain
lake update
lake build
in the repl
directory.
To run the default experiment configuration, which uses the Phind-CodeLlama-34B-v2
LLM to solve the problem_opt0
as specified in prompts.py
in Dafny, do:
python run.py
Note that cmdline.py
handles the usage of command line arguments for the codebase. To see what arguments are available, do:
python cmdline.py --help
To run with specific arguments, simply add a --[argument name] [desired argument value]
for each argument after the original command to run. For example, to execute run.py
as before but with Coq instead of Dafny, do:
python run.py --language Coq
For the run that interacts with the user, do:
python run_user.py
For the run that feeds back verifier info, do:
python run_verifier_feedback.py
For the PPO trainer (slow!), do:
python run_ppo.py
For the DPO trainer, have a set of triples in datasets/gen.jsonl
.
These can be obtained by concatenating existing triple files:
cd datasets
cat pm_gen.jsonl pm_proof_gen.jsonl opt0.jsonl >gen.jsonl
or by running the DPO triple generator on an existiting problem:
python run_dpo_gen.jsonl
Once you have the file datasets/gen.jsonl
, you can run the DPO trainer with the following.
You can set the CUDA_VISIBLE_DEVICES
environment variable to whichever device number you want, informed by availability from nvidia-smi
.
export CUDA_VISIBLE_DEVICES=1
python run_dpo.py
The resulting model will be in ./my_dpo_model
and can be loaded with the command-line argument --base_model_name ./my_dpo_model
.
For a run that selects the next completion diversely, do:
python run_diversity.py
For a more extensive prompt populated with the goal to prove (Coq only) -- do:
python run_focus.py --language Coq
For a more extensive prompt which creates lemmas from failures (Coq only) -- do:
python run_meta.py --language Coq
To a run a certain flavor of experiment on one of the prompts used in the VMCTS paper, do something like the following. For example, to run the logic in run_intermediate_expansion.py
on opt0
, do:
python experiments.py --experiment_name run_intermediate_expansion.py --n_trials 10 --mins_timeout 10 --language Dafny --problem_name problem_opt0 --seed 42 --remove_hints True
To run a certain flavor of experiment on the Clover benchmark dataset, do something like the following. For example, to run the logic in run_intermediate_expansion.py
on Clover, do:
python experiments_clover.py --experiment_name run_intermediate_expansion.py
To log the results of any experiment in the log/ directory, place the following before the experiment command:
./log.sh log/name_of_log_file.txt
For example, to log the results of the intermediate expansion experiment on opt0
you could run:
./log.sh log/intermediate-expansion-01.txt python experiments.py --experiment_name run_intermediate_expansion.py --n_trials 10 --mins_timeout 10 --language Dafny --problem_name problem_opt0 --seed 42 --remove_hints True
Add your unit tests to the test_dict
, the final element of the prompt tuple.
Each key-value pair in this dictionary represents one or more unit tests. The key is some
string that has to occur in the generated code for the tests to run. The value is
a set of tests (think of assertions) that make up the test. Within the value string,
use the test(bool)
function. Example: {'evaluate': "test(fac(5) == 120)"}
.
You can either put each single assertion on its own line (test(fac(5) == 120)\ntest(fac(4) == 30)
),
in its own key-value-pair, or and
all of them together (test(fac(5) == 120 and fac(4) == 30)
).
@misc{brandfonbrener2024vermctssynthesizingmultistepprograms,
title={VerMCTS: Synthesizing Multi-Step Programs using a Verifier, a Large Language Model, and Tree Search},
author={David Brandfonbrener and Simon Henniger and Sibi Raja and Tarun Prasad and Chloe Loughridge and Federico Cassano and Sabrina Ruixin Hu and Jianang Yang and William E. Byrd and Robert Zinkov and Nada Amin},
year={2024},
eprint={2402.08147},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2402.08147},
}