TOMG-Bench

Text-based Open Molecule Generation Benchmark

Authors: Jiatong Li*, Junxian Li*, Yunqing Liu, Dongzhan Zhou, and Qing Li （* Equal Contribution)

Arxiv: https://arxiv.org/abs/2412.14642
Huggingface Dataset: https://huggingface.co/datasets/Duke-de-Artois/TOMG-Bench
PaperWithCode: https://paperswithcode.com/dataset/tomg-bench
Project Page: https://phenixace.github.io/tomgbench/

Introduction

In this paper, we propose Text-based Open Molecule Generation Benchmark (TOMG-Bench), the first benchmark to evaluate the open-domain molecule generation capability of LLMs. TOMG-Bench encompasses a dataset of three major tasks: molecule editing (MolEdit), molecule optimization (MolOpt), and customized molecule generation (MolCustom). Each task further contains three subtasks, with each subtask comprising 5,000 test samples. Given the inherent complexity of open molecule generation, we have also developed an automated evaluation system that helps measure both the quality and the accuracy of the generated molecules. Our comprehensive benchmarking of 25 LLMs reveals the current limitations and potential areas for improvement in text-guided molecule discovery. Furthermore, with the assistance of OpenMolIns, a specialized instruction tuning dataset proposed for solving challenges raised by TOMG-Bench, Llama3.1-8B could outperform all the open-source general LLMs, even surpassing GPT-3.5-turbo by 46.5% on TOMG-Bench.

Leaderboard

Model	#Parameters	A̅cc (%)	wA̅cc (%)
Claude-3.5 (Anthropic, 2024b)	-	51.10	35.92
Gemini-1.5-pro (Deepmind, 2024)	-	52.25	34.80
GPT-4-turbo (Achiam et al., 2023)	-	50.74	34.23
GPT-4o (Achiam et al., 2023)	-	49.08	32.29
Claude-3 (Anthropic, 2024a)	-	46.14	30.47
OpenMolIns-large (Llama-3.1-8B)	8B	43.1	27.22
OpenMolIns-xlarge (Galactica-125M)	125M	44.48	25.73
Llama3-70B-Instruct (Int4) (Dubey et al., 2024)	70B	38.54	23.93
OpenMolIns-large (Galactica-125M)	125M	39.28	23.42
OpenMolIns-medium (Galactica-125M)	125M	34.54	19.89
GPT-3.5-turbo (Achiam et al., 2023)	-	28.93	18.58
OpenMolIns-small (Galactica-125M)	125M	24.17	15.18
Llama3.1-8B-Instruct (Dubey et al., 2024)	8B	26.26	14.09
Llama3-8B-Instruct (Dubey et al., 2024)	8B	26.40	13.75
chatglm-9B (GLM et al., 2024)	9B	18.50	13.13(7)
OpenMolIns-light (Galactica-125M)	125M	20.95	13.13(6)
OpenMolIns-large (Llama3.2-1B)	1B	14.11	8.10
yi-1.5-9B (Young et al., 2024)	9B	14.10	7.32
Mistral-7B-Instruct-v0.2 (Jiang et al., 2023)	7B	11.17	4.81
BioT5-base (Pei et al., 2023)	250M	24.19	4.21
MolT5-large (Edwards et al., 2022)	780M	23.11	2.89
Llama-3.1-1B-Instruct (Dubey et al., 2024)	1B	3.95	1.99
MolT5-base (Edwards et al., 2022)	250M	11.11	1.30(0)
MolT5-small (Edwards et al., 2022)	80M	11.55	1.29(9)
Qwen2-7B-Instruct (Yang et al., 2024)	7B	0.18	0.15

Pipeline

Dataset Categorization

This repository contains the code for the TOMG-Bench benchmark, which is a benchmark for evaluating LLMs' performance on Text-based Open Molecule Generation tasks. The benchmark consists of three main tasks. Each task is further divided into three subtasks, and each subtask is composed of 5000 data samples. Below is the dataset categorization:

MolCustom
- AtomNum
- FunctionalGroup
- BondNum
MolEdit
- AddComponent
- DelComponent
- SubComponent
MolOpt
- LogP
- MR
- QED

Metrics

We adopt different evaluation metrics for different tasks. The evaluation metrics for each subtask are described in the corresponding subtask's README file.

The leaderboard is based on the weighted average accuracy metric, which is discussed in our paper.

Usage

To query proprietary models, please refer to the query_openai.
To evaluate the performance of an open-source general LLM, please refer to the run_query_vllm.
To evaluate the performance of a ChEBI-20 fine-tuned LLM, please refer to the run_query_biot5 and run_query_molt5.
To train on our OpenMolIns dataset, please refer to the train.
To evaluate your model on our benchmark, please refer to the run_query_template.

Submit Your Model

If your model achieves amazing performance on our benchmark and you want to update the leaderboard, please send us your results (including raw files) via our emails. We will help you update the leaderboard once we have verified the results.

Reference

@misc{li2024tomgbenchevaluatingllmstextbased,
      title={TOMG-Bench: Evaluating LLMs on Text-based Open Molecule Generation}, 
      author={Jiatong Li and Junxian Li and Yunqing Liu and Dongzhan Zhou and Qing Li},
      year={2024},
      eprint={2412.14642},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.14642}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
data		data
predictions		predictions
utils		utils
.gitattributes		.gitattributes
README.md		README.md
correct.py		correct.py
evaluate.py		evaluate.py
generation.py		generation.py
instruction_tuning.py		instruction_tuning.py
query.py		query.py
query_multi_gpu.py		query_multi_gpu.py
query_openai.py		query_openai.py
query_vllm.py		query_vllm.py
run_query.bash		run_query.bash
run_query_copy_8B.bash		run_query_copy_8B.bash
run_query_lora.sh		run_query_lora.sh
run_query_openai.bash		run_query_openai.bash
run_query_template.bash		run_query_template.bash
run_query_vllm.bash		run_query_vllm.bash
run_train.bash		run_train.bash
sample.py		sample.py
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TOMG-Bench

Text-based Open Molecule Generation Benchmark

Introduction

Leaderboard

Pipeline

Dataset Categorization

Metrics

Usage

Submit Your Model

Reference

About

Releases

Packages

Contributors 2

Languages

phenixace/TOMG-Bench

Folders and files

Latest commit

History

Repository files navigation

TOMG-Bench

Text-based Open Molecule Generation Benchmark

Introduction

Leaderboard

Pipeline

Dataset Categorization

Metrics

Usage

Submit Your Model

Reference

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages