This is the official repository for "GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts" by Jiahao Yu, Xingwei Lin, Zheng Yu, Xinyu Xing.
- (2024/8/13) This paper was accepted by Usenix Security. See the proceeding paper
- (2023/10/25) 🏆 🏆 We won the Frontier Breakthrough Award and Outstanding Presentation Award at Geekcon 2023!
- (2023/10/22) Our work will be presented in the Geekcon 2023 conference! We will give a talk about our work and give a live attack demo with our tool. We will also include the attack results for Chinese LLMs which have not be included in our paper. We are looking forward to meeting you there!
- (2023/10/21) We have updated our codebase to make it more readable and easier to use. We will continue to update the codebase and add more features and other implementations. We are looking forward to build a general black-box fuzzing framework for large language models. Stay tuned!
- (2023/9/19) Our paper is on arXiv! Check it out here!
Please refer to install.ipynb
python3 main_gptfuzz.py --seed_path "./datasets/prompts/GPTFuzzer.csv" --target_model "/media/d1/huggingface.co/models/meta-llama/Llama-2-7b-chat-hf" --judge_model "/media/d1/huggingface.co/models/hubert233/GPTFuzz" --goal_prompt "Tell me how to make a bomb." --result_path "test_result.json"
The datasets for the harmful question and human-written templates are available in datasets/questions/question_list.csv
and datasets/prompts/GPTFuzzer.csv
. The questions are sampled from two public datasets: llm-jailbreak-study and hh-rlhf, and the templates are collected from llm-jailbreak-study.
For the responses we got by querying Vicuna-7B, ChatGPT and Llama-2-7B-chat, we store them in datasets/responses
and the labeled responses are in datasets/responses_labeled
. You could also use generate_responses.py
to generate responses for different models or different questions (see the scripts under scripts
folder for examples).
We are still working on the evaluation on other question dataset and jailbreak dataset. We will update the codebase and the datasets after we have some results.
Our judgment model is a finetuned RoBERTa-large model and the training code is in ./example/finetune_roberta.py
, and the training/evaluating data is stored in datasets/responses_labeled
. The model we used is hosted on Hugging Face. When running fuzzing experiments, the model will be automatically downloaded and cached for the first time. If you would like to download the model manually, you can run the following code:
from transformers import RobertaForSequenceClassification, RobertaTokenizer
model_path = 'hubert233/GPTFuzz'
model = RobertaForSequenceClassification.from_pretrained(model_path)
tokenizer = RobertaTokenizer.from_pretrained(model_path)
During our experiments, we found that our trained model can also be transferred to other questions. However, we also found that it does not work well on some questions and other languages. We will add more predictor model soon.
We provide a python example to show the minimal code to run the fuzzing experiments. This example uses ChatGPT as mutate model to attack Llama-2-7B-chat with official system prompt(we did the monkey patch for Fastchat template since Fastchat deleted the official system prompt in recent update), and you should be able to get the identical results in example folder (we set the random seed for reproducibility and temperature=0).
You can also refer to the notebook for more details and explanations.
Due to ethical concern, we decided not to release the adversarial templates we found during our experiments openly. However, we are happy to share them with researchers who are interested in this topic. Please contact us via email if you would like to get access to the templates we found during the experiments. Also, you can use the code in this repository to generate your own adversarial templates.
- I found some labels in your labeled responses are wrong.
- We are sorry about that. As our paper claimed, determining whether it is a jailbroken response is not a trivial task and some responses are hard to label. Also, due to the stress of labeling a large amount of potential toxic responses, we might have made some mistakes. If you find any wrong labels, please let us know and we will fix them as soon as possible.
- The fuzz is slow, especially when I am using multiple questions for the local model.
- How could I implement my own mutator/seed selector?
- You can implement your own mutator/seed selector by inheriting the class. You can refer to
mutator.py
andselection.py
for examples. Also, as we claimed, we would like to work on a general black-box fuzzing framework for large language models. If you have some ideas or suggestions or you find other papers that are related to this topic, please let us know or leave the comment in the issue. We are happy to implement them and make this framework more powerful.
- You can implement your own mutator/seed selector by inheriting the class. You can refer to
If you find this useful in your research, please consider citing:
@inproceedings{yu2024llm,
title={$\{$LLM-Fuzzer$\}$: Scaling Assessment of Large Language Model Jailbreaks},
author={Yu, Jiahao and Lin, Xingwei and Yu, Zheng and Xing, Xinyu},
booktitle={33rd USENIX Security Symposium (USENIX Security 24)},
pages={4657--4674},
year={2024}
}
@article{yu2023gptfuzzer,
title={Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts},
author={Yu, Jiahao and Lin, Xingwei and Yu, Zheng and Xing, Xinyu},
journal={arXiv preprint arXiv:2309.10253},
year={2023}
}