Understanding and Enhancing the Transferability of Jailbreaking Attacks

Official implementation of Understanding and Enhancing the Transferability of Jailbreaking Attacks (ICLR 2025).

Abstract

Content Warning: This paper contains examples of harmful language.

Jailbreaking attacks can effectively manipulate open-source large language models (LLMs) to produce harmful responses. However, these attacks exhibit limited transferability, failing to disrupt proprietary LLMs consistently. To reliably identify vulnerabilities in proprietary LLMs, this work investigates the transferability of jailbreaking attacks by analyzing their impact on the model's intent perception. By incorporating adversarial sequences, these attacks can redirect the source LLM's focus away from malicious-intent tokens in the original input, thereby obstructing the model's intent recognition and eliciting harmful responses. Nevertheless, these adversarial sequences fail to mislead the target LLM's intent perception, allowing the target LLM to refocus on malicious-intent tokens and abstain from responding. Our analysis further reveals the inherent $\textit{distributional dependency}$ within the generated adversarial sequences, whose effectiveness stems from overfitting the source LLM's parameters, resulting in limited transferability to target LLMs. To this end, we propose the Perceived-importance Flatten (PiF) method, which uniformly disperses the model's focus across neutral-intent tokens in the original input, thus obscuring malicious-intent tokens without relying on overfitted adversarial sequences. Extensive experiments demonstrate that PiF provides an effective and efficient red-teaming evaluation for proprietary LLMs.

Figure. The procedure of Perceived-importance Flatten (PiF) Method.

Requirements

This codebase is written for python3 and 'pytorch'.
To install necessary python packages, run pip install -r requirements.txt.

Experiments

Data

Please download and place all datasets into the data directory.

Training

Generate jailbreaking attack based on MLM (Bert)

python3 PiF_MLM.py --gen_model_path ../bert-large-uncased --tgt_model_path ../Llama-2-13b-chat-hf --opt_objective ASR --interation 20 --output_dir PiF_From_Bert_To_Llama-2-13B

Generate jailbreaking attack based on CLM (Llama)

python3 PiF_CLM.py --gen_model_path ../Llama-2-7b-chat-hf --tgt_model_path ../Llama-2-13b-chat-hf --opt_objective ASR --interation 20 --output_dir PiF_From_Llama-2-7B_To_Llama-2-13B

Jailbreaking GPT evaluated by keyword ASR

python3 PiF_MLM.py --gen_model_path ../bert-large-uncased --tgt_model_path gpt-4-0613 --opt_objective ASR --interation 50 --output_dir PiF_From_Bert_To_GPT

Jailbreaking GPT evaluated by keyword ASR+GPT

python3 PiF_MLM.py --gen_model_path ../bert-large-uncased --tgt_model_path gpt-4-0613 --opt_objective ASR+GPT --interation 50 --output_dir PiF_From_Bert_To_GPT_ASR+GPT

License and Contributing

This README is formatted based on paperswithcode.
Feel free to post issues via Github.

Reference

If you find the code useful in your research, please consider citing our paper:

@inproceedings{
lin2025understanding,
title={Understanding and Enhancing the Transferability of Jailbreaking Attacks},
author={Runqi Lin and Bo Han and Fengwang Li and Tongliang Liu},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data		data
LICENSE		LICENSE
Method.png		Method.png
PiF_CLM.py		PiF_CLM.py
PiF_MLM.py		PiF_MLM.py
README.md		README.md
attack_clm.py		attack_clm.py
attack_mlm.py		attack_mlm.py
eval.py		eval.py
eval_template.py		eval_template.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Understanding and Enhancing the Transferability of Jailbreaking Attacks

Abstract

Requirements

Experiments

Data

Training

License and Contributing

Reference

About

Releases

Packages

Contributors 2

Languages

License

tmllab/2025_ICLR_PiF

Folders and files

Latest commit

History

Repository files navigation

Understanding and Enhancing the Transferability of Jailbreaking Attacks

Abstract

Requirements

Experiments

Data

Training

License and Contributing

Reference

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages