Stars
Robust recipes to align language models with human and AI preferences
PyTorch implementation of adversarial attacks [torchattacks]
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Universal and Transferable Attacks on Aligned Language Models