ydyjya

Follow

Zhenhong Zhou ydyjya

Follow

LLM Safety

96 followers · 9 following

Beijing University of Post and Telecommunications
Beijing
https://www.zhihu.com/people/warrior-18-53

Achievements

Achievements

Stars

deepseek-ai / DeepSeek-R1

82,015 10,586 Updated Feb 24, 2025

geekan / MetaGPT

🌟 The Multi-Agent Framework: First AI Software Company, Towards Natural Language Programming

Python 47,776 5,667 Updated Feb 19, 2025

IAAR-Shanghai / Awesome-Attention-Heads

An awesome repository & A comprehensive survey on interpretability of LLM attention heads.

TeX 313 9 Updated Feb 12, 2025

ydyjya / SafetyHeadAttribution

Python 14 Updated Oct 19, 2024

hijkzzz / Awesome-LLM-Strawberry

A collection of LLM papers, blogs, and projects, with a focus on OpenAI o1 🍓 and reasoning techniques.

6,497 362 Updated Feb 22, 2025

GIGABaozi / AED

The code for AED which's a method to help LLM defend jailbreaks

Python 4 Updated Jul 29, 2024

boyiwei / alignment-attribution-code

[ICML 2024] Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications

Python 70 9 Updated Oct 4, 2024

IS2Lab / S-Eval

S-Eval: Automatic and Adaptive Test Generation for Benchmarking Safety Evaluation of Large Language Models

52 3 Updated Feb 17, 2025

pillowsofwind / Knowledge-Conflicts-Survey

[EMNLP 2024] The official GitHub repo for the survey paper "Knowledge Conflicts for LLMs: A Survey"

104 5 Updated Sep 21, 2024

HoagyC / sparse_coding

Using sparse coding to find distributed representations used by neural networks.

Jupyter Notebook 214 29 Updated Nov 10, 2023

openai / sparse_autoencoder

Python 423 45 Updated Jul 19, 2024

ydyjya / LLM-IHS-Explanation

Jupyter Notebook 41 3 Updated Jun 13, 2024

JailbreakBench / jailbreakbench

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Language Models [NeurIPS 2024 Datasets and Benchmarks Track]

Python 300 33 Updated Sep 26, 2024

alexandrasouly / strongreject

Repository for "StrongREJECT for Empty Jailbreaks" paper

Jupyter Notebook 118 5 Updated Nov 3, 2024

karpathy / llm.c

LLM training in simple, raw C/CUDA

Cuda 25,791 2,955 Updated Oct 2, 2024

huggingface / trl

Train transformer language models with reinforcement learning.

Python 11,978 1,613 Updated Feb 25, 2025

openai / transformer-debugger

Python 4,059 241 Updated Jun 4, 2024

OptimalScale / LMFlow

An Extensible Toolkit for Finetuning and Inference of Large Foundation Models. Large Models for All.

Python 8,356 833 Updated Feb 22, 2025

chawins / llm-sp

Papers and resources related to the security and privacy of LLMs 🤖

Python 479 35 Updated Nov 27, 2024

HowieHwong / TrustLLM

[ICML 2024] TrustLLM: Trustworthiness in Large Language Models

Python 518 48 Updated Feb 18, 2025

CHATS-lab / persuasive_jailbreaker

Persuasive Jailbreaker: we can persuade LLMs to jailbreak them!

HTML 283 19 Updated Oct 10, 2024

mlabonne / llm-course

Course to get into Large Language Models (LLMs) with roadmaps and Colab notebooks.

Jupyter Notebook 47,055 4,993 Updated Jan 22, 2025

CLUEbenchmark / SuperCLUE-Safety

SC-Safety: 中文大模型多轮对抗安全基准

119 9 Updated Mar 15, 2024

kaixindelele / ChatPaper

Use ChatGPT to summarize the arXiv papers. 全流程加速科研，利用chatgpt进行论文全文总结+专业翻译+润色+审稿+审稿回复

Python 18,743 1,945 Updated Apr 4, 2024

eric-mitchell / direct-preference-optimization

Reference implementation for DPO (Direct Preference Optimization)

Python 2,397 200 Updated Aug 11, 2024

meta-llama / PurpleLlama

Set of tools to assess and improve LLM security.

Python 2,918 486 Updated Feb 14, 2025

nlpxucan / WizardLM

LLMs build upon Evol Insturct: WizardLM, WizardCoder, WizardMath

Python 9,344 731 Updated Aug 5, 2024

HillZhang1999 / llm-hallucination-survey

Reading list of hallucination in LLMs. Check out our new survey paper: "Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models"

988 52 Updated Nov 21, 2024

wangcunxiang / LLM-Factuality-Survey

The repository for the survey paper <<Survey on Large Language Models Factuality: Knowledge, Retrieval and Domain-Specificity>>

332 29 Updated Apr 25, 2024

mistralai / mistral-inference

Official inference library for Mistral models

Jupyter Notebook 10,015 893 Updated Nov 12, 2024