Preference Alignment

This repository explores methods for aligning language models with human preferences. While supervised fine-tuning adapts models to specific tasks or domains, preference alignment ensures that their outputs match human expectations and values. The focus here is on two innovative algorithms: Direct Preference Optimization (DPO) and Odds Ratio Preference Optimization (ORPO).

Overview

Preference alignment methods typically involve multiple stages:

Supervised Fine-Tuning (SFT): Adapts models to specific tasks or domains.
Preference Alignment: Refines outputs to better align with human preferences.

This repository highlights:

DPO, a simpler alternative to traditional RLHF (Reinforcement Learning from Human Feedback), which directly optimizes model behavior using preference data.
ORPO, a novel approach that combines instruction tuning and preference alignment into a unified, single-stage process.

Key Components

Direct Preference Optimization (DPO)

DPO simplifies preference alignment by eliminating the need for separate reward models and reinforcement learning. Instead, it directly optimizes language models using preference datasets, offering a more stable and efficient alternative to RLHF. This repository includes:

A detailed implementation of DPO using pre-trained models.
Notebooks of fine-tuning models on preference datasets to improve alignment.

Odds Ratio Preference Optimization (ORPO)

ORPO introduces a unified approach to instruction tuning and preference alignment, combining:

Negative log-likelihood loss with an odds ratio term at the token level.
A single-stage training process that eliminates the need for reference models.
Improved computational efficiency and strong performance on benchmarks.

This repository demonstrates:

ORPO fine-tuning workflows.
Comparisons between ORPO and DPO on various datasets.

Notebooks Overview

This repository includes interactive notebooks to demonstrate DPO and ORPO in practice:

DPO Training

Description: Fine-tune models using preference datasets with the DPOTrainer.
What's Inside:
- Fine-tuning a model with trl-lib/ultrafeedback_binarized.
- Experimenting with your own preference datasets.
Notebook: DPO Fine-Tuning

ORPO Training

Description: Explore the unified approach of ORPO for both instruction tuning and preference alignment.
What's Inside:
- Training a model using instruction and preference data.
- Experimenting with different loss weightings and configurations.
- Comparing ORPO results with DPO.
Notebook: ORPO Fine-Tuning

Why Explore DPO and ORPO?

DPO: Provides a straightforward and efficient alternative to RLHF, making preference alignment accessible without complex setups.
ORPO: Offers an innovative single-stage training method, reducing computational overhead while achieving strong results.

This repository is a practical guide to implementing these algorithms and understanding their potential for improving language model alignment.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
notebooks		notebooks
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Preference Alignment

Overview

Key Components

Direct Preference Optimization (DPO)

Odds Ratio Preference Optimization (ORPO)

Notebooks Overview

DPO Training

ORPO Training

Why Explore DPO and ORPO?

About

Languages

thibaud-perrin/preference-alignment

Folders and files

Latest commit

History

Repository files navigation

Preference Alignment

Overview

Key Components

Direct Preference Optimization (DPO)

Odds Ratio Preference Optimization (ORPO)

Notebooks Overview

DPO Training

ORPO Training

Why Explore DPO and ORPO?

About

Topics

Resources

Stars

Watchers

Forks

Languages