JPO

[Paper] [🤗 Huggingface Dataset] [🤗 Checkpoints] [Twitter] [Website]

JPO 🕊️, a new objective for aligning LLMs that optimizes preferences over joint instruction-response pairs.

Code for the Paper "Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization".

🔔 If you have any questions or suggestions, please don't hesitate to let us know. You can comment on the Twitter, or post an issue on this repository.

About JPO

Traditional conditional feedback approaches are limited in capturing the complex, multifaceted nature of human preferences! Hence, we collect human and AI preferences jointly over instruction-response pairs i.e., (I1,R1) vs (I2, R2). Joint preferences subsume conditional preferences when I1=I2. To learn from joint preferences, we introduce a new preference optimization objective. Intuitively, it upweights the joint probability of the preferred instruction-response pair over the rejected instruction-response pair. If instructions are identical, then JPO boils down to DPO!

Installation

Basic installation:

conda create -n jpo python=3.10
conda activate jpo

pip install -r requirements.txt

Data

We upload the sample data used for the project in this repo in this folder.
SFT data, train_sft.jsonl/val_sft.jsonl, files for openai_tldr and anthropic_helpful.
Conditional rankings data, train_conditional_pref.jsonl/val_conditional_pref.jsonl, from ChatGPT-0125 for openai_tldr and anthropic_helpful.
Joints rankings data over non-identical instructions, train_joint_pref.jsonl, from ChatGPT-0125 for openai_tldr and anthropic_helpful.
By default, the JPO approach can work on both the conditional and joint rankings. Hence, we train our models on the combination of the conditional and joint rankings.

🤗 Data is uploaded on Huggingface: Link

Checkpoints

We upload the trained SFT and JPO checkpoints to huggingface 🤗.

Model	Link
SFT (TLDR)	Link
SFT (Anthropic-Helpful)	Link
JPO-LORA (TLDR)	Link
JPO-LORA (Anthropic-Helpful)	Link

Note: The LORA checkpoints are just the adapters parameters. You need to pass the path to the SFT model in the argument base_model_name_or_path in the adapter_config.json in the LORA folders.

Feedback Acquisition

We provide the steps to acquire conditional rankings and joint rankings over instruction-response pairs.

Make it clear that the joint rankings are acquired over non-identical instructions.

Conditional Rankings - Compare Two Responses for Identical Instructions

Assume that you have a jsonl file where each line is of the form:

{'instruction': [instruction], 'response_0': [response 0], 'response_1': [response_1]}

Run the following python command:

OPENAI_API_KEY=[OAI KEY] python jpo/ai_feedback.py --mode single --input_data <data_file.jsonl> --output_data <output_data_file.jsonl>

Joint Rankings - Compare Two Instruction-Response Pairs

Assume that you have a jsonl file where each line is of the form:

{'instruction_0': [instruction_0], 'response_0': [response 0], 'instruction_1': [instruction_1], 'response_1': [response_1]}

Run the following python command:

OPENAI_API_KEY=[OAI KEY] python jpo/ai_feedback.py --mode pair --input_data <data_file.jsonl> --output_data <output_data_file.jsonl>

Processing the output file

In the output file, our AI feedback appends a feedback attribute to each instance. Here, feedback = 0 indicates that the first response is preferred. feedback = 1 indicates that the second response is preferred. feedback=2 implies that both responses are equally good/bad.
Once the output data files are ready, you can process to be of the form (say pref.jsonl):

{'i_chosen': [chosen instruction], 'r_chosen': [chosen response], 'i_reject': [rejected instruction], 'r_reject': [rejected response]}

You will remove the responses that are considered equal while creating this file.

Supervised Finetuning

Download the base Mistral-7B model from Huggingface https://huggingface.co/mistralai/Mistral-7B-v0.1.
We present the sample data for SFT in data folder for the openai TL;DR and anthropic-helpful datasets. The file names will be train_sft.jsonl or val_sft.jsonl.
Follow the steps mentioned in this sft readme to install the relevant dependencies and perform full-finetuning of the Mistral-7B model.
While we perform full-finetuning, huggingface TRL (here) supports finetuning 7B models with low-rank adaptation. Feel free to check them out.
We provide the SFT models for the TLDR and anthropic-helpful dataset in the ModelZOO.

Sampling from SFT

Once you have a SFT model, you can use our inference.py to sample from it.
We provide the test data files for tldr and helpful datasets. Specifically, these files are of the form:

{'instruction': [instruction], 'response': [response]}

Run the following python command:

CUDA_VISIBLE_DEVICES=1 python jpo/inference.py --model_path <path to sft model> --test_file [path to test file] --output_file [output file name] --temp 0.001 (default sampling temperature)

Note: We explicitly add the summarization format used to finetune SFT in the above script. Feel free to edit them. 4. The above code will generate a file of the form:

{'instruction': [instruction], 'outputs': [outputs]}

JPO

We provide the sample data for conditional rankings (two responses, identical instrucution) and joint rankings over non-identical instructions for both datasets.
For example, conditional rankings data for anthropic-helpful is train_pref and val_pref.
Joint rankings over non-identical instructions are present in train_joint_pref file.
As conditional rankings is the special case of the joint rankings when the instructions are identical, JPO can utilize the diverse set of preferences.
Hence, the user can train on just the conditional rankings, joint rankings (non-identical instructions), merge both the conditional rankings and joint rankings (proposed setup in the paper).
We will start with the SFT model and align it with the JPO algorithm. Run the following for alignment:

CUDA_VISIBLE_DEVICES=1 python train.py --model_name_or_path <path to sft model> --dataset_name <path to train_pref file> --eval_dataset_name <path to val_pref file> --per_device_train_batch_size 8 --gradient_accumulation_steps 4 --output_dir <location where you want save the model ckpt> --joint_distribution True --max_steps 1000

Note 1: You can use the same command to align the LLM with DPO instead of JPO by removing --joint distribution True command.
Note 2: We recommend increasing the number of steps when you merge the conditional and joint rankings data to keep the number of epochs same. On a single GPU the number of epochs = (dataset size/(batch_size * gradient_accumulation_steps)).
This code will save the LORA adapters (instead of the complete model) in the output_dir.

Sampling from JPO aligned LLM

Once you have a SFT model, you can use our inference.py to sample from it.
We provide the test data files for tldr and helpful datasets. Specifically, these files are of the form:

{'instruction': [instruction], 'response': [response]}

Run the following python command:

CUDA_VISIBLE_DEVICES=1 python jpo/inference.py --model_path <path to JPO model (path to output dir)> --test_file [path to test file] --output_file [output file name] --temp 0.001 (default sampling temperature)

Note: The trained LORA checkpoint for JPO automatically picks up the path to the SFT model as it is mentioned its adapter_config.json. Note: We explicitly add the summarization format used to finetune SFT in the above script. Feel free to edit them. 4. The above code will generate a file of the form:

{'instruction': [instruction], 'outputs': [outputs]}

Sample outputs are presented in outputs.

Evaluation

We aim to assess the win-rate of the model generations when compared against the gold responses.
The code is mostly the same as AI feedback generation.
Run the following command:

OPENAI_API_KEY=[OAI KEY] python jpo/auto_eval.py --input_data <generations file from previous section> --model_name <model_name> --test_data <test file with gold responses> --master_data <output file name (say master.json)>

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
data		data
jpo		jpo
outputs		outputs
sft		sft
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main_fig.png		main_fig.png
master.json		master.json
master_helpful.json		master_helpful.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

JPO

Table of contents

About JPO

Installation

Data

Checkpoints

Feedback Acquisition

Conditional Rankings - Compare Two Responses for Identical Instructions

Joint Rankings - Compare Two Instruction-Response Pairs

Processing the output file

Supervised Finetuning

Sampling from SFT

JPO

Sampling from JPO aligned LLM

Evaluation

About

Releases

Packages

Contributors 3

Languages

License

Hritikbansal/jpo

Folders and files

Latest commit

History

Repository files navigation

JPO

Table of contents

About JPO

Installation

Data

Checkpoints

Feedback Acquisition

Conditional Rankings - Compare Two Responses for Identical Instructions

Joint Rankings - Compare Two Instruction-Response Pairs

Processing the output file

Supervised Finetuning

Sampling from SFT

JPO

Sampling from JPO aligned LLM

Evaluation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages