Contrib12 - preferences chapter, other cleaning and additions (#23)

natolambert · Dec 19, 2024 · 0b49668 · 0b49668
1 parent 11636aa
commit 0b49668
Show file tree

Hide file tree

Showing 21 changed files with 233 additions and 33 deletions.
diff --git a/.gitignore b/.gitignore
@@ -9,3 +9,4 @@ build/html/
 *.log
 .DS_Store
 **/.DS_Store
+.vscode/
diff --git a/Makefile b/Makefile
@@ -15,7 +15,7 @@ METADATA_ARGS = --metadata-file $(METADATA)
 IMAGES = $(shell find images -type f)
 TEMPLATES = $(shell find templates/ -type f)
 COVER_IMAGE = images/cover.png
-MATH_FORMULAS = --webtex
+MATH_FORMULAS = --mathjax # --webtex, is default for PDF/ebook. Consider resetting if issues.
 BIBLIOGRAPHY = --bibliography=chapters/bib.bib --citeproc --csl=templates/ieee.csl
 
 # Chapters content

diff --git a/README.md b/README.md
@@ -1,5 +1,5 @@
 # RLHF Book
-Built on **Pandoc book template**.
+Built on [**Pandoc book template**](https://github.com/wikiti/pandoc-book-template).
 
 [![Code License](https://img.shields.io/badge/license-MIT-blue.svg)](https://github.com/wikiti/pandoc-book-template/blob/master/LICENSE.md)
 [![Content License](https://img.shields.io/badge/license-CC--BY--NC--SA--4.0-lightgrey)](https://github.com/natolambert/rlhf-book/blob/main/LICENSE-Content.md)

diff --git a/chapters/01-introduction.md b/chapters/01-introduction.md
@@ -6,14 +6,24 @@ Its early applications were often in control problems and other traditional doma
 RLHF became most known through the release of ChatGPT and the subsequent rapid development of large language models (LLMs) and other foundation models.
 
 The basic pipeline for RLHF involves three steps.
-First, a language model that can follow user preferences must be trained (see Chapter 9).
+First, a language model that can follow user questions must be trained (see Chapter 9).
 Second, human preference data must be collected for the training of a reward model of human preferences (see Chapter 7).
 Finally, the language model can be optimized with a RL optimizer of choice, by sampling generations and rating them with respect to the reward model (see Chapter 3 and 11).
 This book details key decisions and basic implementation examples for each step in this process.
 
 RLHF has been applied to many domains successfully, with complexity increasing as the techniques have matured.
 Early breakthrough experiments with RLHF were applied to deep reinforcement learning [@christiano2017deep], summarization [@stiennon2020learning], follow instructions [@ouyang2022training], parse web information for question answering [@nakano2021webgpt], and "alignment" [@bai2022training].
 
+In modern language model training, RLHF is one component on post-training. 
+Post-training is a more complete set of techniques and best-practices to make language models more useful for downstream tasks [@lambert2024t].
+Post-training can be summarized as using three optimization methods:
+
+1. Instruction / Supervised Finetuning (SFT),
+2. Preference Finetuning (PreFT), and
+3. Reinforcement Finetuning (RFT).
+
+This book focused on the second area, **preference finetuning**, which has more complexity than instruction tuning and is far more established than Reinforcement Finetuning.
+
 ## Scope of This Book
 
 This book hopes to touch on each of the core steps of doing canonical RLHF implementations. 
@@ -49,6 +59,7 @@ This book has the following chapters following this Introduction:
 1. Constitutional AI
 2. Synthetic Data
 3. Evaluation
+4. Reasoning and Reinforcement Finetuning
 
 **Open Questions (TBD)**:
 

diff --git a/chapters/02-preferences.md b/chapters/02-preferences.md
@@ -1,8 +1,8 @@
 
-# [Incomplete] Human Preferences for RLHF
+# The Nature of Preferences
 
 The core of reinforcement learning from human feedback, also referred to as reinforcement learning from human preferences in early literature, is designed to optimize machine learning models in domains where specifically designing a reward function is hard.
-The motivation for using humans as the reward signals is to obtain a indirect metric for the target reward.
+The motivation for using humans as the reward signals is to obtain a indirect metric for the target reward and *align* the downstream model to human preferences.
 
 The use of human labeled feedback data integrates the history of many fields.
 Using human data alone is a well studied problem, but in the context of RLHF it is used at the intersection of multiple long-standing fields of study [@lambert2023entangled].
@@ -14,22 +14,47 @@ As an approximation, modern RLHF is the convergence of three areas of developmen
 3. Modern deep learning systems.
 
 Together, each of these areas brings specific assumptions at what a preference is and how it can be optimized, which dictates the motivations and design of RLHF problems.
+In practice, RLHF methods are motivated and studied from the perspective of empirical alignment -- maximizing model performance on specific skills instead of measuring the calibration to specific values.
+Still, the origins of value alignment for RLHF methods continue to be studied through research on methods to solve for ``pluralistic alignment'' across populations, such as position papers [@conitzer2024social], [@mishra2023ai], new datasets [@kirk2024prism], and personalization methods [@poddar2024personalizing].
 
-## The Origins of Reward Models: Costs vs. Rewards vs. Preferences
+The goal of this chapter is to illustrate how complex motivations results in presumptions about the nature of tools used in RLHF that do often not apply in practice.
+The specifics of obtaining data for RLHF is discussed further in Chapter 6 and using it for reward modeling in Chapter 7.
+For an extended version of this chapter, see [@lambert2023entangled].
 
-### Specifying objectives: from logic of utility to reward functions
+## The path to optimizing preferences
 
-### Implementing optimal utility
+A popular phrasing for the design of Artificial Intelligence (AI) systems is that of a rational agent maximizing a utility function [@russell2016artificial].
+The inspiration of a **rational agent** is a lens of decision making, where said agent is able to act in the world and impact its future behavior and returns, as a measure of goodness in the world.
 
-### Steering preferences
+The lens of study of **utility** began in the study of analog circuits to optimize behavior on a finite time horizon [@widrow1960adaptive].
+Large portions of optimal control adopted this lens, often studying dynamic problems under the lens of minimizing as cost function on a certain horizon -- a lens often associated with solving for a clear, optimal behavior.
+Reinforcement learning, inspired from literature in operant conditioning, animal behavior, and the *Law of Effect* [@skinner2019behavior],[@thorndike1927law], studies how to elicit behaviors from agents via reinforcing positive behaviors.
 
-### Value alignment's role in RLHF
+Reinforcement learning from human feedback combines multiple lenses by building the theory of learning and change of RL, i.e. that behaviors can be learned by reinforcing behavior, with a suite of methods designed for quantifying preferences.
 
-## From Design to Implementation
+### Quantifying preferences
 
-Many of the principles discussed earlier in this chapter are further specified in the process of implementing the modern RLHF stack, adjusting the meaning of RLHF.
+The core of RLHF's motivation is the ability to optimize a model of human preferences, which therefore needs to be quantified.
+To do this, RLHF builds on extensive literature with assumptions that human decisions and preferences can be quantified.
+Early philosophers discussed the existence of preferences, such as Aristotle's Topics, Book Three, and substantive forms of this reasoning emerged later with *The Port-Royal Logic* [@arnauld1861port]:
 
-## Limitations of RLHF
+> To judge what one must do to obtain a good or avoid an evil, it is necessary to consider not only the good and evil in itself, but also the probability that it happens or does not happen.
+
+Progression of these ideas continued through Bentham's *Hedonic Calculus* [@bentham1823hedonic] that proposed that all of life's considerations can be weighed, and Ramsey’s *Truth and Probability* [@ramsey2016truth] that applied a quantitative model to preferences.
+This direction, drawing on advancements in decision theory, culminated in the Von Neumann-Morgenstern (VNM) utility theorem which gives credence to designing utility functions that assign relative preference for an individual that are used to make decisions.
+
+This theorem is core to all assumptions that pieces of RLHF are learning to model and dictate preferences.
+RLHF is designed to optimize these personal utility functions with reinforcement learning.
+In this context, many of the presumptions around RL problem formulation break down to the difference between a preference function and a utility function.
+
+### On the possibility of preferences
+
+Across fields of study, many critiques exist on the nature of preferences. 
+Some of the most prominent critiques are summarized below:
+
+- **Arrow's impossibility theorem** [@arrow1950difficulty] states that no voting system can aggregate multiple preferences while maintaining certain reasonable criteria.
+- **The impossibility of interpersonal comparison** [@harsanyi1977rule] highlights how different individuals have different relative magnitudes of preferences and they cannot be easily compared (as is done in most modern reward model training).
+- **Preferences can change over time** [@pettigrew2019choosing].
+- **Preferences can vary across contexts**.
+- **The utility functions derived from aggregating preferences can reduce corrigibility** [@soares2015corrigibility] of downstream agents (i.e. the possibility of an agents' behavior to be corrected by the designer).
 
-The specifics of obtaining data for RLHF is discussed further in Chapter 6.
-For an extended version of this chapter, see [@lambert2023entangled].
diff --git a/chapters/03-optimization.md b/chapters/03-optimization.md
@@ -1,8 +1,43 @@
 
-# [Incomplete] Problem Formulation
+# Problem Formulation
 
-## Maximizing Expected Reward
+The optimization of reinforcement learning from human feedback (RLHF) builds on top of the standard RL setup.
+In RL, an agent takes actions, $a$, sampled from a policy, $\pi$, with respect to the state of the environment, $s$, to maximize reward, $r$.
+Traditionally, the environment evolves with respect to a transition or dynamics function $p(s_{t+1}|s_t,a_t)$.
+Hence, across a finite episode, the goal of an RL agent is to solve the following optimization:
 
-TODO: The idea of a "KL Budget" for optimization
+$$J(\pi) = \mathbb{E}_{\tau \sim \pi} \left[ \sum_{t=0}^{\infty} \gamma^t r(s_t, a_t) \right],$$
 
-## Example: Mitigating Safety
+where $\gamma$ is a discount factor from 0 to 1 that balances the desireability of near- versus future-rewards.
+Multiple methods for optimizing this expression are discussed in Chapter 11.
+
+![Standard RL loop](images/rl.png){#fig:rl width=320px .center}
+
+A standard illustration of the RL loop is shown in @fig:rl and how it compares to @fig:rlhf.
+
+
+## Manipulating the standard RL setup
+
+There are multiple core changes from the standard RL setup to that of RLHF:
+
+1. Switching from a reward function to a reward model. In RLHF, a learned model of human preferences, $r_\theta(s_t, a_t)$ (or any other classification model) is used instead of an environmental reward function. This gives the designer a substantial increase in the flexibility of the approach and control over the final results.
+2. No state transitions exist. In RLHF, the initial states for the domain are prompts sampled from a training dataset and the ``action'' is the completion to said prompt. During standard practices, this action does not impact the next state and is only scored by the reward model.
+3. Response level rewards. Often referred to as a Bandits Problem, RLHF attribution of reward is done for an entire sequence of actions, composed of multiple generated tokens, rather than in a fine-grained manner. 
+
+Given the single-turn nature of the problem, the optimization can be re-written without the time horizon and discount factor (and the reward models):
+$$J(\pi) = \mathbb{E}_{\tau \sim \pi} \left[r_\theta(s_t, a_t) \right],$$
+
+In many ways, the result is that while RLHF is heavily inspired by RL optimizers and problem formulations, the action implementation is very distinct from traditional RL.
+
+![Standard RLHF loop](images/rlhf.png){#fig:rlhf}
+
+## Finetuning and regularization
+
+RLHF is implemented from a strong base model, which induces a need to control the optimization from straying too far from the initial policy.
+In order to succeed in a finetuning regime, RLHF techniques employ multiple types of regularization to control the optimization.
+The most common change to the optimization function is to add a distance penalty on the difference between the current RLHF policy and the starting point of the optimization:
+
+$$J(\pi) = \mathbb{E}_{\tau \sim \pi} \left[r_\theta(s_t, a_t)\right] - \beta  \mathcal{D}_{KL}(\pi^{\text{RL}}(\cdot|s_t) \| \pi^{\text{ref}}(\cdot|s_t)).$$
+
+Within this formulation, a lot of study into RLHF training goes into understanding how to spend a certain ``KL budget'' as measured by a distance from the initial model.
+For more details, see Chapter 8 on Regularization.
diff --git a/chapters/04-related-works.md b/chapters/04-related-works.md
@@ -40,6 +40,6 @@ Anthropic continued to use it extensively for early versions of Claude [@bai2022
 ## 2023 to Present: ChatGPT Eta
 
 Since OpenAI launched ChatGPT [@openai2022chatgpt], RLHF has been used extensively in leading language models. 
-It is well known to be used in Anthropic's Constitutional AI for Claude [@bai2022constitutional], Meta's Llama 2 [@touvron2023llama] and Llama 3 [@dubey2024llama], Nvidia's Nemotron [@adler2024nemotron], and more.
+It is well known to be used in Anthropic's Constitutional AI for Claude [@bai2022constitutional], Meta's Llama 2 [@touvron2023llama] and Llama 3 [@dubey2024llama], Nvidia's Nemotron [@adler2024nemotron], Ai2's Tülu 3 [@lambert2024t], and more.
 
 Today, RLHF is growing into a broader field of preference fine-tuning (PreFT), including new applications such as process reward for intermediate reasoning steps [@lightman2023let], direct alignment algorithms inspired by Direct Preference Optimization (DPO) [@rafailov2024direct], learning from execution feedback from code or math [@kumar2024training],[@singh2023beyond], and other online reasoning methods inspired by OpenAI's o1 [@openai2024o1].
diff --git a/chapters/05-setup.md b/chapters/05-setup.md
@@ -1,4 +1,4 @@
-# Problem Setup
+# Definitions
 
 This chapter includes all the definitions, symbols, and operatings frequently used in the RLHF process.
 

diff --git a/chapters/07-reward-models.md b/chapters/07-reward-models.md
@@ -83,6 +83,8 @@ RewardBench (biased, but gives a good overview): [@lambert2023entangled] [@zhou2
 
 New reward model training methods, with aspect-conditioned models [@wang2024interpretable], high quality human datasets [@wang2024helpsteer2] [@wang2024helpsteer2p], scaling [@adler2024nemotron], extensive experimentation [@touvron2023llama], debiasing data [@park2024offsetbias],
 
+Evaluations
+
 ## Recommendations
 
 Strong tendency in the literature to train for only one epoch, otherwise it overfits
diff --git a/chapters/11-policy-gradients.md b/chapters/11-policy-gradients.md
@@ -1,6 +1,7 @@
 # [Incomplete] Policy Gradient Algorithms
 
-The algorithms that popularized RLHF for language models were policy-gradient reinforcement learning algoritms. 
+
+The algorithms that popularized RLHF for language models were policy-gradient reinforcement learning algorithms. 
 These algorithms, such as PPO and Reinforce, use recently generated samples to update their model rather than storing scores in a replay buffer.
 In this section we will cover the fundamentals of the policy gradient algorithms and how they are used in the modern RLHF framework.
 

diff --git a/chapters/13-cai.md b/chapters/13-cai.md
@@ -1 +1 @@
-# [Incomplete] Constitutional AI
+# [Incomplete] Constitutional AI and AI Feedback
diff --git a/chapters/14-reasoning.md b/chapters/14-reasoning.md
@@ -0,0 +1 @@
+# [Incomplete] Constitutional AI
diff --git a/chapters/14-synthetic.md → chapters/15-synthetic.md b/chapters/14-synthetic.md → chapters/15-synthetic.md
diff --git a/chapters/15-evaluation.md → chapters/16-evaluation.md b/chapters/15-evaluation.md → chapters/16-evaluation.md
diff --git a/chapters/16-over-optimization.md → chapters/17-over-optimization.md b/chapters/16-over-optimization.md → chapters/17-over-optimization.md
Original file line number	Diff line number	Diff line change
		@@ -1 +1 @@
		# [Incomplete] Constitutional AI
		# [Incomplete] Constitutional AI and AI Feedback