Direct Preference Optimization: Your Language Model is Secretly a Reward Model
by Community
Stanford
OSS
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Added 2 June 2026
Overview
Direct Preference Optimization (DPO) is a method for fine-tuning language models using human preference data without reinforcement learning. It reformulates the language model as both the policy and the reward model, enabling alignment through a simple binary cross-entropy loss on preference pairs.
Best for
Best for
Researchers and developers who need a straightforward, stable method to align language models with human preferences without the overhead of reinforcement learning.
Use cases
- Aligning large language models with human preferences using pairwise comparisons
- Fine-tuning models for safer and more helpful responses without complex RL pipelines
- Replacing RLHF in scenarios where training stability and simplicity are priorities
Notes
Direct Preference Optimization (DPO) is a method for fine-tuning language models using human preference data without reinforcement learning. It reformulates the language model as both the policy and the reward model, enabling alignment through a simple binary cross-entropy loss on preference pairs.
Use cases
- Aligning large language models with human preferences using pairwise comparisons
- Fine-tuning models for safer and more helpful responses without complex RL pipelines
- Replacing RLHF in scenarios where training stability and simplicity are priorities
Pros
- Simpler and more computationally efficient than RLHF, requiring no separate reward model or PPO
- Training is stable and converges reliably with standard supervised learning techniques
- Directly optimizes the policy from preference data, avoiding reward hacking issues
Cons
- Requires high-quality pairwise preference data, which can be expensive to collect
- Assumes preferences are transitive and can be captured by pairwise comparisons, limiting expressiveness
- May not generalize well to complex or multi-dimensional preference criteria
Indexed from awesome-llm and enriched against its public facts.
Pros
- Simpler and more computationally efficient than RLHF, requiring no separate reward model or PPO
- Training is stable and converges reliably with standard supervised learning techniques
- Directly optimizes the policy from preference data, avoiding reward hacking issues
Cons
- Requires high-quality pairwise preference data, which can be expensive to collect
- Assumes preferences are transitive and can be captured by pairwise comparisons, limiting expressiveness
- May not generalize well to complex or multi-dimensional preference criteria
Pairs with
Other entries in the index that connect to this one. Click through to see the chain.