O Open Source Frameworks medium

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

by Community

Stanford

Visit Community View repo Submit your build →

OSS

Added 2 June 2026

Overview

Direct Preference Optimization (DPO) is a method for fine-tuning language models using human preference data without reinforcement learning. It reformulates the language model as both the policy and the reward model, enabling alignment through a simple binary cross-entropy loss on preference pairs.

Best for

Best for
Researchers and developers who need a straightforward, stable method to align language models with human preferences without the overhead of reinforcement learning.

Use cases

Aligning large language models with human preferences using pairwise comparisons
Fine-tuning models for safer and more helpful responses without complex RL pipelines
Replacing RLHF in scenarios where training stability and simplicity are priorities

Notes

Use cases

Aligning large language models with human preferences using pairwise comparisons
Fine-tuning models for safer and more helpful responses without complex RL pipelines
Replacing RLHF in scenarios where training stability and simplicity are priorities

Pros

Simpler and more computationally efficient than RLHF, requiring no separate reward model or PPO
Training is stable and converges reliably with standard supervised learning techniques
Directly optimizes the policy from preference data, avoiding reward hacking issues

Cons

Requires high-quality pairwise preference data, which can be expensive to collect
Assumes preferences are transitive and can be captured by pairwise comparisons, limiting expressiveness
May not generalize well to complex or multi-dimensional preference criteria

Indexed from awesome-llm and enriched against its public facts.

Pros

Simpler and more computationally efficient than RLHF, requiring no separate reward model or PPO
Training is stable and converges reliably with standard supervised learning techniques
Directly optimizes the policy from preference data, avoiding reward hacking issues

Cons

Requires high-quality pairwise preference data, which can be expensive to collect
Assumes preferences are transitive and can be captured by pairwise comparisons, limiting expressiveness
May not generalize well to complex or multi-dimensional preference criteria

Pairs with

Other entries in the index that connect to this one. Click through to see the chain.

Built with1entry

O OSS Obs medium

PyTorch

Community

Tensors and Dynamic neural networks in Python with strong GPU acceleration

★ 100,318 updated 23d ago

Pairs with2entries

O OSS Framework medium

Axolotl

Community

Go ahead and axolotl questions

★ 11,997 updated 23d ago

O OSS Framework medium

unslothai

Community

Unsloth Studio is a web UI for training and running open models like Gemma 4, Qwen3.6, DeepSeek, gpt-oss locally.

★ 65,515 updated 23d ago

Alternative to1entry

O OSS Framework medium

OpenRLHF

Community

An Easy-to-use, Scalable and High-performance Agentic RL Framework based on Ray (PPO & DAPO & REINFORCE++ & VLM & TIS & vLLM & Ray & Async RL)

★ 9,583 updated 27d ago

← Back to Open Source Submit your own entry →