Training Intermediate Also known as: Direct Preference Optimization · Ottimizzazione diretta delle preferenze

DPO

/dee-pee-oh/

An alignment technique that teaches a model to prefer a better answer over a worse one, without using a separate reward model like RLHF does.

ShareLinkedIn X

In practice

It only needs pairs of answers labeled "better/worse" and a simpler, more stable training loop than PPO. In recent years it has replaced RLHF in many open-source projects (Zephyr, Tulu, Llama variants). It is often the cheapest way to align a fine-tuned model.

Related terms

RLHF PPO SFT Alignment Fine-tuning

Seen in the wild

3 entries mentioning it

November 21, 2024

Allen AI's Tülu 3: the first fully open post-training pipeline

Medium
October 25, 2023

Zephyr-7B: DPO on Mistral 7B beats Llama-2-70B-chat on MT-Bench

High
September 27, 2022

Hugging Face Inference Endpoints: deploy LLMs in two clicks

Medium

← All terms