Skip to content
AImpact
IT EN
Training Intermediate Also known as: Direct Preference Optimization · Ottimizzazione diretta delle preferenze

DPO

/dee-pee-oh/

An alignment technique that teaches a model to prefer a better answer over a worse one, without using a separate reward model like RLHF does.

ShareLinkedInX

In practice

It only needs pairs of answers labeled "better/worse" and a simpler, more stable training loop than PPO. In recent years it has replaced RLHF in many open-source projects (Zephyr, Tulu, Llama variants). It is often the cheapest way to align a fine-tuned model.

Related terms

← All terms