Skip to content
AImpact
IT EN
Training Intermediate Also known as: Proximal Policy Optimization · Ottimizzazione di policy prossimale

PPO

/pee-pee-oh/

A reinforcement learning algorithm that updates the model in small steps, preventing it from drifting too far from the previous version.

ShareLinkedInX

In practice

It was the engine behind RLHF in the early ChatGPT: it maximizes human reward without letting the model diverge. Notoriously hard to stabilize and rich in hyperparameters. That is why many open-source teams now prefer DPO, which gets similar results with less effort.

Related terms

← All terms