Training Intermediate Also known as: Proximal Policy Optimization · Ottimizzazione di policy prossimale

PPO

/pee-pee-oh/

A reinforcement learning algorithm that updates the model in small steps, preventing it from drifting too far from the previous version.

ShareLinkedIn X

In practice

It was the engine behind RLHF in the early ChatGPT: it maximizes human reward without letting the model diverge. Notoriously hard to stabilize and rich in hyperparameters. That is why many open-source teams now prefer DPO, which gets similar results with less effort.

Related terms

RLHF DPO Alignment Loss Function

Seen in the wild

2 entries mentioning it

January 22, 2025

Microsoft 365 Copilot Autonomous Agents: Sales, IT, and HR work without constant oversight

High
August 15, 2024

Zendesk AI Suite: autonomous agents for end-to-end customer support

Medium

← All terms