In practice
It was the engine behind RLHF in the early ChatGPT: it maximizes human reward without letting the model diverge. Notoriously hard to stabilize and rich in hyperparameters. That is why many open-source teams now prefer DPO, which gets similar results with less effort.