In practice
It only needs pairs of answers labeled "better/worse" and a simpler, more stable training loop than PPO. In recent years it has replaced RLHF in many open-source projects (Zephyr, Tulu, Llama variants). It is often the cheapest way to align a fine-tuned model.