Training Intermediate Also known as: Reinforcement Learning from Human Feedback

RLHF

/ar-el-aitch-ef/

A training technique where humans rate and rank model responses, and these preferences are used to steer learning toward more helpful and safer answers.

ShareLinkedIn X

In practice

It is the step that turned ChatGPT into something useful versus a raw predictive model. For API users RLHF has already been done by the provider. Knowing about it explains why more 'aligned' models sometimes refuse legitimate requests.

Related terms

RLAIF Constitutional AI Alignment

Seen in the wild

5 entries mentioning it

October 25, 2023

Zephyr-7B: DPO on Mistral 7B beats Llama-2-70B-chat on MT-Bench

High
July 18, 2023

Llama 2: weights become commercially usable

Landmark
November 30, 2022

ChatGPT: AI lands in everyone's browser

Landmark
January 27, 2022

InstructGPT: the fine-tuning that teaches GPT to obey

High
December 16, 2021

WebGPT: OpenAI teaches GPT-3 to browse the web

High

← All terms