Reinforcement Learning from Human Feedback (RLHF)
Encord Computer Vision Glossary
Reinforcement learning from human feedback (RLHF) in machine learning is a technique that directly trains a "reward model" from human feedback. It uses the model as a reward function to optimize an agent's policy using reinforcement learning through an algorithm like Proximal Policy Optimization. RLHF enhances the robustness and exploration of RL agents, particularly in cases of sparse or noisy reward functions.
Humans provide feedback by ranking the performance of the agent. The rankings can be used to score outputs using rating systems like Elo.
RLHF has found use in multiple natural language processing domains, such as conversational agents, text summarization, and natural language understanding. Unlike conventional reinforcement learning, where agents learn from their own actions based on a predefined "reward function," applying reinforcement learning to natural language processing tasks can be challenging, given that the rewards are often difficult to define or measure, particularly in complex tasks that involve human values or preferences. RLHF can help language models generate answers that are consistent with these complicated values, produce more extensive responses, and reject questions that are irrelevant or outside of their knowledge area. OpenAI's ChatGPT and InstructGPT, as well as DeepMind's Sparrow, are some examples of RLHF-trained language models.
RLHF has also found application in other fields, including the development of video game bots. OpenAI and DeepMind, for instance, trained agents to play Atari games by leveraging human preferences as a guide. The agents' performance in the tested environments was robust, with some even outperforming human players.