TL;DR: In RLHF, there’s rigidity between the reward studying section, which makes use of human choice within the type of comparisons, and the RL fine-tuning section, which optimizes a single, non-comparative reward. What if we carried out RL in a comparative manner?
Determine 1:
This diagram illustrates the distinction between reinforcement studying from absolute suggestions and relative suggestions. By incorporating a brand new part – pairwise coverage gradient, we will unify the reward modeling stage and RL stage, enabling direct updates based mostly on pairwise responses.
Massive Language Fashions (LLMs) have powered more and more succesful digital assistants, akin to GPT-4, Claude-2, Bard and Bing Chat. These techniques can reply to complicated consumer queries, write code, and even produce poetry. The method underlying these wonderful digital assistants is Reinforcement Studying with Human Suggestions (RLHF). RLHF goals to align the mannequin with human values and remove unintended behaviors, which may usually come up as a result of mannequin being uncovered to a big amount of low-quality knowledge throughout its pretraining section.
Proximal Coverage Optimization (PPO), the dominant RL optimizer on this course of, has been reported to exhibit instability and implementation issues. Extra importantly, there’s a persistent discrepancy within the RLHF course of: regardless of the reward mannequin being educated utilizing comparisons between numerous responses, the RL fine-tuning stage works on particular person responses with out making any comparisons. This inconsistency can exacerbate points, particularly within the difficult language technology area.
Given this backdrop, an intriguing query arises: Is it doable to design an RL algorithm that learns in a comparative method? To discover this, we introduce Pairwise Proximal Coverage Optimization (P3O), a way that harmonizes the coaching processes in each the reward studying stage and RL fine-tuning stage of RLHF, offering a passable resolution to this concern.