OpenAI's Reinforcement Finetuning: RL + Science — A New God or Thanos?
Discover OpenAI's Reinforcement Finetuning (RFT), combining RLHF and expert data for breakthroughs in medical diagnosis, decision-making, and scientific challenges.
On December 6, 2024, at 11 a.m. California time, OpenAI released a new Reinforcement Finetuning (RFT) method for building expert models. This approach allows users to solve decision-making problems in specialized domains, such as medical diagnoses or rare disease detection, by fine-tuning as few as a few dozen to a few thousand training cases.
The training data is formatted similarly to common instruction tuning datasets, consisting of multiple options and a correct answer. At the same time, OpenAI launched a Reinforcement Finetuning research project, encouraging scholars and experts to upload unique datasets from their fields to test this fine-tuning method.
This method builds upon techniques already widely used in alignment, mathematics, and coding. Its foundation lies in Reinforcement Learning from Human Feedback (RLHF), which aligns large models with human-preference data. In RLHF, training data consists of questions (answer 1, answer 2, preference), where users select the preferred response. This data is used to train a reward model. Once the reward model is established, reinforcement learning algorithms (e.g., PPO or DPO) fine-tune the model parameters, enabling it to produce content that is more aligned with user preferences.