Uploaded by ibrahim gürses

OpenAI Reinforcement Fine-Tuning: Less Data, Better Results

advertisement

 Home /  All Posts /  General
Bookmark  Notifications

30 Analyzing OpenAI’s Reinforcement Fine-Tuning: Less Data,
Better Results
A few weeks ago, OpenAI announced Reinforcement Fine-Tuning (RFT). This post will cover the
technical details of how it works, as well as the types of tasks for which it is a major breakthrough
that lets LLMs be applied to even more complex custom tasks.
Reinforcement Fine-Tuning—12 Days of OpenAI: Day 2
Share
Watch on
At a high level, RFT helps a reasoning model such as o1 adapt its thinking to new domains more
effectively. And importantly, compared to standard supervised fine-tuning (SFT), RFT allows
models to learn from very small datasets—“a few dozen” examples, according to OpenAI.
This is a significant breakthrough. Training and inference costs have dropped precipitously, but
collecting enough high-quality labeled data remains a bottleneck on applying AI to complex,
novel tasks. So reducing the data required by one or more orders of magnitude is a big deal!
RFT: How Does It Work?
OpenAI’s internal implementation of RFT is not public, but we can make an educated guess
about how it works based on their public description and related research. Here are the highlevel steps:
1. The user uploads a dataset with easily verifiable outputs (this part is important!). Tasks like
classification or information extraction, where there is a clear “right” or “wrong” answer,
work well. OpenAI puts it this way: “Reinforcement Fine-Tuning excels at tasks where the
outcome has an objectively ‘correct’ answer that most experts would agree with.” In the
launch video, OpenAI’s example task involved mapping a medical case report to a specific
genetic mutation.
2. Using that dataset, the RFT training process performs the following loop:
1. Take a batch of dataset inputs, and generate a reasoning trace and output for each one.
Here’s a toy example:
2. The grader scores each generated output. The earlier the correct output is in a list, the
higher the score. In our example we give a score of 1 if the correct answer is in the first
position, 0.5 if it’s in the second position, and 0 if it’s in the third position or not listed.
3. Use PPO or a similar reinforcement learning technique to update the model weights,
favoring generated outputs that receive higher grades.
4. Repeat until the model stops improving.
Let’s Talk About That Grading Function!
You may be wondering why, in our toy example, we ask the model to output a ranked list,
instead of simply asking it to predict the most likely emotion—after all, the dataset output we’re
training on just includes a single emotion. We do this because it allows our grader to assign
partial credit to an answer that includes the correct response in a later position.
It turns out that being able to assign partial credit is really useful.
In reinforcement learning terms, this makes our reward function more dense. By giving partial
credit for a correct answer that isn’t in the top position, we provide a more granular reward signal
—one that acknowledges the model is “on the right track” even if it isn’t fully correct. This helps
stabilize and speed up training, since the model doesn’t have to wait for a perfect response to
receive positive feedback. Instead, it learns from incremental improvements toward the correct
output.
When Should You Use RFT?
There are three main qualifications that make a task a good match for RFT:
1. The task is difficult. (If the task is simple, you may not need any fine-tuning at all.)
2. Outputs are easy to verify. Because RFT requires grading each output, the task should have
a clear verification mechanism. This works well for classification or structured information
extraction, but might be less feasible for tasks like summarization or open-ended
conversation.
3. Labeled data is hard to collect. If you have a lot of pre-labeled data, you will likely get good
results with SFT and won’t need to resort to RFT, which is more complicated, slow and
expensive at both training and inference-time.
One interesting implication of point (3) above is that for very high-volume tasks, RFT may be a
useful “stepping stone” towards a more-optimized classical SFT model. Let’s say that you have 5
million PDFs, and you need to perform a complicated data extraction task on each of them. You
might design your pipeline in the following way:
1. Use an expert human to hand-label 50-100 examples.
2. Use those examples as the training data to create an RFT model that performs the task well.
3. Use your new RFT model to machine-label an additional 20K examples.
4. Use those 20K examples to train a simpler, faster LLM to do the same task using SFT.
5. Use your simpler, faster LLM to label the remaining ~5M documents.
The Road Ahead: Open Source RFT?
Hopefully this has been helpful to build intuition around the types of scenarios where RFT is
potentially useful. But there’s one more thing I’m excited to share!
We’re leading an active project to develop an open-source RFT implementation to fine-tune
reasoning models like Qwen’s QwQ. Early results are promising, but we want to test on more
datasets before releasing. If you’re a researcher interested in collaborating on this project—or
you have a dataset well suited to RFT—please reach out to me directly at kyle@openpipe.ai. I’d
love to get you involved!
Flag
A DAY AGO | 142 VIEWS
Kyle Corbitt
OpenPipe (S23)
kyle@openpipe.ai
COMMENTS
Type your comment here...
Add Comment
Email me when someone replies to my comment.
Mustafa
›
Contact
›
Tools
›
Advice
›
›
Community
Download