Reinforcement Learning for Reasoning in LLMs: Where are we
Introduction
In the evolving landscape of large language models (LLMs), enhancing reasoning capabilities has become a focal point. Reinforcement learning (RL), particularly with verifiable rewards (RLVR), offers a promising avenue for refining these models. Among the emerging techniques, Group Relative Policy Optimization (GRPO) stands out for its innovative approach to fine-tuning LLMs.
GRPO is a variant of the Proximal Policy Optimization (PPO) algorithm, tailored for scenarios involving verifiable (binary) rewards. Unlike traditional PPO, GRPO emphasizes group-relative advantages, optimizing the model by comparing the performance of different actions within a group context. This method has shown promise in tasks requiring precise reasoning, such as mathematical problem-solving and code generation.
The algorithm operates by adjusting the model’s policy to favor actions that yield higher relative rewards, effectively amplifying successful outcomes over iterations. Instead of requiring a reward model, the learning process is guided by well-defined reward functions encouraging the model to refine its reasoning.
It works as follows:
The model generates groups of responses.
Each response is scored based on correctness or another metric created by some set reward function rather than an LLM reward model.
The average score of the group is computed.
Each response's score is compared to the group average.
The model is reinforced to favour higher-scoring responses.
In this post, I’ll focus on mathematical reasoning. For more details, please refer to my earlier post on this topic.
Journey so far
DeepSeek’s research on R1-Zero demonstrated that reasoning can emerge as a learned behavior through pure reinforcement learning (GRPO), without human supervision or predefined instructions. This opens up avenues to train reasoner models without requiring large swathes of reasoning data, a particular interesting paradigm of small reasoning model emerges. This has been adapted in some studies involving small LLMs (wilcbb, TinyZero etc.) to show that reasoning can emerge as a learned behavior through pure reinforcement learning (RL) on problem-answer data without requiring any CoT.
In the original work, the DeepSeek team proposed ‘distilling’ reasoning into smaller models by supervised fine-tuning on the output of the DeepSeek-R1. It was showed that this type of distillation gives better results than pure RL. This observation has been noted in other studies as well e.g., s1K-32B achieves better accuracy than o1-preview on math benchmarks by supervised fine-tuning on a well-curated set of 1K problems paired with the reasoning traces. My own experiments on using LLMs for recommendation systems observed better performance in SFT than RL-based fine-tuning.
Whether to supervised fine-tune (distill) or reinforcement learning fine-tune is an interesting question that has seen lot of attention recently. In a related vein, a deeper understanding of RL-based fine-tuning and the reasoning gains obtained is a hot topic. In DeepScaleR, a small, distilled reasoning model is fine-tuned using GRPO to enhance reasoning. The Deepseek-R1-Distilled-Qwen-1.5B model, which is the distilled Qwen1.5B trained on DeepSeek-R1 traces, is further fine-tuned using RL-GRPO on a well-curated dataset to achieve better performance than o1-preview. This was noted in the original DeepSeek-R1 paper:
Additionally, we found that applying RL to these distilled models yields significant further gains. We believe this warrants further exploration and therefore present only the results of the simple SFT-distilled models here.
What works and what doesn’t?
In a recent work, researchers investigated the behaviour of RL fine-tuning on a small, distilled 1.5B parameter model. They curated a compact, high-quality dataset tailored to mathematical reasoning from two sources - s1K and DeepScaleR. Both the datasets are filtered to remove trivial problems than can be solved by small LLMs. Multiple experiments on fine-tuning Deepseek-R1-Distilled-Qwen-1.5B model using GRPO are conducted to better understand how its reasoning capabilities evolve. They further impose constraints like limiting number of generations to 6 per step and a maximum completion length of 4096 tokens.
They find that with only 7000 training examples and a $42 compute budget, RL fine-tuning can lead to strong improvements, outperforming OpenAI's o1-preview on the AIME24 math benchmark. However, what makes this work interesting is their observation of learning instability with prolonged training. The insights from their experiments are:
Small LLMs can achieve fast reasoning improvements within the first 50–100 training steps using a small high-quality dataset. But the performance quickly drops with prolonged training under strict completion length requirements. It is postulated that the problems in the data are too complex and the model runs out of the token budget before producing the final answer.
Mixing easier and harder problems helps stabilising the training. The model produces shorter responses and the performance improves in intial steps. Easier problems encourage concise reasoning, while more challenging ones allow for complexity. However, performance still degrades over time due to persistent challenges with length constraints and multilingual tendencies.
Instead of the accuracy reward, they tried a cosine scheduled reward where the accuracy reward is scaled based on response length using. This forces shorter correct solutions to receive higher rewards, while longer incorrect solutions are penalized less severely, pushing for concise yet accurate reasoning. This helps control output length more effectively and improves training consistency. But this slightly reduces peak performance compared to standard accuracy-based rewards.
This work shows that RL is effective for small distilled models. However, the complexity of the problems necessitate extending max completion lengths particularly with multilingual base models.
A sober look
A new paper revisits recent claims about the benefits of GRPO-reinforcement learning (RL) for improving distilled language models. Previous research reported substantial improvements when applying RL methods to language models. Yet, upon closer scrutiny, the authors find these reported gains may largely be attributed to statistical noise rather than genuine advancements. They demonstrate that results obtained on smaller benchmarks, such as AIME24, are highly sensitive to minor experimental variations leading to fluctuations in performance scores by several percentage points.
In this work, they consider various small (1.5B and 7B) models which have been RL-finetuned on DeepSeek-R1-Distill models like the ones discussed in the previous section. These models are evaluated under different evaluation design choices to understand the stability of results.
Random Seed: Experiments show high standard deviation - ranging from 5 to 15 percentage points across seeds, particularly for smaller datasets like AIME24 where solving a single extra problem moves the score by 3%. Thus, single-seed evaluations are not trustworthy, reporting requires averaging over multiple seeds should be preferable. In fact, they showed that bootstrapping over multiple runs helps in stabilising results and thus, proposed it as a standard for reliable evaluation.
Sampling Parameters: The effect of temperature and top_p
on generation are well known. The authors observed up to 15% variation in performance due to changing temperature and up to 8% variations due to top_p
values. The recommendation is to set temperature and top_p
to each model’s optimal values to ensure fair and stable evaluation.
Hardware and Software: A somewhat surprising observation in this paper is that seemingly harmless choices like GPU types and evaluation framework.
Generation Length: The length of the generation controlled by max_new_tokens
is an important parameter - too short can induce premature stopping leading to a failure case. This is especially true for complex problems which can have longer reasoning trace and solution steps.
Prompt: Input prompt is another design choice which can have an effect on performance. I mean prompt engineering as an expertise is well established. However, it is often brushed off in academic benchmarks, and understandably so as this is an infinite space. In this paper, they tried 3 different prompts and observed variation in results.
I would love to go a bit deeper into this as prompting seems like an easy way to extract more juice from the model with any additional training. I also wonder how auto-prompt tuning would fare here. I will share some of my experiments in this direction in a future post.
Based on these observations, best practices for evaluation is proposed in this work. Furthermore, they evaluated several RL-finetuning efforts on top of the DeepSeek R1-Distill-1.5B model. It is shown that for most models (except DeepScaleR), when tested in more controlled, standardized evaluation frameworks, the previously touted performance boosts often diminish significantly, frequently falling within statistical margins of error. Instead of the distill model, RL-based fine-tuning of base models (Qwen or Qwen-Math) showed significant gains. However, these gains are not able to surpass those achieved via supervised instruction tuning as reported in the original Qwen papers. Additionally, the gains seen with RL often fail to generalize effectively to novel benchmarks or tasks (e.g. AIME25).
Fine-tuning on reasoning traces from larger models through supervised learning consistently delivers substantial and generalizable improvements across benchmarks. The original DeepSeek-R1 paper also pointed the superiority of SFT on traces over RL. This reinforces belief in using SFT on reasoning traces as a mature fine-tuning paradigm. Furthermore, it is showed in this work that the current RL-based methods are highly prone to overfitting. In contrast, supervised fine-tuning (SFT) models demonstrate stronger generalization and robustness.
Overall, while RL may offer some utility in improving the performance of smaller distilled models under specific conditions, a careful controlled study suggests that the benefits of RL have been exaggerated in reporting. There needs to be more rigorous evaluation standards to accurately assess the effectiveness of RL and other training methods.
Conclusion
An immediate takeaway of this discussion is to be careful in evaluating the performance of models. Personally, I found the effect of evaluation design choices like prompt, base model, data etc. an interesting research direction that I intend to explore in more details in upcoming posts. Furthermore, supervised fine-tuning, especially on reasoning traces of large reasoning models, continues to be a formidable training paradigm. The research discussed in this post is focused on mathematical reasoning and benchmarks. I would love to see interesting work involving RL applications in other tasks - do share!
If you find this work useful, I would appreciate if cite it as:
@misc{verma2025rl-reasoning,
title={Reinforcement Learning for Reasoning in LLMs},
author={Janu Verma},
year={2025},
url={\url{https://januverma.substack.com/p/reinforcement-learning-for-reasoning}},
note={Incomplete Distillation}
}