Bingxiang He, Zekai Qu, Zeyuan Liu, Yinghao Chen, Yuxin Zuo, Cheng Qian, Kaiyan Zhang, Weize Chen, Chaojun Xiao, Ganqu Cui, Ning Ding†, Zhiyuan Liu† †: Corresponding Author

<aside> 💡

TL;DR

Training small reasoning models with RL has become a race toward complexity, using multi-stage pipelines, dynamic schedules, and curriculum learning. We ask: Is this complexity necessary? We show that JustRL, a simple recipe with fixed hyperparameters, achieves state-of-the-art performance on two different 1.5B base models (54.5% and 64.3% across 9 math benchmarks) while using 2× less compute than sophisticated approaches. The same hyperparameters transfer across both models without tuning, and training remains stable over thousands of steps without intervention. This suggests the field may be adding complexity to solve problems that disappear with a stable, scaled-up baseline. We open-source our models and evaluation scripts as validated starting points for practitioners and researchers.

🌐 GitHub, 🔎 Eval Logs, 🤗 JustRL Collection

</aside>

Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away. — Antoine de Saint-Exupéry, Airman's Odyssey

Figure 1: (a) The AIME24 (avg@32) performance curve for scaling from DeepSeek-R1-Distill-Qwen-1.5B into JustRL-DeepSeek-1.5B, from 28% to 58% over 4,000 steps; (b) from OpenMath-Nemotron-1.5B into  our 1.5B reasoning SOTA model JustRL-Nemotron-1.5B, showing its training journey to the final 70+% score over 3,000 steps.

Figure 1: (a) The AIME24 (avg@32) performance curve for scaling from DeepSeek-R1-Distill-Qwen-1.5B into JustRL-DeepSeek-1.5B, from 28% to 58% over 4,000 steps; (b) from OpenMath-Nemotron-1.5B into our 1.5B reasoning SOTA model JustRL-Nemotron-1.5B, showing its training journey to the final 70+% score over 3,000 steps.

Introduction

Recent advances in Large Language Models (LLMs), such as OpenAI's o1 [1] and DeepSeek-R1 [2], have demonstrated the remarkable effectiveness of large-scale Reinforcement Learning with Verifiable Rewards (RLVR) for challenging reasoning tasks in mathematics and coding. For smaller models in the 1-10B parameter range, researchers have increasingly turned to reinforcement learning to push performance boundaries beyond what distillation alone can achieve. Over the past year, we've seen a proliferation of methods attempting to stabilize and improve RL training for small language models (SLMs): multi-stage training pipelines, dynamic hyperparameter schedules, adaptive temperature controls, response length penalties, and various forms of data curation and filtering [3,4,5,6,7,8].

This proliferation of techniques raises an important question: Is this complexity necessary? The accumulated “best practices” may be fighting each other rather than the fundamental challenges of RL [9]. In this blog post, we explore whether stable, competitive training can be achieved with a simpler approach. We apply a minimal setup to two popular 1.5B reasoning models, using single-stage training with fixed hyperparameters derived from common practice. The results match or exceed more complex approaches while using 2× less compute. Importantly, we achieve this without the multi-stage pipelines or dynamic schedules, suggesting that simpler approaches may be sufficient when applied at adequate scale. Besides, the training process itself proves stable: smooth, monotonic improvement over 4,000+ steps without the collapses or oscillations often cited as motivation for complex interventions.

Our goal is not to argue against all techniques or claim we've found the optimal approach. Rather, we provide evidence that simpler baselines deserve more attention than they've received. We offer a simple practice with a minimum set of tricks that can enhance the performance of models that are approaching their distillation limits. The field may benefit from establishing what's fundamentally sufficient before layering on additional complexity. By open-sourcing our models and evaluation scripts, we hope to provide a reliable foundation that others can build upon, whether for practical deployment or as a baseline for developing and validating new techniques.

The Landscape: RL for Small Reasoning Models

Since DeepSeek-R1's release in early 2025, the community has rapidly advanced RL for small language models in mathematical reasoning. The past year has seen a flourishing of approaches, each introducing techniques to stabilize training and push performance boundaries. These works fall into three main families based on their foundation models: DeepSeek-R1-Distill-Qwen-1.5B, OpenMath-Nemotron-1.5B, and Qwen3-1.7B, all starting from distilled bases.

The evolution reveals a clear trend toward increasing sophistication. Early works like STILL explored hyperparameter tuning and reference model resets. Subsequent approaches introduced multi-stage training with progressive context lengthening (DeepScaleR, FastCuRL), alternating between CoT compression and extension across five stages with varying data and batch sizes (FastCuRL), or dividing training into eight stages with scheduled length penalties (ProRL). Later works incorporated hundreds of rollouts per example (BroRL), question augmentation with partial solutions (QuestA), dynamic dataset filtering (POLARIS), and test-time context extrapolation (e3, POLARIS). A summary of RL techniques for various SLMs is shown in the following table.

Model Backbone Entropy Control (Clip Higher / Stop Gradient) Tune Hyperparameters Tune Training Prompt Reset KL Reference Model Length Control (Lengthening / Penalty) Adaptive Temperature Rollout Rescue Mechanism Dynamic Sampling Split Training Stages Date
STILL-3-1.5B-Preview DeepSeek-R1-Distill-Qwen-1.5B Jan, 2025
DeepScaleR-1.5B-Preview DeepSeek-R1-Distill-Qwen-1.5B Feb, 2025
FastCuRL-1.5B-V3 DeepSeek-R1-Distill-Qwen-1.5B Mar, 2025
ProRL-Nemotron-Research-Reasoning-Qwen-1.5B-v1 DeepSeek-R1-Distill-Qwen-1.5B May, 2025
e3-1.7B Qwen3-1.7B Jun, 2025
Polaris-1.7B-Preview Qwen3-1.7B Jul, 2025
Archer-Math-1.5B (Not Released) DeepSeek-R1-Distill-Qwen-1.5B Jul, 2025
ProRL-Nemotron-Research-Reasoning-Qwen-1.5B-v2 DeepSeek-R1-Distill-Qwen-1.5B Aug, 2025
QuestA-Nemotron-1.5B OpenMath-Nemotron-1.5B Sept, 2025
BroRL (Not Released) DeepSeek-R1-Distill-Qwen-1.5B Oct, 2025
JustRL-DeepSeek-1.5B (Ours) DeepSeek-R1-Distill-Qwen-1.5B Nov, 2025
JustRL-Nemotron-1.5B (Ours) OpenMath-Nemotron-1.5B Nov, 2025

The pattern is striking: nearly every work employs multiple techniques from a growing toolkit—multi-stage training, adaptive hyperparameters, length penalties, dynamic sampling, and various stabilization mechanisms. While these methods achieve strong results, each represents a different combination of design choices, making it difficult to isolate which elements truly matter. The engineering complexity also raises a practical question: Is there a simpler path that still achieves competitive performance?

JustRL: Simplicity at Scale

Our approach is deliberately simple. We constrain ourselves to the fundamentals of RL, avoiding the multi-stage pipelines, dynamic schedules, and specialized techniques that have become common in recent work. The goal is to establish what's sufficient before adding complexity.

Training Setup: What We Use (and Don't Use)

Core algorithm: We use standard GRPO with binary outcome rewards—nothing more. The reward signal comes from a lightweight rule-based verifier from DAPO, without symbolic math libraries like SymPy that could add computational overhead.

What we keep simple: