RAGEN is the first reproduction of the DeepSeek-R1(-Zero) methods for training agentic models.
We strongly believe in the future of RL + LLM + Agents. The release is a minimally viable leap forward.
We run RAGEN on Qwen-2.5-{0.5B, 3B}-{Instruct, None} and DeepSeek-R1-Distill-Qwen-1.5B, on the Gym-Sokoban task.
About the sokoban task (from the official repo): Sokoban is Japanese for warehouse keeper and a traditional video game. The game is a transportation puzzle, where the player has to push all boxes in the room on the storage locations/ targets. The possibility of making irreversible mistakes makes these puzzles so challenging especially for Reinforcement Learning algorithms, which mostly lack the ability to think ahead.
NOTE: See Visualization Section for details. The maximum reward of this environment is 10.9. Action spaces are 0-4 (0: Stand, 1: Up, 2: Down, 3: Left, 4: Right).
The loss curve have not converged (since our compute is currently limited...). But we already see some trends:
- Instruct-finetuned models are not significantly advantaged ahead Pretrained-only models, although they are better at start.
- 3B models are performing better than 0.5B models as well, but the advantages are also not that obvious at around 40 steps.
- Interestingly, R1-distilled 1.5B model do less well than 0.5B models for now.
We prepare to release a complete wandb plot for these experiment runs, although you can try it your own and it may even be faster than our run (reasons above).
conda create -n ragen python=3.9 -y
conda activate ragen
git clone [email protected]:ZihanWang314/ragen.git
cd ragen
# setup install
pip install -e . # includes verl-ragen (by us) and verl-core (by the verl team)
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu121
# Optional: to install flash-attn, you may need to install cuda-toolkit first if you don't have
# WARNING: Building flash-attn from source requires significant GPU compute capability
# Attempted build on NVIDIA RTX 3080 (32GB RAM) did not complete after 24 hours
# Consider using a more powerful GPU setup for successful compilation
conda install -c "nvidia/label/cuda-12.1.0" cuda-toolkit -y
export CUDA_HOME=$CONDA_PREFIX # /opt/conda/envs/zero
pip3 install flash-attn --no-build-isolation
pip install -r requirements.txt # other packages
On the Gym-Sokoban task, We create 10k data for training and run for <=1 epoch.
# sokoban env settings. will determine game difficulty
# it's normal to see some SOKOBAN errors, but the data will be created and it's fine
export DIM_X=6
export DIM_Y=6
export NUM_BOXES=1
export MAX_STEPS=5
export SEARCH_DEPTH=30
python scripts/dataset_curation.py \
--output data/sokoban \
--seed 10000 \
--train_size 10000 \
--test_size 10 \
--prefix qwen-instruct # we find it could work for base models
export DATA_DIR=data/sokoban
export DIM_X=6
export DIM_Y=6
export NUM_BOXES=1
export MAX_STEPS=5
export SEARCH_DEPTH=30
# export CUDA_VISIBLE_DEVICES=0
# export BASE_MODEL=Qwen/Qwen2.5-0.5B
# export EXPERIMENT_NAME=test-qwen2.5-0.5b
export CUDA_VISIBLE_DEVICES=0
export BASE_MODEL=checkpoints/Agent-R1/test-qwen2.5-0.5b-instruct-1mbsz/actor/global_step_100
export EXPERIMENT_NAME=test-qwen2.5-0.5b-imagetest
export MICRO_BATCH_SIZE=1
export TRAIN_BATCH_SIZE=128 # 256
export PPO_BATCH_SIZE=64 # 128
export MAX_START_LENGTH=400 # the first round prompt max length
export MAX_RESPONSE_LENGTH=100
export MAX_OBS_LENGTH=120
export MAX_TURNS=5
export NUM_UPDATE_PER_ROLL=1 # roll out for a batch, then the model do N times of update. Currently not implemented.
export LOG_MODE="['wandb']" # or 'console'
export GCP=True # gradient checkpointing
export N_GPUS=1
export ROLLOUT_TP_SIZE=1
bash ./train.sh # more arguments in this file
# default config file is verl/trainer/config/ppo_trainer.yaml
- By setting arguments in
train.sh
, you can visualize the trajectory:
logging.log_images=True # set to True to log images
logging.log_image_dir=.log.debug/trajectory # set to the directory to save images
logging.log_image_step_size=1 # save image every _ steps
logging.log_n_image_per_batch=8 # save _ images per batch
- You may also need to install fonts to make the figures displayed correctly:
sudo apt-get install fonts-noto-cjk
- Example image for one trajectory:
Please see cases/ file. There are only limited cases for now, including reward hacking and the suck moment. we will add more cases recently.
*: Project Lead; †: Advising. Remaining authors are alphabetical order.
We thank DeepSeek for providing the DeepSeek-R1 model and ideas. We thank the veRL team for their infrastructure. We thank the TinyZero team for their discoveries that inspired our early exploration. We thank Yiping Lu, Runxin Xu, Kyunghyun Cho for insightful discussions with them.
@misc{RAGEN,
author = {Zihan Wang and Kangrui Wang and Qineng Wang and Pingyue Zhang and Manling Li},
title = {RAGEN: A General-Purpose Reasoning Agent Training Framework},
year = {2025},
organization = {GitHub},
url = {https://github.com/ZihanWang314/ragen},
}
Phase 1: Understanding the Repository
- Explore the file structure: (Completed) I've examined the file structure and identified key directories and files related to model loading, training, data handling, and configuration.
- Identify the model loading mechanism: (Completed) The DeepSeek model clone is implemented using Megatron-LM, as confirmed by the files in
verl/models/llama/megatron
. - Understand the training process: (Completed) The repository supports both PPO and SFT, with training scripts located in
verl/trainer
.main_ppo.py
is used for PPO training, andfsdp_sft_trainer.py
is used for SFT training. - Analyze the data loading and formatting: (Completed) The
ragen/utils/dataset.py
file defines a customDataset
class for trajectory-based data, andverl/utils/dataset/sft_dataset.py
defines a class for SFT data. Thetransform
method inragen/utils/dataset.py
and the__getitem__
method inverl/utils/dataset/sft_dataset.py
are used to format the data. - Identify the fine-tuning task: (In Progress) The fine-tuning task depends on the configuration files used with the training scripts. The
cases
directory contains example data for reward hacking and "suck moment" scenarios, which may provide clues about the fine-tuning task.
Phase 2: Replicating Results Locally
- Identify the necessary dependencies: (Completed) The
requirements.txt
file lists the required Python packages. - Set up the environment:
- I will guide the user to create a new virtual environment using
conda
orvenv
. - I will guide the user to install the required packages using
pip install -r requirements.txt
. - I will guide the user to ensure that CUDA is properly configured and that PyTorch can access the GPU.
- I will guide the user to create a new virtual environment using
- Run the training script:
- I will guide the user to run the PPO training script using
python verl/trainer/main_ppo.py --config-path verl/trainer/config --config-name ppo_trainer
. - I will guide the user to run the SFT training script using
python verl/trainer/fsdp_sft_trainer.py --config-path verl/trainer/config --config-name sft_trainer
. - I will guide the user to modify the config files as needed.
- I will guide the user to run the PPO training script using
- Verify the results:
- I will guide the user to monitor the training logs and metrics.
- I will guide the user to compare the results with the original paper or repository.
Phase 3: Fine-tuning on Custom Tasks
- Understand the data format:
- For SFT, the data should be in parquet files with columns for prompts and responses. The
SFTDataset
class inverl/utils/dataset/sft_dataset.py
provides more details on the expected format. - For PPO, the data should be in a trajectory format, as handled by the
Dataset
class inragen/utils/dataset.py
.
- For SFT, the data should be in parquet files with columns for prompts and responses. The
- Modify the training script:
- I will guide the user to modify the configuration files to use their own data files.
- I will guide the user to modify the
RewardManager
class inverl/trainer/main_ppo.py
to use their own reward function.
- Run the fine-tuning script:
- I will guide the user to run the modified training script with their own data.
-
Environment setup completed:
- Conda environment 'ragen' created with Python 3.9
- CUDA toolkit 12.1.0 installed via conda
- CUDA_HOME set to conda environment path
-
Current status:
- Attempted building flash-attn wheel for 24 hours without success
- Build process showed normal indicators (CPU usage, nvcc processes)
- RTX 3080 (32GB RAM) appears insufficient for compilation
- More powerful GPU setup likely required
-
Start Anaconda:
# Launch Anaconda Navigator or use command line conda activate base
-
Navigate to project:
cd path/to/ragen conda activate ragen
-
Check environment:
# Verify CUDA toolkit nvcc --version # Check running processes tasklist | findstr "python pip conda nvcc cl"
-
Resume installation:
# Set environment variable set CUDA_HOME=%CONDA_PREFIX% # Attempt flash-attn installation pip3 install flash-attn --no-build-isolation
-
Monitor progress:
- Check Task Manager for CPU/GPU usage
- Watch for nvcc.exe process
- Normal build shows high CPU, low GPU usage
Note: If build fails after >1 hour, consider:
- Checking system logs for errors
- Verifying all CUDA prerequisites
- Attempting with a different flash-attn version