Skip to main content

Documentation Index

Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

A single call to algo.train() runs one iteration of the loop:
1

Rollouts

Each EnvRunner steps its environment(s), producing a batch of episodes. The exploration policy uses the latest weights from the central RLModule.
2

Postprocessing

Connectors compute returns, advantages, and any algorithm-specific batch-level fields.
3

Update

The collected batch flows to the Learner(s). Each Learner runs the loss and updates weights.
4

Sync

Updated weights are broadcast back to all EnvRunners.
5

Metrics

Per-iteration metrics — episode return, learner stats, sampler timings — are aggregated and returned.

Configure the loop

config = (
    PPOConfig()
    .env_runners(num_env_runners=8, num_envs_per_env_runner=4)
    .learners(num_learners=2, num_gpus_per_learner=1)
    .training(train_batch_size=8000, num_epochs=10, minibatch_size=512)
)

Inspect a single iteration

result = algo.train()
print(result["env_runners"]["episode_return_mean"])
print(result["learners"]["__all_modules__"]["total_loss"])
print(result["env_runners"]["sample"])  # rollout time

Custom loops

For full control, drop down to algo.step (the legacy stack uses Algorithm.training_step). The new-stack equivalent is to subclass the algorithm and override the iteration logic.

Stop conditions

Use Ray Tune’s stop config:
tune.run(
    "PPO",
    config=config.to_dict(),
    stop={"env_runners/episode_return_mean": 200, "training_iteration": 100},
)

Next steps

Checkpoints

Save and restore RLlib state.

Offline RL

Skip rollouts and train from logged data.