RLlib includes implementations of major RL algorithms. Each has aDocumentation Index
Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Config class that exposes the algorithm’s hyperparameters in a typed builder.
On-policy
PPO
Default choice for continuous and discrete control. Stable, easy to tune.
APPO
Asynchronous PPO. Higher throughput on large-scale clusters.
IMPALA
Distributed actor-critic with V-trace. Used at scale for game and robotics tasks.
Off-policy
DQN / Rainbow
Discrete action spaces; uses a replay buffer.
SAC
Continuous control; entropy-regularized actor-critic.
Offline RL
Behavior Cloning (BC)
Supervised pre-training from logged trajectories.
MARWIL
Imitation with advantage weighting for higher-quality demonstrations.
CQL
Conservative Q-learning for offline datasets.
Multi-agent
Most algorithms support multi-agent training via the multi-agent API. Specify policies and a mapping from agent ID to policy.Pick an algorithm
| Use case | Recommended starting point |
|---|---|
| Continuous control | SAC |
| Discrete action spaces | PPO or DQN |
| Many parallel envs, simple network | IMPALA |
| Imitation from logs | BC, MARWIL |
| Offline dataset, no env access | CQL |
| Multi-agent | PPO with multi-agent config |
Next steps
Training
Inside an RLlib training iteration.
RL modules
Custom policy networks.