Skip to main content

Documentation Index

Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Off-policy algorithms (DQN, SAC, APEX-DQN) use replay buffers to store past experience. The Learner samples from the buffer instead of from fresh rollouts each step.

Default buffer

Each off-policy algorithm sets a sensible default buffer.
from ray.rllib.algorithms.dqn import DQNConfig
config = DQNConfig().environment("CartPole-v1")
algo = config.build()

Configure buffer size and sampling

config = config.training(
    replay_buffer_config={
        "type": "PrioritizedEpisodeReplayBuffer",
        "capacity": 1_000_000,
        "alpha": 0.6,
        "beta": 0.4,
    },
    train_batch_size=512,
)
Buffer typeBehavior
EpisodeReplayBufferUniform sampling.
PrioritizedEpisodeReplayBufferPrioritized sampling with importance-sampled weights.
MultiAgentEpisodeReplayBufferMulti-agent variant.

Custom buffers

Subclass ReplayBuffer and reference it from the config:
config = config.training(replay_buffer_config={"type": MyBuffer, ...})

Best practices

For SAC and DQN, start with a buffer capacity 10–50× the train batch size and tune from there. Larger buffers stabilize training but slow down the time to first improvement.
Off-policy algorithms are memory-hungry. Keep an eye on ray.cluster_resources()["object_store_memory"] and tune target_max_block_size if you spill heavily.

Next steps

Algorithms

Off-policy algorithms in RLlib.

Offline RL

Train from a static dataset.