Skip to main content

Documentation Index

Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Offline RL learns from a static dataset of (s, a, r, s', done) transitions instead of fresh environment rollouts. Use it when an environment is expensive, dangerous, or simply unavailable.

Dataset format

Offline RL reads from Ray Data datasets containing the standard columns. Use ray.rllib.offline.OfflineData to write or read RLlib-format data.
import ray
ds = ray.data.read_parquet("s3://bucket/offline-trajectories/")

Behavior cloning (BC)

from ray.rllib.algorithms.bc import BCConfig

config = (
    BCConfig()
    .environment(observation_space=obs_space, action_space=act_space)
    .offline_data(input_=ds)
)
algo = config.build()
algo.train()
BC fits a policy to imitate the demonstrator’s action distribution.

MARWIL

Like BC but weighs each transition by the demonstrator’s advantage — gives more weight to actions that led to higher returns.
from ray.rllib.algorithms.marwil import MARWILConfig
config = MARWILConfig().offline_data(input_=ds).training(beta=1.0)

CQL

Conservative Q-learning. Learns a Q-function with a regularizer that pushes down the value of out-of-distribution actions, making the resulting policy stay close to the dataset’s behavior policy.
from ray.rllib.algorithms.cql import CQLConfig
config = CQLConfig().offline_data(input_=ds)

Evaluate against a real env

For algorithms that learn from offline data but still want online evaluation:
config = (
    config
    .evaluation(
        evaluation_interval=1,
        evaluation_num_env_runners=2,
        evaluation_config={"env": "CartPole-v1"},
    )
)

Best practices

Mix BC pre-training with online fine-tuning when an env is available. Start with BC to get a reasonable initial policy, then switch to PPO/SAC.
The quality of offline RL depends heavily on dataset coverage. If the dataset doesn’t contain near-optimal trajectories, BC won’t learn an optimal policy.

Next steps

Algorithms

See online algorithms too.

Replay buffers

Online off-policy training.