Skip to main content

Documentation Index

Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Ray Train is designed to keep long-running jobs alive through worker, node, and transient storage failures.

What gets retried

When any worker fails (process crash, OOM, node loss), Ray Train tears down the entire training group, restores the last checkpoint, and restarts the workers. The number of retries is bounded by FailureConfig:
from ray.train import RunConfig, FailureConfig

run_config = RunConfig(failure_config=FailureConfig(max_failures=3))

Resuming from a checkpoint

After a restart, each worker can recover state from the latest reported checkpoint:
def train_loop_per_worker(config):
    model = build_model()
    optim = build_optim(model)
    start_epoch = 0

    checkpoint = ray.train.get_checkpoint()
    if checkpoint:
        with checkpoint.as_directory() as ckpt_dir:
            state = torch.load(f"{ckpt_dir}/state.pt")
            model.load_state_dict(state["model"])
            optim.load_state_dict(state["optim"])
            start_epoch = state["epoch"] + 1

    for epoch in range(start_epoch, config["epochs"]):
        ...

Resuming from a previous run

To resume a run after the driver process exits:
trainer = TorchTrainer.restore(
    "s3://bucket/runs/my-run",
    train_loop_per_worker=train_loop_per_worker,
    scaling_config=ScalingConfig(num_workers=4, use_gpu=True),
)
trainer.fit()
restore reads the run’s metadata, including the latest checkpoint, and re-creates the trainer.

Spot instances

For preemptible / spot workers, set:
FailureConfig(max_failures=-1)  # unlimited retries
Ray Train will keep restarting the job whenever workers come back. Pair with a small sync_period so checkpoints land in durable storage frequently.

Storage failures

If storage_path is temporarily unreachable (S3 throttling, network blip), Ray retries the upload. If checkpoints are persisted locally first, training continues even if uploads lag.

Best practices

Test your recovery path. Kill a worker mid-training and verify that the run resumes from the latest checkpoint with the same metrics. It’s the only way to know your serialization is correct.
Don’t use Python pickle as your final serialization format. Use framework-native saves (torch.save, model.save_pretrained) so checkpoints stay portable.

Next steps

Checkpointing

Checkpoint API in depth.

Run config

Failure handling and retries.