Fault Tolerance in Ray Train

What gets retried
Resuming from a checkpoint
Resuming from a previous run
Spot instances
Storage failures
Best practices
Next steps

Ray Train is designed to keep long-running jobs alive through worker, node, and transient storage failures.

What gets retried

When any worker fails (process crash, OOM, node loss), Ray Train tears down the entire training group, restores the last checkpoint, and restarts the workers. The number of retries is bounded by FailureConfig:

from ray.train import RunConfig, FailureConfig

run_config = RunConfig(failure_config=FailureConfig(max_failures=3))

Resuming from a checkpoint

After a restart, each worker can recover state from the latest reported checkpoint:

def train_loop_per_worker(config):
    model = build_model()
    optim = build_optim(model)
    start_epoch = 0

    checkpoint = ray.train.get_checkpoint()
    if checkpoint:
        with checkpoint.as_directory() as ckpt_dir:
            state = torch.load(f"{ckpt_dir}/state.pt")
            model.load_state_dict(state["model"])
            optim.load_state_dict(state["optim"])
            start_epoch = state["epoch"] + 1

    for epoch in range(start_epoch, config["epochs"]):
        ...

Resuming from a previous run

To resume a run after the driver process exits:

trainer = TorchTrainer.restore(
    "s3://bucket/runs/my-run",
    train_loop_per_worker=train_loop_per_worker,
    scaling_config=ScalingConfig(num_workers=4, use_gpu=True),
)
trainer.fit()

restore reads the run’s metadata, including the latest checkpoint, and re-creates the trainer.

Spot instances

For preemptible / spot workers, set:

FailureConfig(max_failures=-1)  # unlimited retries

Ray Train will keep restarting the job whenever workers come back. Pair with a small sync_period so checkpoints land in durable storage frequently.

Storage failures

If storage_path is temporarily unreachable (S3 throttling, network blip), Ray retries the upload. If checkpoints are persisted locally first, training continues even if uploads lag.

Best practices

Test your recovery path. Kill a worker mid-training and verify that the run resumes from the latest checkpoint with the same metrics. It’s the only way to know your serialization is correct.

Don’t use Python pickle as your final serialization format. Use framework-native saves (torch.save, model.save_pretrained) so checkpoints stay portable.

Next steps

Checkpointing

Checkpoint API in depth.

Run config

Failure handling and retries.

Data Loading for Ray Train Monitoring and Logging

⌘I

Ray Data

Ray Train

Ray Tune

Ray Serve

Ray RLlib

Ray LLM

Fault Tolerance in Ray Train

What gets retried

Resuming from a checkpoint

Resuming from a previous run

Spot instances

Storage failures

Best practices

Next steps

Checkpointing

Run config

Ray Data

Ray Train

Ray Tune

Ray Serve

Ray RLlib

Ray LLM

Documentation Index

​What gets retried

​Resuming from a checkpoint

​Resuming from a previous run

​Spot instances

​Storage failures

​Best practices

​Next steps

Checkpointing

Run config

What gets retried

Resuming from a checkpoint

Resuming from a previous run

Spot instances

Storage failures

Best practices

Next steps