Ray Train is designed to keep long-running jobs alive through worker, node, and transient storage failures.Documentation Index
Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
What gets retried
When any worker fails (process crash, OOM, node loss), Ray Train tears down the entire training group, restores the last checkpoint, and restarts the workers. The number of retries is bounded byFailureConfig:
Resuming from a checkpoint
After a restart, each worker can recover state from the latest reported checkpoint:Resuming from a previous run
To resume a run after the driver process exits:restore reads the run’s metadata, including the latest checkpoint, and re-creates the trainer.
Spot instances
For preemptible / spot workers, set:sync_period so checkpoints land in durable storage frequently.
Storage failures
Ifstorage_path is temporarily unreachable (S3 throttling, network blip), Ray retries the upload. If checkpoints are persisted locally first, training continues even if uploads lag.
Best practices
Next steps
Checkpointing
Checkpoint API in depth.
Run config
Failure handling and retries.