Ray is designed to recover transparently from worker, node, and network failures. This page describes the fault-tolerance model and the knobs you can use to tune it.Documentation Index
Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Task fault tolerance
By default, Ray retries failed tasks up to three times. A failure includes:- The worker process crashing (segfault, OOM, the task calling
os._exit) - The node hosting the worker dying
- The driver disconnecting and reconnecting
max_retries:
retry_exceptions=True (or a list of exception types) to retry on application errors:
Lineage reconstruction
If a task’s result is lost due to a node failure, Ray can re-execute the producing task by replaying its lineage. This is enabled by default for tasks.Actor fault tolerance
By default, an actor that crashes is not restarted. Setmax_restarts to restart automatically:
max_restarts: how many times Ray will restart the actor process.max_task_retries: how many times Ray will retry each method call submitted to a restarted actor.
RayActorError.
Detached actors
Detached actors (lifetime="detached") outlive their creating job. Restart them carefully — there is no driver to manage their lifecycle.
Object fault tolerance
Object store data is replicated only by lineage reconstruction (recomputing from the producing task). For long-lived intermediate results, persist them to durable storage (S3, GCS, NFS) rather than relying on lineage.Node fault tolerance
If a worker node dies:- All tasks scheduled on it are retried (up to
max_retries). - All actors on it are killed; restartable actors are restarted elsewhere.
- All objects whose only copy lived on the dead node are reconstructed via lineage if possible.
- Without GCS fault tolerance, the cluster terminates.
- With external GCS storage (Redis), the head node restarts and the cluster keeps running.
Object spilling and disk failures
When the object store fills, Ray spills cold objects to disk. If disk fails and a spilled object is needed, Ray reconstructs from lineage when possible, or raises an error.Best practices
Next steps
Patterns
Application-level fault tolerance patterns.
Cluster fault tolerance
GCS, head-node, and worker-node fault tolerance.