Skip to main content

Documentation Index

Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Ray is designed to recover transparently from worker, node, and network failures. This page describes the fault-tolerance model and the knobs you can use to tune it.

Task fault tolerance

By default, Ray retries failed tasks up to three times. A failure includes:
  • The worker process crashing (segfault, OOM, the task calling os._exit)
  • The node hosting the worker dying
  • The driver disconnecting and reconnecting
Tune retries with max_retries:
@ray.remote(max_retries=10)
def flaky():
    ...
Set retry_exceptions=True (or a list of exception types) to retry on application errors:
@ray.remote(max_retries=5, retry_exceptions=[ConnectionError])
def network_call():
    ...

Lineage reconstruction

If a task’s result is lost due to a node failure, Ray can re-execute the producing task by replaying its lineage. This is enabled by default for tasks.

Actor fault tolerance

By default, an actor that crashes is not restarted. Set max_restarts to restart automatically:
@ray.remote(max_restarts=5, max_task_retries=3)
class Worker:
    ...
  • max_restarts: how many times Ray will restart the actor process.
  • max_task_retries: how many times Ray will retry each method call submitted to a restarted actor.
After all restarts are exhausted, calls to the actor raise RayActorError.

Detached actors

Detached actors (lifetime="detached") outlive their creating job. Restart them carefully — there is no driver to manage their lifecycle.

Object fault tolerance

Object store data is replicated only by lineage reconstruction (recomputing from the producing task). For long-lived intermediate results, persist them to durable storage (S3, GCS, NFS) rather than relying on lineage.

Node fault tolerance

If a worker node dies:
  • All tasks scheduled on it are retried (up to max_retries).
  • All actors on it are killed; restartable actors are restarted elsewhere.
  • All objects whose only copy lived on the dead node are reconstructed via lineage if possible.
If the head node dies:
  • Without GCS fault tolerance, the cluster terminates.
  • With external GCS storage (Redis), the head node restarts and the cluster keeps running.
See cluster-level GCS fault tolerance.

Object spilling and disk failures

When the object store fills, Ray spills cold objects to disk. If disk fails and a spilled object is needed, Ray reconstructs from lineage when possible, or raises an error.

Best practices

For long pipelines, checkpoint intermediate results to durable storage. Lineage reconstruction works, but each cascade through a long pipeline adds latency.
Actors with max_restarts > 0 and external state (open files, network connections) need to handle reconnection in __init__. Ray will rerun __init__ on each restart.

Next steps

Patterns

Application-level fault tolerance patterns.

Cluster fault tolerance

GCS, head-node, and worker-node fault tolerance.