Fault Tolerance in Ray Core

Ray is designed to recover transparently from worker, node, and network failures. This page describes the fault-tolerance model and the knobs you can use to tune it.

Task fault tolerance

By default, Ray retries failed tasks up to three times. A failure includes:

The worker process crashing (segfault, OOM, the task calling os._exit)
The node hosting the worker dying
The driver disconnecting and reconnecting

Tune retries with max_retries:

@ray.remote(max_retries=10)
def flaky():
    ...

Set retry_exceptions=True (or a list of exception types) to retry on application errors:

@ray.remote(max_retries=5, retry_exceptions=[ConnectionError])
def network_call():
    ...

Lineage reconstruction

If a task’s result is lost due to a node failure, Ray can re-execute the producing task by replaying its lineage. This is enabled by default for tasks.

Actor fault tolerance

By default, an actor that crashes is not restarted. Set max_restarts to restart automatically:

@ray.remote(max_restarts=5, max_task_retries=3)
class Worker:
    ...

max_restarts: how many times Ray will restart the actor process.
max_task_retries: how many times Ray will retry each method call submitted to a restarted actor.

After all restarts are exhausted, calls to the actor raise RayActorError.

Detached actors

Detached actors (lifetime="detached") outlive their creating job. Restart them carefully — there is no driver to manage their lifecycle.

Object fault tolerance

Object store data is replicated only by lineage reconstruction (recomputing from the producing task). For long-lived intermediate results, persist them to durable storage (S3, GCS, NFS) rather than relying on lineage.

Node fault tolerance

If a worker node dies:

All tasks scheduled on it are retried (up to max_retries).
All actors on it are killed; restartable actors are restarted elsewhere.
All objects whose only copy lived on the dead node are reconstructed via lineage if possible.

If the head node dies:

Without GCS fault tolerance, the cluster terminates.
With external GCS storage (Redis), the head node restarts and the cluster keeps running.

See cluster-level GCS fault tolerance.

Object spilling and disk failures

When the object store fills, Ray spills cold objects to disk. If disk fails and a spilled object is needed, Ray reconstructs from lineage when possible, or raises an error.

Best practices

For long pipelines, checkpoint intermediate results to durable storage. Lineage reconstruction works, but each cascade through a long pipeline adds latency.

Actors with max_restarts > 0 and external state (open files, network connections) need to handle reconnection in __init__. Ray will rerun __init__ on each restart.

Get started

Ray Core

Fault Tolerance in Ray Core

Task fault tolerance

Lineage reconstruction

Actor fault tolerance

Detached actors

Object fault tolerance

Node fault tolerance

Object spilling and disk failures

Best practices

Next steps

Patterns

Cluster fault tolerance

Get started

Ray Core

Documentation Index

​Task fault tolerance

​Lineage reconstruction

​Actor fault tolerance

​Detached actors

​Object fault tolerance

​Node fault tolerance

​Object spilling and disk failures

​Best practices

​Next steps

Patterns

Cluster fault tolerance

Task fault tolerance

Lineage reconstruction

Actor fault tolerance

Detached actors

Object fault tolerance

Node fault tolerance

Object spilling and disk failures

Best practices

Next steps