Troubleshooting Ray Core

This page collects the most frequent issues users hit when running Ray and how to debug them.

Workers crash with `RayOutOfMemoryError`

A task or actor exceeded its memory budget.

Check the worker logs in /tmp/ray/session_latest/logs/worker-*.log.
Reduce batch size or split work into smaller tasks.
Increase the worker’s memory request: @ray.remote(memory=4 * 1024**3).
Use streaming APIs (ray.data.Dataset) for datasets that don’t fit in memory.

Object store fills up

Symptom: tasks block indefinitely, or you see “Object store memory used reached threshold”.

Reduce the number of in-flight refs by gating with ray.wait.
Increase object_store_memory at ray.init or via ray start --object-store-memory.

Enable spilling to disk:

_system_config={"automatic_object_spilling_enabled": True}

`ray.get` hangs

Check the dashboard for stuck tasks (ray dashboard).
A task may be waiting on a missing resource (e.g., num_gpus=1 requested but no GPUs available). Inspect cluster resources with ray.cluster_resources().
A circular dependency between tasks can cause a deadlock.

Actors don’t restart

Ensure max_restarts is set:

@ray.remote(max_restarts=3, max_task_retries=3)
class A: ...

After max_restarts is exceeded, calls raise RayActorError. Catch and recreate the actor explicitly.

Slow `runtime_env` setup

First-time package installs are cached per node. Subsequent tasks reuse the venv.
Use uv instead of pip for faster installs:
```
runtime_env={"uv": ["torch"]}
```
For large environments, build a Docker image and use the container runtime env.

Cluster autoscaler doesn’t add nodes

Check the autoscaler log: tail -f /tmp/ray/session_latest/logs/monitor*.
Confirm the resource you’re requesting is advertised by some node type in the cluster config.
On Kubernetes, verify the RayCluster CR has enableInTreeAutoscaling: true.

”Connection refused” on `ray.init(address="auto")`

Confirm a head node is running: ray status.
Check that RAY_ADDRESS points at the right host.
Firewall: open ports 6379 (GCS), 10001 (client server), and the worker ports.

High driver memory

The driver holds references for every ObjectRef created. Drop refs you no longer need.

del ref  # let the object store reclaim it

Profiling

Use ray timeline to dump a Chrome trace for an interval:

ray timeline -o /tmp/timeline.json

Or call ray.timeline() from Python.

Next steps

Observability

Dashboard, metrics, and logs.

Fault tolerance

Failure recovery semantics.

Get started

Ray Core

Documentation Index

​Workers crash with RayOutOfMemoryError

​Object store fills up

​ray.get hangs

​Actors don’t restart

​Slow runtime_env setup

​Cluster autoscaler doesn’t add nodes

​”Connection refused” on ray.init(address="auto")

​High driver memory

​Profiling

​Next steps