This page collects the most frequent issues users hit when running Ray and how to debug them.Documentation Index
Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Workers crash with RayOutOfMemoryError
A task or actor exceeded its memory budget.
- Check the worker logs in
/tmp/ray/session_latest/logs/worker-*.log. - Reduce batch size or split work into smaller tasks.
- Increase the worker’s memory request:
@ray.remote(memory=4 * 1024**3). - Use streaming APIs (
ray.data.Dataset) for datasets that don’t fit in memory.
Object store fills up
Symptom: tasks block indefinitely, or you see “Object store memory used reached threshold”.- Reduce the number of in-flight refs by gating with
ray.wait. - Increase
object_store_memoryatray.initor viaray start --object-store-memory. - Enable spilling to disk:
ray.get hangs
- Check the dashboard for stuck tasks (
ray dashboard). - A task may be waiting on a missing resource (e.g.,
num_gpus=1requested but no GPUs available). Inspect cluster resources withray.cluster_resources(). - A circular dependency between tasks can cause a deadlock.
Actors don’t restart
Ensuremax_restarts is set:
max_restarts is exceeded, calls raise RayActorError. Catch and recreate the actor explicitly.
Slow runtime_env setup
- First-time package installs are cached per node. Subsequent tasks reuse the venv.
- Use
uvinstead of pip for faster installs: - For large environments, build a Docker image and use the
containerruntime env.
Cluster autoscaler doesn’t add nodes
- Check the autoscaler log:
tail -f /tmp/ray/session_latest/logs/monitor*. - Confirm the resource you’re requesting is advertised by some node type in the cluster config.
- On Kubernetes, verify the RayCluster CR has
enableInTreeAutoscaling: true.
”Connection refused” on ray.init(address="auto")
- Confirm a head node is running:
ray status. - Check that
RAY_ADDRESSpoints at the right host. - Firewall: open ports 6379 (GCS), 10001 (client server), and the worker ports.
High driver memory
The driver holds references for everyObjectRef created. Drop refs you no longer need.
Profiling
Useray timeline to dump a Chrome trace for an interval:
ray.timeline() from Python.
Next steps
Observability
Dashboard, metrics, and logs.
Fault tolerance
Failure recovery semantics.