Skip to main content

Documentation Index

Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

This page collects the most frequent issues users hit when running Ray and how to debug them.

Workers crash with RayOutOfMemoryError

A task or actor exceeded its memory budget.
  • Check the worker logs in /tmp/ray/session_latest/logs/worker-*.log.
  • Reduce batch size or split work into smaller tasks.
  • Increase the worker’s memory request: @ray.remote(memory=4 * 1024**3).
  • Use streaming APIs (ray.data.Dataset) for datasets that don’t fit in memory.

Object store fills up

Symptom: tasks block indefinitely, or you see “Object store memory used reached threshold”.
  • Reduce the number of in-flight refs by gating with ray.wait.
  • Increase object_store_memory at ray.init or via ray start --object-store-memory.
  • Enable spilling to disk:
    _system_config={"automatic_object_spilling_enabled": True}
    

ray.get hangs

  • Check the dashboard for stuck tasks (ray dashboard).
  • A task may be waiting on a missing resource (e.g., num_gpus=1 requested but no GPUs available). Inspect cluster resources with ray.cluster_resources().
  • A circular dependency between tasks can cause a deadlock.

Actors don’t restart

Ensure max_restarts is set:
@ray.remote(max_restarts=3, max_task_retries=3)
class A: ...
After max_restarts is exceeded, calls raise RayActorError. Catch and recreate the actor explicitly.

Slow runtime_env setup

  • First-time package installs are cached per node. Subsequent tasks reuse the venv.
  • Use uv instead of pip for faster installs:
    runtime_env={"uv": ["torch"]}
    
  • For large environments, build a Docker image and use the container runtime env.

Cluster autoscaler doesn’t add nodes

  • Check the autoscaler log: tail -f /tmp/ray/session_latest/logs/monitor*.
  • Confirm the resource you’re requesting is advertised by some node type in the cluster config.
  • On Kubernetes, verify the RayCluster CR has enableInTreeAutoscaling: true.

”Connection refused” on ray.init(address="auto")

  • Confirm a head node is running: ray status.
  • Check that RAY_ADDRESS points at the right host.
  • Firewall: open ports 6379 (GCS), 10001 (client server), and the worker ports.

High driver memory

The driver holds references for every ObjectRef created. Drop refs you no longer need.
del ref  # let the object store reclaim it

Profiling

Use ray timeline to dump a Chrome trace for an interval:
ray timeline -o /tmp/timeline.json
Or call ray.timeline() from Python.

Next steps

Observability

Dashboard, metrics, and logs.

Fault tolerance

Failure recovery semantics.