Skip to main content

Documentation Index

Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

For clusters past ~100 nodes, several defaults need adjusting and several new failure modes appear.

Head node sizing

Allocate the head node 16 CPUs, 32+ GiB RAM, and a fast disk. The GCS, dashboard, and autoscaler scale with cluster size.

GCS fault tolerance

Configure external Redis storage so head failures don’t tear down the cluster:
rayStartParams:
  redis-address: "redis.internal:6379"
  external-redis: "true"
For HA, run Redis in HA mode (Sentinel, cluster mode, or a managed service).

Object store

Workers spill to disk when the object store fills. Use fast local SSDs as the spill target:
rayStartParams:
  temp-dir: "/local-ssd/ray"

Autoscaler

  • Use multiple worker types to avoid waiting on a single VM family.
  • Lower idle_timeout_minutes for cost; raise it for warm-pool spiky workloads.
  • Cap each type with max_workers so a runaway autoscaler can’t drain the budget.

Networking

  • Place all nodes in the same VPC and availability zone (or use enhanced networking) for high throughput.
  • Open the dashboard, GCS, and worker ports between nodes; restrict the dashboard to a bastion or VPN.

Logging

Aggregate /tmp/ray/session_latest/logs/ across all nodes to durable storage (S3, CloudWatch, GCP Logging, etc.). Built-in RAY_DEDUP_LOGS=1 reduces log volume by deduping identical lines from many workers.

Health checks

Use ray status and dashboard alerts to detect:
  • Node count drift (autoscaler misbehavior)
  • Sustained high object-store spilling
  • Worker OOM kills

Updates

For long-running clusters, prefer rolling worker updates (drain-and-replace) over full restarts. KubeRay supports this natively; on VMs, do it manually by replacing one worker group at a time.

Next steps

Logging

Configure log aggregation.

Configuring autoscaling

Tune scaling speed and idle timeouts.