For clusters past ~100 nodes, several defaults need adjusting and several new failure modes appear.Documentation Index
Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Head node sizing
Allocate the head node 16 CPUs, 32+ GiB RAM, and a fast disk. The GCS, dashboard, and autoscaler scale with cluster size.GCS fault tolerance
Configure external Redis storage so head failures don’t tear down the cluster:Object store
Workers spill to disk when the object store fills. Use fast local SSDs as the spill target:Autoscaler
- Use multiple worker types to avoid waiting on a single VM family.
- Lower
idle_timeout_minutesfor cost; raise it for warm-pool spiky workloads. - Cap each type with
max_workersso a runaway autoscaler can’t drain the budget.
Networking
- Place all nodes in the same VPC and availability zone (or use enhanced networking) for high throughput.
- Open the dashboard, GCS, and worker ports between nodes; restrict the dashboard to a bastion or VPN.
Logging
Aggregate/tmp/ray/session_latest/logs/ across all nodes to durable storage (S3, CloudWatch, GCP Logging, etc.). Built-in RAY_DEDUP_LOGS=1 reduces log volume by deduping identical lines from many workers.
Health checks
Useray status and dashboard alerts to detect:
- Node count drift (autoscaler misbehavior)
- Sustained high object-store spilling
- Worker OOM kills
Updates
For long-running clusters, prefer rolling worker updates (drain-and-replace) over full restarts. KubeRay supports this natively; on VMs, do it manually by replacing one worker group at a time.Next steps
Logging
Configure log aggregation.
Configuring autoscaling
Tune scaling speed and idle timeouts.