Best Practices for Large Clusters

Head node sizing
GCS fault tolerance
Object store
Autoscaler
Networking
Logging
Health checks
Updates
Next steps

For clusters past ~100 nodes, several defaults need adjusting and several new failure modes appear.

Head node sizing

Allocate the head node 16 CPUs, 32+ GiB RAM, and a fast disk. The GCS, dashboard, and autoscaler scale with cluster size.

GCS fault tolerance

Configure external Redis storage so head failures don’t tear down the cluster:

rayStartParams:
  redis-address: "redis.internal:6379"
  external-redis: "true"

For HA, run Redis in HA mode (Sentinel, cluster mode, or a managed service).

Object store

Workers spill to disk when the object store fills. Use fast local SSDs as the spill target:

rayStartParams:
  temp-dir: "/local-ssd/ray"

Autoscaler

Use multiple worker types to avoid waiting on a single VM family.
Lower idle_timeout_minutes for cost; raise it for warm-pool spiky workloads.
Cap each type with max_workers so a runaway autoscaler can’t drain the budget.

Networking

Place all nodes in the same VPC and availability zone (or use enhanced networking) for high throughput.
Open the dashboard, GCS, and worker ports between nodes; restrict the dashboard to a bastion or VPN.

Logging

Aggregate /tmp/ray/session_latest/logs/ across all nodes to durable storage (S3, CloudWatch, GCP Logging, etc.). Built-in RAY_DEDUP_LOGS=1 reduces log volume by deduping identical lines from many workers.

Health checks

Use ray status and dashboard alerts to detect:

Node count drift (autoscaler misbehavior)
Sustained high object-store spilling
Worker OOM kills

Updates

For long-running clusters, prefer rolling worker updates (drain-and-replace) over full restarts. KubeRay supports this natively; on VMs, do it manually by replacing one worker group at a time.

Next steps

Logging

Configure log aggregation.

Configuring autoscaling

Tune scaling speed and idle timeouts.

Configuring Autoscaling on VMs Logging on VM Clusters

⌘I

Ray Clusters

Observability

Best Practices for Large Clusters

Head node sizing

GCS fault tolerance

Object store

Autoscaler

Networking

Logging

Health checks

Updates

Next steps

Logging

Configuring autoscaling

Ray Clusters

Observability

Documentation Index

​Head node sizing

​GCS fault tolerance

​Object store

​Autoscaler

​Networking

​Logging

​Health checks

​Updates

​Next steps

Logging

Configuring autoscaling

Head node sizing

GCS fault tolerance

Object store

Autoscaler

Networking

Logging

Health checks

Updates

Next steps