Skip to main content

Documentation Index

Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

How many head nodes do I need?

One. The head is a single point of administration for the cluster. For high availability, configure GCS fault tolerance with external Redis storage.

Can I run Ray without Kubernetes?

Yes — the cluster launcher provisions VMs on AWS, GCP, Azure, or on-premises directly. Use ray up cluster.yaml.

How do I size the head node?

Allocate enough CPU/memory for the GCS and dashboard. For most clusters this is 4 CPUs and 8–16 GiB of RAM. The head doesn’t need to run user code; in fact, for clusters above ~100 workers, prefer dedicating the head to control-plane work.

Can workers join an existing cluster after it’s started?

Yes — call ray start --address=<head>:6379 on a new node. The autoscaler does this automatically.

How does Ray handle preemption?

Preempted workers look like crashed workers. Tasks retry, restartable actors restart, and lineage reconstruction recreates lost objects. For best behavior, set max_retries/max_restarts and use durable storage for checkpoints.

How do I limit concurrent jobs on a cluster?

Use Ray Jobs and the Job submission API’s queueing options. On Kubernetes, RayJob supports queue limits and priority via Kueue.

Why is my cluster not autoscaling?

Common causes:
  • The autoscaler isn’t enabled (enableInTreeAutoscaling: true on RayCluster).
  • The requested resource isn’t advertised by any worker node type.
  • A node type is at its maxWorkers cap.
  • The cloud provider denied the launch (quota, capacity).
Check the autoscaler log: tail -f /tmp/ray/session_latest/logs/monitor*.

How do I run multiple jobs concurrently?

Each ray.init() from a separate process creates a new job. Use namespaces to isolate them logically:
ray.init(address="auto", namespace="job_1")

Where do logs go?

/tmp/ray/session_latest/logs/ on every node. Aggregate via your platform’s log collector (fluentbit, CloudWatch, GCP Logging, etc.).

Next steps

Kubernetes troubleshooting

Issues specific to KubeRay.

Observability

Cluster-wide metrics and logs.