Documentation Index
Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
How many head nodes do I need?
One. The head is a single point of administration for the cluster. For high availability, configure GCS fault tolerance with external Redis storage.Can I run Ray without Kubernetes?
Yes — the cluster launcher provisions VMs on AWS, GCP, Azure, or on-premises directly. Useray up cluster.yaml.
How do I size the head node?
Allocate enough CPU/memory for the GCS and dashboard. For most clusters this is 4 CPUs and 8–16 GiB of RAM. The head doesn’t need to run user code; in fact, for clusters above ~100 workers, prefer dedicating the head to control-plane work.Can workers join an existing cluster after it’s started?
Yes — callray start --address=<head>:6379 on a new node. The autoscaler does this automatically.
How does Ray handle preemption?
Preempted workers look like crashed workers. Tasks retry, restartable actors restart, and lineage reconstruction recreates lost objects. For best behavior, setmax_retries/max_restarts and use durable storage for checkpoints.
How do I limit concurrent jobs on a cluster?
Use Ray Jobs and the Job submission API’s queueing options. On Kubernetes, RayJob supports queue limits and priority via Kueue.Why is my cluster not autoscaling?
Common causes:- The autoscaler isn’t enabled (
enableInTreeAutoscaling: trueon RayCluster). - The requested resource isn’t advertised by any worker node type.
- A node type is at its
maxWorkerscap. - The cloud provider denied the launch (quota, capacity).
tail -f /tmp/ray/session_latest/logs/monitor*.
How do I run multiple jobs concurrently?
Eachray.init() from a separate process creates a new job. Use namespaces to isolate them logically:
Where do logs go?
/tmp/ray/session_latest/logs/ on every node. Aggregate via your platform’s log collector (fluentbit, CloudWatch, GCP Logging, etc.).
Next steps
Kubernetes troubleshooting
Issues specific to KubeRay.
Observability
Cluster-wide metrics and logs.