Documentation Index
Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Pods stuck Pending
kubectl describe pod shows the reason. Most common causes:
- Insufficient resources in the cluster. Check node capacity and any cluster-autoscaler logs.
- Missing node selector / toleration. GPU/TPU pods often need both.
- PVC unbound. Verify the storage class exists and the volume can provision.
Head pod CrashLoopBackOff
Check head logs:- Insufficient memory for the GCS.
- A misconfigured
rayStartParams(e.g., conflicting ports). - Image pull failure.
Workers can’t connect to head
Workers reportConnectionRefusedError to GCS.
- Check the head pod is
Runningand the head service exists:kubectl get svc <cluster>-head-svc. - Confirm
RAY_GCS_RPC_SERVER_PORTand the head’sGcsServerport match. - For multi-namespace clusters, ensure DNS resolves the headless service.
RayService stuck deploying
pendingServiceStatus.applicationStatuses shows per-application progress.
- Verify
runtime_env.working_diris reachable from worker pods. - Check Serve logs for import errors:
kubectl logs <head-pod> -c ray-head | grep -i serve. - A long-running deployment may exceed
deploymentUnhealthySecondThreshold; bump it for slow-loading models.
Autoscaler not adding nodes
- Confirm
enableInTreeAutoscaling: true. - Check the autoscaler container logs:
kubectl logs <head-pod> -c autoscaler. - Verify the requested resource is advertised by some worker group.
- Cluster-autoscaler isn’t adding Kubernetes nodes? Check its events with
kubectl events -n kube-system | grep -i autoscaler.
Out-of-memory kills
Reason: OOMKilled, raise the container’s memory request/limit or reduce per-task memory in your Ray code.
Image pull errors
- For private registries, attach an
imagePullSecret:
Next steps
Configuration
Settings reference.
Cluster FAQ
General Ray cluster questions.