Skip to main content

Documentation Index

Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Pods stuck Pending

kubectl describe pod shows the reason. Most common causes:
  • Insufficient resources in the cluster. Check node capacity and any cluster-autoscaler logs.
  • Missing node selector / toleration. GPU/TPU pods often need both.
  • PVC unbound. Verify the storage class exists and the volume can provision.

Head pod CrashLoopBackOff

Check head logs:
kubectl logs <head-pod> -c ray-head --previous
Common causes:
  • Insufficient memory for the GCS.
  • A misconfigured rayStartParams (e.g., conflicting ports).
  • Image pull failure.

Workers can’t connect to head

Workers report ConnectionRefusedError to GCS.
  • Check the head pod is Running and the head service exists: kubectl get svc <cluster>-head-svc.
  • Confirm RAY_GCS_RPC_SERVER_PORT and the head’s GcsServer port match.
  • For multi-namespace clusters, ensure DNS resolves the headless service.

RayService stuck deploying

kubectl get rayservice <name> -o jsonpath='{.status}'
pendingServiceStatus.applicationStatuses shows per-application progress.
  • Verify runtime_env.working_dir is reachable from worker pods.
  • Check Serve logs for import errors: kubectl logs <head-pod> -c ray-head | grep -i serve.
  • A long-running deployment may exceed deploymentUnhealthySecondThreshold; bump it for slow-loading models.

Autoscaler not adding nodes

  • Confirm enableInTreeAutoscaling: true.
  • Check the autoscaler container logs: kubectl logs <head-pod> -c autoscaler.
  • Verify the requested resource is advertised by some worker group.
  • Cluster-autoscaler isn’t adding Kubernetes nodes? Check its events with kubectl events -n kube-system | grep -i autoscaler.

Out-of-memory kills

kubectl describe pod <worker> | grep -A5 "Last State"
If Reason: OOMKilled, raise the container’s memory request/limit or reduce per-task memory in your Ray code.

Image pull errors

  • For private registries, attach an imagePullSecret:
    template:
      spec:
        imagePullSecrets:
          - { name: ghcr-secret }
    

Next steps

Configuration

Settings reference.

Cluster FAQ

General Ray cluster questions.