Troubleshooting KubeRay

Pods stuck Pending

kubectl describe pod shows the reason. Most common causes:

Insufficient resources in the cluster. Check node capacity and any cluster-autoscaler logs.
Missing node selector / toleration. GPU/TPU pods often need both.
PVC unbound. Verify the storage class exists and the volume can provision.

Head pod CrashLoopBackOff

Check head logs:

kubectl logs <head-pod> -c ray-head --previous

Common causes:

Insufficient memory for the GCS.
A misconfigured rayStartParams (e.g., conflicting ports).
Image pull failure.

Workers can’t connect to head

Workers report ConnectionRefusedError to GCS.

Check the head pod is Running and the head service exists: kubectl get svc <cluster>-head-svc.
Confirm RAY_GCS_RPC_SERVER_PORT and the head’s GcsServer port match.
For multi-namespace clusters, ensure DNS resolves the headless service.

RayService stuck deploying

kubectl get rayservice <name> -o jsonpath='{.status}'

pendingServiceStatus.applicationStatuses shows per-application progress.

Verify runtime_env.working_dir is reachable from worker pods.
Check Serve logs for import errors: kubectl logs <head-pod> -c ray-head | grep -i serve.
A long-running deployment may exceed deploymentUnhealthySecondThreshold; bump it for slow-loading models.

Autoscaler not adding nodes

Confirm enableInTreeAutoscaling: true.
Check the autoscaler container logs: kubectl logs <head-pod> -c autoscaler.
Verify the requested resource is advertised by some worker group.
Cluster-autoscaler isn’t adding Kubernetes nodes? Check its events with kubectl events -n kube-system | grep -i autoscaler.

Out-of-memory kills

kubectl describe pod <worker> | grep -A5 "Last State"

If Reason: OOMKilled, raise the container’s memory request/limit or reduce per-task memory in your Ray code.

Image pull errors

For private registries, attach an imagePullSecret:

template:
  spec:
    imagePullSecrets:
      - { name: ghcr-secret }

Ray Clusters

Observability

Troubleshooting KubeRay

Pods stuck Pending

Head pod CrashLoopBackOff

Workers can’t connect to head

RayService stuck deploying

Autoscaler not adding nodes

Out-of-memory kills

Image pull errors

Next steps

Configuration

Cluster FAQ

Ray Clusters

Observability

Documentation Index

​Pods stuck Pending

​Head pod CrashLoopBackOff

​Workers can’t connect to head

​RayService stuck deploying

​Autoscaler not adding nodes

​Out-of-memory kills

​Image pull errors

​Next steps

Configuration

Cluster FAQ

Pods stuck Pending

Head pod CrashLoopBackOff

Workers can’t connect to head

RayService stuck deploying

Autoscaler not adding nodes

Out-of-memory kills

Image pull errors

Next steps