Skip to main content

Documentation Index

Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

KubeRay supports the in-tree Ray autoscaler — the same logic Ray uses on VM clusters, deployed as a sidecar on the head pod.

Enable autoscaling

spec:
  enableInTreeAutoscaling: true
  autoscalerOptions:
    upscalingMode: Default            # or Aggressive, Conservative
    idleTimeoutSeconds: 60
    resources:
      limits:   { cpu: "1", memory: "1Gi" }
      requests: { cpu: "500m", memory: "512Mi" }

Worker group bounds

workerGroupSpecs:
  - groupName: cpu
    replicas: 1            # initial count
    minReplicas: 0         # min after scale-in
    maxReplicas: 32        # cap on scale-out

Multiple worker groups

Define one group per node profile (CPU, GPU, high-memory). The autoscaler picks the cheapest group that satisfies a pending resource request.
workerGroupSpecs:
  - groupName: cpu
    minReplicas: 0
    maxReplicas: 50
  - groupName: gpu
    minReplicas: 0
    maxReplicas: 8
    template:
      spec:
        containers:
          - name: ray-worker
            resources:
              limits:
                nvidia.com/gpu: 1

Triggers

  • A task or actor with unfulfilled resource requests.
  • An autoscaling Serve deployment whose target_ongoing_requests is exceeded.
  • A placement group that can’t fit on existing nodes.

Cool-down

idleTimeoutSeconds controls how long a node must be idle before being terminated. Long timeouts smooth out spiky traffic; short timeouts save money.

Cluster-autoscaler vs Ray autoscaler

You can run both. The Ray autoscaler asks Kubernetes for new pods; if Kubernetes can’t schedule them on existing nodes, the cluster autoscaler adds nodes from your cloud provider.

Next steps

GPU

GPU node pools.

Troubleshooting

Why isn’t my cluster scaling?