Skip to main content

Documentation Index

Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Prerequisites

  • A Kubernetes cluster with GPU node pools (EKS + p4 nodes, GKE + accelerator nodes, AKS + N-series, etc.).
  • The NVIDIA device plugin or cloud provider’s GPU operator installed.

GPU worker group

workerGroupSpecs:
  - groupName: gpu
    replicas: 0
    minReplicas: 0
    maxReplicas: 8
    rayStartParams: {}
    template:
      spec:
        nodeSelector:
          cloud.google.com/gke-accelerator: nvidia-h100-80gb   # GKE example
        tolerations:
          - key: nvidia.com/gpu
            operator: Exists
            effect: NoSchedule
        containers:
          - name: ray-worker
            image: rayproject/ray:2.43.0-gpu
            resources:
              requests:
                cpu: "8"
                memory: "32Gi"
                nvidia.com/gpu: 1
              limits:
                cpu: "8"
                memory: "32Gi"
                nvidia.com/gpu: 1

Image

Use the -gpu Ray image (CUDA + cuDNN preinstalled), or build your own from nvidia/cuda:*.

Request GPUs from tasks/actors

@ray.remote(num_gpus=1)
def predict(...):
    ...
The Ray scheduler matches the request to a GPU worker. If none exists, the autoscaler adds one.

Multi-GPU per actor

@ray.remote(num_gpus=2)
class BigModel:
    ...
Combine with placement groups for tensor-parallel models that span GPUs.

Different GPU types

Tag custom resources to differentiate GPU classes:
rayStartParams:
  resources: '"{\"H100\": 1}"'
@ray.remote(num_gpus=1, resources={"H100": 1})
def heavy_inference(...):
    ...

Verify

kubectl exec -it <head-pod> -- ray status
Should show GPUs in the cluster’s resource summary.

Next steps

TPU

Run on Google Cloud TPUs.

Autoscaling

Scale GPU pools on demand.