Skip to main content

Documentation Index

Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

KubeRay supports Google Cloud TPU pods (v4, v5e, v5p) on GKE.

Prerequisites

  • A GKE cluster with a TPU node pool.
  • The TPU initialization image bundled with the Ray image, or a custom image with libtpu.

TPU node pool

Provision a TPU node pool through gcloud:
gcloud container node-pools create tpu-pool \
  --cluster=my-cluster \
  --machine-type=ct5lp-hightpu-4t \
  --node-locations=us-central2-b \
  --num-nodes=1

Worker group

workerGroupSpecs:
  - groupName: tpu
    replicas: 0
    minReplicas: 0
    maxReplicas: 4
    rayStartParams: {}
    template:
      spec:
        nodeSelector:
          cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
          cloud.google.com/gke-tpu-topology: 2x4
        containers:
          - name: ray-worker
            image: rayproject/ray:2.43.0
            resources:
              requests:
                google.com/tpu: 4
              limits:
                google.com/tpu: 4

Request TPUs from a task

@ray.remote(resources={"TPU": 4})
def train_step(...):
    import jax
    print(jax.devices())

JAX example

import jax
import ray

@ray.remote(resources={"TPU": 4})
def jax_demo():
    @jax.jit
    def fn(x):
        return x @ x.T
    return fn(jax.numpy.ones((1024, 1024))).sum()

print(ray.get(jax_demo.remote()))

Tips

TPU pods initialize slowly compared to GPUs. Set idleTimeoutSeconds higher (e.g., 600) to avoid churn during interactive use.

Next steps

GPU

GPU equivalents.

Storage

Mount GCS for TPU workflows.