TPU Workloads

Prerequisites
TPU node pool
Worker group
Request TPUs from a task
JAX example
Tips
Next steps

KubeRay supports Google Cloud TPU pods (v4, v5e, v5p) on GKE.

Prerequisites

A GKE cluster with a TPU node pool.
The TPU initialization image bundled with the Ray image, or a custom image with libtpu.

TPU node pool

Provision a TPU node pool through gcloud:

gcloud container node-pools create tpu-pool \
  --cluster=my-cluster \
  --machine-type=ct5lp-hightpu-4t \
  --node-locations=us-central2-b \
  --num-nodes=1

Worker group

workerGroupSpecs:
  - groupName: tpu
    replicas: 0
    minReplicas: 0
    maxReplicas: 4
    rayStartParams: {}
    template:
      spec:
        nodeSelector:
          cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
          cloud.google.com/gke-tpu-topology: 2x4
        containers:
          - name: ray-worker
            image: rayproject/ray:2.43.0
            resources:
              requests:
                google.com/tpu: 4
              limits:
                google.com/tpu: 4

Request TPUs from a task

@ray.remote(resources={"TPU": 4})
def train_step(...):
    import jax
    print(jax.devices())

JAX example

import jax
import ray

@ray.remote(resources={"TPU": 4})
def jax_demo():
    @jax.jit
    def fn(x):
        return x @ x.T
    return fn(jax.numpy.ones((1024, 1024))).sum()

print(ray.get(jax_demo.remote()))

Tips

TPU pods initialize slowly compared to GPUs. Set idleTimeoutSeconds higher (e.g., 600) to avoid churn during interactive use.

Next steps

GPU

GPU equivalents.

Storage

Mount GCS for TPU workflows.

GPU Workloads Observability on Kubernetes

⌘I

Ray Clusters

Observability

Documentation Index

​Prerequisites

​TPU node pool

​Worker group

​Request TPUs from a task

​JAX example

​Tips

​Next steps

GPU

Storage

Prerequisites

TPU node pool

Worker group

Request TPUs from a task

JAX example

Tips

Next steps