RayJob Quickstart

A RayJob creates an ephemeral RayCluster, runs a single job on it, and (optionally) tears the cluster down when the job finishes.

Manifest

apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: my-job
spec:
  entrypoint: python my_script.py
  shutdownAfterJobFinishes: true
  rayClusterSpec:
    rayVersion: "2.43.0"
    headGroupSpec:
      rayStartParams: {}
      template:
        spec:
          containers:
            - name: ray-head
              image: rayproject/ray:2.43.0
              resources:
                requests: { cpu: "2", memory: "4Gi" }
                limits:   { cpu: "2", memory: "4Gi" }
    workerGroupSpecs:
      - groupName: cpu
        replicas: 2
        minReplicas: 2
        maxReplicas: 4
        rayStartParams: {}
        template:
          spec:
            containers:
              - name: ray-worker
                image: rayproject/ray:2.43.0
                resources:
                  requests: { cpu: "4", memory: "8Gi" }
                  limits:   { cpu: "4", memory: "8Gi" }

kubectl apply -f rayjob.yaml
kubectl get rayjob my-job -w

Provide the script

Mount your code via:

A working directory in runtimeEnvYAML:

runtimeEnvYAML: |
  working_dir: "https://my.bucket.s3.amazonaws.com/job.zip"
  pip:
    - torch==2.1.0

A custom Docker image with the script baked in.
A ConfigMap mounted into the container.

Resubmit on failure

spec:
  backoffLimit: 3

Connect to a long-running cluster

If you don’t want a fresh cluster per job:

spec:
  clusterSelector:
    ray.io/cluster: my-cluster   # use an existing RayCluster

Schedule with Kueue

For batch queuing, install Kueue and reference it:

metadata:
  labels:
    kueue.x-k8s.io/queue-name: gpu-queue

Ray Clusters

Observability

RayJob Quickstart

Manifest

Provide the script

Resubmit on failure

Connect to a long-running cluster

Schedule with Kueue

Next steps

RayService

User guides

Ray Clusters

Observability

Documentation Index

​Manifest

​Provide the script

​Resubmit on failure

​Connect to a long-running cluster

​Schedule with Kueue

​Next steps

RayService

User guides

Manifest

Provide the script

Resubmit on failure

Connect to a long-running cluster

Schedule with Kueue

Next steps