Documentation Index
Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
A RayJob creates an ephemeral RayCluster, runs a single job on it, and (optionally) tears the cluster down when the job finishes.
Manifest
apiVersion: ray.io/v1
kind: RayJob
metadata:
name: my-job
spec:
entrypoint: python my_script.py
shutdownAfterJobFinishes: true
rayClusterSpec:
rayVersion: "2.43.0"
headGroupSpec:
rayStartParams: {}
template:
spec:
containers:
- name: ray-head
image: rayproject/ray:2.43.0
resources:
requests: { cpu: "2", memory: "4Gi" }
limits: { cpu: "2", memory: "4Gi" }
workerGroupSpecs:
- groupName: cpu
replicas: 2
minReplicas: 2
maxReplicas: 4
rayStartParams: {}
template:
spec:
containers:
- name: ray-worker
image: rayproject/ray:2.43.0
resources:
requests: { cpu: "4", memory: "8Gi" }
limits: { cpu: "4", memory: "8Gi" }
kubectl apply -f rayjob.yaml
kubectl get rayjob my-job -w
Provide the script
Mount your code via:
- A working directory in
runtimeEnvYAML:
runtimeEnvYAML: |
working_dir: "https://my.bucket.s3.amazonaws.com/job.zip"
pip:
- torch==2.1.0
- A custom Docker image with the script baked in.
- A ConfigMap mounted into the container.
Resubmit on failure
Connect to a long-running cluster
If you don’t want a fresh cluster per job:
spec:
clusterSelector:
ray.io/cluster: my-cluster # use an existing RayCluster
Schedule with Kueue
For batch queuing, install Kueue and reference it:
metadata:
labels:
kueue.x-k8s.io/queue-name: gpu-queue
Next steps
RayService
Long-running serving deployments.
User guides
Storage, GPU, autoscaling, observability.