Production Guide

Deployment options
Health checks
Rolling updates
Observability
Capacity planning
Failure handling
Security
Cost control
Next steps

This guide collects the practices that show up in mature Ray Serve deployments.

Deployment options

Option	When to use
`serve deploy <config.yaml>` against a long-running Ray cluster	Self-managed clusters; quick deploys; no Kubernetes.
RayService on Kubernetes (KubeRay)	Production. Declarative config, rolling upgrades, integration with ingress controllers.
Anyscale Platform	Managed Ray with built-in observability and autoscaling.

Health checks

Configure liveness and readiness probes to point at Serve’s /-/healthz and /-/routes:

livenessProbe:
  httpGet: { path: /-/healthz, port: 8000 }
readinessProbe:
  httpGet: { path: /-/routes, port: 8000 }

Add per-deployment check_health methods for application-level health.

Rolling updates

Update a serveConfigV2 in your RayService manifest (or push a new YAML via serve deploy). KubeRay performs a zero-downtime rolling update.

Observability

Metrics: Serve exports per-deployment QPS, latency, replica count, queue depth, and errors. Scrape with Prometheus and visualize in Grafana.
Logs: Each replica logs to the Ray log directory; aggregate via fluentbit or your platform’s log collector.
Tracing: Use OpenTelemetry middleware on your FastAPI app for distributed traces.

Capacity planning

Establish per-replica latency and throughput on representative traffic.
Set max_ongoing_requests to the maximum concurrency a replica handles before latency degrades.
Set autoscaling target_ongoing_requests to ~50% of max_ongoing_requests.
Provision enough nodes so the autoscaler can reach max_replicas for peak traffic.
Add headroom (downscale_delay_s) to tolerate brief spikes.

Failure handling

Replica crash: Ray restarts the replica automatically. The proxy routes around it during the restart.
Node loss: Replicas on the lost node are recreated elsewhere. Cluster autoscaler adds capacity if needed.
Head node loss: Without GCS fault tolerance the cluster terminates. With external GCS storage, the head restarts and resumes.

Security

Use RayService with TLS termination at the ingress.
Enable authentication in your FastAPI app (or via an auth proxy).
Restrict the Ray dashboard’s exposure on production clusters.

Cost control

Use spot/preemptible workers for stateless inference; configure long enough graceful-shutdown periods.
Right-size your replica resource requests; over-allocating GPUs wastes money fast.
Set max_replicas carefully — autoscaling can hide accidental traffic floods that double your bill.

Next steps

Monitoring

Metric and log details.

Multi-app

Run independent applications side by side.

Autoscaling Monitoring Ray Serve

⌘I

Ray Data

Ray Train

Ray Tune

Ray Serve

Ray RLlib

Ray LLM

Production Guide

Deployment options

Health checks

Rolling updates

Observability

Capacity planning

Failure handling

Security

Cost control

Next steps

Monitoring

Multi-app

Ray Data

Ray Train

Ray Tune

Ray Serve

Ray RLlib

Ray LLM

Documentation Index

​Deployment options

​Health checks

​Rolling updates

​Observability

​Capacity planning

​Failure handling

​Security

​Cost control

​Next steps

Monitoring

Multi-app

Deployment options

Health checks

Rolling updates

Observability

Capacity planning

Failure handling

Security

Cost control

Next steps