This guide collects the practices that show up in mature Ray Serve deployments.Documentation Index
Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Deployment options
| Option | When to use |
|---|---|
serve deploy <config.yaml> against a long-running Ray cluster | Self-managed clusters; quick deploys; no Kubernetes. |
| RayService on Kubernetes (KubeRay) | Production. Declarative config, rolling upgrades, integration with ingress controllers. |
| Anyscale Platform | Managed Ray with built-in observability and autoscaling. |
Health checks
Configure liveness and readiness probes to point at Serve’s/-/healthz and /-/routes:
check_health methods for application-level health.
Rolling updates
Update aserveConfigV2 in your RayService manifest (or push a new YAML via serve deploy). KubeRay performs a zero-downtime rolling update.
Observability
- Metrics: Serve exports per-deployment QPS, latency, replica count, queue depth, and errors. Scrape with Prometheus and visualize in Grafana.
- Logs: Each replica logs to the Ray log directory; aggregate via fluentbit or your platform’s log collector.
- Tracing: Use OpenTelemetry middleware on your FastAPI app for distributed traces.
Capacity planning
- Establish per-replica latency and throughput on representative traffic.
- Set
max_ongoing_requeststo the maximum concurrency a replica handles before latency degrades. - Set autoscaling
target_ongoing_requeststo ~50% ofmax_ongoing_requests. - Provision enough nodes so the autoscaler can reach
max_replicasfor peak traffic. - Add headroom (
downscale_delay_s) to tolerate brief spikes.
Failure handling
- Replica crash: Ray restarts the replica automatically. The proxy routes around it during the restart.
- Node loss: Replicas on the lost node are recreated elsewhere. Cluster autoscaler adds capacity if needed.
- Head node loss: Without GCS fault tolerance the cluster terminates. With external GCS storage, the head restarts and resumes.
Security
- Use RayService with TLS termination at the ingress.
- Enable authentication in your FastAPI app (or via an auth proxy).
- Restrict the Ray dashboard’s exposure on production clusters.
Cost control
- Use spot/preemptible workers for stateless inference; configure long enough graceful-shutdown periods.
- Right-size your replica resource requests; over-allocating GPUs wastes money fast.
- Set
max_replicascarefully — autoscaling can hide accidental traffic floods that double your bill.
Next steps
Monitoring
Metric and log details.
Multi-app
Run independent applications side by side.