Skip to main content

Documentation Index

Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

This guide collects the practices that show up in mature Ray Serve deployments.

Deployment options

OptionWhen to use
serve deploy <config.yaml> against a long-running Ray clusterSelf-managed clusters; quick deploys; no Kubernetes.
RayService on Kubernetes (KubeRay)Production. Declarative config, rolling upgrades, integration with ingress controllers.
Anyscale PlatformManaged Ray with built-in observability and autoscaling.

Health checks

Configure liveness and readiness probes to point at Serve’s /-/healthz and /-/routes:
livenessProbe:
  httpGet: { path: /-/healthz, port: 8000 }
readinessProbe:
  httpGet: { path: /-/routes, port: 8000 }
Add per-deployment check_health methods for application-level health.

Rolling updates

Update a serveConfigV2 in your RayService manifest (or push a new YAML via serve deploy). KubeRay performs a zero-downtime rolling update.

Observability

  • Metrics: Serve exports per-deployment QPS, latency, replica count, queue depth, and errors. Scrape with Prometheus and visualize in Grafana.
  • Logs: Each replica logs to the Ray log directory; aggregate via fluentbit or your platform’s log collector.
  • Tracing: Use OpenTelemetry middleware on your FastAPI app for distributed traces.

Capacity planning

  1. Establish per-replica latency and throughput on representative traffic.
  2. Set max_ongoing_requests to the maximum concurrency a replica handles before latency degrades.
  3. Set autoscaling target_ongoing_requests to ~50% of max_ongoing_requests.
  4. Provision enough nodes so the autoscaler can reach max_replicas for peak traffic.
  5. Add headroom (downscale_delay_s) to tolerate brief spikes.

Failure handling

  • Replica crash: Ray restarts the replica automatically. The proxy routes around it during the restart.
  • Node loss: Replicas on the lost node are recreated elsewhere. Cluster autoscaler adds capacity if needed.
  • Head node loss: Without GCS fault tolerance the cluster terminates. With external GCS storage, the head restarts and resumes.

Security

  • Use RayService with TLS termination at the ingress.
  • Enable authentication in your FastAPI app (or via an auth proxy).
  • Restrict the Ray dashboard’s exposure on production clusters.

Cost control

  • Use spot/preemptible workers for stateless inference; configure long enough graceful-shutdown periods.
  • Right-size your replica resource requests; over-allocating GPUs wastes money fast.
  • Set max_replicas carefully — autoscaling can hide accidental traffic floods that double your bill.

Next steps

Monitoring

Metric and log details.

Multi-app

Run independent applications side by side.