Each deployment in Ray Serve can autoscale independently. The autoscaler observes per-replica request load and adjusts the replica count to keep load near a target.Documentation Index
Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Enable autoscaling
target_ongoing_requests is the desired number of in-flight requests per replica. The autoscaler scales out when the observed value is higher and scales in when it’s lower.
Full configuration
| Field | Effect |
|---|---|
min_replicas / max_replicas | Bounds for the autoscaler. |
initial_replicas | Starting replica count when the deployment first comes up. |
target_ongoing_requests | Per-replica target for in-flight requests. |
metrics_interval_s | How often replicas report load to the controller. |
look_back_period_s | Window of recent load to consider. |
smoothing_factor | Multiplier on the scaling delta. Values below 1 dampen, values above 1 amplify. |
upscale_delay_s / downscale_delay_s | Cool-down between scaling actions. |
Cluster autoscaler integration
When deployment scaling outpaces the available cluster resources, Ray’s cluster autoscaler kicks in to add nodes. On Kubernetes, ensure the RayCluster is configured with autoscaling enabled and worker node groups that can satisfy your deployment’s resource requests.Tuning tips
Next steps
Configure deployment
All deployment-level options.
Production guide
Capacity planning end-to-end.