Autoscaling

Enable autoscaling
Full configuration
Cluster autoscaler integration
Tuning tips
Next steps

Each deployment in Ray Serve can autoscale independently. The autoscaler observes per-replica request load and adjusts the replica count to keep load near a target.

Enable autoscaling

@serve.deployment(
    num_replicas="auto",
    autoscaling_config={
        "min_replicas": 1,
        "max_replicas": 32,
        "target_ongoing_requests": 5,
    },
)
class Service:
    ...

target_ongoing_requests is the desired number of in-flight requests per replica. The autoscaler scales out when the observed value is higher and scales in when it’s lower.

Full configuration

autoscaling_config={
    "min_replicas": 1,
    "max_replicas": 32,
    "initial_replicas": 4,
    "target_ongoing_requests": 5,
    "metrics_interval_s": 10,
    "look_back_period_s": 30,
    "smoothing_factor": 1.0,
    "downscale_delay_s": 600,
    "upscale_delay_s": 30,
}

Field	Effect
`min_replicas` / `max_replicas`	Bounds for the autoscaler.
`initial_replicas`	Starting replica count when the deployment first comes up.
`target_ongoing_requests`	Per-replica target for in-flight requests.
`metrics_interval_s`	How often replicas report load to the controller.
`look_back_period_s`	Window of recent load to consider.
`smoothing_factor`	Multiplier on the scaling delta. Values below 1 dampen, values above 1 amplify.
`upscale_delay_s` / `downscale_delay_s`	Cool-down between scaling actions.

Cluster autoscaler integration

When deployment scaling outpaces the available cluster resources, Ray’s cluster autoscaler kicks in to add nodes. On Kubernetes, ensure the RayCluster is configured with autoscaling enabled and worker node groups that can satisfy your deployment’s resource requests.

Tuning tips

Start with target_ongoing_requests equal to the per-replica concurrency limit (e.g., max_ongoing_requests) divided by 2. This leaves headroom for spikes.

A long downscale_delay_s keeps spare capacity around (good for spiky traffic, expensive otherwise). Start at 600 seconds and shrink if your traffic pattern allows.

Next steps

Configure deployment

All deployment-level options.

Production guide

Capacity planning end-to-end.

gRPC Guide Production Guide

⌘I

Ray Data

Ray Train

Ray Tune

Ray Serve

Ray RLlib

Ray LLM

Autoscaling

Enable autoscaling

Full configuration

Cluster autoscaler integration

Tuning tips

Next steps

Configure deployment

Production guide

Ray Data

Ray Train

Ray Tune

Ray Serve

Ray RLlib

Ray LLM

Documentation Index

​Enable autoscaling

​Full configuration

​Cluster autoscaler integration

​Tuning tips

​Next steps

Configure deployment

Production guide

Enable autoscaling

Full configuration

Cluster autoscaler integration

Tuning tips

Next steps