Skip to main content

Documentation Index

Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Each deployment in Ray Serve can autoscale independently. The autoscaler observes per-replica request load and adjusts the replica count to keep load near a target.

Enable autoscaling

@serve.deployment(
    num_replicas="auto",
    autoscaling_config={
        "min_replicas": 1,
        "max_replicas": 32,
        "target_ongoing_requests": 5,
    },
)
class Service:
    ...
target_ongoing_requests is the desired number of in-flight requests per replica. The autoscaler scales out when the observed value is higher and scales in when it’s lower.

Full configuration

autoscaling_config={
    "min_replicas": 1,
    "max_replicas": 32,
    "initial_replicas": 4,
    "target_ongoing_requests": 5,
    "metrics_interval_s": 10,
    "look_back_period_s": 30,
    "smoothing_factor": 1.0,
    "downscale_delay_s": 600,
    "upscale_delay_s": 30,
}
FieldEffect
min_replicas / max_replicasBounds for the autoscaler.
initial_replicasStarting replica count when the deployment first comes up.
target_ongoing_requestsPer-replica target for in-flight requests.
metrics_interval_sHow often replicas report load to the controller.
look_back_period_sWindow of recent load to consider.
smoothing_factorMultiplier on the scaling delta. Values below 1 dampen, values above 1 amplify.
upscale_delay_s / downscale_delay_sCool-down between scaling actions.

Cluster autoscaler integration

When deployment scaling outpaces the available cluster resources, Ray’s cluster autoscaler kicks in to add nodes. On Kubernetes, ensure the RayCluster is configured with autoscaling enabled and worker node groups that can satisfy your deployment’s resource requests.

Tuning tips

Start with target_ongoing_requests equal to the per-replica concurrency limit (e.g., max_ongoing_requests) divided by 2. This leaves headroom for spikes.
A long downscale_delay_s keeps spare capacity around (good for spiky traffic, expensive otherwise). Start at 600 seconds and shrink if your traffic pattern allows.

Next steps

Configure deployment

All deployment-level options.

Production guide

Capacity planning end-to-end.