Configure a Deployment

Replicas
Resources
Concurrency
Batching
Health check
Reconfigure
Logging
Graceful shutdown
YAML override
Next steps

The @serve.deployment decorator accepts options that control replica count, resource allocation, autoscaling, batching, and lifecycle hooks. Override any of them at deploy time via YAML.

Replicas

@serve.deployment(num_replicas=4)
class Service:
    ...

For autoscaling, use num_replicas="auto" and provide an autoscaling_config:

@serve.deployment(
    num_replicas="auto",
    autoscaling_config={"min_replicas": 1, "max_replicas": 16, "target_ongoing_requests": 5},
)
class Service:
    ...

Resources

@serve.deployment(ray_actor_options={"num_cpus": 2, "num_gpus": 1, "memory": 4 * 1024**3})
class GPUService:
    ...

ray_actor_options follows the same shape as @ray.remote(...). Custom resources work too:

ray_actor_options={"resources": {"high_memory": 1}}

Concurrency

@serve.deployment(max_ongoing_requests=10)
class Service:
    ...

Caps the number of in-flight requests per replica. Excess requests queue.

Batching

@serve.deployment
class BatchedService:
    @serve.batch(max_batch_size=32, batch_wait_timeout_s=0.05)
    async def __call__(self, requests):
        return [self.process(r) for r in requests]

@serve.batch collects up to max_batch_size requests (or waits batch_wait_timeout_s seconds, whichever is sooner) and calls the method once with the list.

Health check

@serve.deployment(health_check_period_s=10, health_check_timeout_s=30)
class Service:
    def check_health(self):
        if not self._is_ready:
            raise RuntimeError("not ready")

check_health runs periodically. If it raises, the replica is restarted.

Reconfigure

@serve.deployment
class Service:
    def reconfigure(self, config: dict):
        self._threshold = config["threshold"]

When you push a new config via serve deploy, Ray calls reconfigure(new_user_config) on each replica without recreating them.

Logging

@serve.deployment(logging_config={"log_level": "INFO", "encoding": "JSON"})
class Service:
    ...

Graceful shutdown

@serve.deployment(graceful_shutdown_timeout_s=30, graceful_shutdown_wait_loop_s=2)
class Service:
    ...

Replicas finish draining in-flight requests before exiting.

YAML override

Most options can be overridden in the deploy config:

deployments:
  - name: Service
    num_replicas: 8
    ray_actor_options:
      num_cpus: 4
    autoscaling_config:
      min_replicas: 2
      max_replicas: 32

Next steps

Autoscaling

Tune the autoscaling controller.

Production guide

Run Serve in production.

Model Composition HTTP Guide

⌘I

Ray Data

Ray Train

Ray Tune

Ray Serve

Ray RLlib

Ray LLM

Configure a Deployment

Replicas

Resources

Concurrency

Batching

Health check

Reconfigure

Logging

Graceful shutdown

YAML override

Next steps

Autoscaling

Production guide

Ray Data

Ray Train

Ray Tune

Ray Serve

Ray RLlib

Ray LLM

Documentation Index

​Replicas

​Resources

​Concurrency

​Batching

​Health check

​Reconfigure

​Logging

​Graceful shutdown

​YAML override

​Next steps

Autoscaling

Production guide

Replicas

Resources

Concurrency

Batching

Health check

Reconfigure

Logging

Graceful shutdown

YAML override

Next steps