Skip to main content

Documentation Index

Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

The @serve.deployment decorator accepts options that control replica count, resource allocation, autoscaling, batching, and lifecycle hooks. Override any of them at deploy time via YAML.

Replicas

@serve.deployment(num_replicas=4)
class Service:
    ...
For autoscaling, use num_replicas="auto" and provide an autoscaling_config:
@serve.deployment(
    num_replicas="auto",
    autoscaling_config={"min_replicas": 1, "max_replicas": 16, "target_ongoing_requests": 5},
)
class Service:
    ...

Resources

@serve.deployment(ray_actor_options={"num_cpus": 2, "num_gpus": 1, "memory": 4 * 1024**3})
class GPUService:
    ...
ray_actor_options follows the same shape as @ray.remote(...). Custom resources work too:
ray_actor_options={"resources": {"high_memory": 1}}

Concurrency

@serve.deployment(max_ongoing_requests=10)
class Service:
    ...
Caps the number of in-flight requests per replica. Excess requests queue.

Batching

@serve.deployment
class BatchedService:
    @serve.batch(max_batch_size=32, batch_wait_timeout_s=0.05)
    async def __call__(self, requests):
        return [self.process(r) for r in requests]
@serve.batch collects up to max_batch_size requests (or waits batch_wait_timeout_s seconds, whichever is sooner) and calls the method once with the list.

Health check

@serve.deployment(health_check_period_s=10, health_check_timeout_s=30)
class Service:
    def check_health(self):
        if not self._is_ready:
            raise RuntimeError("not ready")
check_health runs periodically. If it raises, the replica is restarted.

Reconfigure

@serve.deployment
class Service:
    def reconfigure(self, config: dict):
        self._threshold = config["threshold"]
When you push a new config via serve deploy, Ray calls reconfigure(new_user_config) on each replica without recreating them.

Logging

@serve.deployment(logging_config={"log_level": "INFO", "encoding": "JSON"})
class Service:
    ...

Graceful shutdown

@serve.deployment(graceful_shutdown_timeout_s=30, graceful_shutdown_wait_loop_s=2)
class Service:
    ...
Replicas finish draining in-flight requests before exiting.

YAML override

Most options can be overridden in the deploy config:
deployments:
  - name: Service
    num_replicas: 8
    ray_actor_options:
      num_cpus: 4
    autoscaling_config:
      min_replicas: 2
      max_replicas: 32

Next steps

Autoscaling

Tune the autoscaling controller.

Production guide

Run Serve in production.