Documentation Index Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
The @serve.deployment decorator accepts options that control replica count, resource allocation, autoscaling, batching, and lifecycle hooks. Override any of them at deploy time via YAML.
Replicas
@serve.deployment ( num_replicas = 4 )
class Service :
...
For autoscaling, use num_replicas="auto" and provide an autoscaling_config:
@serve.deployment (
num_replicas = "auto" ,
autoscaling_config = { "min_replicas" : 1 , "max_replicas" : 16 , "target_ongoing_requests" : 5 },
)
class Service :
...
Resources
@serve.deployment ( ray_actor_options = { "num_cpus" : 2 , "num_gpus" : 1 , "memory" : 4 * 1024 ** 3 })
class GPUService :
...
ray_actor_options follows the same shape as @ray.remote(...). Custom resources work too:
ray_actor_options = { "resources" : { "high_memory" : 1 }}
Concurrency
@serve.deployment ( max_ongoing_requests = 10 )
class Service :
...
Caps the number of in-flight requests per replica. Excess requests queue.
Batching
@serve.deployment
class BatchedService :
@serve.batch ( max_batch_size = 32 , batch_wait_timeout_s = 0.05 )
async def __call__ ( self , requests ):
return [ self .process(r) for r in requests]
@serve.batch collects up to max_batch_size requests (or waits batch_wait_timeout_s seconds, whichever is sooner) and calls the method once with the list.
Health check
@serve.deployment ( health_check_period_s = 10 , health_check_timeout_s = 30 )
class Service :
def check_health ( self ):
if not self ._is_ready:
raise RuntimeError ( "not ready" )
check_health runs periodically. If it raises, the replica is restarted.
@serve.deployment
class Service :
def reconfigure ( self , config : dict ):
self ._threshold = config[ "threshold" ]
When you push a new config via serve deploy, Ray calls reconfigure(new_user_config) on each replica without recreating them.
Logging
@serve.deployment ( logging_config = { "log_level" : "INFO" , "encoding" : "JSON" })
class Service :
...
Graceful shutdown
@serve.deployment ( graceful_shutdown_timeout_s = 30 , graceful_shutdown_wait_loop_s = 2 )
class Service :
...
Replicas finish draining in-flight requests before exiting.
YAML override
Most options can be overridden in the deploy config:
deployments :
- name : Service
num_replicas : 8
ray_actor_options :
num_cpus : 4
autoscaling_config :
min_replicas : 2
max_replicas : 32
Next steps
Autoscaling Tune the autoscaling controller.
Production guide Run Serve in production.