Documentation Index Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Composition is one of Ray Serve’s strengths. A single application can chain deployments, fan out to ensembles, and route conditionally — all with low-overhead in-process calls between components.
DeploymentHandle
The primitive for inter-deployment calls. A handle is a remote reference to another deployment’s replicas.
from ray import serve
@serve.deployment
class Tokenizer :
def __call__ ( self , text : str ) -> list[ int ]:
return tokenize(text)
@serve.deployment
class Classifier :
def __init__ ( self , tokenizer : serve.DeploymentHandle):
self ._tokenizer = tokenizer
async def __call__ ( self , text : str ) -> str :
tokens = await self ._tokenizer.remote(text)
return classify(tokens)
app = Classifier.bind(Tokenizer.bind())
serve.run(app)
Calls through a handle are async by default and return values directly (no ray.get needed).
Ensemble
Run multiple models in parallel and aggregate results.
@serve.deployment
class Ensemble :
def __init__ ( self , model_a , model_b ):
self ._a, self ._b = model_a, model_b
async def __call__ ( self , x ):
a, b = await asyncio.gather( self ._a.remote(x), self ._b.remote(x))
return (a + b) / 2
app = Ensemble.bind(ModelA.bind(), ModelB.bind())
Conditional routing
Route requests to one of several models based on a feature.
@serve.deployment
class Router :
def __init__ ( self , fast , slow ):
self ._fast, self ._slow = fast, slow
async def __call__ ( self , request ):
if request[ "urgent" ]:
return await self ._fast.remote(request)
return await self ._slow.remote(request)
Streaming responses
For long-running generative models, stream tokens back to the client.
from starlette.responses import StreamingResponse
@serve.deployment
class TokenStreamer :
async def stream ( self , prompt : str ):
async for token in generate(prompt):
yield token
async def __call__ ( self , request ):
prompt = ( await request.json())[ "prompt" ]
return StreamingResponse( self .stream(prompt), media_type = "text/event-stream" )
Multiplexing
A single replica can host multiple models, with traffic routed by a model ID. Useful when serving many small models with low individual traffic.
@serve.multiplexed ( max_num_models_per_replica = 10 )
async def get_model ( model_id : str ):
return load_model(model_id)
@serve.deployment
class MultiModel :
async def __call__ ( self , request ):
model = await get_model(request.headers[ "model-id" ])
return model( await request.json())
Best practices
Prefer composition over a giant deployment. Smaller deployments scale independently and recover more cleanly.
Inter-deployment latency is low but not zero. Watch your hop count for latency-sensitive paths.
Next steps
HTTP guide Detailed HTTP request and response handling.
Autoscaling Independent autoscaling per deployment.