Skip to main content

Documentation Index

Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Composition is one of Ray Serve’s strengths. A single application can chain deployments, fan out to ensembles, and route conditionally — all with low-overhead in-process calls between components.

DeploymentHandle

The primitive for inter-deployment calls. A handle is a remote reference to another deployment’s replicas.
from ray import serve

@serve.deployment
class Tokenizer:
    def __call__(self, text: str) -> list[int]:
        return tokenize(text)

@serve.deployment
class Classifier:
    def __init__(self, tokenizer: serve.DeploymentHandle):
        self._tokenizer = tokenizer

    async def __call__(self, text: str) -> str:
        tokens = await self._tokenizer.remote(text)
        return classify(tokens)

app = Classifier.bind(Tokenizer.bind())
serve.run(app)
Calls through a handle are async by default and return values directly (no ray.get needed).

Ensemble

Run multiple models in parallel and aggregate results.
@serve.deployment
class Ensemble:
    def __init__(self, model_a, model_b):
        self._a, self._b = model_a, model_b

    async def __call__(self, x):
        a, b = await asyncio.gather(self._a.remote(x), self._b.remote(x))
        return (a + b) / 2

app = Ensemble.bind(ModelA.bind(), ModelB.bind())

Conditional routing

Route requests to one of several models based on a feature.
@serve.deployment
class Router:
    def __init__(self, fast, slow):
        self._fast, self._slow = fast, slow

    async def __call__(self, request):
        if request["urgent"]:
            return await self._fast.remote(request)
        return await self._slow.remote(request)

Streaming responses

For long-running generative models, stream tokens back to the client.
from starlette.responses import StreamingResponse

@serve.deployment
class TokenStreamer:
    async def stream(self, prompt: str):
        async for token in generate(prompt):
            yield token

    async def __call__(self, request):
        prompt = (await request.json())["prompt"]
        return StreamingResponse(self.stream(prompt), media_type="text/event-stream")

Multiplexing

A single replica can host multiple models, with traffic routed by a model ID. Useful when serving many small models with low individual traffic.
@serve.multiplexed(max_num_models_per_replica=10)
async def get_model(model_id: str):
    return load_model(model_id)

@serve.deployment
class MultiModel:
    async def __call__(self, request):
        model = await get_model(request.headers["model-id"])
        return model(await request.json())

Best practices

Prefer composition over a giant deployment. Smaller deployments scale independently and recover more cleanly.
Inter-deployment latency is low but not zero. Watch your hop count for latency-sensitive paths.

Next steps

HTTP guide

Detailed HTTP request and response handling.

Autoscaling

Independent autoscaling per deployment.