Skip to main content

Documentation Index

Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Ray Serve lets you compose Python models, business logic, and asynchronous I/O into autoscaling HTTP and gRPC endpoints. It’s framework-agnostic, scales horizontally across the cluster, and is designed for the long tail of production serving needs: multi-model pipelines, streaming responses, batch inference, and incremental upgrades.

Why Ray Serve

Python-first

Define deployments as Python classes. No protobufs, no YAML to write your application logic.

Multi-model composition

Chain models, ensembles, and business logic into a single endpoint with low-latency in-process calls between components.

Autoscaling

Replicas scale up and down based on traffic and queue depth.

Production primitives

Rolling updates, health checks, gRPC, FastAPI integration, and observability built in.

A minimal deployment

from ray import serve
from starlette.requests import Request

@serve.deployment
class Hello:
    def __call__(self, request: Request) -> str:
        return f"Hello, {request.query_params['name']}"

serve.run(Hello.bind())
# curl "http://localhost:8000/?name=Ray"

Concepts

Key concepts

Deployments, replicas, applications, and the controller.

Develop and deploy

From serve.run in development to serve deploy in production.

Model composition

Compose deployments into pipelines and DAGs.

Autoscaling

Scale replicas based on traffic.

Use cases

LLM serving

Serve LLMs with vLLM, TensorRT-LLM, or custom backends.

Multi-app deployments

Run independent applications side by side.