Ray Serve Overview

Why Ray Serve
A minimal deployment
Concepts
Use cases

Ray Serve lets you compose Python models, business logic, and asynchronous I/O into autoscaling HTTP and gRPC endpoints. It’s framework-agnostic, scales horizontally across the cluster, and is designed for the long tail of production serving needs: multi-model pipelines, streaming responses, batch inference, and incremental upgrades.

Why Ray Serve

Python-first

Define deployments as Python classes. No protobufs, no YAML to write your application logic.

Multi-model composition

Chain models, ensembles, and business logic into a single endpoint with low-latency in-process calls between components.

Autoscaling

Replicas scale up and down based on traffic and queue depth.

Production primitives

Rolling updates, health checks, gRPC, FastAPI integration, and observability built in.

A minimal deployment

from ray import serve
from starlette.requests import Request

@serve.deployment
class Hello:
    def __call__(self, request: Request) -> str:
        return f"Hello, {request.query_params['name']}"

serve.run(Hello.bind())
# curl "http://localhost:8000/?name=Ray"

Concepts

Key concepts

Deployments, replicas, applications, and the controller.

Develop and deploy

From serve.run in development to serve deploy in production.

Model composition

Compose deployments into pipelines and DAGs.

Autoscaling

Scale replicas based on traffic.

Use cases

LLM serving

Serve LLMs with vLLM, TensorRT-LLM, or custom backends.

Multi-app deployments

Run independent applications side by side.

Troubleshooting Ray Tune Get Started With Ray Serve

⌘I

Ray Data

Ray Train

Ray Tune

Ray Serve

Ray RLlib

Ray LLM

Ray Serve Overview

Why Ray Serve

Python-first

Multi-model composition

Autoscaling

Production primitives

A minimal deployment

Concepts

Key concepts

Develop and deploy

Model composition

Autoscaling

Use cases

LLM serving

Multi-app deployments

Ray Data

Ray Train

Ray Tune

Ray Serve

Ray RLlib

Ray LLM

Documentation Index

​Why Ray Serve

Python-first

Multi-model composition

Autoscaling

Production primitives

​A minimal deployment

​Concepts

Key concepts

Develop and deploy

Model composition

Autoscaling

​Use cases

LLM serving

Multi-app deployments

Why Ray Serve

A minimal deployment

Concepts

Use cases