Ray Serve lets you compose Python models, business logic, and asynchronous I/O into autoscaling HTTP and gRPC endpoints. It’s framework-agnostic, scales horizontally across the cluster, and is designed for the long tail of production serving needs: multi-model pipelines, streaming responses, batch inference, and incremental upgrades.Documentation Index
Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Why Ray Serve
Python-first
Define deployments as Python classes. No protobufs, no YAML to write your application logic.
Multi-model composition
Chain models, ensembles, and business logic into a single endpoint with low-latency in-process calls between components.
Autoscaling
Replicas scale up and down based on traffic and queue depth.
Production primitives
Rolling updates, health checks, gRPC, FastAPI integration, and observability built in.
A minimal deployment
Concepts
Key concepts
Deployments, replicas, applications, and the controller.
Develop and deploy
From
serve.run in development to serve deploy in production.Model composition
Compose deployments into pipelines and DAGs.
Autoscaling
Scale replicas based on traffic.
Use cases
LLM serving
Serve LLMs with vLLM, TensorRT-LLM, or custom backends.
Multi-app deployments
Run independent applications side by side.