Documentation Index Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
This walkthrough builds a Ray Serve application from scratch.
Install
pip install -U "ray[serve]"
A minimal deployment
from ray import serve
@serve.deployment
class Greeter :
def __init__ ( self , greeting : str = "Hello" ):
self .greeting = greeting
def __call__ ( self , name : str ) -> str :
return f " { self .greeting } , { name } !"
serve.run(Greeter.bind( greeting = "Hi" ))
serve.run starts the local Serve runtime, deploys Greeter, and exposes it at http://localhost:8000.
Call the deployment
import requests
response = requests.get( "http://localhost:8000/" , params = { "name" : "world" })
Custom HTTP handler
For full control over the request and response, accept a Starlette Request:
from starlette.requests import Request
@serve.deployment
class Echo :
async def __call__ ( self , request : Request):
body = await request.json()
return { "echo" : body}
serve.run(Echo.bind())
FastAPI integration
from fastapi import FastAPI
from ray import serve
app = FastAPI()
@serve.deployment
@serve.ingress (app)
class Service :
@app.get ( "/items/ {item_id} " )
def get_item ( self , item_id : int ):
return { "item_id" : item_id}
serve.run(Service.bind())
@serve.ingress(app) mounts a FastAPI router as the deployment’s HTTP interface.
Multiple replicas
@serve.deployment ( num_replicas = 4 )
class Service :
...
Ray Serve runs four copies of the deployment, load-balanced behind the same endpoint.
Resource requests
@serve.deployment ( ray_actor_options = { "num_cpus" : 2 , "num_gpus" : 1 })
class GPUService :
def __init__ ( self ):
self .model = load_gpu_model()
Bind and run
Service.bind(args) creates a deployment handle — a graph node, not a running deployment. serve.run(handle) materializes the graph and starts the replicas.
Stop
Next steps
Key concepts Deployments, applications, and the controller.
Model composition Chain deployments into pipelines.