Get Started With Ray Serve

Install
A minimal deployment
Call the deployment
Custom HTTP handler
FastAPI integration
Multiple replicas
Resource requests
Bind and run
Stop
Next steps

This walkthrough builds a Ray Serve application from scratch.

Install

pip install -U "ray[serve]"

A minimal deployment

from ray import serve

@serve.deployment
class Greeter:
    def __init__(self, greeting: str = "Hello"):
        self.greeting = greeting

    def __call__(self, name: str) -> str:
        return f"{self.greeting}, {name}!"

serve.run(Greeter.bind(greeting="Hi"))

serve.run starts the local Serve runtime, deploys Greeter, and exposes it at http://localhost:8000.

Call the deployment

import requests
response = requests.get("http://localhost:8000/", params={"name": "world"})

Custom HTTP handler

For full control over the request and response, accept a Starlette Request:

from starlette.requests import Request

@serve.deployment
class Echo:
    async def __call__(self, request: Request):
        body = await request.json()
        return {"echo": body}

serve.run(Echo.bind())

FastAPI integration

from fastapi import FastAPI
from ray import serve

app = FastAPI()

@serve.deployment
@serve.ingress(app)
class Service:
    @app.get("/items/{item_id}")
    def get_item(self, item_id: int):
        return {"item_id": item_id}

serve.run(Service.bind())

@serve.ingress(app) mounts a FastAPI router as the deployment’s HTTP interface.

Multiple replicas

@serve.deployment(num_replicas=4)
class Service:
    ...

Ray Serve runs four copies of the deployment, load-balanced behind the same endpoint.

Resource requests

@serve.deployment(ray_actor_options={"num_cpus": 2, "num_gpus": 1})
class GPUService:
    def __init__(self):
        self.model = load_gpu_model()

Bind and run

Service.bind(args) creates a deployment handle — a graph node, not a running deployment. serve.run(handle) materializes the graph and starts the replicas.

Stop

serve.shutdown()

Next steps

Key concepts

Deployments, applications, and the controller.

Model composition

Chain deployments into pipelines.

Ray Serve Overview Ray Serve Key Concepts

⌘I

Ray Data

Ray Train

Ray Tune

Ray Serve

Ray RLlib

Ray LLM

Get Started With Ray Serve

Install

A minimal deployment

Call the deployment

Custom HTTP handler

FastAPI integration

Multiple replicas

Resource requests

Bind and run

Stop

Next steps

Key concepts

Model composition

Ray Data

Ray Train

Ray Tune

Ray Serve

Ray RLlib

Ray LLM

Documentation Index

​Install

​A minimal deployment

​Call the deployment

​Custom HTTP handler

​FastAPI integration

​Multiple replicas

​Resource requests

​Bind and run

​Stop

​Next steps

Key concepts

Model composition

Install

A minimal deployment

Call the deployment

Custom HTTP handler

FastAPI integration

Multiple replicas

Resource requests

Bind and run

Stop

Next steps