Ray Serve Key Concepts

Deployment
Replica
Application
DeploymentHandle
Controller
Proxy
Ingress deployment
Configuration
Next steps

Deployment

A deployment is a Python class (or function) that handles inference requests. The class is instantiated as one or more replicas — independent worker processes — that share traffic.

@serve.deployment
class Predictor:
    def __init__(self, model_uri: str):
        self.model = load_model(model_uri)

    def __call__(self, request):
        return self.model.predict(request)

Replica

A replica is one running instance of a deployment. Each replica is a Ray actor. Ray Serve manages replicas — scaling, health-checking, restarting, and routing requests to them.

Application

An application is the unit you deploy with serve.run or serve deploy. It’s a graph of deployments, with one designated ingress deployment that handles incoming HTTP/gRPC requests.

serve.run(Service.bind(), name="my-app", route_prefix="/svc")

You can run multiple applications on the same cluster, each at a different route_prefix.

DeploymentHandle

A handle lets one deployment call another in-process (without going through HTTP).

@serve.deployment
class Pipeline:
    def __init__(self, downstream: serve.DeploymentHandle):
        self._downstream = downstream

    async def __call__(self, request):
        result = await self._downstream.remote(request)
        return result

pipeline = Pipeline.bind(MyModel.bind())
serve.run(pipeline)

Controller

A single Serve controller actor manages all applications: launching replicas, applying configuration, and monitoring health. The controller is created once when Serve starts and persists for the cluster’s lifetime.

Proxy

The HTTP/gRPC proxy runs on every node and routes incoming traffic to the right deployment replicas based on the route prefix and method.

Ingress deployment

The deployment at the root of an application is the ingress — it receives external requests and (optionally) calls into other deployments via handles.

Configuration

Configuration lives in two places:

In-code (@serve.deployment(num_replicas=4, ...)): bundled with your Python code.
YAML (serve config): override at deploy time without changing code.

Next steps

Develop and deploy

Promote local Serve apps to production.

Model composition

Build multi-deployment pipelines.

Get Started With Ray Serve Develop and Deploy

⌘I

Ray Data

Ray Train

Ray Tune

Ray Serve

Ray RLlib

Ray LLM

Ray Serve Key Concepts

Deployment

Replica

Application

DeploymentHandle

Controller

Proxy

Ingress deployment

Configuration

Next steps

Develop and deploy

Model composition

Ray Data

Ray Train

Ray Tune

Ray Serve

Ray RLlib

Ray LLM

Documentation Index

​Deployment

​Replica

​Application

​DeploymentHandle

​Controller

​Proxy

​Ingress deployment

​Configuration

​Next steps

Develop and deploy

Model composition

Deployment

Replica

Application

DeploymentHandle

Controller

Proxy

Ingress deployment

Configuration

Next steps