Skip to main content

Documentation Index

Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Deployment

A deployment is a Python class (or function) that handles inference requests. The class is instantiated as one or more replicas — independent worker processes — that share traffic.
@serve.deployment
class Predictor:
    def __init__(self, model_uri: str):
        self.model = load_model(model_uri)

    def __call__(self, request):
        return self.model.predict(request)

Replica

A replica is one running instance of a deployment. Each replica is a Ray actor. Ray Serve manages replicas — scaling, health-checking, restarting, and routing requests to them.

Application

An application is the unit you deploy with serve.run or serve deploy. It’s a graph of deployments, with one designated ingress deployment that handles incoming HTTP/gRPC requests.
serve.run(Service.bind(), name="my-app", route_prefix="/svc")
You can run multiple applications on the same cluster, each at a different route_prefix.

DeploymentHandle

A handle lets one deployment call another in-process (without going through HTTP).
@serve.deployment
class Pipeline:
    def __init__(self, downstream: serve.DeploymentHandle):
        self._downstream = downstream

    async def __call__(self, request):
        result = await self._downstream.remote(request)
        return result

pipeline = Pipeline.bind(MyModel.bind())
serve.run(pipeline)

Controller

A single Serve controller actor manages all applications: launching replicas, applying configuration, and monitoring health. The controller is created once when Serve starts and persists for the cluster’s lifetime.

Proxy

The HTTP/gRPC proxy runs on every node and routes incoming traffic to the right deployment replicas based on the route prefix and method.

Ingress deployment

The deployment at the root of an application is the ingress — it receives external requests and (optionally) calls into other deployments via handles.

Configuration

Configuration lives in two places:
  • In-code (@serve.deployment(num_replicas=4, ...)): bundled with your Python code.
  • YAML (serve config): override at deploy time without changing code.

Next steps

Develop and deploy

Promote local Serve apps to production.

Model composition

Build multi-deployment pipelines.