Serving LLMs

Install
Quickstart
Multi-GPU and tensor parallelism
Multiple models on one endpoint
Multi-LoRA
Autoscaling
OpenAI client compatibility
Streaming
Best practices
Next steps

Ray Serve has first-class support for serving LLMs through ray.serve.llm, a high-level API that wraps vLLM (and other engines) in autoscaling Serve deployments.

Install

pip install -U "ray[serve,llm]" vllm

Quickstart

from ray import serve
from ray.serve.llm import LLMConfig, build_openai_app

config = LLMConfig(
    model_loading_config={"model_id": "meta-llama/Llama-3.1-8B-Instruct"},
    deployment_config={
        "num_replicas": 1,
        "ray_actor_options": {"num_gpus": 1},
    },
    engine_kwargs={"max_model_len": 4096},
)

app = build_openai_app({"llm_configs": [config]})
serve.run(app, name="llm")

The resulting deployment exposes an OpenAI-compatible HTTP API at /v1/chat/completions.

Multi-GPU and tensor parallelism

LLMConfig(
    model_loading_config={"model_id": "meta-llama/Llama-3.1-70B-Instruct"},
    deployment_config={"num_replicas": 1, "ray_actor_options": {"num_gpus": 4}},
    engine_kwargs={"tensor_parallel_size": 4, "dtype": "bfloat16"},
)

Multiple models on one endpoint

config_a = LLMConfig(
    model_loading_config={"model_id": "meta-llama/Llama-3.1-8B-Instruct"},
    ...
)
config_b = LLMConfig(
    model_loading_config={"model_id": "mistralai/Mistral-7B-Instruct-v0.3"},
    ...
)
app = build_openai_app({"llm_configs": [config_a, config_b]})

The router dispatches requests to the right backend based on model in the request body.

Multi-LoRA

Hot-swap LoRA adapters without reloading the base model:

LLMConfig(
    model_loading_config={"model_id": "meta-llama/Llama-3.1-8B-Instruct"},
    lora_config={
        "max_num_adapters_per_replica": 8,
        "dynamic_lora_loading_path": "s3://bucket/loras/",
    },
)

Send a request with model="my-lora-id" to load and use the adapter.

Autoscaling

deployment_config={
    "num_replicas": "auto",
    "autoscaling_config": {"min_replicas": 1, "max_replicas": 8, "target_ongoing_requests": 4},
    "ray_actor_options": {"num_gpus": 1},
}

OpenAI client compatibility

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Hello"}],
)

Streaming

stream = client.chat.completions.create(model=..., messages=..., stream=True)
for chunk in stream:
    print(chunk.choices[0].delta.content, end="")

Best practices

Match tensor_parallel_size to your GPU count per replica. For most workloads, 1 GPU per replica with multiple replicas yields better throughput than larger TP groups.

LLM replicas have multi-minute startup times (loading weights, compiling kernels). Set min_replicas high enough to absorb a burst before autoscaling can react.

Next steps

Ray LLM

Train, fine-tune, and serve LLMs.

Autoscaling

Tune the autoscaling controller.

Multi-Application Deployments Ray RLlib Overview

⌘I

Ray Data

Ray Train

Ray Tune

Ray Serve

Ray RLlib

Ray LLM

Serving LLMs

Install

Quickstart

Multi-GPU and tensor parallelism

Multiple models on one endpoint

Multi-LoRA

Autoscaling

OpenAI client compatibility

Streaming

Best practices

Next steps

Ray LLM

Autoscaling

Ray Data

Ray Train

Ray Tune

Ray Serve

Ray RLlib

Ray LLM

Documentation Index

​Install

​Quickstart

​Multi-GPU and tensor parallelism

​Multiple models on one endpoint

​Multi-LoRA

​Autoscaling

​OpenAI client compatibility

​Streaming

​Best practices

​Next steps

Ray LLM

Autoscaling

Install

Quickstart

Multi-GPU and tensor parallelism

Multiple models on one endpoint

Multi-LoRA

Autoscaling

OpenAI client compatibility

Streaming

Best practices

Next steps