Serving LLMs

Single model
Multiple models
Multi-LoRA
Production deployment
Best practices
Next steps

ray.serve.llm builds an OpenAI-compatible router on top of Ray Serve, with one or more LLM backends behind it.

Single model

from ray.serve.llm import LLMConfig, build_openai_app

config = LLMConfig(
    model_loading_config={"model_id": "meta-llama/Llama-3.1-8B-Instruct"},
    deployment_config={
        "num_replicas": "auto",
        "autoscaling_config": {"min_replicas": 1, "max_replicas": 4, "target_ongoing_requests": 4},
        "ray_actor_options": {"num_gpus": 1},
    },
    engine_kwargs={"max_model_len": 8192, "dtype": "bfloat16"},
)
serve.run(build_openai_app({"llm_configs": [config]}))

Multiple models

small = LLMConfig(model_loading_config={"model_id": "..."})
large = LLMConfig(
    model_loading_config={"model_id": "..."},
    engine_kwargs={"tensor_parallel_size": 4},
)
serve.run(build_openai_app({"llm_configs": [small, large]}))

The router dispatches to the right backend based on the model field in the request.

Multi-LoRA

LLMConfig(
    model_loading_config={"model_id": "meta-llama/Llama-3.1-8B-Instruct"},
    lora_config={
        "max_num_adapters_per_replica": 8,
        "dynamic_lora_loading_path": "s3://bucket/loras/",
    },
)

Send a request with model="my-lora" to load and serve the adapter.

Production deployment

Deploy as a RayService for zero-downtime updates and integrated autoscaling:

apiVersion: ray.io/v1
kind: RayService
metadata: { name: llm }
spec:
  serveConfigV2: |
    applications:
      - name: llm
        import_path: my_llm_app:app
  rayClusterConfig: ...

Best practices

Match tensor_parallel_size to the GPUs per replica. Use multiple replicas (each TP=1) for small models; use TP for models too large for one GPU.

LLM replicas can take minutes to load weights. Set min_replicas high enough to absorb a burst before the autoscaler can react.

Next steps

Configuration

All engine and deployment options.

Multi-LoRA

Hot-swap LoRA adapters at request time.

Ray LLM Quickstart LLM Batch Inference

⌘I

Ray Data

Ray Train

Ray Tune

Ray Serve

Ray RLlib

Ray LLM

Serving LLMs

Single model

Multiple models

Multi-LoRA

Production deployment

Best practices

Next steps

Configuration

Multi-LoRA

Ray Data

Ray Train

Ray Tune

Ray Serve

Ray RLlib

Ray LLM

Documentation Index

​Single model

​Multiple models

​Multi-LoRA

​Production deployment

​Best practices

​Next steps

Configuration

Multi-LoRA

Single model

Multiple models

Multi-LoRA

Production deployment

Best practices

Next steps