Multi-LoRA

Configure
Call with an adapter
Best practices
Next steps

LoRA adapters add a small, model-specific layer on top of a frozen base model. Ray Serve can load many adapters per replica and route each request to the right one.

Configure

from ray.serve.llm import LLMConfig, build_openai_app

config = LLMConfig(
    model_loading_config={"model_id": "meta-llama/Llama-3.1-8B-Instruct"},
    lora_config={
        "max_num_adapters_per_replica": 8,
        "dynamic_lora_loading_path": "s3://bucket/loras/",
    },
    engine_kwargs={"enable_lora": True, "max_lora_rank": 16},
)
serve.run(build_openai_app({"llm_configs": [config]}))

dynamic_lora_loading_path is the directory or S3 prefix Ray Serve looks in for adapters. Each adapter is a subdirectory whose name becomes the model ID.

Call with an adapter

client.chat.completions.create(
    model="my-lora-id",
    messages=[{"role": "user", "content": "Hi"}],
)

If my-lora-id isn’t loaded, Serve loads it from dynamic_lora_loading_path. If the per-replica adapter cache is full, the least-recently-used adapter is evicted.

Best practices

Group requests by adapter when possible — adapter switches inside vLLM aren’t free. Sticky-session routing helps; alternatively, use a router that hashes on adapter ID.

Loading LoRA from network storage adds latency to the first request after eviction. Keep max_num_adapters_per_replica ≥ your hot adapter set.

Next steps

Serving

Production deployment guidance.

Configuration

Engine and resource tuning.

LLM Configuration

⌘I

Ray Data

Ray Train

Ray Tune

Ray Serve

Ray RLlib

Ray LLM

Multi-LoRA

Configure

Call with an adapter

Best practices

Next steps

Serving

Configuration

Ray Data

Ray Train

Ray Tune

Ray Serve

Ray RLlib

Ray LLM

Documentation Index

​Configure

​Call with an adapter

​Best practices

​Next steps

Serving

Configuration

Configure

Call with an adapter

Best practices

Next steps