Skip to main content

Documentation Index

Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

LoRA adapters add a small, model-specific layer on top of a frozen base model. Ray Serve can load many adapters per replica and route each request to the right one.

Configure

from ray.serve.llm import LLMConfig, build_openai_app

config = LLMConfig(
    model_loading_config={"model_id": "meta-llama/Llama-3.1-8B-Instruct"},
    lora_config={
        "max_num_adapters_per_replica": 8,
        "dynamic_lora_loading_path": "s3://bucket/loras/",
    },
    engine_kwargs={"enable_lora": True, "max_lora_rank": 16},
)
serve.run(build_openai_app({"llm_configs": [config]}))
dynamic_lora_loading_path is the directory or S3 prefix Ray Serve looks in for adapters. Each adapter is a subdirectory whose name becomes the model ID.

Call with an adapter

client.chat.completions.create(
    model="my-lora-id",
    messages=[{"role": "user", "content": "Hi"}],
)
If my-lora-id isn’t loaded, Serve loads it from dynamic_lora_loading_path. If the per-replica adapter cache is full, the least-recently-used adapter is evicted.

Best practices

Group requests by adapter when possible — adapter switches inside vLLM aren’t free. Sticky-session routing helps; alternatively, use a router that hashes on adapter ID.
Loading LoRA from network storage adds latency to the first request after eviction. Keep max_num_adapters_per_replica ≥ your hot adapter set.

Next steps

Serving

Production deployment guidance.

Configuration

Engine and resource tuning.