Skip to main content

Documentation Index

Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

ray.serve.llm builds an OpenAI-compatible router on top of Ray Serve, with one or more LLM backends behind it.

Single model

from ray.serve.llm import LLMConfig, build_openai_app

config = LLMConfig(
    model_loading_config={"model_id": "meta-llama/Llama-3.1-8B-Instruct"},
    deployment_config={
        "num_replicas": "auto",
        "autoscaling_config": {"min_replicas": 1, "max_replicas": 4, "target_ongoing_requests": 4},
        "ray_actor_options": {"num_gpus": 1},
    },
    engine_kwargs={"max_model_len": 8192, "dtype": "bfloat16"},
)
serve.run(build_openai_app({"llm_configs": [config]}))

Multiple models

small = LLMConfig(model_loading_config={"model_id": "..."})
large = LLMConfig(
    model_loading_config={"model_id": "..."},
    engine_kwargs={"tensor_parallel_size": 4},
)
serve.run(build_openai_app({"llm_configs": [small, large]}))
The router dispatches to the right backend based on the model field in the request.

Multi-LoRA

LLMConfig(
    model_loading_config={"model_id": "meta-llama/Llama-3.1-8B-Instruct"},
    lora_config={
        "max_num_adapters_per_replica": 8,
        "dynamic_lora_loading_path": "s3://bucket/loras/",
    },
)
Send a request with model="my-lora" to load and serve the adapter.

Production deployment

Deploy as a RayService for zero-downtime updates and integrated autoscaling:
apiVersion: ray.io/v1
kind: RayService
metadata: { name: llm }
spec:
  serveConfigV2: |
    applications:
      - name: llm
        import_path: my_llm_app:app
  rayClusterConfig: ...

Best practices

Match tensor_parallel_size to the GPUs per replica. Use multiple replicas (each TP=1) for small models; use TP for models too large for one GPU.
LLM replicas can take minutes to load weights. Set min_replicas high enough to absorb a burst before the autoscaler can react.

Next steps

Configuration

All engine and deployment options.

Multi-LoRA

Hot-swap LoRA adapters at request time.