Documentation Index
Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
ray.serve.llm builds an OpenAI-compatible router on top of Ray Serve, with one or more LLM backends behind it.
Single model
from ray.serve.llm import LLMConfig, build_openai_app
config = LLMConfig(
model_loading_config={"model_id": "meta-llama/Llama-3.1-8B-Instruct"},
deployment_config={
"num_replicas": "auto",
"autoscaling_config": {"min_replicas": 1, "max_replicas": 4, "target_ongoing_requests": 4},
"ray_actor_options": {"num_gpus": 1},
},
engine_kwargs={"max_model_len": 8192, "dtype": "bfloat16"},
)
serve.run(build_openai_app({"llm_configs": [config]}))
Multiple models
small = LLMConfig(model_loading_config={"model_id": "..."})
large = LLMConfig(
model_loading_config={"model_id": "..."},
engine_kwargs={"tensor_parallel_size": 4},
)
serve.run(build_openai_app({"llm_configs": [small, large]}))
The router dispatches to the right backend based on the model field in the request.
Multi-LoRA
LLMConfig(
model_loading_config={"model_id": "meta-llama/Llama-3.1-8B-Instruct"},
lora_config={
"max_num_adapters_per_replica": 8,
"dynamic_lora_loading_path": "s3://bucket/loras/",
},
)
Send a request with model="my-lora" to load and serve the adapter.
Production deployment
Deploy as a RayService for zero-downtime updates and integrated autoscaling:
apiVersion: ray.io/v1
kind: RayService
metadata: { name: llm }
spec:
serveConfigV2: |
applications:
- name: llm
import_path: my_llm_app:app
rayClusterConfig: ...
Best practices
Match tensor_parallel_size to the GPUs per replica. Use multiple replicas (each TP=1) for small models; use TP for models too large for one GPU.
LLM replicas can take minutes to load weights. Set min_replicas high enough to absorb a burst before the autoscaler can react.
Next steps
Configuration
All engine and deployment options.
Multi-LoRA
Hot-swap LoRA adapters at request time.