Documentation Index Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Ray Serve has first-class support for serving LLMs through ray.serve.llm, a high-level API that wraps vLLM (and other engines) in autoscaling Serve deployments.
Install
pip install -U "ray[serve,llm]" vllm
Quickstart
from ray import serve
from ray.serve.llm import LLMConfig, build_openai_app
config = LLMConfig(
model_loading_config = { "model_id" : "meta-llama/Llama-3.1-8B-Instruct" },
deployment_config = {
"num_replicas" : 1 ,
"ray_actor_options" : { "num_gpus" : 1 },
},
engine_kwargs = { "max_model_len" : 4096 },
)
app = build_openai_app({ "llm_configs" : [config]})
serve.run(app, name = "llm" )
The resulting deployment exposes an OpenAI-compatible HTTP API at /v1/chat/completions.
Multi-GPU and tensor parallelism
LLMConfig(
model_loading_config = { "model_id" : "meta-llama/Llama-3.1-70B-Instruct" },
deployment_config = { "num_replicas" : 1 , "ray_actor_options" : { "num_gpus" : 4 }},
engine_kwargs = { "tensor_parallel_size" : 4 , "dtype" : "bfloat16" },
)
Multiple models on one endpoint
config_a = LLMConfig(
model_loading_config = { "model_id" : "meta-llama/Llama-3.1-8B-Instruct" },
...
)
config_b = LLMConfig(
model_loading_config = { "model_id" : "mistralai/Mistral-7B-Instruct-v0.3" },
...
)
app = build_openai_app({ "llm_configs" : [config_a, config_b]})
The router dispatches requests to the right backend based on model in the request body.
Multi-LoRA
Hot-swap LoRA adapters without reloading the base model:
LLMConfig(
model_loading_config = { "model_id" : "meta-llama/Llama-3.1-8B-Instruct" },
lora_config = {
"max_num_adapters_per_replica" : 8 ,
"dynamic_lora_loading_path" : "s3://bucket/loras/" ,
},
)
Send a request with model="my-lora-id" to load and use the adapter.
Autoscaling
deployment_config = {
"num_replicas" : "auto" ,
"autoscaling_config" : { "min_replicas" : 1 , "max_replicas" : 8 , "target_ongoing_requests" : 4 },
"ray_actor_options" : { "num_gpus" : 1 },
}
OpenAI client compatibility
from openai import OpenAI
client = OpenAI( base_url = "http://localhost:8000/v1" , api_key = "EMPTY" )
client.chat.completions.create(
model = "meta-llama/Llama-3.1-8B-Instruct" ,
messages = [{ "role" : "user" , "content" : "Hello" }],
)
Streaming
stream = client.chat.completions.create( model = ... , messages = ... , stream = True )
for chunk in stream:
print (chunk.choices[ 0 ].delta.content, end = "" )
Best practices
Match tensor_parallel_size to your GPU count per replica. For most workloads, 1 GPU per replica with multiple replicas yields better throughput than larger TP groups.
LLM replicas have multi-minute startup times (loading weights, compiling kernels). Set min_replicas high enough to absorb a burst before autoscaling can react.
Next steps
Ray LLM Train, fine-tune, and serve LLMs.
Autoscaling Tune the autoscaling controller.