Documentation Index
Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Install
pip install -U "ray[serve,llm]" vllm
Define and run
from ray import serve
from ray.serve.llm import LLMConfig, build_openai_app
config = LLMConfig(
model_loading_config={"model_id": "Qwen/Qwen2.5-1.5B-Instruct"},
deployment_config={"num_replicas": 1, "ray_actor_options": {"num_gpus": 1}},
engine_kwargs={"max_model_len": 4096},
)
serve.run(build_openai_app({"llm_configs": [config]}))
Call the endpoint
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
resp = client.chat.completions.create(
model="Qwen/Qwen2.5-1.5B-Instruct",
messages=[{"role": "user", "content": "Say hi"}],
)
print(resp.choices[0].message.content)
Where to go next
Serving
Production serving guide.
Batch inference
Run over a dataset of prompts.