Skip to main content

Documentation Index

Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Install

pip install -U "ray[serve,llm]" vllm

Define and run

from ray import serve
from ray.serve.llm import LLMConfig, build_openai_app

config = LLMConfig(
    model_loading_config={"model_id": "Qwen/Qwen2.5-1.5B-Instruct"},
    deployment_config={"num_replicas": 1, "ray_actor_options": {"num_gpus": 1}},
    engine_kwargs={"max_model_len": 4096},
)
serve.run(build_openai_app({"llm_configs": [config]}))

Call the endpoint

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
resp = client.chat.completions.create(
    model="Qwen/Qwen2.5-1.5B-Instruct",
    messages=[{"role": "user", "content": "Say hi"}],
)
print(resp.choices[0].message.content)

Where to go next

Serving

Production serving guide.

Batch inference

Run over a dataset of prompts.