Ray LLM Quickstart

Install
Define and run
Call the endpoint
Where to go next

Install

pip install -U "ray[serve,llm]" vllm

Define and run

from ray import serve
from ray.serve.llm import LLMConfig, build_openai_app

config = LLMConfig(
    model_loading_config={"model_id": "Qwen/Qwen2.5-1.5B-Instruct"},
    deployment_config={"num_replicas": 1, "ray_actor_options": {"num_gpus": 1}},
    engine_kwargs={"max_model_len": 4096},
)
serve.run(build_openai_app({"llm_configs": [config]}))

Call the endpoint

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
resp = client.chat.completions.create(
    model="Qwen/Qwen2.5-1.5B-Instruct",
    messages=[{"role": "user", "content": "Say hi"}],
)
print(resp.choices[0].message.content)

Where to go next

Serving

Production serving guide.

Batch inference

Run over a dataset of prompts.

Ray LLM Overview Serving LLMs

⌘I

Ray Data

Ray Train

Ray Tune

Ray Serve

Ray RLlib

Ray LLM

Ray LLM Quickstart

Install

Define and run

Call the endpoint

Where to go next

Serving

Batch inference

Ray Data

Ray Train

Ray Tune

Ray Serve

Ray RLlib

Ray LLM

Documentation Index

​Install

​Define and run

​Call the endpoint

​Where to go next

Serving

Batch inference

Install

Define and run

Call the endpoint

Where to go next