Documentation Index
Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
ray.data.llm runs inference over a Ray Dataset using vLLM (or other engines) as a stage of a Ray Data pipeline.
Pipeline
import ray
from ray.data.llm import vLLMEngineProcessorConfig, build_llm_processor
config = vLLMEngineProcessorConfig(
model_source="meta-llama/Llama-3.1-8B-Instruct",
engine_kwargs={"tensor_parallel_size": 1, "max_model_len": 4096},
concurrency=4,
)
processor = build_llm_processor(
config,
preprocess=lambda row: dict(
messages=[{"role": "user", "content": row["prompt"]}],
sampling_params={"temperature": 0.0, "max_tokens": 256},
),
postprocess=lambda row: {"prompt": row["prompt"], "response": row["generated_text"]},
)
ds = ray.data.read_parquet("s3://bucket/prompts.parquet")
ds = processor(ds)
ds.write_parquet("s3://bucket/outputs/")
OpenAI-compatible endpoints
from ray.data.llm import OpenAIChatCompletionsProcessorConfig
config = OpenAIChatCompletionsProcessorConfig(
base_url="https://api.openai.com/v1",
api_key="...",
model="gpt-4o-mini",
concurrency=8,
)
processor = build_llm_processor(config, preprocess=..., postprocess=...)
Multi-GPU
vLLMEngineProcessorConfig(
model_source="meta-llama/Llama-3.1-70B-Instruct",
engine_kwargs={"tensor_parallel_size": 4},
concurrency=2,
)
Each replica claims tensor_parallel_size GPUs. Ray Data places replicas across the cluster.
Best practices
Set concurrency to available_gpus / tensor_parallel_size. Over-subscribing creates contention; under-subscribing wastes compute.
Next steps
Configuration
Engine knobs and tuning.
Working with LLMs (Ray Data)
Detailed patterns.