Skip to main content

Documentation Index

Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

ray.data.llm runs inference over a Ray Dataset using vLLM (or other engines) as a stage of a Ray Data pipeline.

Pipeline

import ray
from ray.data.llm import vLLMEngineProcessorConfig, build_llm_processor

config = vLLMEngineProcessorConfig(
    model_source="meta-llama/Llama-3.1-8B-Instruct",
    engine_kwargs={"tensor_parallel_size": 1, "max_model_len": 4096},
    concurrency=4,
)

processor = build_llm_processor(
    config,
    preprocess=lambda row: dict(
        messages=[{"role": "user", "content": row["prompt"]}],
        sampling_params={"temperature": 0.0, "max_tokens": 256},
    ),
    postprocess=lambda row: {"prompt": row["prompt"], "response": row["generated_text"]},
)

ds = ray.data.read_parquet("s3://bucket/prompts.parquet")
ds = processor(ds)
ds.write_parquet("s3://bucket/outputs/")

OpenAI-compatible endpoints

from ray.data.llm import OpenAIChatCompletionsProcessorConfig

config = OpenAIChatCompletionsProcessorConfig(
    base_url="https://api.openai.com/v1",
    api_key="...",
    model="gpt-4o-mini",
    concurrency=8,
)
processor = build_llm_processor(config, preprocess=..., postprocess=...)

Multi-GPU

vLLMEngineProcessorConfig(
    model_source="meta-llama/Llama-3.1-70B-Instruct",
    engine_kwargs={"tensor_parallel_size": 4},
    concurrency=2,
)
Each replica claims tensor_parallel_size GPUs. Ray Data places replicas across the cluster.

Best practices

Set concurrency to available_gpus / tensor_parallel_size. Over-subscribing creates contention; under-subscribing wastes compute.

Next steps

Configuration

Engine knobs and tuning.

Working with LLMs (Ray Data)

Detailed patterns.