Skip to main content

Documentation Index

Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Ray Data ships with first-class support for LLM batch inference via ray.data.llm. The integration handles tokenization, batching, GPU placement, and model sharding.

Quick example

import ray
from ray.data.llm import vLLMEngineProcessorConfig, build_llm_processor

config = vLLMEngineProcessorConfig(
    model_source="meta-llama/Llama-3.1-8B-Instruct",
    engine_kwargs={"tensor_parallel_size": 1, "max_model_len": 4096},
    concurrency=4,
)

processor = build_llm_processor(
    config,
    preprocess=lambda row: dict(
        messages=[{"role": "user", "content": row["prompt"]}],
        sampling_params={"temperature": 0.0, "max_tokens": 256},
    ),
    postprocess=lambda row: {"prompt": row["prompt"], "response": row["generated_text"]},
)

ds = ray.data.read_parquet("s3://bucket/prompts.parquet")
ds = processor(ds)
ds.write_parquet("s3://bucket/responses/")

Configure the engine

Common knobs on vLLMEngineProcessorConfig:
FieldPurpose
model_sourceHugging Face model ID or local path.
engine_kwargsForwarded to vllm.LLM. Includes tensor_parallel_size, max_model_len, dtype, quantization.
concurrencyNumber of vLLM replicas to run in parallel.
batch_sizePer-replica batch size.
accelerator_typePin to a specific GPU type ("H100", "A100", etc.).

Multi-GPU and tensor parallelism

For models too large for one GPU, set tensor_parallel_size:
config = vLLMEngineProcessorConfig(
    model_source="meta-llama/Llama-3.1-70B-Instruct",
    engine_kwargs={"tensor_parallel_size": 4, "dtype": "bfloat16"},
    concurrency=2,
)
Each replica claims tensor_parallel_size GPUs. Ray Data places replicas across the cluster.

Custom prompts

preprocess runs per row and returns a dict with messages (chat-template input) or prompt (raw text), plus sampling_params.
def preprocess(row):
    return {
        "messages": [
            {"role": "system", "content": "You are a translator."},
            {"role": "user", "content": f"Translate to French: {row['text']}"},
        ],
        "sampling_params": {"temperature": 0.2, "max_tokens": 200},
    }

OpenAI-compatible endpoints

For external APIs, use OpenAIChatCompletionsProcessorConfig:
from ray.data.llm import OpenAIChatCompletionsProcessorConfig

config = OpenAIChatCompletionsProcessorConfig(
    base_url="https://api.openai.com/v1",
    api_key="...",
    model="gpt-4o-mini",
    concurrency=8,
)

Best practices

Set concurrency to match the number of GPUs available for inference, divided by tensor_parallel_size. Over-subscribing creates contention; under-subscribing wastes GPUs.
Long prompts and high max_tokens values reduce throughput. Trim prompts to the smallest length that preserves accuracy.

Next steps

Batch inference

General-purpose batch inference patterns.

Ray LLM

Train, fine-tune, and serve LLMs with Ray.