Working With LLMs

Quick example
Configure the engine
Multi-GPU and tensor parallelism
Custom prompts
OpenAI-compatible endpoints
Best practices
Next steps

Ray Data ships with first-class support for LLM batch inference via ray.data.llm. The integration handles tokenization, batching, GPU placement, and model sharding.

Quick example

import ray
from ray.data.llm import vLLMEngineProcessorConfig, build_llm_processor

config = vLLMEngineProcessorConfig(
    model_source="meta-llama/Llama-3.1-8B-Instruct",
    engine_kwargs={"tensor_parallel_size": 1, "max_model_len": 4096},
    concurrency=4,
)

processor = build_llm_processor(
    config,
    preprocess=lambda row: dict(
        messages=[{"role": "user", "content": row["prompt"]}],
        sampling_params={"temperature": 0.0, "max_tokens": 256},
    ),
    postprocess=lambda row: {"prompt": row["prompt"], "response": row["generated_text"]},
)

ds = ray.data.read_parquet("s3://bucket/prompts.parquet")
ds = processor(ds)
ds.write_parquet("s3://bucket/responses/")

Configure the engine

Common knobs on vLLMEngineProcessorConfig:

Field	Purpose
`model_source`	Hugging Face model ID or local path.
`engine_kwargs`	Forwarded to `vllm.LLM`. Includes `tensor_parallel_size`, `max_model_len`, `dtype`, `quantization`.
`concurrency`	Number of vLLM replicas to run in parallel.
`batch_size`	Per-replica batch size.
`accelerator_type`	Pin to a specific GPU type (`"H100"`, `"A100"`, etc.).

Multi-GPU and tensor parallelism

For models too large for one GPU, set tensor_parallel_size:

config = vLLMEngineProcessorConfig(
    model_source="meta-llama/Llama-3.1-70B-Instruct",
    engine_kwargs={"tensor_parallel_size": 4, "dtype": "bfloat16"},
    concurrency=2,
)

Each replica claims tensor_parallel_size GPUs. Ray Data places replicas across the cluster.

Custom prompts

preprocess runs per row and returns a dict with messages (chat-template input) or prompt (raw text), plus sampling_params.

def preprocess(row):
    return {
        "messages": [
            {"role": "system", "content": "You are a translator."},
            {"role": "user", "content": f"Translate to French: {row['text']}"},
        ],
        "sampling_params": {"temperature": 0.2, "max_tokens": 200},
    }

OpenAI-compatible endpoints

For external APIs, use OpenAIChatCompletionsProcessorConfig:

from ray.data.llm import OpenAIChatCompletionsProcessorConfig

config = OpenAIChatCompletionsProcessorConfig(
    base_url="https://api.openai.com/v1",
    api_key="...",
    model="gpt-4o-mini",
    concurrency=8,
)

Best practices

Set concurrency to match the number of GPUs available for inference, divided by tensor_parallel_size. Over-subscribing creates contention; under-subscribing wastes GPUs.

Long prompts and high max_tokens values reduce throughput. Trim prompts to the smallest length that preserves accuracy.

Next steps

Batch inference

General-purpose batch inference patterns.

Ray LLM

Train, fine-tune, and serve LLMs with Ray.

Working With Tensors Batch Inference

⌘I

Ray Data

Ray Train

Ray Tune

Ray Serve

Ray RLlib

Ray LLM

Working With LLMs

Quick example

Configure the engine

Multi-GPU and tensor parallelism

Custom prompts

OpenAI-compatible endpoints

Best practices

Next steps

Batch inference

Ray LLM

Ray Data

Ray Train

Ray Tune

Ray Serve

Ray RLlib

Ray LLM

Documentation Index

​Quick example

​Configure the engine

​Multi-GPU and tensor parallelism

​Custom prompts

​OpenAI-compatible endpoints

​Best practices

​Next steps

Batch inference

Ray LLM

Quick example

Configure the engine

Multi-GPU and tensor parallelism

Custom prompts

OpenAI-compatible endpoints

Best practices

Next steps