Documentation Index Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Ray Data ships with first-class support for LLM batch inference via ray.data.llm. The integration handles tokenization, batching, GPU placement, and model sharding.
Quick example
import ray
from ray.data.llm import vLLMEngineProcessorConfig, build_llm_processor
config = vLLMEngineProcessorConfig(
model_source = "meta-llama/Llama-3.1-8B-Instruct" ,
engine_kwargs = { "tensor_parallel_size" : 1 , "max_model_len" : 4096 },
concurrency = 4 ,
)
processor = build_llm_processor(
config,
preprocess = lambda row : dict (
messages = [{ "role" : "user" , "content" : row[ "prompt" ]}],
sampling_params = { "temperature" : 0.0 , "max_tokens" : 256 },
),
postprocess = lambda row : { "prompt" : row[ "prompt" ], "response" : row[ "generated_text" ]},
)
ds = ray.data.read_parquet( "s3://bucket/prompts.parquet" )
ds = processor(ds)
ds.write_parquet( "s3://bucket/responses/" )
Common knobs on vLLMEngineProcessorConfig:
Field Purpose model_sourceHugging Face model ID or local path. engine_kwargsForwarded to vllm.LLM. Includes tensor_parallel_size, max_model_len, dtype, quantization. concurrencyNumber of vLLM replicas to run in parallel. batch_sizePer-replica batch size. accelerator_typePin to a specific GPU type ("H100", "A100", etc.).
Multi-GPU and tensor parallelism
For models too large for one GPU, set tensor_parallel_size:
config = vLLMEngineProcessorConfig(
model_source = "meta-llama/Llama-3.1-70B-Instruct" ,
engine_kwargs = { "tensor_parallel_size" : 4 , "dtype" : "bfloat16" },
concurrency = 2 ,
)
Each replica claims tensor_parallel_size GPUs. Ray Data places replicas across the cluster.
Custom prompts
preprocess runs per row and returns a dict with messages (chat-template input) or prompt (raw text), plus sampling_params.
def preprocess ( row ):
return {
"messages" : [
{ "role" : "system" , "content" : "You are a translator." },
{ "role" : "user" , "content" : f "Translate to French: { row[ 'text' ] } " },
],
"sampling_params" : { "temperature" : 0.2 , "max_tokens" : 200 },
}
OpenAI-compatible endpoints
For external APIs, use OpenAIChatCompletionsProcessorConfig:
from ray.data.llm import OpenAIChatCompletionsProcessorConfig
config = OpenAIChatCompletionsProcessorConfig(
base_url = "https://api.openai.com/v1" ,
api_key = "..." ,
model = "gpt-4o-mini" ,
concurrency = 8 ,
)
Best practices
Set concurrency to match the number of GPUs available for inference, divided by tensor_parallel_size. Over-subscribing creates contention; under-subscribing wastes GPUs.
Long prompts and high max_tokens values reduce throughput. Trim prompts to the smallest length that preserves accuracy.
Next steps
Batch inference General-purpose batch inference patterns.
Ray LLM Train, fine-tune, and serve LLMs with Ray.