Ray LLM Overview

Ray’s LLM stack combines Ray Data (for batch inference), Ray Train (for fine-tuning), and Ray Serve (for production serving) into a single workflow. The ray.serve.llm and ray.data.llm modules wrap engines like vLLM in Ray-native APIs.

Quickstart

Run your first LLM batch inference job.

Serving

Deploy LLMs behind an OpenAI-compatible API.

Batch inference

Score prompts at scale with Ray Data.

Configuration

Tensor parallelism, quantization, throughput tuning.

Offline RL Ray LLM Quickstart

⌘I

Ray Data

Ray Train

Ray Tune

Ray Serve

Ray RLlib

Ray LLM

Ray LLM Overview

Quickstart

Serving

Batch inference

Configuration

Ray Data

Ray Train

Ray Tune

Ray Serve

Ray RLlib

Ray LLM

Documentation Index

Quickstart

Serving

Batch inference

Configuration