Ray Data provides flexible and performant APIs for distributed data processing. It targets the data-loading, transformation, and batch-inference stages of ML workflows — the parts that wrap around training and serving.Documentation Index
Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Why Ray Data
Designed for ML
Built-in operators for common ML transforms: shuffling, batching, GPU inference, image and audio decoding, sharding for distributed training.
Streaming execution
Process datasets larger than memory by streaming blocks through a pipeline of parallel operators.
Heterogeneous compute
Mix CPU and GPU stages in the same pipeline; Ray Data places each stage on the appropriate resource.
Native Ray integration
Pass datasets to Ray Train, run transformations inside Ray actors, and pipe results into Ray Serve.
When to use Ray Data
Ray Data sits between low-level Ray Core APIs and full-featured DataFrame libraries.- Ray Data vs. PyTorch DataLoader: Ray Data shards across nodes natively, supports heterogeneous CPU/GPU stages, and pipelines IO with compute. PyTorch DataLoader is single-process per worker.
- Ray Data vs. Spark/Dask: Ray Data is purpose-built for ML — first-class support for GPU stages, model sharding, and integration with Ray Train and Ray Serve.
- Ray Data vs. Ray Tasks/Actors: For ad-hoc parallelism, use tasks. For loading and transforming large datasets, use Ray Data; it handles streaming, backpressure, and shuffling for you.
Quick example
Common workflows
Quickstart
Read, transform, and consume a dataset.
Loading data
Read from Parquet, CSV, JSON, images, databases, and custom sources.
Transforming data
Apply user-defined functions over batches and rows.
Batch inference
Run a model over a dataset on CPUs and GPUs.
Working with LLMs
Score prompts at scale with Ray Data and vLLM.
Performance
Tune throughput and latency.
Next steps
Key concepts
Datasets, blocks, operators, and execution.
Quickstart
Run a Ray Data pipeline end-to-end.