Skip to main content

Documentation Index

Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Ray Data provides flexible and performant APIs for distributed data processing. It targets the data-loading, transformation, and batch-inference stages of ML workflows — the parts that wrap around training and serving.

Why Ray Data

Designed for ML

Built-in operators for common ML transforms: shuffling, batching, GPU inference, image and audio decoding, sharding for distributed training.

Streaming execution

Process datasets larger than memory by streaming blocks through a pipeline of parallel operators.

Heterogeneous compute

Mix CPU and GPU stages in the same pipeline; Ray Data places each stage on the appropriate resource.

Native Ray integration

Pass datasets to Ray Train, run transformations inside Ray actors, and pipe results into Ray Serve.

When to use Ray Data

Ray Data sits between low-level Ray Core APIs and full-featured DataFrame libraries.
  • Ray Data vs. PyTorch DataLoader: Ray Data shards across nodes natively, supports heterogeneous CPU/GPU stages, and pipelines IO with compute. PyTorch DataLoader is single-process per worker.
  • Ray Data vs. Spark/Dask: Ray Data is purpose-built for ML — first-class support for GPU stages, model sharding, and integration with Ray Train and Ray Serve.
  • Ray Data vs. Ray Tasks/Actors: For ad-hoc parallelism, use tasks. For loading and transforming large datasets, use Ray Data; it handles streaming, backpressure, and shuffling for you.

Quick example

import ray

ds = ray.data.read_parquet("s3://anonymous@air-example-data/iris.parquet")

ds = (
    ds
    .map_batches(lambda batch: {"label": batch["target"], "features": batch["features"]})
    .repartition(8)
)

for batch in ds.iter_batches(batch_size=256):
    train_step(batch)

Common workflows

Quickstart

Read, transform, and consume a dataset.

Loading data

Read from Parquet, CSV, JSON, images, databases, and custom sources.

Transforming data

Apply user-defined functions over batches and rows.

Batch inference

Run a model over a dataset on CPUs and GPUs.

Working with LLMs

Score prompts at scale with Ray Data and vLLM.

Performance

Tune throughput and latency.

Next steps

Key concepts

Datasets, blocks, operators, and execution.

Quickstart

Run a Ray Data pipeline end-to-end.