Ray Data Overview

Why Ray Data

Designed for ML

Built-in operators for common ML transforms: shuffling, batching, GPU inference, image and audio decoding, sharding for distributed training.

Streaming execution

Process datasets larger than memory by streaming blocks through a pipeline of parallel operators.

Heterogeneous compute

Mix CPU and GPU stages in the same pipeline; Ray Data places each stage on the appropriate resource.

Native Ray integration

Pass datasets to Ray Train, run transformations inside Ray actors, and pipe results into Ray Serve.

When to use Ray Data

Ray Data sits between low-level Ray Core APIs and full-featured DataFrame libraries.

Ray Data vs. PyTorch DataLoader: Ray Data shards across nodes natively, supports heterogeneous CPU/GPU stages, and pipelines IO with compute. PyTorch DataLoader is single-process per worker.

Ray Data vs. Spark/Dask: Ray Data is purpose-built for ML — first-class support for GPU stages, model sharding, and integration with Ray Train and Ray Serve.

Ray Data vs. Ray Tasks/Actors: For ad-hoc parallelism, use tasks. For loading and transforming large datasets, use Ray Data; it handles streaming, backpressure, and shuffling for you.

import ray ds = ray.data.read_parquet("s3://anonymous@air-example-data/iris.parquet") ds = ( ds .map_batches(lambda batch: {"label": batch["target"], "features": batch["features"]}) .repartition(8) ) for batch in ds.iter_batches(batch_size=256): train_step(batch)

Common workflows

Quickstart

Read, transform, and consume a dataset.

Loading data

Read from Parquet, CSV, JSON, images, databases, and custom sources.

Transforming data

Apply user-defined functions over batches and rows.

Batch inference

Run a model over a dataset on CPUs and GPUs.

Working with LLMs

Score prompts at scale with Ray Data and vLLM.

Performance

Tune throughput and latency.

Ray Data

Ray Train

Ray Tune

Ray Serve

Ray RLlib

Ray LLM

Ray Data Overview

Why Ray Data

Designed for ML

Streaming execution

Heterogeneous compute

Native Ray integration

When to use Ray Data

Quick example

Common workflows

Quickstart

Loading data

Transforming data

Batch inference

Working with LLMs

Performance

Next steps

Key concepts

Quickstart

Ray Data

Ray Train

Ray Tune

Ray Serve

Ray RLlib

Ray LLM

Documentation Index

​Why Ray Data

Designed for ML

Streaming execution

Heterogeneous compute

Native Ray integration

​When to use Ray Data

​Quick example

​Common workflows

Quickstart

Loading data

Transforming data

Batch inference

Working with LLMs

Performance

​Next steps

Key concepts

Quickstart

Why Ray Data

When to use Ray Data

Quick example

Common workflows

Next steps