Ray Data Key Concepts

Dataset
Block
Operator
Streaming execution
Resources
Iterators
Materialization
Sources and sinks
Next steps

Dataset

A Dataset is a logical sequence of records partitioned into blocks. Datasets are lazy: building one with read_* or map_batches records the operation but doesn’t execute it. Execution happens when you iterate (iter_batches, iter_rows), materialize (take, to_pandas, materialize), or write (write_parquet).

ds = ray.data.read_parquet("s3://bucket/data/")  # nothing executed yet
ds = ds.map_batches(transform)                    # still lazy
for batch in ds.iter_batches():                   # execution starts
    ...

Block

A block is a contiguous slice of a dataset, materialized as an Arrow table, pandas DataFrame, or numpy dict. Blocks are the unit of work for parallel operators — Ray Data shards work at block granularity. Tune block size with ctx.target_max_block_size:

ctx = ray.data.DataContext.get_current()
ctx.target_max_block_size = 128 * 1024 * 1024  # 128 MiB

Block size trades off scheduling overhead (smaller blocks → more tasks) against parallelism and memory headroom.

Operator

An operator is one stage in a Ray Data pipeline: map_batches, filter, flat_map, sort, groupby, random_shuffle, repartition, aggregate, write_*. Each operator consumes blocks from upstream and produces blocks for downstream.

Streaming execution

Ray Data uses a streaming executor that pipelines operators. Stages execute concurrently — while one stage transforms a block, the next reads, and so on. This avoids materializing the full dataset between stages and keeps the cluster busy. The executor enforces backpressure: if a downstream stage is slow, upstream stages stop producing new blocks.

Resources

Each operator can declare resource requests just like a Ray task or actor.

ds.map_batches(score, num_gpus=1, batch_size=64)

Mix CPU and GPU stages freely — Ray Data places each stage on a node that satisfies the request.

Iterators

The terminal step of a Ray Data pipeline is usually an iterator:

Method	Use case
`iter_batches(batch_format=...)`	Loop over batches as Arrow, pandas, or numpy.
`iter_torch_batches()`	Yield PyTorch tensors directly.
`iter_tf_batches()`	Yield TensorFlow tensors.
`iter_rows()`	Loop over individual records.
`to_torch_dataset()`	Get a PyTorch `IterableDataset`.

Materialization

Call ds.materialize() to force the full dataset into memory (or spill). Use this when downstream stages are non-deterministic and you want to fix the dataset.

Sources and sinks

Read	Write
`read_parquet`, `read_csv`, `read_json`, `read_images`, `read_text`, `read_binary_files`, `read_tfrecords`, `read_webdataset`, `read_sql`, `read_databricks_tables`, `from_pandas`, `from_numpy`, `from_huggingface`	`write_parquet`, `write_csv`, `write_json`, `write_images`, `write_tfrecords`, `write_sql`

Next steps

Quickstart

Build a full pipeline.

Internals

How the streaming executor schedules and recovers work.

Ray Data Overview Ray Data Quickstart

⌘I

Ray Data

Ray Train

Ray Tune

Ray Serve

Ray RLlib

Ray LLM

Ray Data Key Concepts

Dataset

Block

Operator

Streaming execution

Resources

Iterators

Materialization

Sources and sinks

Next steps

Quickstart

Internals

Ray Data

Ray Train

Ray Tune

Ray Serve

Ray RLlib

Ray LLM

Documentation Index

​Dataset

​Block

​Operator

​Streaming execution

​Resources

​Iterators

​Materialization

​Sources and sinks

​Next steps

Quickstart

Internals

Dataset

Block

Operator

Streaming execution

Resources

Iterators

Materialization

Sources and sinks

Next steps