Skip to main content

Documentation Index

Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Dataset

A Dataset is a logical sequence of records partitioned into blocks. Datasets are lazy: building one with read_* or map_batches records the operation but doesn’t execute it. Execution happens when you iterate (iter_batches, iter_rows), materialize (take, to_pandas, materialize), or write (write_parquet).
ds = ray.data.read_parquet("s3://bucket/data/")  # nothing executed yet
ds = ds.map_batches(transform)                    # still lazy
for batch in ds.iter_batches():                   # execution starts
    ...

Block

A block is a contiguous slice of a dataset, materialized as an Arrow table, pandas DataFrame, or numpy dict. Blocks are the unit of work for parallel operators — Ray Data shards work at block granularity. Tune block size with ctx.target_max_block_size:
ctx = ray.data.DataContext.get_current()
ctx.target_max_block_size = 128 * 1024 * 1024  # 128 MiB
Block size trades off scheduling overhead (smaller blocks → more tasks) against parallelism and memory headroom.

Operator

An operator is one stage in a Ray Data pipeline: map_batches, filter, flat_map, sort, groupby, random_shuffle, repartition, aggregate, write_*. Each operator consumes blocks from upstream and produces blocks for downstream.

Streaming execution

Ray Data uses a streaming executor that pipelines operators. Stages execute concurrently — while one stage transforms a block, the next reads, and so on. This avoids materializing the full dataset between stages and keeps the cluster busy. The executor enforces backpressure: if a downstream stage is slow, upstream stages stop producing new blocks.

Resources

Each operator can declare resource requests just like a Ray task or actor.
ds.map_batches(score, num_gpus=1, batch_size=64)
Mix CPU and GPU stages freely — Ray Data places each stage on a node that satisfies the request.

Iterators

The terminal step of a Ray Data pipeline is usually an iterator:
MethodUse case
iter_batches(batch_format=...)Loop over batches as Arrow, pandas, or numpy.
iter_torch_batches()Yield PyTorch tensors directly.
iter_tf_batches()Yield TensorFlow tensors.
iter_rows()Loop over individual records.
to_torch_dataset()Get a PyTorch IterableDataset.

Materialization

Call ds.materialize() to force the full dataset into memory (or spill). Use this when downstream stages are non-deterministic and you want to fix the dataset.

Sources and sinks

ReadWrite
read_parquet, read_csv, read_json, read_images, read_text, read_binary_files, read_tfrecords, read_webdataset, read_sql, read_databricks_tables, from_pandas, from_numpy, from_huggingfacewrite_parquet, write_csv, write_json, write_images, write_tfrecords, write_sql

Next steps

Quickstart

Build a full pipeline.

Internals

How the streaming executor schedules and recovers work.