Documentation Index
Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Dataset
ADataset is a logical sequence of records partitioned into blocks. Datasets are lazy: building one with read_* or map_batches records the operation but doesn’t execute it. Execution happens when you iterate (iter_batches, iter_rows), materialize (take, to_pandas, materialize), or write (write_parquet).
Block
A block is a contiguous slice of a dataset, materialized as an Arrow table, pandas DataFrame, or numpy dict. Blocks are the unit of work for parallel operators — Ray Data shards work at block granularity. Tune block size withctx.target_max_block_size:
Operator
An operator is one stage in a Ray Data pipeline:map_batches, filter, flat_map, sort, groupby, random_shuffle, repartition, aggregate, write_*. Each operator consumes blocks from upstream and produces blocks for downstream.
Streaming execution
Ray Data uses a streaming executor that pipelines operators. Stages execute concurrently — while one stage transforms a block, the next reads, and so on. This avoids materializing the full dataset between stages and keeps the cluster busy. The executor enforces backpressure: if a downstream stage is slow, upstream stages stop producing new blocks.Resources
Each operator can declare resource requests just like a Ray task or actor.Iterators
The terminal step of a Ray Data pipeline is usually an iterator:| Method | Use case |
|---|---|
iter_batches(batch_format=...) | Loop over batches as Arrow, pandas, or numpy. |
iter_torch_batches() | Yield PyTorch tensors directly. |
iter_tf_batches() | Yield TensorFlow tensors. |
iter_rows() | Loop over individual records. |
to_torch_dataset() | Get a PyTorch IterableDataset. |
Materialization
Callds.materialize() to force the full dataset into memory (or spill). Use this when downstream stages are non-deterministic and you want to fix the dataset.
Sources and sinks
| Read | Write |
|---|---|
read_parquet, read_csv, read_json, read_images, read_text, read_binary_files, read_tfrecords, read_webdataset, read_sql, read_databricks_tables, from_pandas, from_numpy, from_huggingface | write_parquet, write_csv, write_json, write_images, write_tfrecords, write_sql |
Next steps
Quickstart
Build a full pipeline.
Internals
How the streaming executor schedules and recovers work.