Ray Data’s iterators stream blocks from a dataset, decoding them into the format your code consumes.Documentation Index
Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
iter_batches
The general-purpose iterator. Returns a dict of column-arrays in the requested batch format.batch_format | Type returned |
|---|---|
"numpy" (default) | dict[str, np.ndarray] |
"pandas" | pd.DataFrame |
"pyarrow" | pa.Table |
iter_torch_batches
Yields PyTorch tensors directly, with optional dtype and device casts.iter_tf_batches
Equivalent for TensorFlow.iter_rows
For row-by-row iteration. Slower but useful for debugging or for non-vectorized libraries.Prefetching
Ray Data prefetches blocks ahead of the consumer to overlap IO with compute. Tune withprefetch_batches:
Sharding for distributed training
Usestreaming_split to produce one independent iterator per training worker.
Dataset to TorchTrainer or TFTrainer, sharding happens automatically.
Local shuffle
For randomization during training, enable a local shuffle buffer per iterator:Next steps
Shuffling
Choose between local and global shuffles.
Performance tips
Tune iterator throughput.