Skip to main content

Documentation Index

Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Ray Data offers three levels of shuffling, trading randomness against cost.

randomize_block_order

Cheap, no-op shuffle: just reshuffles the order in which blocks are emitted by the executor. Useful when the dataset is already pre-shuffled at write time and you want some run-to-run variation.
ds = ds.randomize_block_order(seed=42)

Local shuffle buffer

When iterating, fill an in-memory buffer and yield rows from it in random order. Trades a fixed memory cost for per-batch randomness — within a buffer, any row can land in any batch.
for batch in ds.iter_torch_batches(
    batch_size=64,
    local_shuffle_buffer_size=10_000,
    local_shuffle_seed=42,
):
    ...
Use this for training when full randomness across the dataset isn’t necessary.

Full distributed shuffle

The most expensive option. Globally shuffles across the cluster — every row has equal probability of ending up in any block.
ds = ds.random_shuffle(seed=42)
Use sparingly. The cost is roughly equivalent to a network sort on the dataset.

Hash-based shuffle

groupby and sort perform a hash- or range-partitioned shuffle, used when grouping by key or producing a globally ordered output.

Best practices

For training, the typical pattern is: shuffle the data once at write time (file-level), randomize_block_order per epoch, and a local_shuffle_buffer_size of 10–100k for per-batch randomness. Reserve random_shuffle for cases that genuinely need uniform randomness.

Next steps

Iterating

Iterator options in training loops.

Performance tips

Profiling shuffle stages.