Shuffling Data

randomize_block_order
Local shuffle buffer
Full distributed shuffle
Hash-based shuffle
Best practices
Next steps

Ray Data offers three levels of shuffling, trading randomness against cost.

randomize_block_order

Cheap, no-op shuffle: just reshuffles the order in which blocks are emitted by the executor. Useful when the dataset is already pre-shuffled at write time and you want some run-to-run variation.

ds = ds.randomize_block_order(seed=42)

Local shuffle buffer

When iterating, fill an in-memory buffer and yield rows from it in random order. Trades a fixed memory cost for per-batch randomness — within a buffer, any row can land in any batch.

for batch in ds.iter_torch_batches(
    batch_size=64,
    local_shuffle_buffer_size=10_000,
    local_shuffle_seed=42,
):
    ...

Use this for training when full randomness across the dataset isn’t necessary.

Full distributed shuffle

The most expensive option. Globally shuffles across the cluster — every row has equal probability of ending up in any block.

ds = ds.random_shuffle(seed=42)

Use sparingly. The cost is roughly equivalent to a network sort on the dataset.

Hash-based shuffle

groupby and sort perform a hash- or range-partitioned shuffle, used when grouping by key or producing a globally ordered output.

Best practices

For training, the typical pattern is: shuffle the data once at write time (file-level), randomize_block_order per epoch, and a local_shuffle_buffer_size of 10–100k for per-batch randomness. Reserve random_shuffle for cases that genuinely need uniform randomness.

Next steps

Iterating

Iterator options in training loops.

Performance tips

Profiling shuffle stages.

Iterating Over Data Saving Data

⌘I

Ray Data

Ray Train

Ray Tune

Ray Serve

Ray RLlib

Ray LLM

Shuffling Data

randomize_block_order

Local shuffle buffer

Full distributed shuffle

Hash-based shuffle

Best practices

Next steps

Iterating

Performance tips

Ray Data

Ray Train

Ray Tune

Ray Serve

Ray RLlib

Ray LLM

Documentation Index

​randomize_block_order

​Local shuffle buffer

​Full distributed shuffle

​Hash-based shuffle

​Best practices

​Next steps

Iterating

Performance tips

randomize_block_order

Local shuffle buffer

Full distributed shuffle

Hash-based shuffle

Best practices

Next steps