Performance Tips for Ray Data

Profile first
Tune batch size
Tune block size
Right-size concurrency
Avoid Python row UDFs
Use the right batch format
Pre-shuffle data at write time
Pin GPU stages
Materialize when reusing
Adjust prefetch
Limit object store usage
Diagnose spilling
Next steps

This page collects the most common levers for improving Ray Data pipeline performance.

Profile first

ds.stats() shows per-stage time, throughput, and memory. Look for:

A stage that dominates wall-clock time
A stage with low CPU/GPU utilization
High spilling activity (object store pressure)

Tune batch size

Bigger batches → better GPU utilization. Start with 32–256 for inference; tune upward until GPU memory is the limit.

ds.map_batches(predict, batch_size=128)

Tune block size

Larger blocks reduce scheduling overhead but increase memory per task. Default 128 MiB is a good starting point.

ctx = ray.data.DataContext.get_current()
ctx.target_max_block_size = 256 * 1024 * 1024

Right-size concurrency

Set concurrency to match available resources. For GPU stages: num_gpus_in_cluster / num_gpus_per_replica.

ds.map_batches(predict, num_gpus=1, concurrency=8)

Avoid Python row UDFs

map (row-level) is significantly slower than map_batches (block-level). Vectorize with NumPy or pandas where possible.

Use the right batch format

Format	When to use
`numpy`	Tensor work; lowest overhead.
`pandas`	Tabular operations.
`pyarrow`	Zero-copy with Parquet.

Pre-shuffle data at write time

Global shuffles (random_shuffle) are expensive. Shuffle once when writing data and use randomize_block_order + a local shuffle buffer at read time.

Pin GPU stages

Ensure GPU stages don’t run on CPU-only nodes:

ds.map_batches(predict, num_gpus=1, accelerator_type="A100")

Materialize when reusing

If you iterate over the same dataset multiple times, materialize it after the expensive transforms:

ds = ds.map_batches(expensive_transform).materialize()
for epoch in range(10):
    for batch in ds.iter_torch_batches():
        ...

Adjust prefetch

ds.iter_torch_batches(prefetch_batches=4)

Higher prefetch overlaps IO with compute at the cost of memory.

Limit object store usage

For very large datasets, increase the object store size:

ray.init(object_store_memory=64 * 1024**3)  # 64 GiB

Diagnose spilling

Excessive spilling shows up in ds.stats() and in dashboard metrics. Reduce in-flight data by:

Lowering concurrency on slow stages
Lowering target_max_block_size
Adding materialize() after a memory-heavy stage so subsequent stages stream from disk

Next steps

Internals

How execution and scheduling work.

Observability

Cluster-level metrics for Ray Data.

Ray Data Internals Ray Train Overview

⌘I

Ray Data

Ray Train

Ray Tune

Ray Serve

Ray RLlib

Ray LLM

Performance Tips for Ray Data

Profile first

Tune batch size

Tune block size

Right-size concurrency

Avoid Python row UDFs

Use the right batch format

Pre-shuffle data at write time

Pin GPU stages

Materialize when reusing

Adjust prefetch

Limit object store usage

Diagnose spilling

Next steps

Internals

Observability

Ray Data

Ray Train

Ray Tune

Ray Serve

Ray RLlib

Ray LLM

Documentation Index

​Profile first

​Tune batch size

​Tune block size

​Right-size concurrency

​Avoid Python row UDFs

​Use the right batch format

​Pre-shuffle data at write time

​Pin GPU stages

​Materialize when reusing

​Adjust prefetch

​Limit object store usage

​Diagnose spilling

​Next steps

Internals

Observability

Profile first

Tune batch size

Tune block size

Right-size concurrency

Avoid Python row UDFs

Use the right batch format

Pre-shuffle data at write time

Pin GPU stages

Materialize when reusing

Adjust prefetch

Limit object store usage

Diagnose spilling

Next steps