This page collects the most common levers for improving Ray Data pipeline performance.Documentation Index
Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Profile first
ds.stats() shows per-stage time, throughput, and memory. Look for:
- A stage that dominates wall-clock time
- A stage with low CPU/GPU utilization
- High spilling activity (object store pressure)
Tune batch size
Bigger batches → better GPU utilization. Start with 32–256 for inference; tune upward until GPU memory is the limit.Tune block size
Larger blocks reduce scheduling overhead but increase memory per task. Default 128 MiB is a good starting point.Right-size concurrency
Setconcurrency to match available resources. For GPU stages: num_gpus_in_cluster / num_gpus_per_replica.
Avoid Python row UDFs
map (row-level) is significantly slower than map_batches (block-level). Vectorize with NumPy or pandas where possible.
Use the right batch format
| Format | When to use |
|---|---|
numpy | Tensor work; lowest overhead. |
pandas | Tabular operations. |
pyarrow | Zero-copy with Parquet. |
Pre-shuffle data at write time
Global shuffles (random_shuffle) are expensive. Shuffle once when writing data and use randomize_block_order + a local shuffle buffer at read time.
Pin GPU stages
Ensure GPU stages don’t run on CPU-only nodes:Materialize when reusing
If you iterate over the same dataset multiple times,materialize it after the expensive transforms:
Adjust prefetch
Limit object store usage
For very large datasets, increase the object store size:Diagnose spilling
Excessive spilling shows up inds.stats() and in dashboard metrics. Reduce in-flight data by:
- Lowering
concurrencyon slow stages - Lowering
target_max_block_size - Adding
materialize()after a memory-heavy stage so subsequent stages stream from disk
Next steps
Internals
How execution and scheduling work.
Observability
Cluster-level metrics for Ray Data.