Understanding Ray Data internals helps when tuning pipelines and debugging slow stages.Documentation Index
Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Logical and physical plans
ADataset builds a logical plan as you call operators (read_*, map_batches, filter, etc.). When execution starts, Ray Data optimizes the plan (fusing adjacent ops, reordering filters and projections) and produces a physical plan of operators connected by streaming queues.
Streaming executor
The executor pipelines operators across the cluster. Each operator is implemented as a Ray task or actor that consumes blocks from upstream queues and produces blocks for downstream queues. Key properties:- Backpressure: when a downstream queue is full, upstream operators stall instead of producing more data.
- Spilling: when memory pressure rises, the executor spills cold blocks to local disk.
- Locality-aware scheduling: operators are placed near their inputs when possible.
Operator fusion
Adjacent operators with compatible resource requirements are fused into a single Ray task to reduce scheduling overhead. For example, two consecutivemap_batches calls on the same resources execute in one worker per block rather than handing data over the object store.
You can disable fusion for debugging:
Block sizing
The executor targets blocks oftarget_max_block_size bytes (default 128 MiB). Larger blocks reduce overhead but increase memory pressure.
Memory management
Ray Data tracks the memory used by in-flight blocks and pauses upstream operators when the budget is exceeded. The budget is split across:- The Ray object store (configured at
ray.init) - Operator output buffers
- Per-task memory requests
Fault tolerance
Failed read or map tasks are retried automatically. Lost intermediate blocks are reconstructed via lineage. For long pipelines, materialize intermediate datasets to avoid expensive reconstruction:materialize forces full execution and pins the result in the object store.
Stats and tracing
ds.stats() returns per-stage timing, throughput, and memory.
ds.iter_internal_ref_bundles() exposes raw block refs for advanced use.
Configuration
Next steps
Performance tips
Diagnose and fix slow pipelines.
Key concepts
Operator and execution overview.