Skip to main content

Documentation Index

Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Understanding Ray Data internals helps when tuning pipelines and debugging slow stages.

Logical and physical plans

A Dataset builds a logical plan as you call operators (read_*, map_batches, filter, etc.). When execution starts, Ray Data optimizes the plan (fusing adjacent ops, reordering filters and projections) and produces a physical plan of operators connected by streaming queues.

Streaming executor

The executor pipelines operators across the cluster. Each operator is implemented as a Ray task or actor that consumes blocks from upstream queues and produces blocks for downstream queues. Key properties:
  • Backpressure: when a downstream queue is full, upstream operators stall instead of producing more data.
  • Spilling: when memory pressure rises, the executor spills cold blocks to local disk.
  • Locality-aware scheduling: operators are placed near their inputs when possible.

Operator fusion

Adjacent operators with compatible resource requirements are fused into a single Ray task to reduce scheduling overhead. For example, two consecutive map_batches calls on the same resources execute in one worker per block rather than handing data over the object store. You can disable fusion for debugging:
ctx = ray.data.DataContext.get_current()
ctx.optimizer_enabled = False

Block sizing

The executor targets blocks of target_max_block_size bytes (default 128 MiB). Larger blocks reduce overhead but increase memory pressure.
ctx.target_max_block_size = 256 * 1024 * 1024  # 256 MiB

Memory management

Ray Data tracks the memory used by in-flight blocks and pauses upstream operators when the budget is exceeded. The budget is split across:
  • The Ray object store (configured at ray.init)
  • Operator output buffers
  • Per-task memory requests

Fault tolerance

Failed read or map tasks are retried automatically. Lost intermediate blocks are reconstructed via lineage. For long pipelines, materialize intermediate datasets to avoid expensive reconstruction:
ds = ds.materialize()
materialize forces full execution and pins the result in the object store.

Stats and tracing

ds.stats() returns per-stage timing, throughput, and memory. ds.iter_internal_ref_bundles() exposes raw block refs for advanced use.

Configuration

ctx = ray.data.DataContext.get_current()
ctx.target_max_block_size = 128 * 1024 * 1024
ctx.execution_options.resource_limits.cpu = 64
ctx.execution_options.resource_limits.gpu = 8
ctx.execution_options.locality_with_output = True

Next steps

Performance tips

Diagnose and fix slow pipelines.

Key concepts

Operator and execution overview.