Skip to main content

Documentation Index

Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Ray Train accepts data in three forms: Ray Datasets, framework-native loaders, or custom iterators inside train_loop_per_worker.

Ray Datasets

The recommended option for large or distributed datasets.
ds = ray.data.read_parquet("s3://bucket/train/")

trainer = TorchTrainer(
    train_loop_per_worker,
    datasets={"train": ds, "valid": valid_ds},
    scaling_config=ScalingConfig(num_workers=4, use_gpu=True),
)
Inside the loop:
def train_loop_per_worker(config):
    train_shard = ray.train.get_dataset_shard("train")
    for batch in train_shard.iter_torch_batches(batch_size=64):
        x, y = batch["features"], batch["label"]
        ...
Each worker receives a unique shard. Ray Data streams blocks from storage to the worker.

Local shuffle buffer

train_shard.iter_torch_batches(
    batch_size=64,
    local_shuffle_buffer_size=10_000,
)
Provides per-batch randomness without paying for a global shuffle.

Framework-native loaders

For datasets that fit on each worker, use a framework loader. Wrap with ray.train.torch.prepare_data_loader to get a DistributedSampler:
from torch.utils.data import DataLoader
loader = DataLoader(my_dataset, batch_size=64)
loader = ray.train.torch.prepare_data_loader(loader)

Multi-dataset training

Pass several keys in datasets:
trainer = TorchTrainer(
    train_loop_per_worker,
    datasets={"train": train_ds, "valid": valid_ds, "test": test_ds},
)
Inside:
train_shard = ray.train.get_dataset_shard("train")
valid_shard = ray.train.get_dataset_shard("valid")

Validation only on rank 0

Read full (unsharded) dataset on the coordinator:
import torch.distributed as dist

if dist.get_rank() == 0:
    for batch in valid_ds.iter_torch_batches():
        ...

Next steps

Ray Data

Build the input pipeline.

Performance

Tune data-loading throughput.