Documentation Index Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Ray Train accepts data in three forms: Ray Datasets, framework-native loaders, or custom iterators inside train_loop_per_worker.
Ray Datasets
The recommended option for large or distributed datasets.
ds = ray.data.read_parquet( "s3://bucket/train/" )
trainer = TorchTrainer(
train_loop_per_worker,
datasets = { "train" : ds, "valid" : valid_ds},
scaling_config = ScalingConfig( num_workers = 4 , use_gpu = True ),
)
Inside the loop:
def train_loop_per_worker ( config ):
train_shard = ray.train.get_dataset_shard( "train" )
for batch in train_shard.iter_torch_batches( batch_size = 64 ):
x, y = batch[ "features" ], batch[ "label" ]
...
Each worker receives a unique shard. Ray Data streams blocks from storage to the worker.
Local shuffle buffer
train_shard.iter_torch_batches(
batch_size = 64 ,
local_shuffle_buffer_size = 10_000 ,
)
Provides per-batch randomness without paying for a global shuffle.
Framework-native loaders
For datasets that fit on each worker, use a framework loader. Wrap with ray.train.torch.prepare_data_loader to get a DistributedSampler:
from torch.utils.data import DataLoader
loader = DataLoader(my_dataset, batch_size = 64 )
loader = ray.train.torch.prepare_data_loader(loader)
Multi-dataset training
Pass several keys in datasets:
trainer = TorchTrainer(
train_loop_per_worker,
datasets = { "train" : train_ds, "valid" : valid_ds, "test" : test_ds},
)
Inside:
train_shard = ray.train.get_dataset_shard( "train" )
valid_shard = ray.train.get_dataset_shard( "valid" )
Validation only on rank 0
Read full (unsharded) dataset on the coordinator:
import torch.distributed as dist
if dist.get_rank() == 0 :
for batch in valid_ds.iter_torch_batches():
...
Next steps
Ray Data Build the input pipeline.
Performance Tune data-loading throughput.