Ray Train sets up the PyTorch distributed environment so you don’t have to. The sameDocumentation Index
Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
train_loop_per_worker runs across N workers; you just pick the parallelism strategy.
DistributedDataParallel (DDP)
The default. Each worker holds a full model replica; gradients are all-reduced at every backward pass.prepare_model moves the model to the worker’s GPU and wraps it in torch.nn.parallel.DistributedDataParallel.
Fully Sharded Data Parallel (FSDP)
For models too large to fit on one GPU. FSDP shards model parameters, gradients, and optimizer state across workers.RayFSDPStrategy for a higher-level integration.
Tensor parallelism
For very large models, combine FSDP with tensor parallelism. Use libraries like Megatron-LM, DeepSpeed, or PyTorch’sDeviceMesh. Ray Train provides the process group; the partitioning strategy is up to your library of choice.
Process group initialization
Ray Train initializes the default process group beforetrain_loop_per_worker runs. You can use any PyTorch distributed primitive directly:
Communication backend
By default, Ray usesnccl for GPU workers and gloo for CPU workers. Override with TorchConfig:
Multi-node training
Just bumpnum_workers past one node’s GPU count. Ray Train places workers across nodes and configures the process group with the correct master address.
Next steps
Data loading
Feed Ray Datasets to PyTorch workers.
Fault tolerance
Recover from worker failures.