Ray Train turns existing single-process training code into a distributed training job. You write aDocumentation Index
Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
train_loop_per_worker that looks almost identical to a single-GPU script, and Ray Train runs it as N parallel workers — each on its own GPU, possibly across many nodes.
Why Ray Train
Framework agnostic
Native integrations with PyTorch, PyTorch Lightning, Hugging Face Transformers and Accelerate, JAX, TensorFlow, XGBoost, and LightGBM.
Scale up
Run on a laptop for development; scale to many GPUs and nodes by changing one line of configuration.
Fault tolerance
Automatic checkpointing, resumption, and worker restart on failure.
Composable
Combine with Ray Data for distributed data loading, Ray Tune for hyperparameter tuning, and Ray Serve for deployment.
At a glance
Quickstarts by framework
PyTorch
Native distributed PyTorch with
prepare_model / prepare_data_loader.Lightning
Distributed training with the Lightning trainer.
Transformers
Distributed fine-tuning of Hugging Face models.
Concepts
Key concepts
Trainers, workers, scaling configs, run configs.
Checkpointing
Save and resume training state.
Data loading
Pipe Ray Datasets into training workers.
Fault tolerance
Recover from worker, node, and storage failures.