Skip to main content

Documentation Index

Fetch the complete documentation index at: https://ray-preview.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Ray Train turns existing single-process training code into a distributed training job. You write a train_loop_per_worker that looks almost identical to a single-GPU script, and Ray Train runs it as N parallel workers — each on its own GPU, possibly across many nodes.

Why Ray Train

Framework agnostic

Native integrations with PyTorch, PyTorch Lightning, Hugging Face Transformers and Accelerate, JAX, TensorFlow, XGBoost, and LightGBM.

Scale up

Run on a laptop for development; scale to many GPUs and nodes by changing one line of configuration.

Fault tolerance

Automatic checkpointing, resumption, and worker restart on failure.

Composable

Combine with Ray Data for distributed data loading, Ray Tune for hyperparameter tuning, and Ray Serve for deployment.

At a glance

from ray.train import ScalingConfig
from ray.train.torch import TorchTrainer

def train_loop_per_worker(config):
    import torch
    from ray.train.torch import prepare_model

    model = MyModel()
    model = prepare_model(model)
    optimizer = torch.optim.Adam(model.parameters(), lr=config["lr"])

    for epoch in range(config["epochs"]):
        for batch in dataset:
            ...

trainer = TorchTrainer(
    train_loop_per_worker,
    train_loop_config={"lr": 1e-3, "epochs": 10},
    scaling_config=ScalingConfig(num_workers=4, use_gpu=True),
)
result = trainer.fit()

Quickstarts by framework

PyTorch

Native distributed PyTorch with prepare_model / prepare_data_loader.

Lightning

Distributed training with the Lightning trainer.

Transformers

Distributed fine-tuning of Hugging Face models.

Concepts

Key concepts

Trainers, workers, scaling configs, run configs.

Checkpointing

Save and resume training state.

Data loading

Pipe Ray Datasets into training workers.

Fault tolerance

Recover from worker, node, and storage failures.