Ray Train Overview

Why Ray Train
At a glance
Quickstarts by framework
Concepts

Ray Train turns existing single-process training code into a distributed training job. You write a train_loop_per_worker that looks almost identical to a single-GPU script, and Ray Train runs it as N parallel workers — each on its own GPU, possibly across many nodes.

Why Ray Train

Framework agnostic

Native integrations with PyTorch, PyTorch Lightning, Hugging Face Transformers and Accelerate, JAX, TensorFlow, XGBoost, and LightGBM.

Scale up

Run on a laptop for development; scale to many GPUs and nodes by changing one line of configuration.

Fault tolerance

Automatic checkpointing, resumption, and worker restart on failure.

Composable

Combine with Ray Data for distributed data loading, Ray Tune for hyperparameter tuning, and Ray Serve for deployment.

At a glance

from ray.train import ScalingConfig
from ray.train.torch import TorchTrainer

def train_loop_per_worker(config):
    import torch
    from ray.train.torch import prepare_model

    model = MyModel()
    model = prepare_model(model)
    optimizer = torch.optim.Adam(model.parameters(), lr=config["lr"])

    for epoch in range(config["epochs"]):
        for batch in dataset:
            ...

trainer = TorchTrainer(
    train_loop_per_worker,
    train_loop_config={"lr": 1e-3, "epochs": 10},
    scaling_config=ScalingConfig(num_workers=4, use_gpu=True),
)
result = trainer.fit()

Quickstarts by framework

PyTorch

Native distributed PyTorch with prepare_model / prepare_data_loader.

Lightning

Distributed training with the Lightning trainer.

Transformers

Distributed fine-tuning of Hugging Face models.

Concepts

Key concepts

Trainers, workers, scaling configs, run configs.

Checkpointing

Save and resume training state.

Data loading

Pipe Ray Datasets into training workers.

Fault tolerance

Recover from worker, node, and storage failures.

Performance Tips for Ray Data Get Started With PyTorch

⌘I

Ray Data

Ray Train

Ray Tune

Ray Serve

Ray RLlib

Ray LLM

Ray Train Overview

Why Ray Train

Framework agnostic

Scale up

Fault tolerance

Composable

At a glance

Quickstarts by framework

PyTorch

Lightning

Transformers

Concepts

Key concepts

Checkpointing

Data loading

Fault tolerance

Ray Data

Ray Train

Ray Tune

Ray Serve

Ray RLlib

Ray LLM

Documentation Index

​Why Ray Train

Framework agnostic

Scale up

Fault tolerance

Composable

​At a glance

​Quickstarts by framework

PyTorch

Lightning

Transformers

​Concepts

Key concepts

Checkpointing

Data loading

Fault tolerance

Why Ray Train

At a glance

Quickstarts by framework

Concepts